-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Jenkins server 2.263.3 (OpenJDK 1.8.0_275-b01)
Agent Ubuntu 20.04 (OpenJDK build 11.0.9.1+1)
Hi Jenkins team
we noticed that our Jenkins cloud agents fail to connect to the Jenkins server from time to time.
This happens approximately 1-2 times a week on a particular agent.
The client connection is starting but stalls suddenly after the agent remote identity has been confirmed:
Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: bert-os-bos-108-bmqio-2-1 Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Feb 05, 2021 3:45:26 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 4.5 Feb 05, 2021 3:45:26 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /var/jenkins/localssd/remoting as a remoting work directory Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among https://my.buildserver.io/ Feb 05, 2021 3:45:27 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: my.buildserver.io Agent port: 50000 Identity: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aa Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to my.buildserver.io:50000 Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aae placeholder
During the same time, the Jenkins server log shows the connection attempt of the agent as well as the established connection:
2021-02-05 15:45:27.203+0000 [id=154585] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #205 from /$BUILDAGENT-IP:52654 bash-5.0# netstat -natup |grep $BUILDAGENT-IP tcp 0 0 172.29.0.2:50000 $BUILDAGENT-IP:52654 ESTABLISHED 6/java
=========================================================================
Debugging the agent side, shows the process is waiting for a thread indefinitly.
When strace the agent pid it shows the parent process is waiting for (child) pid 3011 which hangs forever:
# strace -p 2916 strace: Process 2916 attached futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011 strace: Process 3011 attached futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL
strace child pid 3011:
root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011 -ff strace: Process 3011 attached with 22 threads [pid 3024] futex(0x7fba2c1b197c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3020] futex(0x7fba2c00a0e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3014] futex(0x7fba2c05d680, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3034] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3023] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3022] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3018] futex(0x7fba2c15d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3013] futex(0x7fba2c05d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3012] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2916] futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL <unfinished ...> [pid 3027] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3026] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3032] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...> [pid 3017] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3011] futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3015] futex(0x7fba2c0c9d7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3021] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 3030] futex(0x7fba2c385c28, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3019] futex(0x7fba2c15ee7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3031] epoll_wait(11, <unfinished ...> [pid 3016] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3025] restart_syscall(<... resuming interrupted restart_syscall ...>) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=322548318}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=372848350}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=423172343}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=473504715}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=523781139}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3034] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3034] futex(0x7fb9dc1d4d28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3034] futex(0x7fb9dc1d4d78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3766, tv_nsec=508365589}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3025] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=574152656}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3016] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3016] futex(0x7fba2c0cba28, FUTEX_WAKE_PRIVATE, 1) = 0
lsof :
# lsof -p 2916 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2916 root cwd DIR 8,1 4096 2 / java 2916 root rtd DIR 8,1 4096 2 / java 2916 root txt REG 8,1 14560 525860 /usr/lib/jvm/java-11-openjdk-amd64/bin/java java 2916 root mem REG 8,1 17371136 526322 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/classes.jsa java 2916 root mem REG 8,1 199456 526019 /usr/lib/jvm/java-11-openjdk-amd64/lib/libsunec.so java 2916 root mem REG 8,1 3035952 5882 /usr/lib/locale/locale-archive java 2916 root mem REG 8,1 101320 3443 /usr/lib/x86_64-linux-gnu/libresolv-2.31.so java 2916 root mem REG 8,1 96320 526013 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnet.so java 2916 root mem REG 8,1 1518110 5887 /usr/lib/locale/C.UTF-8/LC_COLLATE java 2916 root mem REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules java 2916 root mem REG 8,1 31176 3436 /usr/lib/x86_64-linux-gnu/libnss_dns-2.31.so java 2916 root mem REG 8,1 14392 525994 /usr/lib/jvm/java-11-openjdk-amd64/lib/libextnet.so java 2916 root mem REG 8,1 75840 526014 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnio.so java 2916 root mem REG 8,1 201272 5888 /usr/lib/locale/C.UTF-8/LC_CTYPE java 2916 root mem REG 8,1 50 5893 /usr/lib/locale/C.UTF-8/LC_NUMERIC java 2916 root mem REG 8,1 3360 5896 /usr/lib/locale/C.UTF-8/LC_TIME java 2916 root mem REG 8,1 270 5891 /usr/lib/locale/C.UTF-8/LC_MONETARY java 2916 root mem REG 8,1 48 5885 /usr/lib/locale/C.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES java 2916 root mem REG 8,1 27002 3724 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache java 2916 root mem REG 8,1 34872 526022 /usr/lib/jvm/java-11-openjdk-amd64/lib/libzip.so java 2916 root mem REG 8,1 51832 3437 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so java 2916 root mem REG 8,1 34 5894 /usr/lib/locale/C.UTF-8/LC_PAPER java 2916 root mem REG 8,1 62 5892 /usr/lib/locale/C.UTF-8/LC_NAME java 2916 root mem REG 8,1 131 5886 /usr/lib/locale/C.UTF-8/LC_ADDRESS java 2916 root mem REG 8,1 47 5895 /usr/lib/locale/C.UTF-8/LC_TELEPHONE java 2916 root mem REG 8,1 32768 774997 /tmp/hsperfdata_root/2916 java 2916 root mem REG 8,1 182000 526001 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so java 2916 root mem REG 8,1 67720 526021 /usr/lib/jvm/java-11-openjdk-amd64/lib/libverify.so java 2916 root mem REG 8,1 40040 3444 /usr/lib/x86_64-linux-gnu/librt-2.31.so java 2916 root mem REG 8,1 104984 3422 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 java 2916 root mem REG 8,1 1369352 3431 /usr/lib/x86_64-linux-gnu/libm-2.31.so java 2916 root mem REG 8,1 1952928 3423 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28 java 2916 root mem REG 8,1 19164144 526028 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so java 2916 root mem REG 8,1 157224 3442 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so java 2916 root mem REG 8,1 18816 3430 /usr/lib/x86_64-linux-gnu/libdl-2.31.so java 2916 root mem REG 8,1 108936 3823 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11 java 2916 root mem REG 8,1 2029224 3429 /usr/lib/x86_64-linux-gnu/libc-2.31.so java 2916 root mem REG 8,1 23 5890 /usr/lib/locale/C.UTF-8/LC_MEASUREMENT java 2916 root mem REG 8,1 30952 526005 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjimage.so java 2916 root mem REG 8,1 71832 525985 /usr/lib/jvm/java-11-openjdk-amd64/lib/jli/libjli.so java 2916 root mem REG 8,1 191472 3425 /usr/lib/x86_64-linux-gnu/ld-2.31.so java 2916 root mem REG 8,1 252 5889 /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION java 2916 root 0r CHR 1,3 0t0 6 /dev/null java 2916 root 1u unix 0xffff89d4863a5000 0t0 33660 type=STREAM java 2916 root 2u unix 0xffff89d4863a5000 0t0 33660 type=STREAM java 2916 root 3r REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules java 2916 root 4r REG 8,1 1521553 526324 /var/jenkins/agent.jar java 2916 root 5r CHR 1,8 0t0 10 /dev/random java 2916 root 6r CHR 1,9 0t0 11 /dev/urandom java 2916 root 7wW REG 259,0 0 7077892 /var/jenkins/localssd/remoting/logs/remoting.log.0.lck java 2916 root 8u unix 0xffff89d48b8dd800 0t0 34053 type=STREAM java 2916 root 9w REG 259,0 1455 7077893 /var/jenkins/localssd/remoting/logs/remoting.log.0 java 2916 root 10u unix 0xffff89d487de6c00 0t0 34938 type=STREAM java 2916 root 11u a_inode 0,14 0 11385 [eventpoll] java 2916 root 12r FIFO 0,13 0t0 34057 pipe java 2916 root 13w FIFO 0,13 0t0 34057 pipe java 2916 root 14u IPv6 34950 0t0 TCP bert-os-bos-108-bmqio-2-1.c.ilabs-playground.internal:52654->$BUILDSERVER-IP.bc.googleusercontent.com:50000 (ESTABLISHED)
The agent connection process can only be resumed by killing the hanging child process.
Sending SIGQUIT to the child process is causing the agent to establish the previously stalled connection:
# kill -SIGQUIT 3011 # strace -p 2916 strace: Process 2916 attachedfutex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) — SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=3531, si_uid=0} — futex(0x7fba2c00a0e0, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn({mask=[]}) = 202