-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Minor
-
Component/s: remoting
-
Environment:Jenkins server 2.263.3 (OpenJDK 1.8.0_275-b01)
Agent Ubuntu 20.04 (OpenJDK build 11.0.9.1+1)
Hi Jenkins team
we noticed that our Jenkins cloud agents fail to connect to the Jenkins server from time to time.
This happens approximately 1-2 times a week on a particular agent.
The client connection is starting but stalls suddenly after the agent remote identity has been confirmed:
Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up agent: bert-os-bos-108-bmqio-2-1 Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Feb 05, 2021 3:45:26 PM hudson.remoting.Engine startEngine INFO: Using Remoting version: 4.5 Feb 05, 2021 3:45:26 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /var/jenkins/localssd/remoting as a remoting work directory Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among https://my.buildserver.io/ Feb 05, 2021 3:45:27 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping] Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Agent discovery successful Agent address: my.buildserver.io Agent port: 50000 Identity: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aa Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to my.buildserver.io:50000 Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP4-connect Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Remote identity confirmed: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aae placeholder
Â
During the same time, the Jenkins server log shows the connection attempt of the agent as well as the established connection:
Â
2021-02-05 15:45:27.203+0000 [id=154585] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #205 from /$BUILDAGENT-IP:52654 bash-5.0# netstat -natup |grep $BUILDAGENT-IP tcp 0 0 172.29.0.2:50000 $BUILDAGENT-IP:52654 ESTABLISHED 6/java
=========================================================================
Â
Â
Debugging the agent side, shows the process is waiting for a thread indefinitly.
When strace the agent pid it shows the parent process is waiting for (child) pid 3011 which hangs forever:
Â
# strace -p 2916 strace: Process 2916 attached futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011 strace: Process 3011 attached futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL
strace child pid 3011:
Â
Â
root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011 -ff strace: Process 3011 attached with 22 threads [pid 3024] futex(0x7fba2c1b197c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3020] futex(0x7fba2c00a0e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3014] futex(0x7fba2c05d680, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3034] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3023] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3022] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3018] futex(0x7fba2c15d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3013] futex(0x7fba2c05d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3012] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2916] futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL <unfinished ...> [pid 3027] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3026] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3032] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...> [pid 3017] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3011] futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3015] futex(0x7fba2c0c9d7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3021] restart_syscall(<... resuming interrupted futex ...> <unfinished ...> [pid 3030] futex(0x7fba2c385c28, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3019] futex(0x7fba2c15ee7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 3031] epoll_wait(11, <unfinished ...> [pid 3016] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 3025] restart_syscall(<... resuming interrupted restart_syscall ...>) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=322548318}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=372848350}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=423172343}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=473504715}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=523781139}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3034] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3034] futex(0x7fb9dc1d4d28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3034] futex(0x7fb9dc1d4d78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3766, tv_nsec=508365589}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3025] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=574152656}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 3016] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out) [pid 3016] futex(0x7fba2c0cba28, FUTEX_WAKE_PRIVATE, 1) = 0
Â
Â
lsof :
Â
# lsof -p 2916 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java 2916 root cwd DIR 8,1 4096 2 / java 2916 root rtd DIR 8,1 4096 2 / java 2916 root txt REG 8,1 14560 525860 /usr/lib/jvm/java-11-openjdk-amd64/bin/java java 2916 root mem REG 8,1 17371136 526322 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/classes.jsa java 2916 root mem REG 8,1 199456 526019 /usr/lib/jvm/java-11-openjdk-amd64/lib/libsunec.so java 2916 root mem REG 8,1 3035952 5882 /usr/lib/locale/locale-archive java 2916 root mem REG 8,1 101320 3443 /usr/lib/x86_64-linux-gnu/libresolv-2.31.so java 2916 root mem REG 8,1 96320 526013 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnet.so java 2916 root mem REG 8,1 1518110 5887 /usr/lib/locale/C.UTF-8/LC_COLLATE java 2916 root mem REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules java 2916 root mem REG 8,1 31176 3436 /usr/lib/x86_64-linux-gnu/libnss_dns-2.31.so java 2916 root mem REG 8,1 14392 525994 /usr/lib/jvm/java-11-openjdk-amd64/lib/libextnet.so java 2916 root mem REG 8,1 75840 526014 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnio.so java 2916 root mem REG 8,1 201272 5888 /usr/lib/locale/C.UTF-8/LC_CTYPE java 2916 root mem REG 8,1 50 5893 /usr/lib/locale/C.UTF-8/LC_NUMERIC java 2916 root mem REG 8,1 3360 5896 /usr/lib/locale/C.UTF-8/LC_TIME java 2916 root mem REG 8,1 270 5891 /usr/lib/locale/C.UTF-8/LC_MONETARY java 2916 root mem REG 8,1 48 5885 /usr/lib/locale/C.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES java 2916 root mem REG 8,1 27002 3724 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache java 2916 root mem REG 8,1 34872 526022 /usr/lib/jvm/java-11-openjdk-amd64/lib/libzip.so java 2916 root mem REG 8,1 51832 3437 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so java 2916 root mem REG 8,1 34 5894 /usr/lib/locale/C.UTF-8/LC_PAPER java 2916 root mem REG 8,1 62 5892 /usr/lib/locale/C.UTF-8/LC_NAME java 2916 root mem REG 8,1 131 5886 /usr/lib/locale/C.UTF-8/LC_ADDRESS java 2916 root mem REG 8,1 47 5895 /usr/lib/locale/C.UTF-8/LC_TELEPHONE java 2916 root mem REG 8,1 32768 774997 /tmp/hsperfdata_root/2916 java 2916 root mem REG 8,1 182000 526001 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so java 2916 root mem REG 8,1 67720 526021 /usr/lib/jvm/java-11-openjdk-amd64/lib/libverify.so java 2916 root mem REG 8,1 40040 3444 /usr/lib/x86_64-linux-gnu/librt-2.31.so java 2916 root mem REG 8,1 104984 3422 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 java 2916 root mem REG 8,1 1369352 3431 /usr/lib/x86_64-linux-gnu/libm-2.31.so java 2916 root mem REG 8,1 1952928 3423 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28 java 2916 root mem REG 8,1 19164144 526028 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so java 2916 root mem REG 8,1 157224 3442 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so java 2916 root mem REG 8,1 18816 3430 /usr/lib/x86_64-linux-gnu/libdl-2.31.so java 2916 root mem REG 8,1 108936 3823 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11 java 2916 root mem REG 8,1 2029224 3429 /usr/lib/x86_64-linux-gnu/libc-2.31.so java 2916 root mem REG 8,1 23 5890 /usr/lib/locale/C.UTF-8/LC_MEASUREMENT java 2916 root mem REG 8,1 30952 526005 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjimage.so java 2916 root mem REG 8,1 71832 525985 /usr/lib/jvm/java-11-openjdk-amd64/lib/jli/libjli.so java 2916 root mem REG 8,1 191472 3425 /usr/lib/x86_64-linux-gnu/ld-2.31.so java 2916 root mem REG 8,1 252 5889 /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION java 2916 root 0r CHR 1,3 0t0 6 /dev/null java 2916 root 1u unix 0xffff89d4863a5000 0t0 33660 type=STREAM java 2916 root 2u unix 0xffff89d4863a5000 0t0 33660 type=STREAM java 2916 root 3r REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules java 2916 root 4r REG 8,1 1521553 526324 /var/jenkins/agent.jar java 2916 root 5r CHR 1,8 0t0 10 /dev/random java 2916 root 6r CHR 1,9 0t0 11 /dev/urandom java 2916 root 7wW REG 259,0 0 7077892 /var/jenkins/localssd/remoting/logs/remoting.log.0.lck java 2916 root 8u unix 0xffff89d48b8dd800 0t0 34053 type=STREAM java 2916 root 9w REG 259,0 1455 7077893 /var/jenkins/localssd/remoting/logs/remoting.log.0 java 2916 root 10u unix 0xffff89d487de6c00 0t0 34938 type=STREAM java 2916 root 11u a_inode 0,14 0 11385 [eventpoll] java 2916 root 12r FIFO 0,13 0t0 34057 pipe java 2916 root 13w FIFO 0,13 0t0 34057 pipe java 2916 root 14u IPv6 34950 0t0 TCP bert-os-bos-108-bmqio-2-1.c.ilabs-playground.internal:52654->$BUILDSERVER-IP.bc.googleusercontent.com:50000 (ESTABLISHED)
Â
Â
The agent connection process can only be resumed by killing the hanging child process.
Sending SIGQUIT to the child process is causing the agent to establish the previously stalled connection:
Â
# kill -SIGQUIT 3011 # strace -p 2916 strace: Process 2916 attachedfutex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) — SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=3531, si_uid=0} — futex(0x7fba2c00a0e0, FUTEX_WAKE_PRIVATE, 1) = 1 rt_sigreturn({mask=[]}) = 202
Â
Â
Â