Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64820

Intermittent agent connection issues

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • remoting
    • None
    • Jenkins server 2.263.3 (OpenJDK 1.8.0_275-b01)
      Agent Ubuntu 20.04 (OpenJDK build 11.0.9.1+1)

      Hi Jenkins team

      we noticed that our Jenkins cloud agents fail to connect to the Jenkins server from time to time.
      This happens approximately 1-2 times a week on a particular agent.
      The client connection is starting but stalls suddenly after the agent remote identity has been confirmed:

      Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up agent: bert-os-bos-108-bmqio-2-1
      Feb 05, 2021 3:45:26 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Feb 05, 2021 3:45:26 PM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 4.5
      Feb 05, 2021 3:45:26 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using /var/jenkins/localssd/remoting as a remoting work directory
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among https://my.buildserver.io/
      Feb 05, 2021 3:45:27 PM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: my.buildserver.io
      Agent port: 50000
      Identity: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aa
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to my.buildserver.io:50000
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Feb 05, 2021 3:45:27 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Remote identity confirmed: b7:98:61:3e:17:26:eb:80:c6:79:cf:a3:aa:aa:aa:aae placeholder
      

       

      During the same time, the Jenkins server log shows the connection attempt of the agent as well as the established connection:

       

      2021-02-05 15:45:27.203+0000 [id=154585] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #205 from /$BUILDAGENT-IP:52654
      
      bash-5.0# netstat -natup |grep $BUILDAGENT-IP
      tcp 0 0 172.29.0.2:50000 $BUILDAGENT-IP:52654 ESTABLISHED 6/java

      =========================================================================

       

       

      Debugging the agent side, shows the process is waiting for a thread indefinitly.

      When strace the agent pid it shows the parent process is waiting for (child) pid 3011 which hangs forever:

       

      # strace -p 2916
      strace: Process 2916 attached
      futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL
      
      root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011
      strace: Process 3011 attached
      futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL
      

      strace child pid 3011:

       

       

      root@bert-os-bos-108-bmqio-2-1:/var/jenkins/localssd/remoting/logs# strace -p 3011 -ff
      strace: Process 3011 attached with 22 threads
      [pid 3024] futex(0x7fba2c1b197c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3020] futex(0x7fba2c00a0e0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3014] futex(0x7fba2c05d680, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3034] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3023] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3022] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3018] futex(0x7fba2c15d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3013] futex(0x7fba2c05d07c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3012] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 2916] futex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL <unfinished ...>
      [pid 3027] futex(0x7fba2c02cce0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3026] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3032] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
      [pid 3017] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3011] futex(0x7fba2c015278, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3015] futex(0x7fba2c0c9d7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3021] restart_syscall(<... resuming interrupted futex ...> <unfinished ...>
      [pid 3030] futex(0x7fba2c385c28, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3019] futex(0x7fba2c15ee7c, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
      [pid 3031] epoll_wait(11, <unfinished ...>
      [pid 3016] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
      [pid 3025] restart_syscall(<... resuming interrupted restart_syscall ...>) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=322548318}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=372848350}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=423172343}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=473504715}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=523781139}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3034] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
      [pid 3034] futex(0x7fb9dc1d4d28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3034] futex(0x7fb9dc1d4d78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3766, tv_nsec=508365589}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3025] <... futex resumed>) = -1 ETIMEDOUT (Connection timed out)
      [pid 3025] futex(0x7fba2c1b3f28, FUTEX_WAKE_PRIVATE, 1) = 0
      [pid 3025] futex(0x7fba2c1b3f78, FUTEX_WAIT_BITSET_PRIVATE, 0, {tv_sec=3765, tv_nsec=574152656}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
      [pid 3016] <... restart_syscall resumed>) = -1 ETIMEDOUT (Connection timed out)
      [pid 3016] futex(0x7fba2c0cba28, FUTEX_WAKE_PRIVATE, 1) = 0
      

       

       

      lsof :

       

      # lsof -p 2916
      COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
      java 2916 root cwd DIR 8,1 4096 2 /
      java 2916 root rtd DIR 8,1 4096 2 /
      java 2916 root txt REG 8,1 14560 525860 /usr/lib/jvm/java-11-openjdk-amd64/bin/java
      java 2916 root mem REG 8,1 17371136 526322 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/classes.jsa
      java 2916 root mem REG 8,1 199456 526019 /usr/lib/jvm/java-11-openjdk-amd64/lib/libsunec.so
      java 2916 root mem REG 8,1 3035952 5882 /usr/lib/locale/locale-archive
      java 2916 root mem REG 8,1 101320 3443 /usr/lib/x86_64-linux-gnu/libresolv-2.31.so
      java 2916 root mem REG 8,1 96320 526013 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnet.so
      java 2916 root mem REG 8,1 1518110 5887 /usr/lib/locale/C.UTF-8/LC_COLLATE
      java 2916 root mem REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules
      java 2916 root mem REG 8,1 31176 3436 /usr/lib/x86_64-linux-gnu/libnss_dns-2.31.so
      java 2916 root mem REG 8,1 14392 525994 /usr/lib/jvm/java-11-openjdk-amd64/lib/libextnet.so
      java 2916 root mem REG 8,1 75840 526014 /usr/lib/jvm/java-11-openjdk-amd64/lib/libnio.so
      java 2916 root mem REG 8,1 201272 5888 /usr/lib/locale/C.UTF-8/LC_CTYPE
      java 2916 root mem REG 8,1 50 5893 /usr/lib/locale/C.UTF-8/LC_NUMERIC
      java 2916 root mem REG 8,1 3360 5896 /usr/lib/locale/C.UTF-8/LC_TIME
      java 2916 root mem REG 8,1 270 5891 /usr/lib/locale/C.UTF-8/LC_MONETARY
      java 2916 root mem REG 8,1 48 5885 /usr/lib/locale/C.UTF-8/LC_MESSAGES/SYS_LC_MESSAGES
      java 2916 root mem REG 8,1 27002 3724 /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
      java 2916 root mem REG 8,1 34872 526022 /usr/lib/jvm/java-11-openjdk-amd64/lib/libzip.so
      java 2916 root mem REG 8,1 51832 3437 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so
      java 2916 root mem REG 8,1 34 5894 /usr/lib/locale/C.UTF-8/LC_PAPER
      java 2916 root mem REG 8,1 62 5892 /usr/lib/locale/C.UTF-8/LC_NAME
      java 2916 root mem REG 8,1 131 5886 /usr/lib/locale/C.UTF-8/LC_ADDRESS
      java 2916 root mem REG 8,1 47 5895 /usr/lib/locale/C.UTF-8/LC_TELEPHONE
      java 2916 root mem REG 8,1 32768 774997 /tmp/hsperfdata_root/2916
      java 2916 root mem REG 8,1 182000 526001 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjava.so
      java 2916 root mem REG 8,1 67720 526021 /usr/lib/jvm/java-11-openjdk-amd64/lib/libverify.so
      java 2916 root mem REG 8,1 40040 3444 /usr/lib/x86_64-linux-gnu/librt-2.31.so
      java 2916 root mem REG 8,1 104984 3422 /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
      java 2916 root mem REG 8,1 1369352 3431 /usr/lib/x86_64-linux-gnu/libm-2.31.so
      java 2916 root mem REG 8,1 1952928 3423 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.28
      java 2916 root mem REG 8,1 19164144 526028 /usr/lib/jvm/java-11-openjdk-amd64/lib/server/libjvm.so
      java 2916 root mem REG 8,1 157224 3442 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
      java 2916 root mem REG 8,1 18816 3430 /usr/lib/x86_64-linux-gnu/libdl-2.31.so
      java 2916 root mem REG 8,1 108936 3823 /usr/lib/x86_64-linux-gnu/libz.so.1.2.11
      java 2916 root mem REG 8,1 2029224 3429 /usr/lib/x86_64-linux-gnu/libc-2.31.so
      java 2916 root mem REG 8,1 23 5890 /usr/lib/locale/C.UTF-8/LC_MEASUREMENT
      java 2916 root mem REG 8,1 30952 526005 /usr/lib/jvm/java-11-openjdk-amd64/lib/libjimage.so
      java 2916 root mem REG 8,1 71832 525985 /usr/lib/jvm/java-11-openjdk-amd64/lib/jli/libjli.so
      java 2916 root mem REG 8,1 191472 3425 /usr/lib/x86_64-linux-gnu/ld-2.31.so
      java 2916 root mem REG 8,1 252 5889 /usr/lib/locale/C.UTF-8/LC_IDENTIFICATION
      java 2916 root 0r CHR 1,3 0t0 6 /dev/null
      java 2916 root 1u unix 0xffff89d4863a5000 0t0 33660 type=STREAM
      java 2916 root 2u unix 0xffff89d4863a5000 0t0 33660 type=STREAM
      java 2916 root 3r REG 8,1 141978067 526023 /usr/lib/jvm/java-11-openjdk-amd64/lib/modules
      java 2916 root 4r REG 8,1 1521553 526324 /var/jenkins/agent.jar
      java 2916 root 5r CHR 1,8 0t0 10 /dev/random
      java 2916 root 6r CHR 1,9 0t0 11 /dev/urandom
      java 2916 root 7wW REG 259,0 0 7077892 /var/jenkins/localssd/remoting/logs/remoting.log.0.lck
      java 2916 root 8u unix 0xffff89d48b8dd800 0t0 34053 type=STREAM
      java 2916 root 9w REG 259,0 1455 7077893 /var/jenkins/localssd/remoting/logs/remoting.log.0
      java 2916 root 10u unix 0xffff89d487de6c00 0t0 34938 type=STREAM
      java 2916 root 11u a_inode 0,14 0 11385 [eventpoll]
      java 2916 root 12r FIFO 0,13 0t0 34057 pipe
      java 2916 root 13w FIFO 0,13 0t0 34057 pipe
      java 2916 root 14u IPv6 34950 0t0 TCP bert-os-bos-108-bmqio-2-1.c.ilabs-playground.internal:52654->$BUILDSERVER-IP.bc.googleusercontent.com:50000 (ESTABLISHED)
      

       

       

      The agent connection process can only be resumed by killing the hanging child process.

      Sending SIGQUIT to the child process is causing the agent to establish the previously stalled connection:

       

      # kill -SIGQUIT 3011
      
      # strace -p 2916
      strace: Process 2916 attachedfutex(0x7fba3184d9d0, FUTEX_WAIT, 3011, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
      — SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=3531, si_uid=0} —
      futex(0x7fba2c00a0e0, FUTEX_WAKE_PRIVATE, 1) = 1
      rt_sigreturn({mask=[]}) = 202
      

       

       

       

            Unassigned Unassigned
            herzogt Tamara
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: