• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • Jenkins 2.245 or 2.235.3
      Java 1.8.0_171 or 1.8.0_172
    • 2.253

      We are seeing the bug originally described in JENKINS-62181. Remote agents hang on launch. This is using a supposedly patched release.

      In our experience, we are seeing a correct launch of this agent the first time after Jenkins is rebooted. Then, after the agent times out (in-demand delay 1, idle delay 5) and goes down, it cannot be restarting, it deadlocks on launch.

      It hangs here:

      <===[JENKINS REMOTING CAPACITY]===>channel started
      Remoting version: 4.3
      This is a Unix agent

      I don't know how to put in a test for Java deadlocks. Please advise. This is blocking usage.

          [JENKINS-63082] Deadlock launching remote agent

          Marcus Philip added a comment - - edited

          We are seeing hanging agents launched by SSH as well immediately after core upgrade 2.222.2 -> 2.235.3 yesterday (as well as upgrading all plugins with all new versions since ~ 3 months).

          Interesting thing is that not all agents hang. Maybe the working ones have kept connection and would behave like this as well if they had to reconnect? I dare not try...

          I set agent java logging to FINEST, but not much more info there AFAICT:

          SSHLauncher{host='jenkins-slave11.test.aza.nu', port=22, credentialsId='9fb9152c-c904-4af7-88b0-1597a5e924a0', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=60, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true}
          [07/28/20 22:09:33] [SSH] Opening SSH connection to jenkins-slave11.test.aza.nu:22.
          [07/28/20 22:09:33] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection.
          [07/28/20 22:09:34] [SSH] Authentication successful.
          [07/28/20 22:09:34] [SSH] The remote user's environment is:
          BASH=/usr/bin/bash
          BASHOPTS=cmdhist:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath
          BASH_ALIASES=()
          BASH_ARGC=()
          BASH_ARGV=()
          BASH_CMDS=()
          BASH_EXECUTION_STRING=set
          BASH_LINENO=()
          BASH_SOURCE=()
          BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="2" [4]="release" [5]="x86_64-redhat-linux-gnu")
          BASH_VERSION='4.2.46(2)-release'
          DIRSTACK=()
          EUID=1002
          GROUPS=()
          HOME=/home/jenkins
          HOSTNAME=jenkins-slave11
          HOSTTYPE=x86_64
          IFS=$' \t\n'
          JAVA_HOME=/usr/java/latest
          LANG=en_US.UTF-8
          LOGNAME=jenkins
          M2_HOME=/usr/share/apache-maven
          MACHTYPE=x86_64-redhat-linux-gnu
          MAIL=/var/mail/jenkins
          OPTERR=1
          OPTIND=1
          OSTYPE=linux-gnu
          PATH=/usr/local/bin:/usr/bin:/usr/share/apache-maven/bin
          PIPESTATUS=([0]="0")
          PPID=9423
          PS4='+ '
          PWD=/home/jenkins
          SHELL=/bin/bash
          SHELLOPTS=braceexpand:hashall:interactive-comments
          SHLVL=1
          SSH_CLIENT='10.87.2.30 39108 22'
          SSH_CONNECTION='10.87.2.30 39108 10.87.3.20 22'
          TERM=dumb
          UID=1002
          USER=jenkins
          XDG_RUNTIME_DIR=/run/user/1002
          XDG_SESSION_ID=34
          _=M2_HOME
          [07/28/20 22:09:34] [SSH] Checking java version of /opt/jenkins/jdk/bin/java
          Couldn't figure out the Java version of /opt/jenkins/jdk/bin/java
          bash: /opt/jenkins/jdk/bin/java: No such file or directory[07/28/20 22:09:34] [SSH] Checking java version of java
          [07/28/20 22:09:34] [SSH] java -version returned 1.8.0_172.
          [07/28/20 22:09:34] [SSH] Starting sftp client.
          [07/28/20 22:09:34] [SSH] Copying latest remoting.jar...
          Source agent hash is E5FEC468D6F172BF394E1F2571EA686C. Installed agent hash is E5FEC468D6F172BF394E1F2571EA686C
          Verified agent jar. No update is necessary.
          Expanded the channel window size to 4MB
          [07/28/20 22:09:34] [SSH] Starting agent process: cd "/opt/jenkins" && java  -jar remoting.jar -workDir /opt/jenkins -jar-cache /opt/jenkins/remoting/jarCache
          Jul 28, 2020 10:09:34 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
          INFO: Using /opt/jenkins/remoting as a remoting work directory
          Jul 28, 2020 10:09:34 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
          INFO: Both error and output logs will be printed to /opt/jenkins/remoting
          <===[JENKINS REMOTING CAPACITY]===>channel started
          Remoting version: 4.3
          This is a Unix agent
          Jul 28, 2020 10:09:35 PM hudson.remoting.Channel send
          FINE: Send Response:UserRequest:hudson.logging.LogRecorder$SetLevel@46d1c13a(hudson.remoting.UserRequest$NormalResponse)
          Jul 28, 2020 10:09:35 PM hudson.remoting.Channel$1 handle
          FINE: Received UserRequest:hudson.slaves.ChannelPinger$SetUpRemotePing@2905
          Jul 28, 2020 10:09:35 PM hudson.remoting.Channel$1 handle
          FINE: Completed command UserRequest:hudson.slaves.ChannelPinger$SetUpRemotePing@2905. It took 1ms
          Jul 28, 2020 10:09:35 PM hudson.remoting.RemoteClassLoader findClass
          FINER: fetch3(hudson.slaves.ChannelPinger$SetUpRemotePing)
          Jul 28, 2020 10:09:35 PM hudson.remoting.Channel send
          FINE: Send RPCRequest:hudson.remoting.RemoteClassLoader$IClassLoader.fetch3[java.lang.String](2)
          Jul 28, 2020 10:10:35 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
          FINER: rate(1min) = 0.0±0.0/sec; rate(5min) = 0.0±0.0/sec; rate(15min) = 0.0±0.0/sec; rate(total) = 0.0±0.0/sec; N = 11
          Jul 28, 2020 10:11:35 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats
          FINER: rate(1min) = 0.0±0.0/sec; rate(5min) = 0.0±0.0/sec; rate(15min) = 0.0±0.0/sec; rate(total) = 0.0±0.0/sec; N = 23

           

          I did a jstack as well:

          [jenkins@jenkins-slave01.test.aza.nu ~]$ jstack -l 106882
          2020-07-29 15:18:26
          Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode):
          
          "Attach Listener" #27 daemon prio=9 os_prio=0 tid=0x00007f6588001000 nid=0x15f3b waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "pool-1-thread-1 for channel id=21742 / waiting for channel id=21" #12 prio=5 os_prio=0 tid=0x00007f6568003800 nid=0x1a196 in Object.wait() [0x00007f65b4cf8000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Request.call(Request.java:177)
          	- locked <0x0000000771400cc0> (a hudson.remoting.RemoteInvocationHandler$RPCRequest)
          	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288)
          	at com.sun.proxy.$Proxy5.fetch3(Unknown Source)
          	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:211)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
          	- locked <0x0000000771400bb0> (a hudson.remoting.RemoteClassLoader)
          	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
          	at java.lang.Class.forName0(Native Method)
          	at java.lang.Class.forName(Class.java:348)
          	at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:132)
          	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1866)
          	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1749)
          	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2040)
          	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1571)
          	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431)
          	at hudson.remoting.UserRequest.deserialize(UserRequest.java:290)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:189)
          	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
          	at hudson.remoting.Request$2.run(Request.java:369)
          	at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          
             Locked ownable synchronizers:
          	- <0x000000077146c7a0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
          
          "Channel reader thread: channel" #11 prio=5 os_prio=0 tid=0x00007f65d436e000 nid=0x1a195 waiting for monitor entry [0x00007f65b4dfa000]
             java.lang.Thread.State: BLOCKED (on object monitor)
          	at hudson.slaves.SlaveComputer$SlaveInitializer$1.publish(SlaveComputer.java:1027)
          	at java.util.logging.Logger.log(Logger.java:738)
          	at java.util.logging.Logger.doLog(Logger.java:765)
          	at java.util.logging.Logger.log(Logger.java:851)
          	at hudson.remoting.Channel$1.handle(Channel.java:608)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:85)
          
             Locked ownable synchronizers:
          	- None
          
          "RemoteInvocationHandler [#1]" #10 daemon prio=5 os_prio=0 tid=0x00007f65d4368000 nid=0x1a194 in Object.wait() [0x00007f65b4efb000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
          	- locked <0x0000000771676d50> (a java.lang.ref.ReferenceQueue$Lock)
          	at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:600)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:111)
          	at java.lang.Thread.run(Thread.java:748)
          
             Locked ownable synchronizers:
          	- None
          
          "Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f65d4206000 nid=0x1a191 runnable [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f65d41c8800 nid=0x1a190 waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f65d41c7000 nid=0x1a18f waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f65d41c4000 nid=0x1a18e waiting on condition [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f65d41c2800 nid=0x1a18d runnable [0x0000000000000000]
             java.lang.Thread.State: RUNNABLE
          
             Locked ownable synchronizers:
          	- None
          
          "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f65d418f000 nid=0x1a18c in Object.wait() [0x00007f65b5d2b000]
             java.lang.Thread.State: WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
          	- locked <0x00000007714110c0> (a java.lang.ref.ReferenceQueue$Lock)
          	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
          	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)
          
             Locked ownable synchronizers:
          	- None
          
          "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f65d418a800 nid=0x1a18b in Object.wait() [0x00007f65b5e2c000]
             java.lang.Thread.State: WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Object.wait(Object.java:502)
          	at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
          	- locked <0x0000000771411278> (a java.lang.ref.Reference$Lock)
          	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)
          
             Locked ownable synchronizers:
          	- None
          
          "main" #1 prio=5 os_prio=0 tid=0x00007f65d4009000 nid=0x1a183 in Object.wait() [0x00007f65db793000]
             java.lang.Thread.State: TIMED_WAITING (on object monitor)
          	at java.lang.Object.wait(Native Method)
          	at hudson.remoting.Channel.join(Channel.java:1182)
          	- locked <0x0000000771400d08> (a hudson.remoting.Channel)
          	at hudson.remoting.Launcher.main(Launcher.java:796)
          	at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:718)
          	at hudson.remoting.Launcher.run(Launcher.java:398)
          	at hudson.remoting.Launcher.main(Launcher.java:298)
          
             Locked ownable synchronizers:
          	- None
          
          "VM Thread" os_prio=0 tid=0x00007f65d4183000 nid=0x1a18a runnable
          
          "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f65d401e800 nid=0x1a184 runnable
          
          "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f65d4020800 nid=0x1a185 runnable
          
          "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f65d4022000 nid=0x1a186 runnable
          
          "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f65d4024000 nid=0x1a187 runnable
          
          "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f65d4026000 nid=0x1a188 runnable
          
          "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f65d4027800 nid=0x1a189 runnable
          
          "VM Periodic Task Thread" os_prio=0 tid=0x00007f65d4213000 nid=0x1a192 waiting on condition
          
          JNI global references: 231
           

           I see same thing as in JENKINS-62181

          "Channel reader thread: channel" #11 prio=5 os_prio=0 tid=0x00007f65d436e000 nid=0x1a195 waiting for monitor entry [0x00007f65b4dfa000]
             java.lang.Thread.State: BLOCKED (on object monitor)
          	at hudson.slaves.SlaveComputer$SlaveInitializer$1.publish(SlaveComputer.java:1027)

           
          Note that when I manually kill the agent java process (that is indeed started on agent!) I get this log:

          ERROR: Connection terminated
          java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
          	at hudson.remoting.Command.readFrom(Command.java:142)
          	at hudson.remoting.Command.readFrom(Command.java:128)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          Caused: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          Agent JVM has terminated. Exit code=143
          [07/30/20 15:52:24] Launch failed - cleaning up connection
          [07/30/20 15:52:24] [SSH] Connection closed.
          

           
          so the ssh connection seems OK. It seems to be a java code problem.

          Please let me know if there's any more information you need, or procedures you would like me to test.

          Marcus Philip added a comment - - edited We are seeing hanging agents launched by SSH as well immediately after core upgrade 2.222.2 -> 2.235.3 yesterday (as well as upgrading all plugins with all new versions since ~ 3 months). Interesting thing is that not all agents hang. Maybe the working ones have kept connection and would behave like this as well if they had to reconnect? I dare not try... I set agent java logging to FINEST, but not much more info there AFAICT: SSHLauncher{host='jenkins-slave11.test.aza.nu', port=22, credentialsId='9fb9152c-c904-4af7-88b0-1597a5e924a0', jvmOptions='', javaPath='', prefixStartSlaveCmd='', suffixStartSlaveCmd='', launchTimeoutSeconds=60, maxNumRetries=10, retryWaitTime=15, sshHostKeyVerificationStrategy=hudson.plugins.sshslaves.verifiers.NonVerifyingKeyVerificationStrategy, tcpNoDelay=true, trackCredentials=true} [07/28/20 22:09:33] [SSH] Opening SSH connection to jenkins-slave11.test.aza.nu:22. [07/28/20 22:09:33] [SSH] WARNING: SSH Host Keys are not being verified. Man-in-the-middle attacks may be possible against this connection. [07/28/20 22:09:34] [SSH] Authentication successful. [07/28/20 22:09:34] [SSH] The remote user's environment is: BASH=/usr/bin/bash BASHOPTS=cmdhist:extquote:force_fignore:hostcomplete:interactive_comments:progcomp:promptvars:sourcepath BASH_ALIASES=() BASH_ARGC=() BASH_ARGV=() BASH_CMDS=() BASH_EXECUTION_STRING=set BASH_LINENO=() BASH_SOURCE=() BASH_VERSINFO=([0]="4" [1]="2" [2]="46" [3]="2" [4]="release" [5]="x86_64-redhat-linux-gnu") BASH_VERSION='4.2.46(2)-release' DIRSTACK=() EUID=1002 GROUPS=() HOME=/home/jenkins HOSTNAME=jenkins-slave11 HOSTTYPE=x86_64 IFS=$' \t\n' JAVA_HOME=/usr/java/latest LANG=en_US.UTF-8 LOGNAME=jenkins M2_HOME=/usr/share/apache-maven MACHTYPE=x86_64-redhat-linux-gnu MAIL=/var/mail/jenkins OPTERR=1 OPTIND=1 OSTYPE=linux-gnu PATH=/usr/local/bin:/usr/bin:/usr/share/apache-maven/bin PIPESTATUS=([0]="0") PPID=9423 PS4='+ ' PWD=/home/jenkins SHELL=/bin/bash SHELLOPTS=braceexpand:hashall:interactive-comments SHLVL=1 SSH_CLIENT='10.87.2.30 39108 22' SSH_CONNECTION='10.87.2.30 39108 10.87.3.20 22' TERM=dumb UID=1002 USER=jenkins XDG_RUNTIME_DIR=/run/user/1002 XDG_SESSION_ID=34 _=M2_HOME [07/28/20 22:09:34] [SSH] Checking java version of /opt/jenkins/jdk/bin/java Couldn't figure out the Java version of /opt/jenkins/jdk/bin/java bash: /opt/jenkins/jdk/bin/java: No such file or directory[07/28/20 22:09:34] [SSH] Checking java version of java [07/28/20 22:09:34] [SSH] java -version returned 1.8.0_172. [07/28/20 22:09:34] [SSH] Starting sftp client. [07/28/20 22:09:34] [SSH] Copying latest remoting.jar... Source agent hash is E5FEC468D6F172BF394E1F2571EA686C. Installed agent hash is E5FEC468D6F172BF394E1F2571EA686C Verified agent jar. No update is necessary. Expanded the channel window size to 4MB [07/28/20 22:09:34] [SSH] Starting agent process: cd "/opt/jenkins" && java -jar remoting.jar -workDir /opt/jenkins -jar-cache /opt/jenkins/remoting/jarCache Jul 28, 2020 10:09:34 PM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir INFO: Using /opt/jenkins/remoting as a remoting work directory Jul 28, 2020 10:09:34 PM org.jenkinsci.remoting.engine.WorkDirManager setupLogging INFO: Both error and output logs will be printed to /opt/jenkins/remoting <===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 4.3 This is a Unix agent Jul 28, 2020 10:09:35 PM hudson.remoting.Channel send FINE: Send Response:UserRequest:hudson.logging.LogRecorder$SetLevel@46d1c13a(hudson.remoting.UserRequest$NormalResponse) Jul 28, 2020 10:09:35 PM hudson.remoting.Channel$1 handle FINE: Received UserRequest:hudson.slaves.ChannelPinger$SetUpRemotePing@2905 Jul 28, 2020 10:09:35 PM hudson.remoting.Channel$1 handle FINE: Completed command UserRequest:hudson.slaves.ChannelPinger$SetUpRemotePing@2905. It took 1ms Jul 28, 2020 10:09:35 PM hudson.remoting.RemoteClassLoader findClass FINER: fetch3(hudson.slaves.ChannelPinger$SetUpRemotePing) Jul 28, 2020 10:09:35 PM hudson.remoting.Channel send FINE: Send RPCRequest:hudson.remoting.RemoteClassLoader$IClassLoader.fetch3[java.lang.String](2) Jul 28, 2020 10:10:35 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats FINER: rate(1min) = 0.0±0.0/sec; rate(5min) = 0.0±0.0/sec; rate(15min) = 0.0±0.0/sec; rate(total) = 0.0±0.0/sec; N = 11 Jul 28, 2020 10:11:35 PM hudson.remoting.RemoteInvocationHandler$Unexporter reportStats FINER: rate(1min) = 0.0±0.0/sec; rate(5min) = 0.0±0.0/sec; rate(15min) = 0.0±0.0/sec; rate(total) = 0.0±0.0/sec; N = 23   I did a jstack as well: [jenkins@jenkins-slave01.test.aza.nu ~]$ jstack -l 106882 2020-07-29 15:18:26 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.172-b11 mixed mode): "Attach Listener" #27 daemon prio=9 os_prio=0 tid=0x00007f6588001000 nid=0x15f3b waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "pool-1-thread-1 for channel id=21742 / waiting for channel id=21" #12 prio=5 os_prio=0 tid=0x00007f6568003800 nid=0x1a196 in Object.wait() [0x00007f65b4cf8000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at hudson.remoting.Request.call(Request.java:177) - locked <0x0000000771400cc0> (a hudson.remoting.RemoteInvocationHandler$RPCRequest) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:288) at com.sun.proxy.$Proxy5.fetch3(Unknown Source) at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:211) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) - locked <0x0000000771400bb0> (a hudson.remoting.RemoteClassLoader) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at hudson.remoting.MultiClassLoaderSerializer$Input.resolveClass(MultiClassLoaderSerializer.java:132) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1866) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1749) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2040) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1571) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:431) at hudson.remoting.UserRequest.deserialize(UserRequest.java:290) at hudson.remoting.UserRequest.perform(UserRequest.java:189) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:369) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x000000077146c7a0> (a java.util.concurrent.ThreadPoolExecutor$Worker) "Channel reader thread: channel" #11 prio=5 os_prio=0 tid=0x00007f65d436e000 nid=0x1a195 waiting for monitor entry [0x00007f65b4dfa000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.slaves.SlaveComputer$SlaveInitializer$1.publish(SlaveComputer.java:1027) at java.util.logging.Logger.log(Logger.java:738) at java.util.logging.Logger.doLog(Logger.java:765) at java.util.logging.Logger.log(Logger.java:851) at hudson.remoting.Channel$1.handle(Channel.java:608) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:85) Locked ownable synchronizers: - None "RemoteInvocationHandler [#1]" #10 daemon prio=5 os_prio=0 tid=0x00007f65d4368000 nid=0x1a194 in Object.wait() [0x00007f65b4efb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) - locked <0x0000000771676d50> (a java.lang.ref.ReferenceQueue$Lock) at hudson.remoting.RemoteInvocationHandler$Unexporter.run(RemoteInvocationHandler.java:600) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:111) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - None "Service Thread" #8 daemon prio=9 os_prio=0 tid=0x00007f65d4206000 nid=0x1a191 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "C1 CompilerThread2" #7 daemon prio=9 os_prio=0 tid=0x00007f65d41c8800 nid=0x1a190 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "C2 CompilerThread1" #6 daemon prio=9 os_prio=0 tid=0x00007f65d41c7000 nid=0x1a18f waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "C2 CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f65d41c4000 nid=0x1a18e waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f65d41c2800 nid=0x1a18d runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "Finalizer" #3 daemon prio=8 os_prio=0 tid=0x00007f65d418f000 nid=0x1a18c in Object.wait() [0x00007f65b5d2b000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144) - locked <0x00000007714110c0> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216) Locked ownable synchronizers: - None "Reference Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f65d418a800 nid=0x1a18b in Object.wait() [0x00007f65b5e2c000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at java.lang.ref.Reference.tryHandlePending(Reference.java:191) - locked <0x0000000771411278> (a java.lang.ref.Reference$Lock) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153) Locked ownable synchronizers: - None "main" #1 prio=5 os_prio=0 tid=0x00007f65d4009000 nid=0x1a183 in Object.wait() [0x00007f65db793000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at hudson.remoting.Channel.join(Channel.java:1182) - locked <0x0000000771400d08> (a hudson.remoting.Channel) at hudson.remoting.Launcher.main(Launcher.java:796) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:718) at hudson.remoting.Launcher.run(Launcher.java:398) at hudson.remoting.Launcher.main(Launcher.java:298) Locked ownable synchronizers: - None "VM Thread" os_prio=0 tid=0x00007f65d4183000 nid=0x1a18a runnable "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x00007f65d401e800 nid=0x1a184 runnable "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x00007f65d4020800 nid=0x1a185 runnable "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x00007f65d4022000 nid=0x1a186 runnable "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x00007f65d4024000 nid=0x1a187 runnable "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x00007f65d4026000 nid=0x1a188 runnable "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x00007f65d4027800 nid=0x1a189 runnable "VM Periodic Task Thread" os_prio=0 tid=0x00007f65d4213000 nid=0x1a192 waiting on condition JNI global references: 231    I see same thing as in  JENKINS-62181 :  "Channel reader thread: channel" #11 prio=5 os_prio=0 tid=0x00007f65d436e000 nid=0x1a195 waiting for monitor entry [0x00007f65b4dfa000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.slaves.SlaveComputer$SlaveInitializer$1.publish(SlaveComputer.java:1027)   Note that when I manually kill the agent java process (that is indeed started on agent!) I get this log: ERROR: Connection terminated java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49) at hudson.remoting.Command.readFrom(Command.java:142) at hudson.remoting.Command.readFrom(Command.java:128) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) Caused: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Agent JVM has terminated. Exit code=143 [07/30/20 15:52:24] Launch failed - cleaning up connection [07/30/20 15:52:24] [SSH] Connection closed.   so the ssh connection seems OK. It seems to be a java code problem. Please let me know if there's any more information you need, or procedures you would like me to test.

          Marcus Philip added a comment -

          I tried to use another connection method, namely Launch agent via execution of command on the master.

          It works. I get this log (where you can see the command as well):

          [07/30/20 16:10:01] Launching agent
          $ ssh -o "StrictHostKeyChecking=no" jenkins@jenkins-slave01.test.aza.nu bin/start-agent.sh
          <===[JENKINS REMOTING CAPACITY]===>channel started
          Remoting version: 4.3
          This is a Unix agent
          Evacuated stdout
          Agent successfully connected and online
          

          The sh script is just the one suggested in docs:

          #!/bin/sh
          exec java -jar ~/bin/agent.jar
          

          Marcus Philip added a comment - I tried to use another connection method, namely Launch agent via execution of command on the master . It works. I get this log (where you can see the command as well): [07/30/20 16:10:01] Launching agent $ ssh -o "StrictHostKeyChecking=no" jenkins@jenkins-slave01.test.aza.nu bin/start-agent.sh <===[JENKINS REMOTING CAPACITY]===>channel started Remoting version: 4.3 This is a Unix agent Evacuated stdout Agent successfully connected and online The sh script is just the one suggested in docs: #!/bin/sh exec java -jar ~/bin/agent.jar

          Jesse Glick added a comment -

          The line acquiring the monitor, for reference.

          Unfortunately the thread dump does not specify which monitor the thread is being blocked on. Neither of the methods of LogRecord being called ought to acquire a monitor. I am guessing this is some weird thing involving remote class loading.

          Jesse Glick added a comment - The line acquiring the monitor , for reference. Unfortunately the thread dump does not specify which monitor the thread is being blocked on. Neither of the methods of LogRecord being called ought to acquire a monitor. I am guessing this is some weird thing involving remote class loading.

          Jesse Glick added a comment -

          Filed jenkins #4886. What we need is a user willing to run a prerelease build and check whether it actually solves the problem or not.

          Jesse Glick added a comment - Filed jenkins #4886. What we need is a user willing to run a prerelease build and check whether it actually solves the problem or not.

          Marcus Philip added a comment -

          I an willing to try a pre-release build. But not possible before Monday.

          Marcus Philip added a comment - I an willing to try a pre-release build. But not possible before Monday.

          Jesse Glick added a comment - https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/main/jenkins-war/2.251-rc30290.4454a28de2c1/jenkins-war-2.251-rc30290.4454a28de2c1.war is available for testing.

          Marcus Philip added a comment -

          I have tested the patch and it seems to fix the problem.

          I have relaunched several agents after starting Jenkins master with this patch and it works fine.

          Thanks for the quick response. Hoping to see this soon in a LTS patch.

          Marcus Philip added a comment - I have tested the patch and it seems to fix the problem. I have relaunched several agents after starting Jenkins master with this patch and it works fine. Thanks for the quick response. Hoping to see this soon in a LTS patch.

          Jesse Glick added a comment -

          marcus_phi thanks for testing!

          Jesse Glick added a comment - marcus_phi thanks for testing!

          Is this in 2.251? It is not in the Changelog.

          In any case, it is a huge blocker for us if it is not in any LTS anyhow. When is this coming? After an update we cannot connect to half of our slaves anymore. With exactly this error.

          Julianus Pfeuffer added a comment - Is this in 2.251? It is not in the Changelog. In any case, it is a huge blocker for us if it is not in any LTS anyhow. When is this coming? After an update we cannot connect to half of our slaves anymore. With exactly this error.

          Jesse Glick added a comment -

          jpfeuffer the fix is still under review. You can help by testing the binary linked above and verifying whether it makes the issue go away for you.

          Jesse Glick added a comment - jpfeuffer the fix is still under review. You can help by testing the binary linked above and verifying whether it makes the issue go away for you.

          Yes, it seems to work now. At first, I still had some hiccups (on far fewer slaves) since there were still some (hung up) java processes running on the slaves but after a restart of those, all of them are up again.

          Julianus Pfeuffer added a comment - Yes, it seems to work now. At first, I still had some hiccups (on far fewer slaves) since there were still some (hung up) java processes running on the slaves but after a restart of those, all of them are up again.

          I have upgraded to version 2.253 and we are still experiencing problems with this exact issue. After upgrading and restarting more than half of our nodes deadlock in the manner described above. I have not been able to identify a procedure that will bring them up again dependably. Is anyone else having better luck at this?

          Vegar Andersen added a comment - I have upgraded to version 2.253 and we are still experiencing problems with this exact issue. After upgrading and restarting more than half of our nodes deadlock in the manner described above. I have not been able to identify a procedure that will bring them up again dependably. Is anyone else having better luck at this?

          Jesse Glick added a comment -

          vegar_andersen please attach a thread dump.

          Jesse Glick added a comment - vegar_andersen please attach a thread dump.

          Fredrik de Vibe added a comment - - edited

          jglick I just attached one, it's from the same Jenkins instance that vegar_andersen refers to.

          Fredrik de Vibe added a comment - - edited jglick I just attached one, it's from the same Jenkins instance that vegar_andersen refers to.

          Jesse Glick added a comment -

          f5k vegar_andersen thanks, I have filed your deadlock as JENKINS-63458.

          Jesse Glick added a comment - f5k vegar_andersen thanks, I have filed your deadlock as JENKINS-63458 .

            jglick Jesse Glick
            gehlhaar Dan Gehlhaar
            Votes:
            3 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: