Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23560

Slave hung in startup phase with missing logging in the GUI

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core, ssh-slaves-plugin
    • None
    • Jenkins 1.554.2

      I noticed today a number of jobs stuck in Jenkins. They were waiting for a slave machine to automatically start. However, it was not. It was indicating that the slave was starting but there was nothing in the logs. A manual restart had no effect. I decided to put Jenkins in shutdown mode, stop all running jobs, flush the queues and stop all build slaves. Once there was absolutely no activity on Jenkins I tried to manually start the stuck slave. However, nothing happened again. It seems like Jenkins is hung in the build start phase. I attach my list of plugins and thread dump.

      Can anyone confirm Jenkins is hung based on the thread dump?
      Any known workarounds?

          [JENKINS-23560] Slave hung in startup phase with missing logging in the GUI

          Daniel Beck added a comment -

          What about local logging on the slave machine? Anything interesting there?

          How was the slave being started (SSH, …)?

          Daniel Beck added a comment - What about local logging on the slave machine? Anything interesting there? How was the slave being started (SSH, …)?

          Artur Szostak added a comment -

          I think I should adjust one comment in my description: "It seems like Jenkins is hung in the build start phase." => "It seems like Jenkins is hung in the slave start phase."
          Anyway, nothing interesting in the logs on the slave side since the Jenkins slave JAR never gets started. I dont see any process related to Jenkins on the slave machine.

          I did notice a similar situation that I was able to dig into a little more. This was related to a physically hung machine that needed a reboot while a Jenkins slave instance was running. It could be related. I sent an email to the developer list to see if anyone could give credibility to my hypothesis. but unfortunately got no reply. I reproduce the email test I sent below with the information I gathered. Perhaps its related since the symptoms were essentially the same.
          _____________________________________________________________________________________________

          Hi,

          I am trying to debug the following symptom:
          Jenkins started a slave. The slave died (machine hung, never had a chance to communicate back to master). Jenkins tries to restart it, but is not able to. When trying to restart the slave manually nothing happens. The slave logs are and remain empty with the spinning icon just running.

          I had a look at the thread dump and saw a number of threads blocked and waiting for the following thread:

          "Computer.threadPoolForRemoting 14515" Id=577044 Group=main WAITING on com.trilead.ssh2.channel.Channel@76dd2191
          at java.lang.Object.wait(Native Method)

          • waiting on com.trilead.ssh2.channel.Channel@76dd2191
            at java.lang.Object.wait(Object.java:503)
            at com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
            at com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
            at com.trilead.ssh2.Session.<init>(Session.java:41)
            at com.trilead.ssh2.Connection.openSession(Connection.java:1129)
          • locked com.trilead.ssh2.Connection@1e553bf
            at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
            at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
            at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
          • locked hudson.plugins.sshslaves.SSHLauncher@3c884383
            at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@638f4d22

          Looking at the code for hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock.

          Can anyone confirm or refute my logic here? This certainly seems it could explain my symptoms.

          Kind regards.

          Artur

          Artur Szostak added a comment - I think I should adjust one comment in my description: "It seems like Jenkins is hung in the build start phase." => "It seems like Jenkins is hung in the slave start phase." Anyway, nothing interesting in the logs on the slave side since the Jenkins slave JAR never gets started. I dont see any process related to Jenkins on the slave machine. I did notice a similar situation that I was able to dig into a little more. This was related to a physically hung machine that needed a reboot while a Jenkins slave instance was running. It could be related. I sent an email to the developer list to see if anyone could give credibility to my hypothesis. but unfortunately got no reply. I reproduce the email test I sent below with the information I gathered. Perhaps its related since the symptoms were essentially the same. _____________________________________________________________________________________________ Hi, I am trying to debug the following symptom: Jenkins started a slave. The slave died (machine hung, never had a chance to communicate back to master). Jenkins tries to restart it, but is not able to. When trying to restart the slave manually nothing happens. The slave logs are and remain empty with the spinning icon just running. I had a look at the thread dump and saw a number of threads blocked and waiting for the following thread: "Computer.threadPoolForRemoting 14515 " Id=577044 Group=main WAITING on com.trilead.ssh2.channel.Channel@76dd2191 at java.lang.Object.wait(Native Method) waiting on com.trilead.ssh2.channel.Channel@76dd2191 at java.lang.Object.wait(Object.java:503) at com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) at com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) at com.trilead.ssh2.Session.<init>(Session.java:41) at com.trilead.ssh2.Connection.openSession(Connection.java:1129) locked com.trilead.ssh2.Connection@1e553bf at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) locked hudson.plugins.sshslaves.SSHLauncher@3c884383 at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@638f4d22 Looking at the code for hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock. Can anyone confirm or refute my logic here? This certainly seems it could explain my symptoms. Kind regards. Artur

          Artur Szostak added a comment -

          Oh, and the slave is configured with the following settings:

          Usage = Utilize this slave as much as possible
          Launch method = Launch slave agents on Unix machines via SSH
          Port = 22
          JVM Options = -Djava.awt.headless=true
          Availability = Take this slave on-line when in demand and off-line when idle
          In demand delay = 0
          Idel delay = 60
          Environment variables is checked with things like ARCH, BUILD_TYPE, CCACHE_DIR, OS, etc set.
          Prepare job environment is also checked to setup PATH and LD_LIBRARY_PATH (Can't seem to set these two with just "Environment variables" settings)
          Configured credentials to login to the build account.

          All other options are default/empty.

          Artur Szostak added a comment - Oh, and the slave is configured with the following settings: Usage = Utilize this slave as much as possible Launch method = Launch slave agents on Unix machines via SSH Port = 22 JVM Options = -Djava.awt.headless=true Availability = Take this slave on-line when in demand and off-line when idle In demand delay = 0 Idel delay = 60 Environment variables is checked with things like ARCH, BUILD_TYPE, CCACHE_DIR, OS, etc set. Prepare job environment is also checked to setup PATH and LD_LIBRARY_PATH (Can't seem to set these two with just "Environment variables" settings) Configured credentials to login to the build account. All other options are default/empty.

          Artur Szostak added a comment -

          This is starting to become a blocker. It is happening regularly (2-3 times a week) and forcing a restart of Jenkins. I have updated to Jenkins version 1.554.3 with no improvement.

          I think I managed to catch the problem at a slightly earlier moment this time also. After a restart of Jenkins, a number of hours later a job was stuck on one of the build slaves and was aborted. When logging into the machine there was no sign of the Jenkins slave Java process running. However, in the Jenkins GUI the slave appeared to be running or at least starting. About 10 minutes before the stuck job another job ran and failed with the build log following messages:
          ____________________________________________________________________________
          Started by upstream project "admin-validate-slave-configs" build number 456
          originally caused by:
          Started by timer
          [EnvInject] - Loading node environment variables.
          [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException

          Deleting project workspace...
          Collecting metadata...
          Metadata collection done.
          Finished: FAILURE
          ____________________________________________________________________________

          The thread stack dumps that I believe are relevant:
          ____________________________________________________________________________
          "Channel reader thread: ma016213" Id=23008 Group=main WAITING on com.trilead.ssh2.channel.Channel@1d6e8511
          at java.lang.Object.wait(Native Method)

          • waiting on com.trilead.ssh2.channel.Channel@1d6e8511
            at java.lang.Object.wait(Object.java:503)
            at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
            at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
            at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
            at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
            at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
            at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:77)
            at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
            at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
            at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
            at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
            at java.lang.Throwable.readObject(Throwable.java:914)
            at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:606)
            at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
            at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
            at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
            at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
            at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
            at hudson.remoting.Command.readFrom(Command.java:92)
            at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
            at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          "Executor #0 for ma016213 : executing admin-validate-slave-configs » ma016213 #456 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=254 Group=main TIMED_WAITING on hudson.remoting.UserRequest@e493bde
          at java.lang.Object.wait(Native Method)

          • waiting on hudson.remoting.UserRequest@e493bde
            at hudson.remoting.Request.call(Request.java:146)
            at hudson.remoting.Channel.call(Channel.java:722)
            at hudson.FilePath.act(FilePath.java:1003)
            at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44)
            at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81)
            at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39)
            at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:637)
            at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:543)
            at hudson.model.Run.execute(Run.java:1684)
            at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
            at hudson.model.ResourceController.execute(ResourceController.java:88)
            at hudson.model.Executor.run(Executor.java:231)

          "pool-48-thread-1 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=22727 Group=main TIMED_WAITING on hudson.remoting.UserRequest@14c4e0d3
          at java.lang.Object.wait(Native Method)

          • waiting on hudson.remoting.UserRequest@14c4e0d3
            at hudson.remoting.Request.call(Request.java:146)
            at hudson.remoting.Channel.call(Channel.java:722)
            at org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl.onOnline(ComputerListenerImpl.java:32)
            at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:503)
            at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:345)
            at hudson.plugins.sshslaves.SSHLauncher.startSlave(SSHLauncher.java:901)
            at hudson.plugins.sshslaves.SSHLauncher.access$400(SSHLauncher.java:126)
            at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:658)
            at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:642)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@7f77eb2e
            ____________________________________________________________________________

          What is interesting is that the job indicated in the stack dump for "Executor #0" is the one that failed earlier, before the job that actually got stuck was run.
          For your info: admin-validate-slave-configs is a matrix job that simply runs a python script on each slave every morning to check that all paths, tools and environment variables are setup correctly.

          The following is a dump of the slave's log:
          ____________________________________________________________________________
          [07/30/14 06:11:30] [SSH] Opening SSH connection to ma016213:22.
          [07/30/14 06:11:31] [SSH] Authentication successful.
          [07/30/14 06:11:32] [SSH] The remote users environment is:
          BASH=/bin/bash
          BASH_ARGC=()
          BASH_ARGV=()
          BASH_EXECUTION_STRING=set
          BASH_LINENO=()
          BASH_SOURCE=()
          BASH_VERSINFO=([0]="3" [1]="2" [2]="51" [3]="1" [4]="release" [5]="x86_64-apple-darwin13")
          BASH_VERSION='3.2.51(1)-release'
          DIRSTACK=()
          EUID=502
          GROUPS=()
          HOME=/Users/buildacc
          HOSTNAME=ma016213
          HOSTTYPE=x86_64
          IFS=$' \t\n'
          LOGNAME=buildacc
          MACHTYPE=x86_64-apple-darwin13
          MAIL=/var/mail/buildacc
          OPTERR=1
          OPTIND=1
          OSTYPE=darwin13
          PATH=/usr/bin:/bin:/usr/sbin:/sbin
          PPID=30224
          PS4='+ '
          PWD=/Users/buildacc
          SHELL=/bin/bash
          SHELLOPTS=braceexpand:hashall:interactive-comments
          SHLVL=1
          SSH_CLIENT='**.*.*.** 50759 22'
          SSH_CONNECTION='**.*.*.** 50759 **.*.*.** 22'
          TERM=dumb
          TMPDIR=/var/folders/07/53xr552x5yq4dsbjnlkh0l4m0000gp/T/
          UID=502
          USER=buildacc
          _=bash
          [07/30/14 06:11:32] [SSH] Checking java version of java
          [07/30/14 06:11:34] [SSH] java -version returned 1.7.0_55.
          [07/30/14 06:11:34] [SSH] Starting sftp client.
          [07/30/14 06:11:34] [SSH] Copying latest slave.jar...
          [07/30/14 06:11:34] [SSH] Copied 364,754 bytes.
          Expanded the channel window size to 4MB
          [07/30/14 06:11:34] [SSH] Starting slave process: cd "/Users/buildacc/jenkinsBuild" && java -Djava.awt.headless=true -jar slave.jar
          <===[JENKINS REMOTING CAPACITY]===>@@^@channel started
          Slave.jar version: 2.36
          This is a Unix slave
          Evacuated stdout
          ____________________________________________________________________________

          Artur Szostak added a comment - This is starting to become a blocker. It is happening regularly (2-3 times a week) and forcing a restart of Jenkins. I have updated to Jenkins version 1.554.3 with no improvement. I think I managed to catch the problem at a slightly earlier moment this time also. After a restart of Jenkins, a number of hours later a job was stuck on one of the build slaves and was aborted. When logging into the machine there was no sign of the Jenkins slave Java process running. However, in the Jenkins GUI the slave appeared to be running or at least starting. About 10 minutes before the stuck job another job ran and failed with the build log following messages: ____________________________________________________________________________ Started by upstream project "admin-validate-slave-configs" build number 456 originally caused by: Started by timer [EnvInject] - Loading node environment variables. [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException Deleting project workspace... Collecting metadata... Metadata collection done. Finished: FAILURE ____________________________________________________________________________ The thread stack dumps that I believe are relevant: ____________________________________________________________________________ "Channel reader thread: ma016213" Id=23008 Group=main WAITING on com.trilead.ssh2.channel.Channel@1d6e8511 at java.lang.Object.wait(Native Method) waiting on com.trilead.ssh2.channel.Channel@1d6e8511 at java.lang.Object.wait(Object.java:503) at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212) at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127) at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946) at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58) at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:77) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500) at java.lang.Throwable.readObject(Throwable.java:914) at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) "Executor #0 for ma016213 : executing admin-validate-slave-configs » ma016213 #456 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=254 Group=main TIMED_WAITING on hudson.remoting.UserRequest@e493bde at java.lang.Object.wait(Native Method) waiting on hudson.remoting.UserRequest@e493bde at hudson.remoting.Request.call(Request.java:146) at hudson.remoting.Channel.call(Channel.java:722) at hudson.FilePath.act(FilePath.java:1003) at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44) at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81) at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39) at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:637) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:543) at hudson.model.Run.execute(Run.java:1684) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:231) "pool-48-thread-1 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=22727 Group=main TIMED_WAITING on hudson.remoting.UserRequest@14c4e0d3 at java.lang.Object.wait(Native Method) waiting on hudson.remoting.UserRequest@14c4e0d3 at hudson.remoting.Request.call(Request.java:146) at hudson.remoting.Channel.call(Channel.java:722) at org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl.onOnline(ComputerListenerImpl.java:32) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:503) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:345) at hudson.plugins.sshslaves.SSHLauncher.startSlave(SSHLauncher.java:901) at hudson.plugins.sshslaves.SSHLauncher.access$400(SSHLauncher.java:126) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:658) at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:642) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@7f77eb2e ____________________________________________________________________________ What is interesting is that the job indicated in the stack dump for "Executor #0" is the one that failed earlier, before the job that actually got stuck was run. For your info: admin-validate-slave-configs is a matrix job that simply runs a python script on each slave every morning to check that all paths, tools and environment variables are setup correctly. The following is a dump of the slave's log: ____________________________________________________________________________ [07/30/14 06:11:30] [SSH] Opening SSH connection to ma016213:22. [07/30/14 06:11:31] [SSH] Authentication successful. [07/30/14 06:11:32] [SSH] The remote users environment is: BASH=/bin/bash BASH_ARGC=() BASH_ARGV=() BASH_EXECUTION_STRING=set BASH_LINENO=() BASH_SOURCE=() BASH_VERSINFO=( [0] ="3" [1] ="2" [2] ="51" [3] ="1" [4] ="release" [5] ="x86_64-apple-darwin13") BASH_VERSION='3.2.51(1)-release' DIRSTACK=() EUID=502 GROUPS=() HOME=/Users/buildacc HOSTNAME=ma016213 HOSTTYPE=x86_64 IFS=$' \t\n' LOGNAME=buildacc MACHTYPE=x86_64-apple-darwin13 MAIL=/var/mail/buildacc OPTERR=1 OPTIND=1 OSTYPE=darwin13 PATH=/usr/bin:/bin:/usr/sbin:/sbin PPID=30224 PS4='+ ' PWD=/Users/buildacc SHELL=/bin/bash SHELLOPTS=braceexpand:hashall:interactive-comments SHLVL=1 SSH_CLIENT='** . * . * . ** 50759 22' SSH_CONNECTION='** . * . * . ** 50759 ** . * . * . ** 22' TERM=dumb TMPDIR=/var/folders/07/53xr552x5yq4dsbjnlkh0l4m0000gp/T/ UID=502 USER=buildacc _=bash [07/30/14 06:11:32] [SSH] Checking java version of java [07/30/14 06:11:34] [SSH] java -version returned 1.7.0_55. [07/30/14 06:11:34] [SSH] Starting sftp client. [07/30/14 06:11:34] [SSH] Copying latest slave.jar... [07/30/14 06:11:34] [SSH] Copied 364,754 bytes. Expanded the channel window size to 4MB [07/30/14 06:11:34] [SSH] Starting slave process: cd "/Users/buildacc/jenkinsBuild" && java -Djava.awt.headless=true -jar slave.jar <=== [JENKINS REMOTING CAPACITY] ===> @ @^@channel started Slave.jar version: 2.36 This is a Unix slave Evacuated stdout ____________________________________________________________________________

          Artur Szostak added a comment -

          Additional thread dumps that could be relevant to the previous comment:

          "Computer.threadPoolForRemoting 144" Id=22726 Group=main WAITING on java.util.concurrent.FutureTask@497a56f6
          at sun.misc.Unsafe.park(Native Method)

          • waiting on java.util.concurrent.FutureTask@497a56f6
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
            at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
            at java.util.concurrent.FutureTask.get(FutureTask.java:187)
            at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:243)
            at java.util.concurrent.Executors$DelegatedExecutorService.invokeAll(Executors.java:648)
            at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:691)
          • locked hudson.plugins.sshslaves.SSHLauncher@f4631f7
            at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@451047f5

          "Computer.threadPoolForRemoting 744" Id=38276 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
          at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)

          • blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
            at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@3bd8d3ab

          "Computer.threadPoolForRemoting 823" Id=40700 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
          at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)

          • blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
            at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@5f27ca48

          "Computer.threadPoolForRemoting 849" Id=41501 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
          at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)

          • blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
            at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
            at java.util.concurrent.FutureTask.run(FutureTask.java:262)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)

          Number of locked synchronizers = 1

          • java.util.concurrent.ThreadPoolExecutor$Worker@36a88ffe

          Artur Szostak added a comment - Additional thread dumps that could be relevant to the previous comment: "Computer.threadPoolForRemoting 144 " Id=22726 Group=main WAITING on java.util.concurrent.FutureTask@497a56f6 at sun.misc.Unsafe.park(Native Method) waiting on java.util.concurrent.FutureTask@497a56f6 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425) at java.util.concurrent.FutureTask.get(FutureTask.java:187) at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:243) at java.util.concurrent.Executors$DelegatedExecutorService.invokeAll(Executors.java:648) at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:691) locked hudson.plugins.sshslaves.SSHLauncher@f4631f7 at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@451047f5 "Computer.threadPoolForRemoting 744 " Id=38276 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144 " Id=22726 at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152) blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7 at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@3bd8d3ab "Computer.threadPoolForRemoting 823 " Id=40700 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144 " Id=22726 at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639) blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7 at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@5f27ca48 "Computer.threadPoolForRemoting 849 " Id=41501 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144 " Id=22726 at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639) blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7 at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Number of locked synchronizers = 1 java.util.concurrent.ThreadPoolExecutor$Worker@36a88ffe

          aszostak said

          hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock.

          I agree with that analysis. The SFTPv3Client constructor does not have any explicit timeout support. It relies on the global timeout support provided in trilead... which does not exist... only reasonable solution would be to fork that work off into a side thread and join that thread with a timeout...

          Stephen Connolly added a comment - aszostak said hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock. I agree with that analysis. The SFTPv3Client constructor does not have any explicit timeout support. It relies on the global timeout support provided in trilead... which does not exist... only reasonable solution would be to fork that work off into a side thread and join that thread with a timeout...

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          http://jenkins-ci.org/commit/ssh-slaves-plugin/1118394240fef4554f8b84010c1b88c4513cefa0
          Log:
          [FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar

          • without a timeout this can be left in the hands of trilead's infinite blocking on the socket

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java http://jenkins-ci.org/commit/ssh-slaves-plugin/1118394240fef4554f8b84010c1b88c4513cefa0 Log: [FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar without a timeout this can be left in the hands of trilead's infinite blocking on the socket

          Code changed in jenkins
          User: Stephen Connolly
          Path:
          src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
          http://jenkins-ci.org/commit/ssh-slaves-plugin/84caa2c24558da8fc73c41ae7f8d8bd785d327e3
          Log:
          Merge pull request #26 from jenkinsci/jenkins-23560

          [FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar

          Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/1cf28a11c7fc...84caa2c24558

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Stephen Connolly Path: src/main/java/hudson/plugins/sshslaves/SSHLauncher.java http://jenkins-ci.org/commit/ssh-slaves-plugin/84caa2c24558da8fc73c41ae7f8d8bd785d327e3 Log: Merge pull request #26 from jenkinsci/jenkins-23560 [FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/1cf28a11c7fc...84caa2c24558

          Artur Szostak added a comment -

          ssh-slaves plugin version 1.10 has been running for over a year on our Jenkins instance with over 50 build slaves. This problem has not occurred since the plugin was updated. Thus I confirm that the fix worked and am closing the ticket.

          Artur Szostak added a comment - ssh-slaves plugin version 1.10 has been running for over a year on our Jenkins instance with over 50 build slaves. This problem has not occurred since the plugin was updated. Thus I confirm that the fix worked and am closing the ticket.

            kohsuke Kohsuke Kawaguchi
            aszostak Artur Szostak
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: