-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Jenkins 1.554.2
-
Powered by SuggestiMate
I noticed today a number of jobs stuck in Jenkins. They were waiting for a slave machine to automatically start. However, it was not. It was indicating that the slave was starting but there was nothing in the logs. A manual restart had no effect. I decided to put Jenkins in shutdown mode, stop all running jobs, flush the queues and stop all build slaves. Once there was absolutely no activity on Jenkins I tried to manually start the stuck slave. However, nothing happened again. It seems like Jenkins is hung in the build start phase. I attach my list of plugins and thread dump.
Can anyone confirm Jenkins is hung based on the thread dump?
Any known workarounds?
- plugin_list.txt
- 6 kB
- thread_dump.txt
- 160 kB
[JENKINS-23560] Slave hung in startup phase with missing logging in the GUI
I think I should adjust one comment in my description: "It seems like Jenkins is hung in the build start phase." => "It seems like Jenkins is hung in the slave start phase."
Anyway, nothing interesting in the logs on the slave side since the Jenkins slave JAR never gets started. I dont see any process related to Jenkins on the slave machine.
I did notice a similar situation that I was able to dig into a little more. This was related to a physically hung machine that needed a reboot while a Jenkins slave instance was running. It could be related. I sent an email to the developer list to see if anyone could give credibility to my hypothesis. but unfortunately got no reply. I reproduce the email test I sent below with the information I gathered. Perhaps its related since the symptoms were essentially the same.
_____________________________________________________________________________________________
Hi,
I am trying to debug the following symptom:
Jenkins started a slave. The slave died (machine hung, never had a chance to communicate back to master). Jenkins tries to restart it, but is not able to. When trying to restart the slave manually nothing happens. The slave logs are and remain empty with the spinning icon just running.
I had a look at the thread dump and saw a number of threads blocked and waiting for the following thread:
"Computer.threadPoolForRemoting 14515" Id=577044 Group=main WAITING on com.trilead.ssh2.channel.Channel@76dd2191
at java.lang.Object.wait(Native Method)
- waiting on com.trilead.ssh2.channel.Channel@76dd2191
at java.lang.Object.wait(Object.java:503)
at com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
at com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
at com.trilead.ssh2.Session.<init>(Session.java:41)
at com.trilead.ssh2.Connection.openSession(Connection.java:1129) - locked com.trilead.ssh2.Connection@1e553bf
at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) - locked hudson.plugins.sshslaves.SSHLauncher@3c884383
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@638f4d22
Looking at the code for hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock.
Can anyone confirm or refute my logic here? This certainly seems it could explain my symptoms.
Kind regards.
Artur
Oh, and the slave is configured with the following settings:
Usage = Utilize this slave as much as possible
Launch method = Launch slave agents on Unix machines via SSH
Port = 22
JVM Options = -Djava.awt.headless=true
Availability = Take this slave on-line when in demand and off-line when idle
In demand delay = 0
Idel delay = 60
Environment variables is checked with things like ARCH, BUILD_TYPE, CCACHE_DIR, OS, etc set.
Prepare job environment is also checked to setup PATH and LD_LIBRARY_PATH (Can't seem to set these two with just "Environment variables" settings)
Configured credentials to login to the build account.
All other options are default/empty.
This is starting to become a blocker. It is happening regularly (2-3 times a week) and forcing a restart of Jenkins. I have updated to Jenkins version 1.554.3 with no improvement.
I think I managed to catch the problem at a slightly earlier moment this time also. After a restart of Jenkins, a number of hours later a job was stuck on one of the build slaves and was aborted. When logging into the machine there was no sign of the Jenkins slave Java process running. However, in the Jenkins GUI the slave appeared to be running or at least starting. About 10 minutes before the stuck job another job ran and failed with the build log following messages:
____________________________________________________________________________
Started by upstream project "admin-validate-slave-configs" build number 456
originally caused by:
Started by timer
[EnvInject] - Loading node environment variables.
[EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
Deleting project workspace...
Collecting metadata...
Metadata collection done.
Finished: FAILURE
____________________________________________________________________________
The thread stack dumps that I believe are relevant:
____________________________________________________________________________
"Channel reader thread: ma016213" Id=23008 Group=main WAITING on com.trilead.ssh2.channel.Channel@1d6e8511
at java.lang.Object.wait(Native Method)
- waiting on com.trilead.ssh2.channel.Channel@1d6e8511
at java.lang.Object.wait(Object.java:503)
at com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
at com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
at com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
at com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:79)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:77)
at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)
at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at hudson.remoting.Command.readFrom(Command.java:92)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
"Executor #0 for ma016213 : executing admin-validate-slave-configs » ma016213 #456 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=254 Group=main TIMED_WAITING on hudson.remoting.UserRequest@e493bde
at java.lang.Object.wait(Native Method)
- waiting on hudson.remoting.UserRequest@e493bde
at hudson.remoting.Request.call(Request.java:146)
at hudson.remoting.Channel.call(Channel.java:722)
at hudson.FilePath.act(FilePath.java:1003)
at org.jenkinsci.plugins.envinject.service.EnvironmentVariablesNodeLoader.gatherEnvironmentVariablesNode(EnvironmentVariablesNodeLoader.java:44)
at org.jenkinsci.plugins.envinject.EnvInjectListener.loadEnvironmentVariablesNode(EnvInjectListener.java:81)
at org.jenkinsci.plugins.envinject.EnvInjectListener.setUpEnvironment(EnvInjectListener.java:39)
at hudson.model.AbstractBuild$AbstractBuildExecution.createLauncher(AbstractBuild.java:637)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:543)
at hudson.model.Run.execute(Run.java:1684)
at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:231)
"pool-48-thread-1 / waiting for hudson.remoting.Channel@21fb7f0a:ma016213" Id=22727 Group=main TIMED_WAITING on hudson.remoting.UserRequest@14c4e0d3
at java.lang.Object.wait(Native Method)
- waiting on hudson.remoting.UserRequest@14c4e0d3
at hudson.remoting.Request.call(Request.java:146)
at hudson.remoting.Channel.call(Channel.java:722)
at org.jenkinsci.modules.slave_installer.impl.ComputerListenerImpl.onOnline(ComputerListenerImpl.java:32)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:503)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:345)
at hudson.plugins.sshslaves.SSHLauncher.startSlave(SSHLauncher.java:901)
at hudson.plugins.sshslaves.SSHLauncher.access$400(SSHLauncher.java:126)
at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:658)
at hudson.plugins.sshslaves.SSHLauncher$2.call(SSHLauncher.java:642)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@7f77eb2e
____________________________________________________________________________
What is interesting is that the job indicated in the stack dump for "Executor #0" is the one that failed earlier, before the job that actually got stuck was run.
For your info: admin-validate-slave-configs is a matrix job that simply runs a python script on each slave every morning to check that all paths, tools and environment variables are setup correctly.
The following is a dump of the slave's log:
____________________________________________________________________________
[07/30/14 06:11:30] [SSH] Opening SSH connection to ma016213:22.
[07/30/14 06:11:31] [SSH] Authentication successful.
[07/30/14 06:11:32] [SSH] The remote users environment is:
BASH=/bin/bash
BASH_ARGC=()
BASH_ARGV=()
BASH_EXECUTION_STRING=set
BASH_LINENO=()
BASH_SOURCE=()
BASH_VERSINFO=([0]="3" [1]="2" [2]="51" [3]="1" [4]="release" [5]="x86_64-apple-darwin13")
BASH_VERSION='3.2.51(1)-release'
DIRSTACK=()
EUID=502
GROUPS=()
HOME=/Users/buildacc
HOSTNAME=ma016213
HOSTTYPE=x86_64
IFS=$' \t\n'
LOGNAME=buildacc
MACHTYPE=x86_64-apple-darwin13
MAIL=/var/mail/buildacc
OPTERR=1
OPTIND=1
OSTYPE=darwin13
PATH=/usr/bin:/bin:/usr/sbin:/sbin
PPID=30224
PS4='+ '
PWD=/Users/buildacc
SHELL=/bin/bash
SHELLOPTS=braceexpand:hashall:interactive-comments
SHLVL=1
SSH_CLIENT='**.*.*.** 50759 22'
SSH_CONNECTION='**.*.*.** 50759 **.*.*.** 22'
TERM=dumb
TMPDIR=/var/folders/07/53xr552x5yq4dsbjnlkh0l4m0000gp/T/
UID=502
USER=buildacc
_=bash
[07/30/14 06:11:32] [SSH] Checking java version of java
[07/30/14 06:11:34] [SSH] java -version returned 1.7.0_55.
[07/30/14 06:11:34] [SSH] Starting sftp client.
[07/30/14 06:11:34] [SSH] Copying latest slave.jar...
[07/30/14 06:11:34] [SSH] Copied 364,754 bytes.
Expanded the channel window size to 4MB
[07/30/14 06:11:34] [SSH] Starting slave process: cd "/Users/buildacc/jenkinsBuild" && java -Djava.awt.headless=true -jar slave.jar
<===[JENKINS REMOTING CAPACITY]===>@@^@channel started
Slave.jar version: 2.36
This is a Unix slave
Evacuated stdout
____________________________________________________________________________
Additional thread dumps that could be relevant to the previous comment:
"Computer.threadPoolForRemoting 144" Id=22726 Group=main WAITING on java.util.concurrent.FutureTask@497a56f6
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.FutureTask@497a56f6
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:425)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:243)
at java.util.concurrent.Executors$DelegatedExecutorService.invokeAll(Executors.java:648)
at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:691) - locked hudson.plugins.sshslaves.SSHLauncher@f4631f7
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@451047f5
"Computer.threadPoolForRemoting 744" Id=38276 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
- blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@3bd8d3ab
"Computer.threadPoolForRemoting 823" Id=40700 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
- blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@5f27ca48
"Computer.threadPoolForRemoting 849" Id=41501 Group=main BLOCKED on hudson.plugins.sshslaves.SSHLauncher@f4631f7 owned by "Computer.threadPoolForRemoting 144" Id=22726
at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
- blocked on hudson.plugins.sshslaves.SSHLauncher@f4631f7
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:228)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker@36a88ffe
aszostak said
hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock.
I agree with that analysis. The SFTPv3Client constructor does not have any explicit timeout support. It relies on the global timeout support provided in trilead... which does not exist... only reasonable solution would be to fork that work off into a side thread and join that thread with a timeout...
Code changed in jenkins
User: Stephen Connolly
Path:
src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
http://jenkins-ci.org/commit/ssh-slaves-plugin/1118394240fef4554f8b84010c1b88c4513cefa0
Log:
[FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar
- without a timeout this can be left in the hands of trilead's infinite blocking on the socket
Code changed in jenkins
User: Stephen Connolly
Path:
src/main/java/hudson/plugins/sshslaves/SSHLauncher.java
http://jenkins-ci.org/commit/ssh-slaves-plugin/84caa2c24558da8fc73c41ae7f8d8bd785d327e3
Log:
Merge pull request #26 from jenkinsci/jenkins-23560
[FIXED JENKINS-23560] Time-out the afterDisconnect cleanup of the remote slave.jar
Compare: https://github.com/jenkinsci/ssh-slaves-plugin/compare/1cf28a11c7fc...84caa2c24558
ssh-slaves plugin version 1.10 has been running for over a year on our Jenkins instance with over 50 build slaves. This problem has not occurred since the plugin was updated. Thus I confirm that the fix worked and am closing the ticket.
What about local logging on the slave machine? Anything interesting there?
How was the slave being started (SSH, …)?