-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
mesos-plugin 0.6.0 (slightly modified)
It seems when JenkinsScheduler.statusUpdate() tries to stop the Scheduler and the Retention Timer of a Slave tries to stop a Slave it can somehow end in a deadlock.
This is because the Timer locks the MesosImpl instance and statusUpdate() the SUPERVISOR_LOCK. Then MesosImpl tries to terminate the Slave and waits for the SUPERVISOR_LOCK to be freed by the statusUpdate() Thread. However, it seems that statusUpdate() needs a lock on MesosImpl too, when trying to stop the Scheduler.
This is the Threaddump (I use a slightly modified version of Mesos plugin 0.6.0, so the linenumbers are probably not 100% right):
"Thread-2516073" - Thread t@2898790 java.lang.Thread.State: BLOCKED at org.jenkinsci.plugins.mesos.Mesos$MesosImpl.stopScheduler(Mesos.java:141) - waiting to lock <62132b60> (a org.jenkinsci.plugins.mesos.Mesos$MesosImpl) owned by "jenkins.util.Timer [#9]" t@66 at org.jenkinsci.plugins.mesos.JenkinsScheduler.supervise(JenkinsScheduler.java:749) at org.jenkinsci.plugins.mesos.JenkinsScheduler.statusUpdate(JenkinsScheduler.java:634) Locked ownable synchronizers: - locked <3af5466a> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"jenkins.util.Timer [#9]" - Thread t@66 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - waiting to lock <3af5466a> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "Thread-2516073" t@2898790 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.jenkinsci.plugins.mesos.JenkinsScheduler.supervise(JenkinsScheduler.java:725) at org.jenkinsci.plugins.mesos.JenkinsScheduler.terminateJenkinsSlave(JenkinsScheduler.java:220) - locked <55398768> (a org.jenkinsci.plugins.mesos.JenkinsScheduler) at org.jenkinsci.plugins.mesos.Mesos$MesosImpl.stopJenkinsSlave(Mesos.java:157) - locked <62132b60> (a org.jenkinsci.plugins.mesos.Mesos$MesosImpl) at org.jenkinsci.plugins.mesos.MesosComputerLauncher.terminate(MesosComputerLauncher.java:122) at org.jenkinsci.plugins.mesos.MesosSlave.terminate(MesosSlave.java:91) at org.jenkinsci.plugins.mesos.MesosRetentionStrategy.check(MesosRetentionStrategy.java:70) - locked <75b63404> (a org.jenkinsci.plugins.mesos.MesosRetentionStrategy) at org.jenkinsci.plugins.mesos.MesosRetentionStrategy.check(MesosRetentionStrategy.java:26) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Locked ownable synchronizers: - locked <703c7665> (a java.util.concurrent.ThreadPoolExecutor$Worker)
I tried to solve the problem myself, but I somehow got a knot in my brain from all the synchronized calls etc. The only thing I can guess is that the multiple synchronized cross calls between MesosImpl and JenkinsScheduler are not great.
Maybe some Java whiz can solve the problem there.
PS: I posted this also on the github issues page, because it seems to be more active (https://github.com/jenkinsci/mesos-plugin/issues/97).