-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Critical
-
Component/s: mesos-plugin
-
None
-
Environment:mesos-plugin 0.6.0 (slightly modified)
It seems when JenkinsScheduler.statusUpdate() tries to stop the Scheduler and the Retention Timer of a Slave tries to stop a Slave it can somehow end in a deadlock.
This is because the Timer locks the MesosImpl instance and statusUpdate() the SUPERVISOR_LOCK. Then MesosImpl tries to terminate the Slave and waits for the SUPERVISOR_LOCK to be freed by the statusUpdate() Thread. However, it seems that statusUpdate() needs a lock on MesosImpl too, when trying to stop the Scheduler.
This is the Threaddump (I use a slightly modified version of Mesos plugin 0.6.0, so the linenumbers are probably not 100% right):
"Thread-2516073" - Thread t@2898790
java.lang.Thread.State: BLOCKED
at org.jenkinsci.plugins.mesos.Mesos$MesosImpl.stopScheduler(Mesos.java:141)
- waiting to lock <62132b60> (a org.jenkinsci.plugins.mesos.Mesos$MesosImpl) owned by "jenkins.util.Timer [#9]" t@66
at org.jenkinsci.plugins.mesos.JenkinsScheduler.supervise(JenkinsScheduler.java:749)
at org.jenkinsci.plugins.mesos.JenkinsScheduler.statusUpdate(JenkinsScheduler.java:634)
Locked ownable synchronizers:
- locked <3af5466a> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"jenkins.util.Timer [#9]" - Thread t@66
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- waiting to lock <3af5466a> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "Thread-2516073" t@2898790
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at org.jenkinsci.plugins.mesos.JenkinsScheduler.supervise(JenkinsScheduler.java:725)
at org.jenkinsci.plugins.mesos.JenkinsScheduler.terminateJenkinsSlave(JenkinsScheduler.java:220)
- locked <55398768> (a org.jenkinsci.plugins.mesos.JenkinsScheduler)
at org.jenkinsci.plugins.mesos.Mesos$MesosImpl.stopJenkinsSlave(Mesos.java:157)
- locked <62132b60> (a org.jenkinsci.plugins.mesos.Mesos$MesosImpl)
at org.jenkinsci.plugins.mesos.MesosComputerLauncher.terminate(MesosComputerLauncher.java:122)
at org.jenkinsci.plugins.mesos.MesosSlave.terminate(MesosSlave.java:91)
at org.jenkinsci.plugins.mesos.MesosRetentionStrategy.check(MesosRetentionStrategy.java:70)
- locked <75b63404> (a org.jenkinsci.plugins.mesos.MesosRetentionStrategy)
at org.jenkinsci.plugins.mesos.MesosRetentionStrategy.check(MesosRetentionStrategy.java:26)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Locked ownable synchronizers:
- locked <703c7665> (a java.util.concurrent.ThreadPoolExecutor$Worker)
I tried to solve the problem myself, but I somehow got a knot in my brain from all the synchronized calls etc. The only thing I can guess is that the multiple synchronized cross calls between MesosImpl and JenkinsScheduler are not great.
Maybe some Java whiz can solve the problem there.
PS: I posted this also on the github issues page, because it seems to be more active (https://github.com/jenkinsci/mesos-plugin/issues/97).