-
Bug
-
Resolution: Fixed
-
Major
-
-
2.283 released Mar 9, 2021, 2.277.2 released Apr 7, 2021
support-core-plugin has detected a Deadlock
============== Deadlock Found ==============
"Executor #-1 for master : executing xxxx.xxxx@57a2067d" id=6472210 (0x62c212) state=WAITING cpu=0% - waiting on <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - locked <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "Executor #-1 for master" id=6472207 (0x62c20f) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.ResourceController.execute(ResourceController.java:81) at hudson.model.Executor.run(Executor.java:428)
"Executor #-1 for master" id=6472207 (0x62c20f) state=BLOCKED cpu=0% - waiting to lock <0x195cc02c> (a hudson.model.queue.FutureImpl) owned by "Executor #-1 for master : executing xxxxxxxx #183" id=6218806 (0x5ee436) at hudson.model.queue.FutureImpl.addExecutor(FutureImpl.java:96) at hudson.model.queue.WorkUnit.setExecutor(WorkUnit.java:73) at hudson.model.Executor$1.call(Executor.java:359) at hudson.model.Executor$1.call(Executor.java:346) at hudson.model.Queue._withLock(Queue.java:1458) at hudson.model.Queue.withLock(Queue.java:1319) at hudson.model.Executor.run(Executor.java:346)
"Executor #-1 for master : executing xxxxxxxxx #183" id=6218806 (0x5ee436) state=WAITING cpu=76% - waiting on <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - locked <0x270b04ac> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "Executor #-1 for master" id=6472207 (0x62c20f) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue.cancel(Queue.java:732) at hudson.model.queue.FutureImpl.cancel(FutureImpl.java:82)
My idea (but it’s complicated to write those race conditions as a unit test…):
Thread A is calling Queue._withLock (so get the lock instance field ReentrantLock lock ) (https://github.com/jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/Queue.java#L1381)
Thread B is calling FutureImpl.cancel this method have a synchronized block on the Queue instance (same as above as it’s unique instance in Jenkins) https://github.com/jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/queue/FutureImpl.java#L74
Thread B is holding queue instance and try to cancel method from Queue the cancel method try to get the lock from the instance field but this one is already hold by Thread A.
Thread A try to return the lock as Thread B have a synchronized on Queue instance.
The solution seems to remove the synchronized block on the Queue instance here https://github.com/jenkinsci/jenkins/blob/e065e79d9b19822593260f9db27d4e5b16939ef3/core/src/main/java/hudson/model/queue/FutureImpl.java#L74 as there is a use of a lock in Queue.
Looks to be a safe change (again writing a unit test is not easy to prove it)
The other solution is to have the caller not using FutureImpl.cancel but using queue.cancel
PR https://github.com/jenkinsci/jenkins/pull/5305
This commit introduced a new strategy using a Lock https://github.com/jenkinsci/jenkins/commit/92147c3597308bc05e6448ccc41409fcc7c05fd7 but didn't change the FutureImpl class to not use anymore synchronized on Queue instance.
possible workaround is to use queue.cancel(FutureImpl.task) so this will use the Lock from Queue.