-
Bug
-
Resolution: Unresolved
-
Critical
-
None
This bug is with 1.37 plugin
EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.
Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one.
In this case we have a lock when: a thread is trying to provide(when provide try also to remove the no active slaves) and another thread is trying to reconnect the death slaves
It seems that this deadlock happens when the price of some spot instance type is mayor than we have set and we see in the aws console instance in open status for price-to-low
"jenkins.util.Timer 6" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000]
java.lang.Thread.State: BLOCKED (on object monitor)
at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640)
- waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279)
at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438)
at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406)
at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73)
at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346)
at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
at hudson.model.Queue._withLock(Queue.java:1334)
at hudson.model.Queue.withLock(Queue.java:1211)
at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
- <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
"jenkins.util.Timer 2" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
at hudson.model.Queue._withLock(Queue.java:1332)
at hudson.model.Queue.withLock(Queue.java:1211)
at jenkins.model.Nodes.removeNode(Nodes.java:237)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2089)
at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422)
at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502)
at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522) - locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551)
at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714)
at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
- <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
I attach the complete jstack log
[JENKINS-48161] Deadlock caused by synchronized methods in EC2Cloud
Attachment | New: jstack_jenkins.log [ 40464 ] |
Description |
Original:
This bug is with 1.37 plugin EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet. Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. In this case with have a lock when a s thread is trying to provide and another is trying to reconnect the death slave "jenkins.util.Timer [#6]" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640) - waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1334) at hudson.model.Queue.withLock(Queue.java:1211) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) "jenkins.util.Timer [#2]" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1332) at hudson.model.Queue.withLock(Queue.java:1211) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2089) at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422) at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522) - locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) I attach the complete jstack log |
New:
This bug is with 1.37 plugin EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet. Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. In this case we have a lock when: a thread is trying to provide(when provide try also to remove the no active slaves) and another thread is trying to reconnect the death slaves "jenkins.util.Timer [#6]" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640) - waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1334) at hudson.model.Queue.withLock(Queue.java:1211) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) "jenkins.util.Timer [#2]" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1332) at hudson.model.Queue.withLock(Queue.java:1211) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2089) at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422) at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522) - locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) I attach the complete jstack log |
Description |
Original:
This bug is with 1.37 plugin EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet. Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. In this case we have a lock when: a thread is trying to provide(when provide try also to remove the no active slaves) and another thread is trying to reconnect the death slaves "jenkins.util.Timer [#6]" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640) - waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1334) at hudson.model.Queue.withLock(Queue.java:1211) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) "jenkins.util.Timer [#2]" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1332) at hudson.model.Queue.withLock(Queue.java:1211) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2089) at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422) at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522) - locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) I attach the complete jstack log |
New:
This bug is with 1.37 plugin EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet. Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. In this case we have a lock when: a thread is trying to provide(when provide try also to remove the no active slaves) and another thread is trying to reconnect the death slaves It seems that this deadlock happens when the price of some spot instance type is mayor than we have set and we see in the aws console instance in open status for price-to-low "jenkins.util.Timer [#6]" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000] java.lang.Thread.State: BLOCKED (on object monitor) at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640) - waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1334) at hudson.model.Queue.withLock(Queue.java:1211) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) - <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) "jenkins.util.Timer [#2]" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1332) at hudson.model.Queue.withLock(Queue.java:1211) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2089) at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422) at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522) - locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Locked ownable synchronizers: - <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker) - <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) I attach the complete jstack log |