-
Bug
-
Resolution: Fixed
-
Blocker
This is against 1.35
EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.
Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.
{{------------------------------------------------------------------------------------------------------------
T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
– parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
which is held by T2 "EC2 alive slaves monitor thread"
------------------------------------------------------------------------------------------------------------
"Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at hudson.model.Queue.schedule2(Queue.java:556)
at hudson.model.Queue.schedule2(Queue.java:679)
at hudson.model.Queue.schedule(Queue.java:672)
at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)
-------------------------------------------------------------------------------------------------------
T2 "EC2 alive slaves monitor thread"
– waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
which is held by T3 "jenkins.util.Timer 7
-------------------------------------------------------------------------------------------------------
"EC2 alive slaves monitor thread":
at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)
- waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
at jenkins.model.Nodes$4.run(Nodes.java:219)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at jenkins.model.Nodes.removeNode(Nodes.java:210)
at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)
-------------------------------------------------------------------------------------------------------------
T3 "jenkins.util.Timer 7"
– parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
which is held by T2 "EC2 alive slaves monitor thread"
-------------------------------------------------------------------------------------------------------------
"jenkins.util.Timer 7":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at hudson.model.Queue._withLock(Queue.java:1318)
at hudson.model.Queue.withLock(Queue.java:1197)
at jenkins.model.Nodes.removeNode(Nodes.java:210)
at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503) - locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}
- relates to
-
JENKINS-45074 spot-fleet plugin deadlocks master when scaling up
- Resolved