Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
Description
This is against 1.35
EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.
Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.
{{------------------------------------------------------------------------------------------------------------
T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
– parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
which is held by T2 "EC2 alive slaves monitor thread"
------------------------------------------------------------------------------------------------------------
"Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at hudson.model.Queue.schedule2(Queue.java:556)
at hudson.model.Queue.schedule2(Queue.java:679)
at hudson.model.Queue.schedule(Queue.java:672)
at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)
-------------------------------------------------------------------------------------------------------
T2 "EC2 alive slaves monitor thread"
– waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
which is held by T3 "jenkins.util.Timer 7
-------------------------------------------------------------------------------------------------------
"EC2 alive slaves monitor thread":
at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)
- waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
at jenkins.model.Nodes$4.run(Nodes.java:219)
at hudson.model.Queue._withLock(Queue.java:1320)
at hudson.model.Queue.withLock(Queue.java:1197)
at jenkins.model.Nodes.removeNode(Nodes.java:210)
at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)
-------------------------------------------------------------------------------------------------------------
T3 "jenkins.util.Timer 7"
– parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
which is held by T2 "EC2 alive slaves monitor thread"
-------------------------------------------------------------------------------------------------------------
"jenkins.util.Timer 7":
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at hudson.model.Queue._withLock(Queue.java:1318)
at hudson.model.Queue.withLock(Queue.java:1197)
at jenkins.model.Nodes.removeNode(Nodes.java:210)
at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503) - locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}
Attachments
Issue Links
- relates to
-
JENKINS-45074 spot-fleet plugin deadlocks master when scaling up
-
- Resolved
-
trose I saw the PR, but you closed it. I just took the other PR related to the NPE checking for the instanceId. https://github.com/jenkinsci/ec2-plugin/pull/215
Did you have problems in testing https://github.com/jenkinsci/ec2-plugin/pull/214?
Does 215 supersede 214?