Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37483

Deadlock caused by synchronized methods in EC2Cloud

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Blocker Blocker
    • ec2-plugin

      This is against 1.35

      EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.

      Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.

      {{------------------------------------------------------------------------------------------------------------
      T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      ------------------------------------------------------------------------------------------------------------

      "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue.schedule2(Queue.java:556)
        at hudson.model.Queue.schedule2(Queue.java:679)
        at hudson.model.Queue.schedule(Queue.java:672)
        at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)

      -------------------------------------------------------------------------------------------------------
      T2 "EC2 alive slaves monitor thread"
      – waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
      which is held by T3 "jenkins.util.Timer 7
      -------------------------------------------------------------------------------------------------------

      "EC2 alive slaves monitor thread":
      at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)

      • waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
        at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
        at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
        at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
        at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
        at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
        at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
        at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
        at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
        at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
        at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
        at jenkins.model.Nodes$4.run(Nodes.java:219)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)

      -------------------------------------------------------------------------------------------------------------
      T3 "jenkins.util.Timer 7"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      -------------------------------------------------------------------------------------------------------------

      "jenkins.util.Timer 7":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue._withLock(Queue.java:1318)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
        at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503)
      • locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}

            francisu Francis Upton
            trose Todd Rose
            Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: