Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37483

Deadlock caused by synchronized methods in EC2Cloud

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • ec2-plugin

    Description

      This is against 1.35

      EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.

      Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.

      {{------------------------------------------------------------------------------------------------------------
      T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      ------------------------------------------------------------------------------------------------------------

      "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue.schedule2(Queue.java:556)
        at hudson.model.Queue.schedule2(Queue.java:679)
        at hudson.model.Queue.schedule(Queue.java:672)
        at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)

      -------------------------------------------------------------------------------------------------------
      T2 "EC2 alive slaves monitor thread"
      – waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
      which is held by T3 "jenkins.util.Timer 7
      -------------------------------------------------------------------------------------------------------

      "EC2 alive slaves monitor thread":
      at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)

      • waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
        at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
        at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
        at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
        at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
        at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
        at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
        at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
        at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
        at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
        at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
        at jenkins.model.Nodes$4.run(Nodes.java:219)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)

      -------------------------------------------------------------------------------------------------------------
      T3 "jenkins.util.Timer 7"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      -------------------------------------------------------------------------------------------------------------

      "jenkins.util.Timer 7":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue._withLock(Queue.java:1318)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
        at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503)
      • locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}

      Attachments

        Issue Links

          Activity

            People

              francisu Francis Upton
              trose Todd Rose
              Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: