Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37483

Deadlock caused by synchronized methods in EC2Cloud

    XMLWordPrintable

Details

    Description

      This is against 1.35

      EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.

      Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.

      {{------------------------------------------------------------------------------------------------------------
      T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      ------------------------------------------------------------------------------------------------------------

      "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue.schedule2(Queue.java:556)
        at hudson.model.Queue.schedule2(Queue.java:679)
        at hudson.model.Queue.schedule(Queue.java:672)
        at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)

      -------------------------------------------------------------------------------------------------------
      T2 "EC2 alive slaves monitor thread"
      – waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
      which is held by T3 "jenkins.util.Timer 7
      -------------------------------------------------------------------------------------------------------

      "EC2 alive slaves monitor thread":
      at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)

      • waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
        at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
        at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
        at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
        at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
        at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
        at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
        at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
        at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
        at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
        at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
        at jenkins.model.Nodes$4.run(Nodes.java:219)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)

      -------------------------------------------------------------------------------------------------------------
      T3 "jenkins.util.Timer 7"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      -------------------------------------------------------------------------------------------------------------

      "jenkins.util.Timer 7":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue._withLock(Queue.java:1318)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
        at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503)
      • locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}

      Attachments

        Issue Links

          Activity

            francisu Francis Upton added a comment -

            trose I saw the PR, but you closed it. I just took the other PR related to the NPE checking for the instanceId. https://github.com/jenkinsci/ec2-plugin/pull/215

            Did you have problems in testing https://github.com/jenkinsci/ec2-plugin/pull/214?

            Does 215 supersede 214?

            francisu Francis Upton added a comment - trose I saw the PR, but you closed it. I just took the other PR related to the NPE checking for the instanceId. https://github.com/jenkinsci/ec2-plugin/pull/215 Did you have problems in testing https://github.com/jenkinsci/ec2-plugin/pull/214? Does 215 supersede 214?
            trose Todd Rose added a comment -

            iirc, 215 includes the deadlock avoidance fixes that were also in 214.  We've been running our custom version of the plugin for over a year and haven't seen deadlock problems.  Our use cases with Jenkins are a bit non-traditional - we use it to manage a fleet of several hundred instances, so we run into timing and performance issues all the time that nobody else probably ever sees.  Anyway, I think the PR you merged should be ok.

            trose Todd Rose added a comment - iirc, 215 includes the deadlock avoidance fixes that were also in 214.  We've been running our custom version of the plugin for over a year and haven't seen deadlock problems.  Our use cases with Jenkins are a bit non-traditional - we use it to manage a fleet of several hundred instances, so we run into timing and performance issues all the time that nobody else probably ever sees.  Anyway, I think the PR you merged should be ok.
            trose Todd Rose added a comment -

            Note that the PR should fix the first two deadlocks reported in this ticket.  I don't think it will have any affect on the most recent one re: the fleet plugin.

            trose Todd Rose added a comment - Note that the PR should fix the first two deadlocks reported in this ticket.  I don't think it will have any affect on the most recent one re: the fleet plugin.
            francisu Francis Upton added a comment -

            Fixed in 1.37 (upcoming)

            francisu Francis Upton added a comment - Fixed in 1.37 (upcoming)
            doydoy Ben Bullock added a comment -

            Thanks trose - I've found a similar JIRA for the fleet plugin and will monitor there (JENKINS-45074)

            doydoy Ben Bullock added a comment - Thanks trose - I've found a similar JIRA for the fleet plugin and will monitor there ( JENKINS-45074 )

            People

              francisu Francis Upton
              trose Todd Rose
              Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: