Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-48161

Deadlock caused by synchronized methods in EC2Cloud

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • ec2-plugin
    • None

      This bug is with 1.37 plugin

      EC2Cloud.java has several synchronized methods that can be called from various timers. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.

      Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one.

      In this case we have a lock when: a thread is trying to provide(when provide try also to remove the no active slaves) and another thread is trying to reconnect the death slaves

      It seems that this deadlock happens when the price of some spot instance type is mayor than we have set and we see in the aws console instance in open status for price-to-low

      "jenkins.util.Timer 6" #73 daemon prio=5 os_prio=0 tid=0x00007ffaf0216800 nid=0x46fa waiting for monitor entry [0x00007ffa74aad000]
      java.lang.Thread.State: BLOCKED (on object monitor)
      at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:640)

      • waiting to lock <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279)
        at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438)
        at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406)
        at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73)
        at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:346)
        at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
        at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
        at hudson.model.Queue._withLock(Queue.java:1334)
        at hudson.model.Queue.withLock(Queue.java:1211)
        at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

      Locked ownable synchronizers:

      • <0x00000006800423d0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      • <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      • <0x0000000725d120b8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

      "jenkins.util.Timer 2" #68 daemon prio=5 os_prio=0 tid=0x00007ffaf8003000 nid=0x46f5 waiting on condition [0x00007ffb24da9000]
      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x0000000682713fc8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at hudson.model.Queue._withLock(Queue.java:1332)
        at hudson.model.Queue.withLock(Queue.java:1211)
        at jenkins.model.Nodes.removeNode(Nodes.java:237)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:2089)
        at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:422)
        at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:502)
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:522)
      • locked <0x0000000727baa970> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:551)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:714)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)

      Locked ownable synchronizers:

      • <0x0000000680033dd0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      • <0x0000000683447ff8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

      I attach the complete jstack log
       

       

          [JENKINS-48161] Deadlock caused by synchronized methods in EC2Cloud

          Hi, today we have update the plugin from 1.38 to 1.39 but we had to do the downgrade to 1.38 anothertime because in the new version jenkins go into deadlock (2 times in few hours)

          Andrea Vavassori added a comment - Hi, today we have update the plugin from 1.38 to 1.39 but we had to do the downgrade to 1.38 anothertime because in the new version jenkins go into deadlock (2 times in few hours)

          andrea_vavassori, is it ok for you with 1.38 or just less severe than with 1.39?
          We have deadlocks as well with 1.36, so I'm wondering if we should upgrade or downgrade to mitigate the issue.

          Vitaly Gorbunov added a comment - andrea_vavassori , is it ok for you with 1.38 or just less severe than with 1.39? We have deadlocks as well with 1.36, so I'm wondering if we should upgrade or downgrade to mitigate the issue.

          with 1.38 it is ok if we don't use spot instances. if we use spot instances(seems only the spot intances price > of the price that we configure as max) we have deadlock in 1.38 to

          Andrea Vavassori added a comment - with 1.38 it is ok if we don't use spot instances. if we use spot instances(seems only the spot intances price > of the price that we configure as max) we have deadlock in 1.38 to

          Alex Taylor added a comment -

          andrea_vavassori This very much looks like JENKINS-53858 which was resolved in 1.42. Could I get you to compare that JIRA with your current one and see if it is the same thing?

          Alex Taylor added a comment - andrea_vavassori This very much looks like JENKINS-53858 which was resolved in 1.42. Could I get you to compare that JIRA with your current one and see if it is the same thing?

            francisu Francis Upton
            andrea_vavassori Andrea Vavassori
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: