Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-37483

Deadlock caused by synchronized methods in EC2Cloud

    XMLWordPrintable

Details

    Description

      This is against 1.35

      EC2Cloud.java has several synchronized methods that can be called from various timers. getNewOrExistingAvailableSlave() and connect() are the problematic ones in this case. Our installation heavily utilizes the spot market and we have a high number of nodes in our fleet.

      Under load you can easily get into a situation where one thread is terminating an instance and at the same time another is trying to provision a new one. The liberal use of synchronized methods in EC2Cloud is not safe. A finer-grained locking strategy, or moving to a lockless strategy is advisable.

      {{------------------------------------------------------------------------------------------------------------
      T1 "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      ------------------------------------------------------------------------------------------------------------

      "Handling POST /view/Adhoc/job/admin_FailedSourceReplayRunner/build from xxx.xx.xxx.xx : RequestHandlerThread2247":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue.schedule2(Queue.java:556)
        at hudson.model.Queue.schedule2(Queue.java:679)
        at hudson.model.Queue.schedule(Queue.java:672)
        at hudson.model.ParametersDefinitionProperty._doBuild(ParametersDefinitionProperty.java:173)

      -------------------------------------------------------------------------------------------------------
      T2 "EC2 alive slaves monitor thread"
      – waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
      which is held by T3 "jenkins.util.Timer 7
      -------------------------------------------------------------------------------------------------------

      "EC2 alive slaves monitor thread":
      at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:619)

      • waiting to lock <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2SpotSlave.getSpotRequest(EC2SpotSlave.java:114)
        at hudson.plugins.ec2.EC2SpotSlave.getInstanceId(EC2SpotSlave.java:155)
        at hudson.plugins.ec2.EC2Computer._describeInstanceOnce(EC2Computer.java:165)
        at hudson.plugins.ec2.EC2Computer._describeInstance(EC2Computer.java:149)
        at hudson.plugins.ec2.EC2Computer.describeInstance(EC2Computer.java:107)
        at hudson.plugins.ec2.EC2Computer.getUptime(EC2Computer.java:133)
        at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:104)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
        at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
        at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:717)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:714)
        at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:118)
        at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:44)
        at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:186)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:169)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1338)
        at jenkins.model.Nodes$4.run(Nodes.java:219)
        at hudson.model.Queue._withLock(Queue.java:1320)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:101)

      -------------------------------------------------------------------------------------------------------------
      T3 "jenkins.util.Timer 7"
      – parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      which is held by T2 "EC2 alive slaves monitor thread"
      -------------------------------------------------------------------------------------------------------------

      "jenkins.util.Timer 7":
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x000000060090c078> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
        at hudson.model.Queue._withLock(Queue.java:1318)
        at hudson.model.Queue.withLock(Queue.java:1197)
        at jenkins.model.Nodes.removeNode(Nodes.java:210)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1860)
        at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:414)
        at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:483)
        at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:503)
      • locked <0x000000061ef25978> (a hudson.plugins.ec2.AmazonEC2Cloud)
        at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:532)
        at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:701)
        at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:307)
        at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:60)
        at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:798)}}

      Attachments

        Issue Links

          Activity

            trose Todd Rose added a comment -

            I think the quickest fix for this is to make the non-static connect() method synchronize on the class object. connect() is really the only thing that I can see that can be invoked from a lot of different contexts and threads.

            trose Todd Rose added a comment - I think the quickest fix for this is to make the non-static connect() method synchronize on the class object. connect() is really the only thing that I can see that can be invoked from a lot of different contexts and threads.
            trose Todd Rose added a comment - https://github.com/jenkinsci/ec2-plugin/pull/214
            rraboy Randall Raboy added a comment -

            I am seeing the same deadlock in our setup:

            (omitted jvm related classes)

            Handling POST /cloud/ec2-us-west-2/provision from 172.16.6.210 : RequestHandlerThread[#969] - threadId:76774 - state:WAITING
            stackTrace:
            java.lang.Thread.State: WAITING
            at sun.misc.Unsafe.park(Native Method)
            - waiting to lock <e1f3b58> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "jenkins.util.Timer [#3]" t@36
            ...
            at hudson.model.Queue._withLock(Queue.java:1307)
            at hudson.model.Queue.withLock(Queue.java:1186)
            at jenkins.model.Nodes.removeNode(Nodes.java:237)
            at jenkins.model.Jenkins.removeNode(Jenkins.java:2084)
            at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:420)
            at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:499)
            at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:518)
            - locked <65f5826a> (a hudson.plugins.ec2.AmazonEC2Cloud)
            at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:340)
            ...
            Locked ownable synchronizers:
            - locked <112a6eb5> (a java.util.concurrent.ThreadPoolExecutor$Worker)
            
            jenkins.util.Timer [#3] - threadId:36 - state:BLOCKED
            stackTrace:
            java.lang.Thread.State: BLOCKED
            at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:634)
            - waiting to lock <65f5826a> (a hudson.plugins.ec2.AmazonEC2Cloud) owned by "Handling POST /cloud/ec2-us-west-2/provision from 172.16.6.210 : RequestHandlerThread[#969]" t@76774
            at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:277)
            at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:429)
            at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:397)
            at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73)
            at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:344)
            at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:136)
            at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
            at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
            at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
            at hudson.model.Queue._withLock(Queue.java:1309)
            at hudson.model.Queue.withLock(Queue.java:1186)
            at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
            at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50)
            ...
            Locked ownable synchronizers:
            - locked <e1f3b58> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
            

            I noticed this deadlock only happen if I switch from on-demand to a spot request. The on-demand works pretty well. Similarly, I noticed same deadlock when using the ec2 fleet plugin.

            Jenkins version 2.32.2
            EC2 plugin: 1.36

            rraboy Randall Raboy added a comment - I am seeing the same deadlock in our setup: (omitted jvm related classes) Handling POST /cloud/ec2-us-west-2/provision from 172.16.6.210 : RequestHandlerThread[#969] - threadId:76774 - state:WAITING stackTrace: java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - waiting to lock <e1f3b58> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) owned by "jenkins.util.Timer [#3]" t@36 ... at hudson.model.Queue._withLock(Queue.java:1307) at hudson.model.Queue.withLock(Queue.java:1186) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2084) at hudson.plugins.ec2.EC2Cloud.countCurrentEC2Slaves(EC2Cloud.java:420) at hudson.plugins.ec2.EC2Cloud.getPossibleNewSlavesCount(EC2Cloud.java:499) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:518) - locked <65f5826a> (a hudson.plugins.ec2.AmazonEC2Cloud) at hudson.plugins.ec2.EC2Cloud.doProvision(EC2Cloud.java:340) ... Locked ownable synchronizers: - locked <112a6eb5> (a java.util.concurrent.ThreadPoolExecutor$Worker) jenkins.util.Timer [#3] - threadId:36 - state:BLOCKED stackTrace: java.lang.Thread.State: BLOCKED at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:634) - waiting to lock <65f5826a> (a hudson.plugins.ec2.AmazonEC2Cloud) owned by "Handling POST /cloud/ec2-us-west-2/provision from 172.16.6.210 : RequestHandlerThread[#969]" t@76774 at hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:277) at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:429) at hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:397) at hudson.plugins.ec2.EC2SpotSlave.terminate(EC2SpotSlave.java:73) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:344) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:136) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1309) at hudson.model.Queue.withLock(Queue.java:1186) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:50) ... Locked ownable synchronizers: - locked <e1f3b58> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) I noticed this deadlock only happen if I switch from on-demand to a spot request. The on-demand works pretty well. Similarly, I noticed same deadlock when using the ec2 fleet plugin. Jenkins version 2.32.2 EC2 plugin: 1.36
            doydoy Ben Bullock added a comment -

            I'm using a spot fleet and the ec2 fleet plugin to provision slaves and also encountering this deadlock:

             

            Sep 07, 2017 4:28:35 PM jenkins.metrics.api.Metrics$HealthChecker execute
            WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#3] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@22807dcd (owned by jenkins.util.Timer [#6]):
            at sun.misc.Unsafe.park(Native Method)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
            at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
            at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
            at hudson.model.Queue._withLock(Queue.java:1340)
            at hudson.model.Queue.withLock(Queue.java:1219)
            at jenkins.model.Nodes.removeNode(Nodes.java:237)
            at jenkins.model.Jenkins.removeNode(Jenkins.java:2121)
            at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:360)
            at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:318)
            at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42)
            at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51)
            at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:748)
            , AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#129] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@22807dcd (owned by jenkins.util.Timer [#6]):
            at sun.misc.Unsafe.park(Native Method)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
            at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
            at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
            at hudson.model.Queue.maintain(Queue.java:1420)
            at hudson.model.Queue$1.call(Queue.java:321)
            at hudson.model.Queue$1.call(Queue.java:318)
            at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108)
            at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98)
            at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
            at java.lang.Thread.run(Thread.java:748)
            , jenkins.util.Timer [#6] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@3885e4e5 (owned by jenkins.util.Timer [#3]):
            at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38)
            at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15)
            at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:730)
            at hudson.model.Queue._withLock(Queue.java:1342)
            at hudson.model.Queue.withLock(Queue.java:1219)
            at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:727)
            at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
            at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:45)
            at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:192)
            at hudson.model.Queue._withLock(Queue.java:1342)
            at hudson.model.Queue.withLock(Queue.java:1219)
            

            Jenkins version 2.60.2

            EC2 Fleet plugin version 1.1.4

             

            doydoy Ben Bullock added a comment - I'm using a spot fleet and the ec2 fleet plugin to provision slaves and also encountering this deadlock:   Sep 07, 2017 4:28:35 PM jenkins.metrics.api.Metrics$HealthChecker execute WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#3] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@22807dcd (owned by jenkins.util.Timer [#6]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1340) at hudson.model.Queue.withLock(Queue.java:1219) at jenkins.model.Nodes.removeNode(Nodes.java:237) at jenkins.model.Jenkins.removeNode(Jenkins.java:2121) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.addNewSlave(EC2FleetCloud.java:360) at com.amazon.jenkins.ec2fleet.EC2FleetCloud.updateStatus(EC2FleetCloud.java:318) at com.amazon.jenkins.ec2fleet.CloudNanny.doRun(CloudNanny.java:42) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:51) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:748) , AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#129] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@22807dcd (owned by jenkins.util.Timer [#6]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue.maintain(Queue.java:1420) at hudson.model.Queue$1.call(Queue.java:321) at hudson.model.Queue$1.call(Queue.java:318) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang. Thread .run( Thread .java:748) , jenkins.util.Timer [#6] locked on com.amazon.jenkins.ec2fleet.EC2FleetCloud@3885e4e5 (owned by jenkins.util.Timer [#3]): at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:38) at com.amazon.jenkins.ec2fleet.IdleRetentionStrategy.check(IdleRetentionStrategy.java:15) at hudson.slaves.SlaveComputer$4.run(SlaveComputer.java:730) at hudson.model.Queue._withLock(Queue.java:1342) at hudson.model.Queue.withLock(Queue.java:1219) at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:727) at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120) at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:45) at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:192) at hudson.model.Queue._withLock(Queue.java:1342) at hudson.model.Queue.withLock(Queue.java:1219) Jenkins version 2.60.2 EC2 Fleet plugin version 1.1.4  
            francisu Francis Upton added a comment -

            trose I saw the PR, but you closed it. I just took the other PR related to the NPE checking for the instanceId. https://github.com/jenkinsci/ec2-plugin/pull/215

            Did you have problems in testing https://github.com/jenkinsci/ec2-plugin/pull/214?

            Does 215 supersede 214?

            francisu Francis Upton added a comment - trose I saw the PR, but you closed it. I just took the other PR related to the NPE checking for the instanceId. https://github.com/jenkinsci/ec2-plugin/pull/215 Did you have problems in testing https://github.com/jenkinsci/ec2-plugin/pull/214? Does 215 supersede 214?
            trose Todd Rose added a comment -

            iirc, 215 includes the deadlock avoidance fixes that were also in 214.  We've been running our custom version of the plugin for over a year and haven't seen deadlock problems.  Our use cases with Jenkins are a bit non-traditional - we use it to manage a fleet of several hundred instances, so we run into timing and performance issues all the time that nobody else probably ever sees.  Anyway, I think the PR you merged should be ok.

            trose Todd Rose added a comment - iirc, 215 includes the deadlock avoidance fixes that were also in 214.  We've been running our custom version of the plugin for over a year and haven't seen deadlock problems.  Our use cases with Jenkins are a bit non-traditional - we use it to manage a fleet of several hundred instances, so we run into timing and performance issues all the time that nobody else probably ever sees.  Anyway, I think the PR you merged should be ok.
            trose Todd Rose added a comment -

            Note that the PR should fix the first two deadlocks reported in this ticket.  I don't think it will have any affect on the most recent one re: the fleet plugin.

            trose Todd Rose added a comment - Note that the PR should fix the first two deadlocks reported in this ticket.  I don't think it will have any affect on the most recent one re: the fleet plugin.
            francisu Francis Upton added a comment -

            Fixed in 1.37 (upcoming)

            francisu Francis Upton added a comment - Fixed in 1.37 (upcoming)
            doydoy Ben Bullock added a comment -

            Thanks trose - I've found a similar JIRA for the fleet plugin and will monitor there (JENKINS-45074)

            doydoy Ben Bullock added a comment - Thanks trose - I've found a similar JIRA for the fleet plugin and will monitor there ( JENKINS-45074 )

            People

              francisu Francis Upton
              trose Todd Rose
              Votes:
              4 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: