• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • ec2-plugin
    • Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue), 1.40.1

      We seem to be running into an issue about once per day where multiple threads deadlock trying to access and update resources within the EC2 plugin.

      We have several jobs that add substantial numbers of subjobs (~40) to the build queue, and they thus invoke the Pipeline step `ec2 cloud: 'AWS Cloud', template: 'Micro'` several times to preallocate enough EC2 nodes to run them all (though it looks like this behavior will no longer be necessary in ec2-plugin 1.40).

      In addition, it seems that manually provisioning a node through the UI or manually deleting a node has a chance of deadlocking if it runs at the same time as the provisioning or unprovisioning process happens.

       

      The following stacktrace shows the three threads running in 1.40-SNAPSHOT (master as of Friday afternoon).

      Warning, the following threads are deadlocked : Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700, jenkins.util.Timer [#3], jenkins.util.Timer [#6]
      
       "jenkins.util.Timer [#3]" daemon prio=5 WAITING
         sun.misc.Unsafe.park(Native Method)
         java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
         java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
         java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
         hudson.model.Queue._withLock(Queue.java:1437)
         hudson.model.Queue.withLock(Queue.java:1300)
         jenkins.model.Nodes.updateNode(Nodes.java:193)
         jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
         hudson.model.Node.save(Node.java:140)
         hudson.util.PersistedList.onModified(PersistedList.java:173)
         hudson.util.PersistedList.replaceBy(PersistedList.java:85)
         hudson.model.Slave.<init>(Slave.java:198)
         hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
         hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
         hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
         hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:899)
         hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:606)
         hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:578)
         hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:415)
         hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:542)
         hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:557)
         hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
         hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
         hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
         hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
         hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
         jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
         java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         java.lang.Thread.run(Thread.java:748) "jenkins.util.Timer [#6]" daemon prio=5 BLOCKED
         hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:671)
         hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)
         hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:452)
         hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:420)
         hudson.plugins.ec2.EC2OndemandSlave.terminate(EC2OndemandSlave.java:68)
         hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:360)
         hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
         hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
         hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
         hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
         hudson.model.Queue._withLock(Queue.java:1380)
         hudson.model.Queue.withLock(Queue.java:1257)
         hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
         hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
         jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
         java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         java.lang.Thread.run(Thread.java:748)
      
      "Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700" prio=5 WAITING
        sun.misc.Unsafe.park(Native Method)
        java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        hudson.model.Queue.schedule2(Queue.java:587)
        hudson.model.Queue.schedule2(Queue.java:713)
        jenkins.model.ParameterizedJobMixIn.doBuild(ParameterizedJobMixIn.java:217)
        jenkins.model.ParameterizedJobMixIn$ParameterizedJob.doBuild(ParameterizedJobMixIn.java:408)
        java.lang.invoke.LambdaForm$DMH/227306521.invokeInterface_L4_V(LambdaForm$DMH)
        java.lang.invoke.LambdaForm$BMH/1196970080.reinvoke(LambdaForm$BMH)
        java.lang.invoke.LambdaForm$MH/457755914.invoker(LambdaForm$MH)
        java.lang.invoke.LambdaForm$MH/145876426.invokeExact_MT(LambdaForm$MH)
        java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
        org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
        org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
        org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
        org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:668)
        org.kohsuke.stapler.Stapler.service(Stapler.java:238)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
        org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
        net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
        net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
        org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:99)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
        hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
      

      We upgraded to 1.40-SNAPSHOT after running into similar global deadlocks in 1.38 and 1.39, which I can attach stack dumps for, but since current master has a lot of reworking of the locking code, I'm not sure if they'll be useful.

          [JENKINS-53858] Deadlock on EC2 resources

          Yes, with the 1.40 should not needed anymore the pre warming the instances, the plugin is able to raise  40 nodes per minutes (if you don't reach the AWS api limits). Anyway I started to investigate the deadlock.

          FABRIZIO MANFREDI added a comment - Yes, with the 1.40 should not needed anymore the pre warming the instances, the plugin is able to raise  40 nodes per minutes (if you don't reach the AWS api limits). Anyway I started to investigate the deadlock.

          Mason Donahue added a comment -

          Just an update with a stackdump from 1.40.1:

          10/4/18 2:34 PM
          
          ===== Threads on dev-jenkins-master-useast1b-01@172.25.33.234 =====
          
          Warning, the following threads are deadlocked : GitHubPushTrigger [#4], jenkins.util.Timer [#2], jenkins.util.Timer [#9]
          
          "GitHubPushTrigger [#4]" prio=5 WAITING
          	sun.misc.Unsafe.park(Native Method)
          	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
          	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
          	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
          	hudson.model.Queue.schedule2(Queue.java:587)
          	jenkins.model.ParameterizedJobMixIn.scheduleBuild2(ParameterizedJobMixIn.java:156)
          	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:116)
          	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:105)
          	com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:143)
          	hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119)
          	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	java.lang.Thread.run(Thread.java:748)
          
          "jenkins.util.Timer [#2]" daemon prio=5 BLOCKED
          	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
          	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
          	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
          	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
          	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
          	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
          	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
          	hudson.model.Queue._withLock(Queue.java:1380)
          	hudson.model.Queue.withLock(Queue.java:1257)
          	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
          	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
          	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
          	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	java.lang.Thread.run(Thread.java:748)
          
          "jenkins.util.Timer [#9]" daemon prio=5 WAITING
          	sun.misc.Unsafe.park(Native Method)
          	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
          	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
          	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
          	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
          	hudson.model.Queue._withLock(Queue.java:1437)
          	hudson.model.Queue.withLock(Queue.java:1300)
          	jenkins.model.Nodes.updateNode(Nodes.java:193)
          	jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
          	hudson.model.Node.save(Node.java:140)
          	hudson.util.PersistedList.onModified(PersistedList.java:173)
          	hudson.util.PersistedList.replaceBy(PersistedList.java:85)
          	hudson.model.Slave.<init>(Slave.java:198)
          	hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
          	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
          	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
          	hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
          	hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
          	hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:572)
          	hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
          	hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
          	hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
          	hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
          	hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
          	hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
          	hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
          	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
          	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
          	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	java.lang.Thread.run(Thread.java:748)
          

          Mason Donahue added a comment - Just an update with a stackdump from 1.40.1: 10/4/18 2:34 PM ===== Threads on dev-jenkins-master-useast1b-01@172.25.33.234 ===== Warning, the following threads are deadlocked : GitHubPushTrigger [#4], jenkins.util.Timer [#2], jenkins.util.Timer [#9] "GitHubPushTrigger [#4]" prio=5 WAITING sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) hudson.model.Queue.schedule2(Queue.java:587) jenkins.model.ParameterizedJobMixIn.scheduleBuild2(ParameterizedJobMixIn.java:156) jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:116) jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:105) com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:143) hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#2]" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673) hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314) hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362) hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46) hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) hudson.model.Queue._withLock(Queue.java:1380) hudson.model.Queue.withLock(Queue.java:1257) hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#9]" daemon prio=5 WAITING sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) hudson.model.Queue._withLock(Queue.java:1437) hudson.model.Queue.withLock(Queue.java:1300) jenkins.model.Nodes.updateNode(Nodes.java:193) jenkins.model.Jenkins.updateNode(Jenkins.java:2077) hudson.model.Node.save(Node.java:140) hudson.util.PersistedList.onModified(PersistedList.java:173) hudson.util.PersistedList.replaceBy(PersistedList.java:85) hudson.model.Slave.<init>(Slave.java:198) hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134) hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918) hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624) hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:572) hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432) hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544) hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559) hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748)

          Mason Donahue added a comment -

          And as for the actual objects being held:

          Oct 04, 2018 4:34:47 PM jenkins.metrics.api.Metrics$HealthChecker execute
          WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on hudson.plugins.ec2.AmazonEC2Cloud@61f05bdb (owned by jenkins.util.Timer [#9]):
                 at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
                 at hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
                 at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
                 at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
                 at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
                 at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
                 at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
                 at hudson.model.Queue._withLock(Queue.java:1380)
                 at hudson.model.Queue.withLock(Queue.java:1257)
                 at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
                 at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                 at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                 at java.lang.Thread.run(Thread.java:748)
          , jenkins.util.Timer [#9] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@731b93cc (owned by jenkins.util.Timer [#8]):
                 at sun.misc.Unsafe.park(Native Method)
                 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                 at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
                 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
                 at hudson.model.Queue._withLock(Queue.java:1437)
                 at hudson.model.Queue.withLock(Queue.java:1300)
                 at jenkins.model.Nodes.updateNode(Nodes.java:193)
                 at jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
                 at hudson.model.Node.save(Node.java:140)
                 at hudson.util.PersistedList.onModified(PersistedList.java:173)
                 at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
                 at hudson.model.Slave.<init>(Slave.java:198)
                 at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
                 at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
                 at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
                 at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
                 at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
                 at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:596)
                 at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
                 at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
                 at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
                 at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
                 at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
                 at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
                 at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
                 at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                 at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                 at java.lang.Thread.run(Thread.java:748)
          ]] 

          Mason Donahue added a comment - And as for the actual objects being held: Oct 04, 2018 4:34:47 PM jenkins.metrics.api.Metrics$HealthChecker execute WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on hudson.plugins.ec2.AmazonEC2Cloud@61f05bdb (owned by jenkins.util.Timer [#9]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673) at hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1380) at hudson.model.Queue.withLock(Queue.java:1257) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) , jenkins.util.Timer [#9] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@731b93cc (owned by jenkins.util.Timer [#8]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.Queue.withLock(Queue.java:1300) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2077) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:596) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) ]]

          You using multiple cloud configuration ? I found a potential race condition .

          From the stack trace seems more a core deadlock
          oleg_nenashev What do you think of the deadlock reported by the user ? 
           

          FABRIZIO MANFREDI added a comment - You using multiple cloud configuration ? I found a potential race condition . From the stack trace seems more a core deadlock oleg_nenashev What do you think of the deadlock reported by the user ?   

          Mason Donahue added a comment -

          We are currently using one cloud, but multiple configurations within that (same AMI, but machine types and labels differ).

          Mason Donahue added a comment - We are currently using one cloud, but multiple configurations within that (same AMI, but machine types and labels differ).

          Perrin Morrow added a comment -

          We started using our new Jenkins instance on Monday, which is running the very latest of everything. I was seeing this exact same deadlock several times an hour, and having to kill the threads via the JavaMelody monitoring UI when it happened.

          It's happened only two or three times in the last two days, but I think that's because I drastically increased the idle timeouts for all the agents, so it is terminating and launching them far less often.

          It looks like two timer tasks are acquiring the queue lock and EC2Cloud object monitor, but in different order?

          1. the idle check timer task locks the queue while checking for idle agents, then tries to synchronize on the EC2Cloud object when it needs to terminate one.
          2. the node provisioner timer task synchronizes on the EC2Cloud object while it launches an agent, but in the construction of hudson.util.Slave, it tries to acquire the queue lock.

          Perrin Morrow added a comment - We started using our new Jenkins instance on Monday, which is running the very latest of everything. I was seeing this exact same deadlock several times an hour, and having to kill the threads via the JavaMelody monitoring UI when it happened. It's happened only two or three times in the last two days, but I think that's because I drastically increased the idle timeouts for all the agents, so it is terminating and launching them far less often. It looks like two timer tasks are acquiring the queue lock and EC2Cloud object monitor, but in different order? the idle check timer task locks the queue while checking for idle agents, then tries to synchronize on the EC2Cloud object when it needs to terminate one. the node provisioner timer task synchronizes on the EC2Cloud object while it launches an agent, but in the construction of hudson.util.Slave, it tries to acquire the queue lock.

          David Hayes added a comment -

          Seeing this issue also on ec2-plugin 1.40, with a single cloud configuration, on Jenkins 2.138.2.

           

          "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED"EC2 alive slaves monitor thread" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) java.lang.Thread.run(Thread.java:748) 
          "jenkins.util.Timer [#1]" daemon prio=5 BLOCKED
          	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638)
          	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:300)
          	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:348)
          	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123)
          	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
          	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
          	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
          	hudson.model.Queue._withLock(Queue.java:1380)
          	hudson.model.Queue.withLock(Queue.java:1257)
          	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
          	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
          	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
          	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
          	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
          	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	java.lang.Thread.run(Thread.java:748)
          

          David Hayes added a comment - Seeing this issue also on ec2-plugin 1.40, with a single cloud configuration, on Jenkins 2.138.2.   "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#1]" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:300) hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:348) hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) hudson.model.Queue._withLock(Queue.java:1380) hudson.model.Queue.withLock(Queue.java:1257) hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748)

          FABRIZIO MANFREDI added a comment - Can you try this snapshot : https://repo.jenkins-ci.org/snapshots/org/jenkins-ci/plugins/ec2/1.42-SNAPSHOT/ec2-1.42-20181106.195515-1.hpi

          Perrin Morrow added a comment -

          Using that snapshot causes a NullPointerException to be thrown as soon as the plugin tries to launch an instance:

          SEVERE: Timer task hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@6d31793d failed
          java.lang.NullPointerException
                  at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587)
                  at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:598)
                  at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
                  at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
                  at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
                  at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
                  at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                  at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748) 

          I think that's because slaveCountingLock is transient, so I added this line to readResolve() to reinitialise it after deserialisation:

          slaveCountingLock = new ReentrantLock();

          It's now launching slaves OK in our staging environment but I wasn't able to reproduce the original deadlock problem there anyway. I'm going to try my build of the plugin in our live server (where we were seeing it many times a day) as soon as it's quiet enough to restart.

           

          Perrin Morrow added a comment - Using that snapshot causes a NullPointerException  to be thrown as soon as the plugin tries to launch an instance: SEVERE: Timer task hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@6d31793d failed java.lang.NullPointerException at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:598) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I think that's because slaveCountingLock is transient, so I added this line to readResolve() to reinitialise it after deserialisation: slaveCountingLock = new ReentrantLock(); It's now launching slaves OK in our staging environment but I wasn't able to reproduce the original deadlock problem there anyway. I'm going to try my build of the plugin in our live server (where we were seeing it many times a day) as soon as it's quiet enough to restart.  

          Mason Donahue added a comment -

          I put up another PR to hopefully take care of the reentrantlock being null and another NPE I saw.

          Mason Donahue added a comment - I put up another PR to hopefully take care of the reentrantlock being null and another NPE I saw.

          Simon Weber added a comment -

          I was able to recreate the problem pretty easily in staging, and master + https://github.com/jenkinsci/ec2-plugin/pull/321 has fixed it so far for me. I'm planning to roll it out in prod asap.

          Simon Weber added a comment - I was able to recreate the problem pretty easily in staging, and master + https://github.com/jenkinsci/ec2-plugin/pull/321 has fixed it so far for me. I'm planning to roll it out in prod asap.

          I am testing as well the patch and for now seems working fine, thanks.

          FABRIZIO MANFREDI added a comment - I am testing as well the patch and for now seems working fine, thanks.

          Jason Axley added a comment -

          I'm testing the hotfix as well, pulled from https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/ec2/1.42-rc815.252b09cce4fe/

          Will see if we get any deadlock issues later today or tomorrow.

          Jason Axley added a comment - I'm testing the hotfix as well, pulled from https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/ec2/1.42-rc815.252b09cce4fe/ Will see if we get any deadlock issues later today or tomorrow.

          Jason Axley added a comment -

          Been running the hotfix for 48 hours with no recurrence of the deadlock issue.

          Jason Axley added a comment - Been running the hotfix for 48 hours with no recurrence of the deadlock issue.

          Are the node retired correctly base on the idle timeout ?  with the last SNAPSHOT this test failed, I am investigating.

          FABRIZIO MANFREDI added a comment - Are the node retired correctly base on the idle timeout ?  with the last SNAPSHOT this test failed, I am investigating.

          For what it's worth - I noticed frequent lockups during mass node provisioning with 1.41 (on jenkins 2.150).

          I deployed 1.42-rc823.17ad3043e0e0 and haven't seen any more lockups since.

          Klaus Schniedergers added a comment - For what it's worth - I noticed frequent lockups during mass node provisioning with 1.41 (on jenkins 2.150). I deployed 1.42-rc823.17ad3043e0e0 and haven't seen any more lockups since.

          Nicolas Zin added a comment -

          Same issue here.
          Do you know where the official 1.42 will be release?

          Nicolas Zin added a comment - Same issue here. Do you know where the official 1.42 will be release?

          How to get 1.42 release? its not available on jenkins plugin manager updates. and dont see available for download as well.

          Ashish Sanagar added a comment - How to get 1.42 release? its not available on jenkins plugin manager updates. and dont see available for download as well.

          David Frank added a comment -

          We've been experiencing deadlocks similar to this - more so since upgrading to 1.41.

          As some others have asked, is there an ETA on when 1.42 will be released?

          David Frank added a comment - We've been experiencing deadlocks similar to this - more so since upgrading to 1.41. As some others have asked, is there an ETA on when 1.42 will be released?

          Nicolas Zin added a comment -

          Hi guys,

           

          with one of the snapshot version of the 1.42 (1.42-rc833.4f1c51128070 to be precise), I got an issue (not sure if you already face it):

          if I run a groovy script on a remote slave, the slave is killed.
          The script is really simple, so I dont think it is the script that is at risk, maybe more the fact that it is a groovy script?

          import jenkins.model.Jenkins
          import hudson.plugins.ec2.EC2OndemandSlaveJenkins.instance.nodes
                  .grep { it instanceof EC2OndemandSlave }
                  .grep { it.toComputer().uptime > 24 * 1000 * 3600 }
                  .each {
              out.println "Removing labels from ${it.name}"
              it.labelString = ""
              it.save()
          } 

          Nicolas Zin added a comment - Hi guys,   with one of the snapshot version of the 1.42 (1.42-rc833.4f1c51128070 to be precise), I got an issue (not sure if you already face it): if I run a groovy script on a remote slave, the slave is killed. The script is really simple, so I dont think it is the script that is at risk, maybe more the fact that it is a groovy script? import jenkins.model.Jenkins import hudson.plugins.ec2.EC2OndemandSlaveJenkins.instance.nodes .grep { it instanceof EC2OndemandSlave } .grep { it.toComputer().uptime > 24 * 1000 * 3600 } .each { out.println "Removing labels from ${it.name}" it.labelString = "" it.save() }

          Greg Smith added a comment -

          Hate to pile on, but is there any ETA for a release of 1.42 yet?  Its been about a month since this ticket was updated, and the wiki for the plugin says this release is still pending.

          Greg Smith added a comment - Hate to pile on, but is there any ETA for a release of 1.42 yet?  Its been about a month since this ticket was updated, and the wiki for the plugin says this release is still pending.

          It has been released yesterday, I started to received the feedback. 

          Please let me know if you find something is not working.

          FABRIZIO MANFREDI added a comment - It has been released yesterday, I started to received the feedback.  Please let me know if you find something is not working.

          Sam Gleske added a comment -

          Either this bug still exists or maybe I'm experiencing a slightly different bug but similar. Just in case, I filed a new issue with debug details in JENKINS-56986.

          If you feel mine is a duplicate feel free to close my issue and re-open this issue. I have populated new details (I think after reading comments).

          Sam Gleske added a comment - Either this bug still exists or maybe I'm experiencing a slightly different bug but similar. Just in case, I filed a new issue with debug details in JENKINS-56986 . If you feel mine is a duplicate feel free to close my issue and re-open this issue. I have populated new details (I think after reading comments).

          Sam Gleske added a comment -

          False alarm. I closed my issue as a duplicate because I realized my version did not include the deadlock fix.

          Sam Gleske added a comment - False alarm. I closed my issue as a duplicate because I realized my version did not include the deadlock fix.

          Nicolas De Loof added a comment - - edited

          Issue is still present. 

          hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts.

          I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class. 

           

           

          => proposed fix https://github.com/jenkinsci/ec2-plugin/pull/349 

           

           

          Nicolas De Loof added a comment - - edited Issue is still present.  hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts. I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class.      => proposed fix  https://github.com/jenkinsci/ec2-plugin/pull/349      

          FABRIZIO MANFREDI added a comment - - edited

          Several improvement has been done in the locking and the PR 363 has been merged in 1.44.1

          FABRIZIO MANFREDI added a comment - - edited Several improvement has been done in the locking and the PR 363 has been merged in 1.44.1

            thoulen FABRIZIO MANFREDI
            masond Mason Donahue
            Votes:
            18 Vote for this issue
            Watchers:
            42 Start watching this issue

              Created:
              Updated:
              Resolved: