Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53858

Deadlock on EC2 resources

    XMLWordPrintable

Details

    • Bug
    • Status: Fixed but Unreleased (View Workflow)
    • Critical
    • Resolution: Fixed
    • ec2-plugin
    • Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue), 1.40.1

    Description

      We seem to be running into an issue about once per day where multiple threads deadlock trying to access and update resources within the EC2 plugin.

      We have several jobs that add substantial numbers of subjobs (~40) to the build queue, and they thus invoke the Pipeline step `ec2 cloud: 'AWS Cloud', template: 'Micro'` several times to preallocate enough EC2 nodes to run them all (though it looks like this behavior will no longer be necessary in ec2-plugin 1.40).

      In addition, it seems that manually provisioning a node through the UI or manually deleting a node has a chance of deadlocking if it runs at the same time as the provisioning or unprovisioning process happens.

       

      The following stacktrace shows the three threads running in 1.40-SNAPSHOT (master as of Friday afternoon).

      Warning, the following threads are deadlocked : Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700, jenkins.util.Timer [#3], jenkins.util.Timer [#6]
      
       "jenkins.util.Timer [#3]" daemon prio=5 WAITING
         sun.misc.Unsafe.park(Native Method)
         java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
         java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
         java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
         java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
         hudson.model.Queue._withLock(Queue.java:1437)
         hudson.model.Queue.withLock(Queue.java:1300)
         jenkins.model.Nodes.updateNode(Nodes.java:193)
         jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
         hudson.model.Node.save(Node.java:140)
         hudson.util.PersistedList.onModified(PersistedList.java:173)
         hudson.util.PersistedList.replaceBy(PersistedList.java:85)
         hudson.model.Slave.<init>(Slave.java:198)
         hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
         hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
         hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
         hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:899)
         hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:606)
         hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:578)
         hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:415)
         hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:542)
         hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:557)
         hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
         hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
         hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
         hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
         hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
         jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
         java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         java.lang.Thread.run(Thread.java:748) "jenkins.util.Timer [#6]" daemon prio=5 BLOCKED
         hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:671)
         hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:47)
         hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:452)
         hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:420)
         hudson.plugins.ec2.EC2OndemandSlave.terminate(EC2OndemandSlave.java:68)
         hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:360)
         hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
         hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
         hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
         hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
         hudson.model.Queue._withLock(Queue.java:1380)
         hudson.model.Queue.withLock(Queue.java:1257)
         hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
         hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
         jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
         java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
         java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
         java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
         java.lang.Thread.run(Thread.java:748)
      
      "Handling POST /job/Selenium%20Tests/job/PAID-1256%252Fenable-paid-tests/build from 172.26.3.39 : qtp125130493-18700" prio=5 WAITING
        sun.misc.Unsafe.park(Native Method)
        java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        hudson.model.Queue.schedule2(Queue.java:587)
        hudson.model.Queue.schedule2(Queue.java:713)
        jenkins.model.ParameterizedJobMixIn.doBuild(ParameterizedJobMixIn.java:217)
        jenkins.model.ParameterizedJobMixIn$ParameterizedJob.doBuild(ParameterizedJobMixIn.java:408)
        java.lang.invoke.LambdaForm$DMH/227306521.invokeInterface_L4_V(LambdaForm$DMH)
        java.lang.invoke.LambdaForm$BMH/1196970080.reinvoke(LambdaForm$BMH)
        java.lang.invoke.LambdaForm$MH/457755914.invoker(LambdaForm$MH)
        java.lang.invoke.LambdaForm$MH/145876426.invokeExact_MT(LambdaForm$MH)
        java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
        org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
        org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
        org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
        org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.MetaClass$5.doDispatch(MetaClass.java:248)
        org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
        org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:734)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:864)
        org.kohsuke.stapler.Stapler.invoke(Stapler.java:668)
        org.kohsuke.stapler.Stapler.service(Stapler.java:238)
        javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:865)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1655)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
        org.jenkinsci.plugins.ssegateway.Endpoint$SSEListenChannelFilter.doFilter(Endpoint.java:243)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        io.jenkins.blueocean.auth.jwt.impl.JwtAuthenticationFilter.doFilter(JwtAuthenticationFilter.java:61)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        io.jenkins.blueocean.ResourceCacheControl.doFilter(ResourceCacheControl.java:134)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        jenkins.metrics.impl.MetricsFilter.doFilter(MetricsFilter.java:125)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:239)
        net.bull.javamelody.MonitoringFilter.doFilter(MonitoringFilter.java:215)
        net.bull.javamelody.PluginMonitoringFilter.doFilter(PluginMonitoringFilter.java:88)
        org.jvnet.hudson.plugins.monitoring.HudsonMonitoringFilter.doFilter(HudsonMonitoringFilter.java:114)
        hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
        hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:99)
        org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1642)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
        hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
        jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
        hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
      

      We upgraded to 1.40-SNAPSHOT after running into similar global deadlocks in 1.38 and 1.39, which I can attach stack dumps for, but since current master has a lot of reworking of the locking code, I'm not sure if they'll be useful.

      Attachments

        Issue Links

          Activity

            masond Mason Donahue created issue -
            thoulen FABRIZIO MANFREDI made changes -
            Field Original Value New Value
            Status Open [ 1 ] In Progress [ 3 ]

            Yes, with the 1.40 should not needed anymore the pre warming the instances, the plugin is able to raise  40 nodes per minutes (if you don't reach the AWS api limits). Anyway I started to investigate the deadlock.

            thoulen FABRIZIO MANFREDI added a comment - Yes, with the 1.40 should not needed anymore the pre warming the instances, the plugin is able to raise  40 nodes per minutes (if you don't reach the AWS api limits). Anyway I started to investigate the deadlock.
            masond Mason Donahue added a comment -

            Just an update with a stackdump from 1.40.1:

            10/4/18 2:34 PM
            
            ===== Threads on dev-jenkins-master-useast1b-01@172.25.33.234 =====
            
            Warning, the following threads are deadlocked : GitHubPushTrigger [#4], jenkins.util.Timer [#2], jenkins.util.Timer [#9]
            
            "GitHubPushTrigger [#4]" prio=5 WAITING
            	sun.misc.Unsafe.park(Native Method)
            	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
            	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
            	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
            	hudson.model.Queue.schedule2(Queue.java:587)
            	jenkins.model.ParameterizedJobMixIn.scheduleBuild2(ParameterizedJobMixIn.java:156)
            	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:116)
            	jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:105)
            	com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:143)
            	hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119)
            	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	java.lang.Thread.run(Thread.java:748)
            
            "jenkins.util.Timer [#2]" daemon prio=5 BLOCKED
            	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
            	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
            	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
            	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
            	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
            	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
            	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
            	hudson.model.Queue._withLock(Queue.java:1380)
            	hudson.model.Queue.withLock(Queue.java:1257)
            	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
            	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
            	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
            	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	java.lang.Thread.run(Thread.java:748)
            
            "jenkins.util.Timer [#9]" daemon prio=5 WAITING
            	sun.misc.Unsafe.park(Native Method)
            	java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
            	java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
            	java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
            	java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
            	hudson.model.Queue._withLock(Queue.java:1437)
            	hudson.model.Queue.withLock(Queue.java:1300)
            	jenkins.model.Nodes.updateNode(Nodes.java:193)
            	jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
            	hudson.model.Node.save(Node.java:140)
            	hudson.util.PersistedList.onModified(PersistedList.java:173)
            	hudson.util.PersistedList.replaceBy(PersistedList.java:85)
            	hudson.model.Slave.<init>(Slave.java:198)
            	hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
            	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
            	hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
            	hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
            	hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
            	hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:572)
            	hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
            	hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
            	hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
            	hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
            	hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
            	hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
            	hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
            	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
            	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
            	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	java.lang.Thread.run(Thread.java:748)
            
            masond Mason Donahue added a comment - Just an update with a stackdump from 1.40.1: 10/4/18 2:34 PM ===== Threads on dev-jenkins-master-useast1b-01@172.25.33.234 ===== Warning, the following threads are deadlocked : GitHubPushTrigger [#4], jenkins.util.Timer [#2], jenkins.util.Timer [#9] "GitHubPushTrigger [#4]" prio=5 WAITING sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) hudson.model.Queue.schedule2(Queue.java:587) jenkins.model.ParameterizedJobMixIn.scheduleBuild2(ParameterizedJobMixIn.java:156) jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:116) jenkins.model.ParameterizedJobMixIn.scheduleBuild(ParameterizedJobMixIn.java:105) com.cloudbees.jenkins.GitHubPushTrigger$1.run(GitHubPushTrigger.java:143) hudson.util.SequentialExecutionQueue$QueueEntry.run(SequentialExecutionQueue.java:119) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.run(FutureTask.java:266) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#2]" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673) hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314) hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362) hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46) hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) hudson.model.Queue._withLock(Queue.java:1380) hudson.model.Queue.withLock(Queue.java:1257) hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#9]" daemon prio=5 WAITING sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) hudson.model.Queue._withLock(Queue.java:1437) hudson.model.Queue.withLock(Queue.java:1300) jenkins.model.Nodes.updateNode(Nodes.java:193) jenkins.model.Jenkins.updateNode(Jenkins.java:2077) hudson.model.Node.save(Node.java:140) hudson.util.PersistedList.onModified(PersistedList.java:173) hudson.util.PersistedList.replaceBy(PersistedList.java:85) hudson.model.Slave.<init>(Slave.java:198) hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134) hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918) hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624) hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:572) hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432) hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544) hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559) hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748)
            masond Mason Donahue added a comment -

            And as for the actual objects being held:

            Oct 04, 2018 4:34:47 PM jenkins.metrics.api.Metrics$HealthChecker execute
            WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on hudson.plugins.ec2.AmazonEC2Cloud@61f05bdb (owned by jenkins.util.Timer [#9]):
                   at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673)
                   at hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314)
                   at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362)
                   at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126)
                   at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88)
                   at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46)
                   at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
                   at hudson.model.Queue._withLock(Queue.java:1380)
                   at hudson.model.Queue.withLock(Queue.java:1257)
                   at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
                   at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                   at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                   at java.lang.Thread.run(Thread.java:748)
            , jenkins.util.Timer [#9] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@731b93cc (owned by jenkins.util.Timer [#8]):
                   at sun.misc.Unsafe.park(Native Method)
                   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                   at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                   at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                   at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
                   at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
                   at hudson.model.Queue._withLock(Queue.java:1437)
                   at hudson.model.Queue.withLock(Queue.java:1300)
                   at jenkins.model.Nodes.updateNode(Nodes.java:193)
                   at jenkins.model.Jenkins.updateNode(Jenkins.java:2077)
                   at hudson.model.Node.save(Node.java:140)
                   at hudson.util.PersistedList.onModified(PersistedList.java:173)
                   at hudson.util.PersistedList.replaceBy(PersistedList.java:85)
                   at hudson.model.Slave.<init>(Slave.java:198)
                   at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134)
                   at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49)
                   at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42)
                   at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918)
                   at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624)
                   at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:596)
                   at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432)
                   at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544)
                   at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559)
                   at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
                   at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
                   at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
                   at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
                   at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                   at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                   at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                   at java.lang.Thread.run(Thread.java:748)
            ]] 
            masond Mason Donahue added a comment - And as for the actual objects being held: Oct 04, 2018 4:34:47 PM jenkins.metrics.api.Metrics$HealthChecker execute WARNING: Some health checks are reporting as unhealthy: [thread-deadlock : [jenkins.util.Timer [#8] locked on hudson.plugins.ec2.AmazonEC2Cloud@61f05bdb (owned by jenkins.util.Timer [#9]): at hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:673) at hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:314) at hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:362) at hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:126) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:88) at hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:46) at hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) at hudson.model.Queue._withLock(Queue.java:1380) at hudson.model.Queue.withLock(Queue.java:1257) at hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) , jenkins.util.Timer [#9] locked on java.util.concurrent.locks.ReentrantLock$NonfairSync@731b93cc (owned by jenkins.util.Timer [#8]): at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1437) at hudson.model.Queue.withLock(Queue.java:1300) at jenkins.model.Nodes.updateNode(Nodes.java:193) at jenkins.model.Jenkins.updateNode(Jenkins.java:2077) at hudson.model.Node.save(Node.java:140) at hudson.util.PersistedList.onModified(PersistedList.java:173) at hudson.util.PersistedList.replaceBy(PersistedList.java:85) at hudson.model.Slave.<init>(Slave.java:198) at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:134) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:49) at hudson.plugins.ec2.EC2OndemandSlave.<init>(EC2OndemandSlave.java:42) at hudson.plugins.ec2.SlaveTemplate.newOndemandSlave(SlaveTemplate.java:918) at hudson.plugins.ec2.SlaveTemplate.toSlaves(SlaveTemplate.java:624) at hudson.plugins.ec2.SlaveTemplate.provisionOndemand(SlaveTemplate.java:596) at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:432) at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:544) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:559) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) ]]

            You using multiple cloud configuration ? I found a potential race condition .

            From the stack trace seems more a core deadlock
            oleg_nenashev What do you think of the deadlock reported by the user ? 
             

            thoulen FABRIZIO MANFREDI added a comment - You using multiple cloud configuration ? I found a potential race condition . From the stack trace seems more a core deadlock oleg_nenashev What do you think of the deadlock reported by the user ?   
            masond Mason Donahue added a comment -

            We are currently using one cloud, but multiple configurations within that (same AMI, but machine types and labels differ).

            masond Mason Donahue added a comment - We are currently using one cloud, but multiple configurations within that (same AMI, but machine types and labels differ).
            masond Mason Donahue made changes -
            Environment Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue) Jenkins v2.144, ubuntu 14.04, ec2-plugin 1.38, 1.39, 1.40-SNAPSHOT (private-160d794a-masondonahue), 1.40.1
            perrin Perrin Morrow added a comment -

            We started using our new Jenkins instance on Monday, which is running the very latest of everything. I was seeing this exact same deadlock several times an hour, and having to kill the threads via the JavaMelody monitoring UI when it happened.

            It's happened only two or three times in the last two days, but I think that's because I drastically increased the idle timeouts for all the agents, so it is terminating and launching them far less often.

            It looks like two timer tasks are acquiring the queue lock and EC2Cloud object monitor, but in different order?

            1. the idle check timer task locks the queue while checking for idle agents, then tries to synchronize on the EC2Cloud object when it needs to terminate one.
            2. the node provisioner timer task synchronizes on the EC2Cloud object while it launches an agent, but in the construction of hudson.util.Slave, it tries to acquire the queue lock.
            perrin Perrin Morrow added a comment - We started using our new Jenkins instance on Monday, which is running the very latest of everything. I was seeing this exact same deadlock several times an hour, and having to kill the threads via the JavaMelody monitoring UI when it happened. It's happened only two or three times in the last two days, but I think that's because I drastically increased the idle timeouts for all the agents, so it is terminating and launching them far less often. It looks like two timer tasks are acquiring the queue lock and EC2Cloud object monitor, but in different order? the idle check timer task locks the queue while checking for idle agents, then tries to synchronize on the EC2Cloud object when it needs to terminate one. the node provisioner timer task synchronizes on the EC2Cloud object while it launches an agent, but in the construction of hudson.util.Slave, it tries to acquire the queue lock.
            evidex David Hayes added a comment -

            Seeing this issue also on ec2-plugin 1.40, with a single cloud configuration, on Jenkins 2.138.2.

             

            "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED"EC2 alive slaves monitor thread" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) java.lang.Thread.run(Thread.java:748) 
            "jenkins.util.Timer [#1]" daemon prio=5 BLOCKED
            	hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638)
            	hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:300)
            	hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:348)
            	hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123)
            	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85)
            	hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43)
            	hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72)
            	hudson.model.Queue._withLock(Queue.java:1380)
            	hudson.model.Queue.withLock(Queue.java:1257)
            	hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63)
            	hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
            	jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
            	java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            	java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
            	java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
            	java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            	java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            	java.lang.Thread.run(Thread.java:748)
            
            evidex David Hayes added a comment - Seeing this issue also on ec2-plugin 1.40, with a single cloud configuration, on Jenkins 2.138.2.   "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED "EC2 alive slaves monitor thread" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.getInstance(EC2AbstractSlave.java:279) hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:438) hudson.plugins.ec2.EC2AbstractSlave.isAlive(EC2AbstractSlave.java:406) hudson.plugins.ec2.EC2SlaveMonitor.execute(EC2SlaveMonitor.java:43) hudson.model.AsyncPeriodicWork$1.run(AsyncPeriodicWork.java:101) java.lang. Thread .run( Thread .java:748) "jenkins.util.Timer [#1]" daemon prio=5 BLOCKED hudson.plugins.ec2.EC2Cloud.connect(EC2Cloud.java:638) hudson.plugins.ec2.EC2AbstractSlave.stop(EC2AbstractSlave.java:300) hudson.plugins.ec2.EC2AbstractSlave.idleTimeout(EC2AbstractSlave.java:348) hudson.plugins.ec2.EC2RetentionStrategy.internalCheck(EC2RetentionStrategy.java:123) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:85) hudson.plugins.ec2.EC2RetentionStrategy.check(EC2RetentionStrategy.java:43) hudson.slaves.ComputerRetentionWork$1.run(ComputerRetentionWork.java:72) hudson.model.Queue._withLock(Queue.java:1380) hudson.model.Queue.withLock(Queue.java:1257) hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:63) hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) java.lang. Thread .run( Thread .java:748)
            thoulen FABRIZIO MANFREDI added a comment - Can you try this snapshot : https://repo.jenkins-ci.org/snapshots/org/jenkins-ci/plugins/ec2/1.42-SNAPSHOT/ec2-1.42-20181106.195515-1.hpi
            perrin Perrin Morrow added a comment -

            Using that snapshot causes a NullPointerException to be thrown as soon as the plugin tries to launch an instance:

            SEVERE: Timer task hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@6d31793d failed
            java.lang.NullPointerException
                    at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587)
                    at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:598)
                    at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
                    at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
                    at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61)
                    at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809)
                    at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72)
                    at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
                    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
                    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
                    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                    at java.lang.Thread.run(Thread.java:748) 

            I think that's because slaveCountingLock is transient, so I added this line to readResolve() to reinitialise it after deserialisation:

            slaveCountingLock = new ReentrantLock();

            It's now launching slaves OK in our staging environment but I wasn't able to reproduce the original deadlock problem there anyway. I'm going to try my build of the plugin in our live server (where we were seeing it many times a day) as soon as it's quiet enough to restart.

             

            perrin Perrin Morrow added a comment - Using that snapshot causes a NullPointerException  to be thrown as soon as the plugin tries to launch an instance: SEVERE: Timer task hudson.slaves.NodeProvisioner$NodeProvisionerInvoker@6d31793d failed java.lang.NullPointerException at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:587) at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:598) at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715) at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320) at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:61) at hudson.slaves.NodeProvisioner$NodeProvisionerInvoker.doRun(NodeProvisioner.java:809) at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:72) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) I think that's because slaveCountingLock is transient, so I added this line to readResolve() to reinitialise it after deserialisation: slaveCountingLock = new ReentrantLock(); It's now launching slaves OK in our staging environment but I wasn't able to reproduce the original deadlock problem there anyway. I'm going to try my build of the plugin in our live server (where we were seeing it many times a day) as soon as it's quiet enough to restart.  
            masond Mason Donahue added a comment -

            I put up another PR to hopefully take care of the reentrantlock being null and another NPE I saw.

            masond Mason Donahue added a comment - I put up another PR to hopefully take care of the reentrantlock being null and another NPE I saw.
            simonmweber Simon Weber added a comment -

            I was able to recreate the problem pretty easily in staging, and master + https://github.com/jenkinsci/ec2-plugin/pull/321 has fixed it so far for me. I'm planning to roll it out in prod asap.

            simonmweber Simon Weber added a comment - I was able to recreate the problem pretty easily in staging, and master + https://github.com/jenkinsci/ec2-plugin/pull/321 has fixed it so far for me. I'm planning to roll it out in prod asap.

            I am testing as well the patch and for now seems working fine, thanks.

            thoulen FABRIZIO MANFREDI added a comment - I am testing as well the patch and for now seems working fine, thanks.
            jaxley Jason Axley added a comment -

            I'm testing the hotfix as well, pulled from https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/ec2/1.42-rc815.252b09cce4fe/

            Will see if we get any deadlock issues later today or tomorrow.

            jaxley Jason Axley added a comment - I'm testing the hotfix as well, pulled from https://repo.jenkins-ci.org/incrementals/org/jenkins-ci/plugins/ec2/1.42-rc815.252b09cce4fe/ Will see if we get any deadlock issues later today or tomorrow.
            jaxley Jason Axley added a comment -

            Been running the hotfix for 48 hours with no recurrence of the deadlock issue.

            jaxley Jason Axley added a comment - Been running the hotfix for 48 hours with no recurrence of the deadlock issue.

            Are the node retired correctly base on the idle timeout ?  with the last SNAPSHOT this test failed, I am investigating.

            thoulen FABRIZIO MANFREDI added a comment - Are the node retired correctly base on the idle timeout ?  with the last SNAPSHOT this test failed, I am investigating.

            For what it's worth - I noticed frequent lockups during mass node provisioning with 1.41 (on jenkins 2.150).

            I deployed 1.42-rc823.17ad3043e0e0 and haven't seen any more lockups since.

            schniedergers Klaus Schniedergers added a comment - For what it's worth - I noticed frequent lockups during mass node provisioning with 1.41 (on jenkins 2.150). I deployed 1.42-rc823.17ad3043e0e0 and haven't seen any more lockups since.
            nzin Nicolas Zin added a comment -

            Same issue here.
            Do you know where the official 1.42 will be release?

            nzin Nicolas Zin added a comment - Same issue here. Do you know where the official 1.42 will be release?

            How to get 1.42 release? its not available on jenkins plugin manager updates. and dont see available for download as well.

            ashish_sanagar Ashish Sanagar added a comment - How to get 1.42 release? its not available on jenkins plugin manager updates. and dont see available for download as well.
            dfrank David Frank added a comment -

            We've been experiencing deadlocks similar to this - more so since upgrading to 1.41.

            As some others have asked, is there an ETA on when 1.42 will be released?

            dfrank David Frank added a comment - We've been experiencing deadlocks similar to this - more so since upgrading to 1.41. As some others have asked, is there an ETA on when 1.42 will be released?
            nzin Nicolas Zin added a comment -

            Hi guys,

             

            with one of the snapshot version of the 1.42 (1.42-rc833.4f1c51128070 to be precise), I got an issue (not sure if you already face it):

            if I run a groovy script on a remote slave, the slave is killed.
            The script is really simple, so I dont think it is the script that is at risk, maybe more the fact that it is a groovy script?

            import jenkins.model.Jenkins
            import hudson.plugins.ec2.EC2OndemandSlaveJenkins.instance.nodes
                    .grep { it instanceof EC2OndemandSlave }
                    .grep { it.toComputer().uptime > 24 * 1000 * 3600 }
                    .each {
                out.println "Removing labels from ${it.name}"
                it.labelString = ""
                it.save()
            } 
            nzin Nicolas Zin added a comment - Hi guys,   with one of the snapshot version of the 1.42 (1.42-rc833.4f1c51128070 to be precise), I got an issue (not sure if you already face it): if I run a groovy script on a remote slave, the slave is killed. The script is really simple, so I dont think it is the script that is at risk, maybe more the fact that it is a groovy script? import jenkins.model.Jenkins import hudson.plugins.ec2.EC2OndemandSlaveJenkins.instance.nodes .grep { it instanceof EC2OndemandSlave } .grep { it.toComputer().uptime > 24 * 1000 * 3600 } .each { out.println "Removing labels from ${it.name}" it.labelString = "" it.save() }
            sverhoef Stefan Verhoeff made changes -
            Link This issue is related to JENKINS-54187 [ JENKINS-54187 ]
            gregcovertsmith Greg Smith added a comment -

            Hate to pile on, but is there any ETA for a release of 1.42 yet?  Its been about a month since this ticket was updated, and the wiki for the plugin says this release is still pending.

            gregcovertsmith Greg Smith added a comment - Hate to pile on, but is there any ETA for a release of 1.42 yet?  Its been about a month since this ticket was updated, and the wiki for the plugin says this release is still pending.

            It has been released yesterday, I started to received the feedback. 

            Please let me know if you find something is not working.

            thoulen FABRIZIO MANFREDI added a comment - It has been released yesterday, I started to received the feedback.  Please let me know if you find something is not working.
            thoulen FABRIZIO MANFREDI made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            sag47 Sam Gleske made changes -
            Link This issue is related to JENKINS-56986 [ JENKINS-56986 ]
            sag47 Sam Gleske added a comment -

            Either this bug still exists or maybe I'm experiencing a slightly different bug but similar. Just in case, I filed a new issue with debug details in JENKINS-56986.

            If you feel mine is a duplicate feel free to close my issue and re-open this issue. I have populated new details (I think after reading comments).

            sag47 Sam Gleske added a comment - Either this bug still exists or maybe I'm experiencing a slightly different bug but similar. Just in case, I filed a new issue with debug details in JENKINS-56986 . If you feel mine is a duplicate feel free to close my issue and re-open this issue. I have populated new details (I think after reading comments).
            sag47 Sam Gleske made changes -
            Link This issue is duplicated by JENKINS-56986 [ JENKINS-56986 ]
            sag47 Sam Gleske added a comment -

            False alarm. I closed my issue as a duplicate because I realized my version did not include the deadlock fix.

            sag47 Sam Gleske added a comment - False alarm. I closed my issue as a duplicate because I realized my version did not include the deadlock fix.
            ndeloof Nicolas De Loof made changes -
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Reopened [ 4 ]
            ndeloof Nicolas De Loof added a comment - - edited

            Issue is still present. 

            hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts.

            I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class. 

             

             

            => proposed fix https://github.com/jenkinsci/ec2-plugin/pull/349 

             

             

            ndeloof Nicolas De Loof added a comment - - edited Issue is still present.  hudson.plugins.ec2.EC2Cloud#connect() uses double-check-locking, which is known to be broken in java until the target field is marked a `volatile` to avoid cache conflicts. I don't think hudson.plugins.ec2.EC2Cloud#connect(AWSCredentialsProvider, java.net.URL) even need to be synchronized, as the lazy-init lock is required in the context of an EC2Cloud instance, not the whole class.      => proposed fix  https://github.com/jenkinsci/ec2-plugin/pull/349      
            thoulen FABRIZIO MANFREDI added a comment - - edited

            Several improvement has been done in the locking and the PR 363 has been merged in 1.44.1

            thoulen FABRIZIO MANFREDI added a comment - - edited Several improvement has been done in the locking and the PR 363 has been merged in 1.44.1
            thoulen FABRIZIO MANFREDI made changes -
            Resolution Fixed [ 1 ]
            Status Reopened [ 4 ] Fixed but Unreleased [ 10203 ]

            People

              thoulen FABRIZIO MANFREDI
              masond Mason Donahue
              Votes:
              18 Vote for this issue
              Watchers:
              42 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: