Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39552

After restart, interrupted pipeline deadlocks waiting for executor

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      I had a pipeline build running, and then restarted Jenkins. After coming up again, I had this in the log for one of the parallel steps in the build:

      Resuming build at Mon Nov 07 13:11:05 CET 2016 after Jenkins restart
      Waiting to resume part of Atlassian Bitbucket » honey » master #4: ???
      Waiting to resume part of Atlassian Bitbucket » honey » master #4: Waiting for next available executor on bcubuntu32

      And the last message repeating every few minutes. The slave bcubuntu32 has only one executor, and it seems like this executor was "used up" for this task of waiting for an available executor...

      After I went into the configuration and changed number of executors to 2, the build continued as normal.

      A possibly related issue: Before restart, I put Jenkins in quiet mode, but the same build agent hung at the end of the pipeline part that was running, never finishing the build. In the end I made the restart without waiting for the part to finish.

      How to reproduce

      • In a fresh Jenkins instance, set master executors number to 1
      • Create job-1 and job-2 as follow
        node {
            parallel "parallel-1": {
                sh "true"
            }, "parallel-2": {
                sh "true"
            }
        }
        build 'job-2'
        
        node {
            sh "sleep 300"
        }
        

      Start a build, wait for job-2 node block to start, then restart Jenkins.

      When it comes back online, you'll see a deadlock

      It seems job-1 is trying to come back on the node it used before the restart, even though its current state doesn't require any node.

        Attachments

          Issue Links

            Activity

            estyrke Emil Styrke created issue -
            estyrke Emil Styrke made changes -
            Field Original Value New Value
            Epic Link JENKINS-35399 [ 171192 ]
            elatt Erik Lattimore made changes -
            Link This issue relates to JENKINS-43587 [ JENKINS-43587 ]
            Hide
            piratejohnny Jon B added a comment -

            What's the corrective action here?

            Show
            piratejohnny Jon B added a comment - What's the corrective action here?
            Hide
            estyrke Emil Styrke added a comment -

            As a workaround for this specific case, I increased the number of executors by one (temporarily). But I have never experienced this since that one time, so I don't know if this is still an issue.

            Show
            estyrke Emil Styrke added a comment - As a workaround for this specific case, I increased the number of executors by one (temporarily). But I have never experienced this since that one time, so I don't know if this is still an issue.
            vlatombe Vincent Latombe made changes -
            Attachment pipeline_restart_deadlock.png [ 37800 ]
            Hide
            vlatombe Vincent Latombe added a comment -

            Reproducer:

            • In a fresh Jenkins instance, set master executors number to 1
            • Create job-1 and job-2 as follow
              node {
                  parallel "parallel-1": {
                      sh "true"
                  }, "parallel-2": {
                      sh "true"
                  }
              }
              build 'job-2'
              
              node {
                  sh "sleep 300"
              }
              

            Start a build, wait for job-2 node block to start, then restart Jenkins.

            When it comes back online, you'll see a deadlock

            It seems job-1 is trying to come back on the node it used before restart, even though its current state doesn't require any node.

            Show
            vlatombe Vincent Latombe added a comment - Reproducer: In a fresh Jenkins instance, set master executors number to 1 Create job-1 and job-2 as follow node { parallel "parallel-1" : { sh " true " }, "parallel-2" : { sh " true " } } build 'job-2' node { sh "sleep 300" } Start a build, wait for job-2 node block to start, then restart Jenkins. When it comes back online, you'll see a deadlock It seems job-1 is trying to come back on the node it used before restart, even though its current state doesn't require any node.
            Hide
            mkobit Mike Kobit added a comment -

            We see similar restart problems, not sure on cause yet but it deadlocks the queue and prevents any builds at all from joining the build queue.

            Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
            Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
            Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
            

            Requires a hard kill on that run and then a Jenkins restart.

            Show
            mkobit Mike Kobit added a comment - We see similar restart problems, not sure on cause yet but it deadlocks the queue and prevents any builds at all from joining the build queue. Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Requires a hard kill on that run and then a Jenkins restart.
            Hide
            piratejohnny Jon B added a comment - - edited

            I am using GitHub web hooks to listen for certain events and then I subsequently invoke Pipeline jobs in Jenkins. We use multiple stages and also the parallel build feature.

            I am worried that there appear to be many cases where Jenkins will get into a screwed up state when certain things happens like if Jenkins needs to restart. I will have sometimes dozens of jobs in progress and many more queued and this helps power the CI at our whole company.

            It was very worrisome to see how the pipeline jobs that were in flight at the time of a restart were all stuck with messages that seemed to indicate the pipeline was essentially deadlocked... something about it being paused or trying to resume.

            Should I consider avoiding the Jenkins pipeline plugin due to it being too brittle for use in a developer's critical workflow?

            Show
            piratejohnny Jon B added a comment - - edited I am using GitHub web hooks to listen for certain events and then I subsequently invoke Pipeline jobs in Jenkins. We use multiple stages and also the parallel build feature. I am worried that there appear to be many cases where Jenkins will get into a screwed up state when certain things happens like if Jenkins needs to restart. I will have sometimes dozens of jobs in progress and many more queued and this helps power the CI at our whole company. It was very worrisome to see how the pipeline jobs that were in flight at the time of a restart were all stuck with messages that seemed to indicate the pipeline was essentially deadlocked... something about it being paused or trying to resume. Should I consider avoiding the Jenkins pipeline plugin due to it being too brittle for use in a developer's critical workflow?
            Hide
            mkobit Mike Kobit added a comment -

            We see this frequently, and are very concerned with the survivability of pipeline jobs. Jenkins is rendered unusable for some reason (possibly due to nodes disappearing underneath them?). We see builds in the queue with the ??? and have no idea how to resolve the issues.

            Show
            mkobit Mike Kobit added a comment - We see this frequently, and are very concerned with the survivability of pipeline jobs. Jenkins is rendered unusable for some reason (possibly due to nodes disappearing underneath them?). We see builds in the queue with the ??? and have no idea how to resolve the issues.
            Hide
            mkobit Mike Kobit added a comment -

            In our build logs in multiple places:

            java.io.IOException: bitbucket_projects/dp/read-api/PR-235 #1 did not yet start
                    at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:884)
                    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
                    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
                    at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
                    at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
                    at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
                    at jenkins.model.Jenkins.<init>(Jenkins.java:997)
                    at hudson.model.Hudson.<init>(Hudson.java:86)
                    at hudson.model.Hudson.<init>(Hudson.java:82)
                    at hudson.WebAppMain$3.run(WebAppMain.java:235)
            
            Show
            mkobit Mike Kobit added a comment - In our build logs in multiple places: java.io.IOException: bitbucket_projects/dp/read-api/PR-235 #1 did not yet start at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:884) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:997) at hudson.model.Hudson.<init>(Hudson.java:86) at hudson.model.Hudson.<init>(Hudson.java:82) at hudson.WebAppMain$3.run(WebAppMain.java:235)
            Hide
            mkobit Mike Kobit added a comment -

            From thread dump

            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#24]" Id=647 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1
            	at sun.misc.Unsafe.park(Native Method)
            	-  waiting on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1
            	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
            	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
            	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
            	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
            	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
            	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
            	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294)
            	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
            	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259)
            	at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411)
            	at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168)
            	at hudson.model.Queue.isBuildBlocked(Queue.java:1184)
            	at hudson.model.Queue.maintain(Queue.java:1505)
            	at hudson.model.Queue$1.call(Queue.java:320)
            	at hudson.model.Queue$1.call(Queue.java:317)
            	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108)
            	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98)
            	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
            	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
            	at java.lang.Thread.run(Thread.java:745)
            
            	Number of locked synchronizers = 1
            	- java.util.concurrent.locks.ReentrantLock$NonfairSync@5613fb44
            
            Show
            mkobit Mike Kobit added a comment - From thread dump "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#24]" Id=647 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1 at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang. Thread .run( Thread .java:745) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@5613fb44
            Hide
            abayer Andrew Bayer added a comment -

            Mike Kobit - that sounds like JENKINS-44747, fyi. This issue here predates the change in Throttle Concurrent Builds, so is probably caused by something else.

            Show
            abayer Andrew Bayer added a comment - Mike Kobit - that sounds like JENKINS-44747 , fyi. This issue here predates the change in Throttle Concurrent Builds, so is probably caused by something else.
            Hide
            mkobit Mike Kobit added a comment -

            Thanks Andrew Bayer - I'll follow that issue.

            I'm starting to think that my issue may be a different. We saw a lot of weirdness with Jenkins restarts and lots of LinkageError from a few user pipelines, and they add a bunch of load statements (some nested) and reloading the same resources that may have caused our issue. Still unsure, but haven't seen it happen again since we fixed it in the last day.

            Show
            mkobit Mike Kobit added a comment - Thanks Andrew Bayer - I'll follow that issue. I'm starting to think that my issue may be a different. We saw a lot of weirdness with Jenkins restarts and lots of LinkageError from a few user pipelines, and they add a bunch of load statements (some nested) and reloading the same resources that may have caused our issue. Still unsure, but haven't seen it happen again since we fixed it in the last day.
            jamesdumay James Dumay made changes -
            Labels pipeline cloudbees-internal-pipeline pipeline
            vlatombe Vincent Latombe made changes -
            Description I had a pipeline build running, and then restarted Jenkins. After coming up again, I had this in the log for one of the parallel steps in the build:

            Resuming build at Mon Nov 07 13:11:05 CET 2016 after Jenkins restart
            Waiting to resume part of Atlassian Bitbucket » honey » master #4: ???
            Waiting to resume part of Atlassian Bitbucket » honey » master #4: Waiting for next available executor on bcubuntu32

            And the last message repeating every few minutes. The slave bcubuntu32 has only one executor, and it seems like this executor was "used up" for this task of waiting for an available executor...

            After I went into the configuration and changed number of executors to 2, the build continued as normal.

            A possibly related issue: Before restart, I put Jenkins in quiet mode, but the same build agent hung at the end of the pipeline part that was running, never finishing the build. In the end I made the restart without waiting for the part to finish.
            I had a pipeline build running, and then restarted Jenkins. After coming up again, I had this in the log for one of the parallel steps in the build:

            Resuming build at Mon Nov 07 13:11:05 CET 2016 after Jenkins restart
            Waiting to resume part of Atlassian Bitbucket » honey » master #4: ???
            Waiting to resume part of Atlassian Bitbucket » honey » master #4: Waiting for next available executor on bcubuntu32

            And the last message repeating every few minutes. The slave bcubuntu32 has only one executor, and it seems like this executor was "used up" for this task of waiting for an available executor...

            After I went into the configuration and changed number of executors to 2, the build continued as normal.

            A possibly related issue: Before restart, I put Jenkins in quiet mode, but the same build agent hung at the end of the pipeline part that was running, never finishing the build. In the end I made the restart without waiting for the part to finish.

            *How to reproduce*
             * In a fresh Jenkins instance, set master executors number to 1
             * Create job-1 and job-2 as follow
            {code:java}
            node {
                parallel "parallel-1": {
                    sh "true"
                }, "parallel-2": {
                    sh "true"
                }
            }
            build 'job-2'
            {code}
            {code:java}
            node {
                sh "sleep 300"
            }
            {code}

            Start a build, wait for job-2 node block to start, then restart Jenkins.

            When it comes back online, you'll see a deadlock
             !pipeline_restart_deadlock.png|thumbnail!

            It seems job-1 is trying to come back on the node it used before the restart, even though its current state doesn't require any node.
            cloudbees CloudBees Inc. made changes -
            Remote Link This issue links to "CloudBees Internal CD-29 (Web Link)" [ 19126 ]
            abayer Andrew Bayer made changes -
            Component/s workflow-durable-task-step-plugin [ 21715 ]
            Component/s pipeline [ 21692 ]
            vivek Vivek Pandey made changes -
            Labels cloudbees-internal-pipeline pipeline cloudbees-internal-pipeline pipeline triaged-2018-11
            Hide
            macdrega Joerg Schwaerzler added a comment - - edited

            We're facing the same issue here - Jenkins 2.190.2, workflow-durable-task-step-plugin: 2.35.
            Are there probably any updates/workaround available in the meantime?

            Seems like we can work around this issue by temporarily adding a second executor...

            Show
            macdrega Joerg Schwaerzler added a comment - - edited We're facing the same issue here - Jenkins 2.190.2, workflow-durable-task-step-plugin: 2.35. Are there probably any updates/workaround available in the meantime? Seems like we can work around this issue by temporarily adding a second executor...
            Hide
            dnusbaum Devin Nusbaum added a comment -

            There are various issues described here, but I think the main issue in the description is a duplicate of JENKINS-53709 (fixed in Pipeline: Groovy version 2.56) or JENKINS-41791 (fixed in Pipeline: Groovy 2.66). There is also a possibility that the fix for JENKINS-63164 (released in Pipeline: Groovy version 2.82) would fix this issue. Given that, I am going to go ahead and close this issue as a duplicate.

            Joerg Schwaerzler Assuming you are running the latest version of Pipeline: Groovy plugin, I would open a new issue and describe the behavior you are seeing, including the Pipeline having the problem, a build log from when the problem happened (ideally the entire build folder zipped), and any exceptions in the Jenkins system logs when the problem occurred.

            Show
            dnusbaum Devin Nusbaum added a comment - There are various issues described here, but I think the main issue in the description is a duplicate of JENKINS-53709 (fixed in Pipeline: Groovy version 2.56) or JENKINS-41791 (fixed in Pipeline: Groovy 2.66). There is also a possibility that the fix for JENKINS-63164 (released in Pipeline: Groovy version 2.82) would fix this issue. Given that, I am going to go ahead and close this issue as a duplicate. Joerg Schwaerzler Assuming you are running the latest version of Pipeline: Groovy plugin, I would open a new issue and describe the behavior you are seeing, including the Pipeline having the problem, a build log from when the problem happened (ideally the entire build folder zipped), and any exceptions in the Jenkins system logs when the problem occurred.
            dnusbaum Devin Nusbaum made changes -
            Link This issue duplicates JENKINS-53709 [ JENKINS-53709 ]
            dnusbaum Devin Nusbaum made changes -
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Closed [ 6 ]
            dnusbaum Devin Nusbaum made changes -
            Link This issue relates to JENKINS-41791 [ JENKINS-41791 ]

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              estyrke Emil Styrke
              Votes:
              11 Vote for this issue
              Watchers:
              22 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: