Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39552

After restart, interrupted pipeline deadlocks waiting for executor

      I had a pipeline build running, and then restarted Jenkins. After coming up again, I had this in the log for one of the parallel steps in the build:

      Resuming build at Mon Nov 07 13:11:05 CET 2016 after Jenkins restart
      Waiting to resume part of Atlassian Bitbucket » honey » master #4: ???
      Waiting to resume part of Atlassian Bitbucket » honey » master #4: Waiting for next available executor on bcubuntu32

      And the last message repeating every few minutes. The slave bcubuntu32 has only one executor, and it seems like this executor was "used up" for this task of waiting for an available executor...

      After I went into the configuration and changed number of executors to 2, the build continued as normal.

      A possibly related issue: Before restart, I put Jenkins in quiet mode, but the same build agent hung at the end of the pipeline part that was running, never finishing the build. In the end I made the restart without waiting for the part to finish.

      How to reproduce

      • In a fresh Jenkins instance, set master executors number to 1
      • Create job-1 and job-2 as follow
        node {
            parallel "parallel-1": {
                sh "true"
            }, "parallel-2": {
                sh "true"
            }
        }
        build 'job-2'
        
        node {
            sh "sleep 300"
        }
        

      Start a build, wait for job-2 node block to start, then restart Jenkins.

      When it comes back online, you'll see a deadlock

      It seems job-1 is trying to come back on the node it used before the restart, even though its current state doesn't require any node.

          [JENKINS-39552] After restart, interrupted pipeline deadlocks waiting for executor

          Reproducer:

          • In a fresh Jenkins instance, set master executors number to 1
          • Create job-1 and job-2 as follow
            node {
                parallel "parallel-1": {
                    sh "true"
                }, "parallel-2": {
                    sh "true"
                }
            }
            build 'job-2'
            
            node {
                sh "sleep 300"
            }
            

          Start a build, wait for job-2 node block to start, then restart Jenkins.

          When it comes back online, you'll see a deadlock

          It seems job-1 is trying to come back on the node it used before restart, even though its current state doesn't require any node.

          Vincent Latombe added a comment - Reproducer: In a fresh Jenkins instance, set master executors number to 1 Create job-1 and job-2 as follow node { parallel "parallel-1" : { sh " true " }, "parallel-2" : { sh " true " } } build 'job-2' node { sh "sleep 300" } Start a build, wait for job-2 node block to start, then restart Jenkins. When it comes back online, you'll see a deadlock It seems job-1 is trying to come back on the node it used before restart, even though its current state doesn't require any node.

          Mike Kobit added a comment -

          We see similar restart problems, not sure on cause yet but it deadlocks the queue and prevents any builds at all from joining the build queue.

          Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
          Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
          Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ???
          

          Requires a hard kill on that run and then a Jenkins restart.

          Mike Kobit added a comment - We see similar restart problems, not sure on cause yet but it deadlocks the queue and prevents any builds at all from joining the build queue. Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Waiting to resume part of Bitbucket Projects » NOC » NOC/noc-jobs » master #56: ??? Requires a hard kill on that run and then a Jenkins restart.

          Jon B added a comment - - edited

          I am using GitHub web hooks to listen for certain events and then I subsequently invoke Pipeline jobs in Jenkins. We use multiple stages and also the parallel build feature.

          I am worried that there appear to be many cases where Jenkins will get into a screwed up state when certain things happens like if Jenkins needs to restart. I will have sometimes dozens of jobs in progress and many more queued and this helps power the CI at our whole company.

          It was very worrisome to see how the pipeline jobs that were in flight at the time of a restart were all stuck with messages that seemed to indicate the pipeline was essentially deadlocked... something about it being paused or trying to resume.

          Should I consider avoiding the Jenkins pipeline plugin due to it being too brittle for use in a developer's critical workflow?

          Jon B added a comment - - edited I am using GitHub web hooks to listen for certain events and then I subsequently invoke Pipeline jobs in Jenkins. We use multiple stages and also the parallel build feature. I am worried that there appear to be many cases where Jenkins will get into a screwed up state when certain things happens like if Jenkins needs to restart. I will have sometimes dozens of jobs in progress and many more queued and this helps power the CI at our whole company. It was very worrisome to see how the pipeline jobs that were in flight at the time of a restart were all stuck with messages that seemed to indicate the pipeline was essentially deadlocked... something about it being paused or trying to resume. Should I consider avoiding the Jenkins pipeline plugin due to it being too brittle for use in a developer's critical workflow?

          Mike Kobit added a comment -

          We see this frequently, and are very concerned with the survivability of pipeline jobs. Jenkins is rendered unusable for some reason (possibly due to nodes disappearing underneath them?). We see builds in the queue with the ??? and have no idea how to resolve the issues.

          Mike Kobit added a comment - We see this frequently, and are very concerned with the survivability of pipeline jobs. Jenkins is rendered unusable for some reason (possibly due to nodes disappearing underneath them?). We see builds in the queue with the ??? and have no idea how to resolve the issues.

          Mike Kobit added a comment -

          In our build logs in multiple places:

          java.io.IOException: bitbucket_projects/dp/read-api/PR-235 #1 did not yet start
                  at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:884)
                  at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65)
                  at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57)
                  at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
                  at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
                  at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178)
                  at jenkins.model.Jenkins.<init>(Jenkins.java:997)
                  at hudson.model.Hudson.<init>(Hudson.java:86)
                  at hudson.model.Hudson.<init>(Hudson.java:82)
                  at hudson.WebAppMain$3.run(WebAppMain.java:235)
          

          Mike Kobit added a comment - In our build logs in multiple places: java.io.IOException: bitbucket_projects/dp/read-api/PR-235 #1 did not yet start at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:884) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:997) at hudson.model.Hudson.<init>(Hudson.java:86) at hudson.model.Hudson.<init>(Hudson.java:82) at hudson.WebAppMain$3.run(WebAppMain.java:235)

          Mike Kobit added a comment -

          From thread dump

          "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#24]" Id=647 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1
          	at sun.misc.Unsafe.park(Native Method)
          	-  waiting on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
          	at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275)
          	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237)
          	at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294)
          	at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61)
          	at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259)
          	at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411)
          	at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168)
          	at hudson.model.Queue.isBuildBlocked(Queue.java:1184)
          	at hudson.model.Queue.maintain(Queue.java:1505)
          	at hudson.model.Queue$1.call(Queue.java:320)
          	at hudson.model.Queue$1.call(Queue.java:317)
          	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108)
          	at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98)
          	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110)
          	at java.lang.Thread.run(Thread.java:745)
          
          	Number of locked synchronizers = 1
          	- java.util.concurrent.locks.ReentrantLock$NonfairSync@5613fb44
          

          Mike Kobit added a comment - From thread dump "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] [#24]" Id=647 Group=main WAITING on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1 at sun.misc.Unsafe.park(Native Method) - waiting on com.google.common.util.concurrent.AbstractFuture$Sync@47eda1e1 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:275) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:111) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadGroupSynchronously(CpsStepContext.java:248) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.getThreadSynchronously(CpsStepContext.java:237) at org.jenkinsci.plugins.workflow.cps.CpsStepContext.doGet(CpsStepContext.java:294) at org.jenkinsci.plugins.workflow.support.DefaultStepContext.get(DefaultStepContext.java:61) at org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask.getNode(ExecutorStepExecution.java:259) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.categoriesForPipeline(ThrottleQueueTaskDispatcher.java:411) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:168) at hudson.model.Queue.isBuildBlocked(Queue.java:1184) at hudson.model.Queue.maintain(Queue.java:1505) at hudson.model.Queue$1.call(Queue.java:320) at hudson.model.Queue$1.call(Queue.java:317) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:108) at jenkins.util.AtmostOneTaskExecutor$1.call(AtmostOneTaskExecutor.java:98) at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.AtmostOneThreadExecutor$Worker.run(AtmostOneThreadExecutor.java:110) at java.lang. Thread .run( Thread .java:745) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@5613fb44

          Andrew Bayer added a comment -

          mkobit - that sounds like JENKINS-44747, fyi. This issue here predates the change in Throttle Concurrent Builds, so is probably caused by something else.

          Andrew Bayer added a comment - mkobit - that sounds like JENKINS-44747 , fyi. This issue here predates the change in Throttle Concurrent Builds, so is probably caused by something else.

          Mike Kobit added a comment -

          Thanks abayer - I'll follow that issue.

          I'm starting to think that my issue may be a different. We saw a lot of weirdness with Jenkins restarts and lots of LinkageError from a few user pipelines, and they add a bunch of load statements (some nested) and reloading the same resources that may have caused our issue. Still unsure, but haven't seen it happen again since we fixed it in the last day.

          Mike Kobit added a comment - Thanks abayer - I'll follow that issue. I'm starting to think that my issue may be a different. We saw a lot of weirdness with Jenkins restarts and lots of LinkageError from a few user pipelines, and they add a bunch of load statements (some nested) and reloading the same resources that may have caused our issue. Still unsure, but haven't seen it happen again since we fixed it in the last day.

          Joerg Schwaerzler added a comment - - edited

          We're facing the same issue here - Jenkins 2.190.2, workflow-durable-task-step-plugin: 2.35.
          Are there probably any updates/workaround available in the meantime?

          Seems like we can work around this issue by temporarily adding a second executor...

          Joerg Schwaerzler added a comment - - edited We're facing the same issue here - Jenkins 2.190.2, workflow-durable-task-step-plugin: 2.35. Are there probably any updates/workaround available in the meantime? Seems like we can work around this issue by temporarily adding a second executor...

          Devin Nusbaum added a comment -

          There are various issues described here, but I think the main issue in the description is a duplicate of JENKINS-53709 (fixed in Pipeline: Groovy version 2.56) or JENKINS-41791 (fixed in Pipeline: Groovy 2.66). There is also a possibility that the fix for JENKINS-63164 (released in Pipeline: Groovy version 2.82) would fix this issue. Given that, I am going to go ahead and close this issue as a duplicate.

          macdrega Assuming you are running the latest version of Pipeline: Groovy plugin, I would open a new issue and describe the behavior you are seeing, including the Pipeline having the problem, a build log from when the problem happened (ideally the entire build folder zipped), and any exceptions in the Jenkins system logs when the problem occurred.

          Devin Nusbaum added a comment - There are various issues described here, but I think the main issue in the description is a duplicate of JENKINS-53709 (fixed in Pipeline: Groovy version 2.56) or JENKINS-41791 (fixed in Pipeline: Groovy 2.66). There is also a possibility that the fix for JENKINS-63164 (released in Pipeline: Groovy version 2.82) would fix this issue. Given that, I am going to go ahead and close this issue as a duplicate. macdrega Assuming you are running the latest version of Pipeline: Groovy plugin, I would open a new issue and describe the behavior you are seeing, including the Pipeline having the problem, a build log from when the problem happened (ideally the entire build folder zipped), and any exceptions in the Jenkins system logs when the problem occurred.

            Unassigned Unassigned
            estyrke Emil Styrke
            Votes:
            11 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: