Hi,

      We found a potential bug that can only be replicated in pipeline jobs. Essentially when a job a running and a Jenkins restart occurs, the job is left hanging infinitely:

      Resuming build at Tue Jan 03 10:37:18 UTC 2017 after Jenkins restart
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      ...
      

      I noticed that this behaviour does not exhibit on any other job types. i.e. freestyle.

      Here is a simple test pipeline script:

      node('XXXXX') {
      
        stage 'Stage 1'
          println 'Deploying to Stage 1...'
      
        stage 'Stage 2'
          println 'Running Tests in Stage 2'
          sleep 120
          println 'Tests passed!'
      
        stage 'Stage 3'
          println 'Deploying to Stage 3...'
      
      }
      

      ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

      Currently I am using version 2.3, but I believe this issue was replicated in previous versions.

      Please can you help me explain why this behaviour only exists in pipeline jobs?

      Kind Regards,
      Tuan

          [JENKINS-40771] Race condition in FlowExecutionList

          Jesse Glick added a comment -

          Each case is potentially a distinct bug, and details matter a lot in terms of producing complete steps to reproduce from scratch.

          Jesse Glick added a comment - Each case is potentially a distinct bug, and details matter a lot in terms of producing complete steps to reproduce from scratch.

          Matthew Hall added a comment -

          This is another case, i've copied the following from JENKINS-33761

          Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required:

          Job 1 will trigger job_40_sec and job_50_sec in parallel

          If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely

          Please let me know if you need any more information or if this is the wrong place for this information

          Pipeline scripts:

          Job 1

          Map parallel_jobs = ['branch_1': {build job: 'job_50_sec'},
                               'branch_2': {build job: 'job_40_sec'}]
          parallel parallel_jobs

          job_40_sec

          node { sleep(40) }

          job_50_sec

          node { sleep(50) }

          Matthew Hall added a comment - This is another case, i've copied the following from JENKINS-33761 Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required: Job 1 will trigger job_40_sec and job_50_sec in parallel If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely Please let me know if you need any more information or if this is the wrong place for this information Pipeline scripts: Job 1 Map parallel_jobs = [ 'branch_1' : {build job: 'job_50_sec' }, 'branch_2' : {build job: 'job_40_sec' }] parallel parallel_jobs job_40_sec node { sleep(40) } job_50_sec node { sleep(50) }

          Tuan Nguyen added a comment - - edited

          jglick, I have a step to step reproducible bug:

          node('XXXXX') {
          
            stage 'Stage 1'
              println 'Deploying to Stage 1...'
          
            stage 'Stage 2'
              println 'Running Tests in Stage 2'
              sleep 120
              println 'Tests passed!'
          
            stage 'Stage 3'
              println 'Deploying to Stage 3...'
          
          }
          

          ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

          I'm not sure I follow about each case being a distinct bug. Can you describe what information you require, rather than resolving this as incomplete?

          Thanks,
          Tuan

          Tuan Nguyen added a comment - - edited jglick , I have a step to step reproducible bug: node('XXXXX') { stage 'Stage 1' println 'Deploying to Stage 1...' stage 'Stage 2' println 'Running Tests in Stage 2' sleep 120 println 'Tests passed!' stage 'Stage 3' println 'Deploying to Stage 3...' } ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour. I'm not sure I follow about each case being a distinct bug. Can you describe what information you require, rather than resolving this as incomplete? Thanks, Tuan

          > ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

          I'm not sure what Jesse is referring to with "details matter" but please could you say whether you are restarting the master, slave, both? Are the executors on the master node or elsewhere?

          Russell Gallop added a comment - > ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour. I'm not sure what Jesse is referring to with "details matter" but please could you say whether you are restarting the master, slave, both? Are the executors on the master node or elsewhere?

          Tuan Nguyen added a comment -

          My apologies. In my case, we are seeing this on both the Jenkins master and slave. This is the same with the executors (master & slave).

          Tuan Nguyen added a comment - My apologies. In my case, we are seeing this on both the Jenkins master and slave. This is the same with the executors (master & slave).

          Jesse Glick added a comment -

          derng I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script

          node {
          
            stage 'Stage 1'
              println 'Deploying to Stage 1...'
          
            stage 'Stage 2'
              println 'Running Tests in Stage 2'
              sleep 120
              println 'Tests passed!'
          
            stage 'Stage 3'
              println 'Deploying to Stage 3...'
          
          }
          

          ran *Build Now*, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected:

          Started by user admin
          [Pipeline] node
          Running on master in …/workspace/derng
          [Pipeline] {
          [Pipeline] stage (Stage 1)
          Using the ‘stage’ step without a block argument is deprecated
          Entering stage Stage 1
          Proceeding
          [Pipeline] echo
          Deploying to Stage 1...
          [Pipeline] stage (Stage 2)
          Using the ‘stage’ step without a block argument is deprecated
          Entering stage Stage 2
          Proceeding
          [Pipeline] echo
          Running Tests in Stage 2
          [Pipeline] sleep
          Sleeping for 2 min 0 sec
          Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart
          Ready to run at Fri Feb 10 12:57:14 EST 2017
          Sleeping for 1 min 34 sec
          [Pipeline] echo
          Tests passed!
          [Pipeline] stage (Stage 3)
          Using the ‘stage’ step without a block argument is deprecated
          Entering stage Stage 3
          Proceeding
          [Pipeline] echo
          Deploying to Stage 3...
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] End of Pipeline
          Finished: SUCCESS
          

          Jesse Glick added a comment - derng I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script node { stage 'Stage 1' println 'Deploying to Stage 1...' stage 'Stage 2' println 'Running Tests in Stage 2' sleep 120 println 'Tests passed!' stage 'Stage 3' println 'Deploying to Stage 3...' } ran * Build Now *, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected: Started by user admin [Pipeline] node Running on master in …/workspace/derng [Pipeline] { [Pipeline] stage (Stage 1) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 1 Proceeding [Pipeline] echo Deploying to Stage 1... [Pipeline] stage (Stage 2) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 2 Proceeding [Pipeline] echo Running Tests in Stage 2 [Pipeline] sleep Sleeping for 2 min 0 sec Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 12:57:14 EST 2017 Sleeping for 1 min 34 sec [Pipeline] echo Tests passed! [Pipeline] stage (Stage 3) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 3 Proceeding [Pipeline] echo Deploying to Stage 3... [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS

          Jesse Glick added a comment -

          matthall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main:

          Map parallel_jobs = ['branch_1': {build job: 'matthall-50'},
                               'branch_2': {build job: 'matthall-40'}]
          parallel parallel_jobs
          

          and matthall-50:

          node { sleep(50) }
          

          and matthall-40:

          node { sleep(40) }
          

          Clicked Build Now on matthall-main; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart. matthall-40 resumed:

          Started by upstream project "matthall-main" build number 1
          originally caused by:
           Started by user admin
          [Pipeline] node
          Running on master in …/workspace/matthall-40
          [Pipeline] {
          [Pipeline] sleep
          Sleeping for 40 sec
          Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart
          Ready to run at Fri Feb 10 13:07:24 EST 2017
          Sleeping for 7.1 sec
          [Pipeline] }
          [Pipeline] // node
          [Pipeline] End of Pipeline
          Finished: SUCCESS
          

          matthall-50 did not:

          Started by upstream project "matthall-main" build number 1
          originally caused by:
           Started by user admin
          [Pipeline] node
          Running on master in …/workspace/matthall-50
          [Pipeline] {
          [Pipeline] sleep
          Sleeping for 50 sec
          Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart
          Waiting to resume Unknown Pipeline node step: ???
          Ready to run at Fri Feb 10 13:07:27 EST 2017
          

          with thread dump

          Thread #2
          	at DSL.sleep(should have stopped sleeping 1 min 52 sec)
          	at WorkflowScript.run(WorkflowScript:1)
          	at DSL.node(running on )
          	at WorkflowScript.run(WorkflowScript:1)
          

          I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2). Looking into why…

          Jesse Glick added a comment - matthall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main : Map parallel_jobs = [ 'branch_1' : {build job: 'matthall-50' }, 'branch_2' : {build job: 'matthall-40' }] parallel parallel_jobs and matthall-50 : node { sleep(50) } and matthall-40 : node { sleep(40) } Clicked Build Now on matthall-main ; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart . matthall-40 resumed: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-40 [Pipeline] { [Pipeline] sleep Sleeping for 40 sec Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 13:07:24 EST 2017 Sleeping for 7.1 sec [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS matthall-50 did not: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-50 [Pipeline] { [Pipeline] sleep Sleeping for 50 sec Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart Waiting to resume Unknown Pipeline node step: ??? Ready to run at Fri Feb 10 13:07:27 EST 2017 with thread dump Thread #2 at DSL.sleep(should have stopped sleeping 1 min 52 sec) at WorkflowScript.run(WorkflowScript:1) at DSL.node(running on ) at WorkflowScript.run(WorkflowScript:1) I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2 ). Looking into why…

          Jesse Glick added a comment -

          Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater, causing it to not resume after Jenkins restart.

          Jesse Glick added a comment - Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater , causing it to not resume after Jenkins restart.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
          src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
          http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6
          Log:
          [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6 Log: [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
          src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
          http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46
          Log:
          Merge pull request #31 from jglick/FlowExecutionList-JENKINS-40771

          JENKINS-40771 FlowExecutionList race condition

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46 Log: Merge pull request #31 from jglick/FlowExecutionList- JENKINS-40771 JENKINS-40771 FlowExecutionList race condition

            jglick Jesse Glick
            derng Tuan Nguyen
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: