Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-40771

Race condition in FlowExecutionList

    XMLWordPrintable

Details

    Description

      Hi,

      We found a potential bug that can only be replicated in pipeline jobs. Essentially when a job a running and a Jenkins restart occurs, the job is left hanging infinitely:

      Resuming build at Tue Jan 03 10:37:18 UTC 2017 after Jenkins restart
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      Waiting to resume part of TestRun2 #2: Waiting for next available executor
      ...
      

      I noticed that this behaviour does not exhibit on any other job types. i.e. freestyle.

      Here is a simple test pipeline script:

      node('XXXXX') {
      
        stage 'Stage 1'
          println 'Deploying to Stage 1...'
      
        stage 'Stage 2'
          println 'Running Tests in Stage 2'
          sleep 120
          println 'Tests passed!'
      
        stage 'Stage 3'
          println 'Deploying to Stage 3...'
      
      }
      

      ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

      Currently I am using version 2.3, but I believe this issue was replicated in previous versions.

      Please can you help me explain why this behaviour only exists in pipeline jobs?

      Kind Regards,
      Tuan

      Attachments

        Issue Links

          Activity

            mcating Mike Cating added a comment -

            Seeing very similar behavior on JENKINS_VERSION = 2.32.1, except message is slightly different:

            Resuming build at Sat Jan 28 18:39:23 UTC 2017 after Jenkins restart
            Waiting to resume Unknown Pipeline node step: <AWS instance id> is offline

            mcating Mike Cating added a comment - Seeing very similar behavior on JENKINS_VERSION = 2.32.1, except message is slightly different: Resuming build at Sat Jan 28 18:39:23 UTC 2017 after Jenkins restart Waiting to resume Unknown Pipeline node step: <AWS instance id> is offline
            thehosh Hosh added a comment -

            Having a similar issue, in my case I'm backing up the jobs directory and restoring it before starting Jenkins:

            [Pipeline] {
            [Pipeline] sh
            [jenkins-backup] Running shell script
            + mktemp jenkins-jobs-XXXXXXX.tar.gz
            [Pipeline] stage
            [Pipeline] { (Backup build history)
            [Pipeline] sh
            Resuming build at Tue Jan 31 16:40:33 GMT 2017 after Jenkins restart
            Waiting to resume Unknown Pipeline node step: ???
            [jenkins-backup] Running shell script
            Ready to run at Tue Jan 31 16:40:36 GMT 2017
            
            thehosh Hosh added a comment - Having a similar issue, in my case I'm backing up the jobs directory and restoring it before starting Jenkins: [Pipeline] { [Pipeline] sh [jenkins-backup] Running shell script + mktemp jenkins-jobs-XXXXXXX.tar.gz [Pipeline] stage [Pipeline] { (Backup build history) [Pipeline] sh Resuming build at Tue Jan 31 16:40:33 GMT 2017 after Jenkins restart Waiting to resume Unknown Pipeline node step: ??? [jenkins-backup] Running shell script Ready to run at Tue Jan 31 16:40:36 GMT 2017
            jglick Jesse Glick added a comment -

            Each case is potentially a distinct bug, and details matter a lot in terms of producing complete steps to reproduce from scratch.

            jglick Jesse Glick added a comment - Each case is potentially a distinct bug, and details matter a lot in terms of producing complete steps to reproduce from scratch.
            matthall Matthew Hall added a comment -

            This is another case, i've copied the following from JENKINS-33761

            Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required:

            Job 1 will trigger job_40_sec and job_50_sec in parallel

            If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely

            Please let me know if you need any more information or if this is the wrong place for this information

            Pipeline scripts:

            Job 1

            Map parallel_jobs = ['branch_1': {build job: 'job_50_sec'},
                                 'branch_2': {build job: 'job_40_sec'}]
            parallel parallel_jobs

            job_40_sec

            node { sleep(40) }

            job_50_sec

            node { sleep(50) }
            matthall Matthew Hall added a comment - This is another case, i've copied the following from JENKINS-33761 Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required: Job 1 will trigger job_40_sec and job_50_sec in parallel If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely Please let me know if you need any more information or if this is the wrong place for this information Pipeline scripts: Job 1 Map parallel_jobs = [ 'branch_1' : {build job: 'job_50_sec' }, 'branch_2' : {build job: 'job_40_sec' }] parallel parallel_jobs job_40_sec node { sleep(40) } job_50_sec node { sleep(50) }
            derng Tuan Nguyen added a comment - - edited

            jglick, I have a step to step reproducible bug:

            node('XXXXX') {
            
              stage 'Stage 1'
                println 'Deploying to Stage 1...'
            
              stage 'Stage 2'
                println 'Running Tests in Stage 2'
                sleep 120
                println 'Tests passed!'
            
              stage 'Stage 3'
                println 'Deploying to Stage 3...'
            
            }
            

            ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

            I'm not sure I follow about each case being a distinct bug. Can you describe what information you require, rather than resolving this as incomplete?

            Thanks,
            Tuan

            derng Tuan Nguyen added a comment - - edited jglick , I have a step to step reproducible bug: node('XXXXX') { stage 'Stage 1' println 'Deploying to Stage 1...' stage 'Stage 2' println 'Running Tests in Stage 2' sleep 120 println 'Tests passed!' stage 'Stage 3' println 'Deploying to Stage 3...' } ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour. I'm not sure I follow about each case being a distinct bug. Can you describe what information you require, rather than resolving this as incomplete? Thanks, Tuan

            > ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour.

            I'm not sure what Jesse is referring to with "details matter" but please could you say whether you are restarting the master, slave, both? Are the executors on the master node or elsewhere?

            rg Russell Gallop added a comment - > ...Restart Jenkins as soon as it enters Stage 2, to replicate such behaviour. I'm not sure what Jesse is referring to with "details matter" but please could you say whether you are restarting the master, slave, both? Are the executors on the master node or elsewhere?
            derng Tuan Nguyen added a comment -

            My apologies. In my case, we are seeing this on both the Jenkins master and slave. This is the same with the executors (master & slave).

            derng Tuan Nguyen added a comment - My apologies. In my case, we are seeing this on both the Jenkins master and slave. This is the same with the executors (master & slave).
            jglick Jesse Glick added a comment -

            derng I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script

            node {
            
              stage 'Stage 1'
                println 'Deploying to Stage 1...'
            
              stage 'Stage 2'
                println 'Running Tests in Stage 2'
                sleep 120
                println 'Tests passed!'
            
              stage 'Stage 3'
                println 'Deploying to Stage 3...'
            
            }
            

            ran *Build Now*, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected:

            Started by user admin
            [Pipeline] node
            Running on master in …/workspace/derng
            [Pipeline] {
            [Pipeline] stage (Stage 1)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 1
            Proceeding
            [Pipeline] echo
            Deploying to Stage 1...
            [Pipeline] stage (Stage 2)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 2
            Proceeding
            [Pipeline] echo
            Running Tests in Stage 2
            [Pipeline] sleep
            Sleeping for 2 min 0 sec
            Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart
            Ready to run at Fri Feb 10 12:57:14 EST 2017
            Sleeping for 1 min 34 sec
            [Pipeline] echo
            Tests passed!
            [Pipeline] stage (Stage 3)
            Using the ‘stage’ step without a block argument is deprecated
            Entering stage Stage 3
            Proceeding
            [Pipeline] echo
            Deploying to Stage 3...
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] End of Pipeline
            Finished: SUCCESS
            
            jglick Jesse Glick added a comment - derng I ran 2.32.2 on a fresh home dir (Linux / Java 8), installed Pipeline incl. workflow-basic-steps 2.3, workflow-cps 2.26, pipeline-stage-step 2.2, workflow-job 2.9, created a Pipeline job with the script node { stage 'Stage 1' println 'Deploying to Stage 1...' stage 'Stage 2' println 'Running Tests in Stage 2' sleep 120 println 'Tests passed!' stage 'Stage 3' println 'Deploying to Stage 3...' } ran * Build Now *, waited for the stage view to show the second stage in progress, used /restart to restart Jenkins, and after the restart it completed as expected: Started by user admin [Pipeline] node Running on master in …/workspace/derng [Pipeline] { [Pipeline] stage (Stage 1) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 1 Proceeding [Pipeline] echo Deploying to Stage 1... [Pipeline] stage (Stage 2) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 2 Proceeding [Pipeline] echo Running Tests in Stage 2 [Pipeline] sleep Sleeping for 2 min 0 sec Resuming build at Fri Feb 10 12:57:13 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 12:57:14 EST 2017 Sleeping for 1 min 34 sec [Pipeline] echo Tests passed! [Pipeline] stage (Stage 3) Using the ‘stage’ step without a block argument is deprecated Entering stage Stage 3 Proceeding [Pipeline] echo Deploying to Stage 3... [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS
            jglick Jesse Glick added a comment -

            matthall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main:

            Map parallel_jobs = ['branch_1': {build job: 'matthall-50'},
                                 'branch_2': {build job: 'matthall-40'}]
            parallel parallel_jobs
            

            and matthall-50:

            node { sleep(50) }
            

            and matthall-40:

            node { sleep(40) }
            

            Clicked Build Now on matthall-main; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart. matthall-40 resumed:

            Started by upstream project "matthall-main" build number 1
            originally caused by:
             Started by user admin
            [Pipeline] node
            Running on master in …/workspace/matthall-40
            [Pipeline] {
            [Pipeline] sleep
            Sleeping for 40 sec
            Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart
            Ready to run at Fri Feb 10 13:07:24 EST 2017
            Sleeping for 7.1 sec
            [Pipeline] }
            [Pipeline] // node
            [Pipeline] End of Pipeline
            Finished: SUCCESS
            

            matthall-50 did not:

            Started by upstream project "matthall-main" build number 1
            originally caused by:
             Started by user admin
            [Pipeline] node
            Running on master in …/workspace/matthall-50
            [Pipeline] {
            [Pipeline] sleep
            Sleeping for 50 sec
            Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart
            Waiting to resume Unknown Pipeline node step: ???
            Ready to run at Fri Feb 10 13:07:27 EST 2017
            

            with thread dump

            Thread #2
            	at DSL.sleep(should have stopped sleeping 1 min 52 sec)
            	at WorkflowScript.run(WorkflowScript:1)
            	at DSL.node(running on )
            	at WorkflowScript.run(WorkflowScript:1)
            

            I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2). Looking into why…

            jglick Jesse Glick added a comment - matthall on the same configuration (incl. pipeline-build-step 2.4), created job matthall-main : Map parallel_jobs = [ 'branch_1' : {build job: 'matthall-50' }, 'branch_2' : {build job: 'matthall-40' }] parallel parallel_jobs and matthall-50 : node { sleep(50) } and matthall-40 : node { sleep(40) } Clicked Build Now on matthall-main ; went to dashboard to see that matthall-50 and matthall-40 were both building on heavyweight executors; /restart . matthall-40 resumed: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-40 [Pipeline] { [Pipeline] sleep Sleeping for 40 sec Resuming build at Fri Feb 10 13:07:23 EST 2017 after Jenkins restart Ready to run at Fri Feb 10 13:07:24 EST 2017 Sleeping for 7.1 sec [Pipeline] } [Pipeline] // node [Pipeline] End of Pipeline Finished: SUCCESS matthall-50 did not: Started by upstream project "matthall-main" build number 1 originally caused by: Started by user admin [Pipeline] node Running on master in …/workspace/matthall-50 [Pipeline] { [Pipeline] sleep Sleeping for 50 sec Resuming build at Fri Feb 10 13:07:26 EST 2017 after Jenkins restart Waiting to resume Unknown Pipeline node step: ??? Ready to run at Fri Feb 10 13:07:27 EST 2017 with thread dump Thread #2 at DSL.sleep(should have stopped sleeping 1 min 52 sec) at WorkflowScript.run(WorkflowScript:1) at DSL.node(running on ) at WorkflowScript.run(WorkflowScript:1) I updated to a release candidate of workflow-basic-steps 2.4 and tried again, but still it fails (this time in matthall-40 #2 ). Looking into why…
            jglick Jesse Glick added a comment -

            Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater, causing it to not resume after Jenkins restart.

            jglick Jesse Glick added a comment - Well diagnosed that one anyway—when two Pipeline builds are started at essentially the same moment, the registry of running builds can lose one of them, apparently due to a flaw in FlowExecutionList.saveLater , causing it to not resume after Jenkins restart.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            pom.xml
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
            http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6
            Log:
            [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/643dc718a18858e7c65f227daa18c998820dafc6 Log: [FIXED JENKINS-40771] FlowExecutionList.register (and .unregister) was incorrectly loading from disk, causing a race condition with asynchronous saves.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java
            src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java
            http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46
            Log:
            Merge pull request #31 from jglick/FlowExecutionList-JENKINS-40771

            JENKINS-40771 FlowExecutionList race condition

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionList.java src/test/java/org/jenkinsci/plugins/workflow/flow/FlowExecutionListTest.java http://jenkins-ci.org/commit/workflow-api-plugin/d7575cae43019af2e3f80bfc7248688ff6393a46 Log: Merge pull request #31 from jglick/FlowExecutionList- JENKINS-40771 JENKINS-40771 FlowExecutionList race condition

            People

              jglick Jesse Glick
              derng Tuan Nguyen
              Votes:
              3 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: