Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43587

Pipeline fails to resume after master restart/plugin upgrade

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • pipeline
    • None
    • Jenkins 2.46.1, Latest version of pipeline plugins (pipeline-build-step: 2.5, pipeline-rest-api: 2.6, pipeline-stage-step: 2.2, etc)
    • durable-task 1.18

      During a recent Jenkins plugin upgrade and master restart, it seems that Jenkins failed to resume at least two Pipeline jobs. The pipeline was in the middle of a sh() step when the master was restarted. Both jobs have output similar to the following in the console:

      Resuming build at Thu Apr 13 15:01:50 EDT 2017 after Jenkins restart
      Waiting to resume part of <job name...>: ???
      Ready to run at Thu Apr 13 15:01:51 EDT 2017

       

      However this text has been displayed for several minutes now with no obvious indication on what the job is waiting for. We can see that the pipeline is still running on the correct executor that it was running on pre-restart however, if we log into the server, there is no durable task or process of the script that the sh() step was running. From logging of the script that we were running, we can tell that the command did finish successfully but can't understand how Jenkins lost track of it. From the logging, the time when the command finished was around the same time when the master was restarting (it is difficult to pinpoint exactly). 

          [JENKINS-43587] Pipeline fails to resume after master restart/plugin upgrade

          Erik Lattimore created issue -
          Erik Lattimore made changes -
          Link New: This issue relates to JENKINS-39552 [ JENKINS-39552 ]

          These were the plugins that were being upgraded at the time:

          • blueocean-commons.jpi
          • blueocean-jwt.jpi
          • blueocean-web.jpi
          • blueocean-rest.jpi
          • blueocean-rest-impl.jpi
          • blueocean-pipeline-api-impl.jpi
          • blueocean-github-pipeline.jpi
          • blueocean-git-pipeline.jpi
          • blueocean-config.jpi
          • blueocean-events.jpi
          • blueocean-personalization.jpi
          • blueocean-i18n.jpi
          • blueocean-dashboard.jpi
          • blueocean.jpi
          • hashicorp-vault-plugin.jpi
          • analysis-core.jpi
          • pipeline-maven.jpi
          • workflow-api.jpi
          • warnings.jpi
          • ssh-slaves.jpi
          • mask-passwords.jpi
          • violation-comments-to-stash.jpi

          Erik Lattimore added a comment - These were the plugins that were being upgraded at the time: blueocean-commons.jpi blueocean-jwt.jpi blueocean-web.jpi blueocean-rest.jpi blueocean-rest-impl.jpi blueocean-pipeline-api-impl.jpi blueocean-github-pipeline.jpi blueocean-git-pipeline.jpi blueocean-config.jpi blueocean-events.jpi blueocean-personalization.jpi blueocean-i18n.jpi blueocean-dashboard.jpi blueocean.jpi hashicorp-vault-plugin.jpi analysis-core.jpi pipeline-maven.jpi workflow-api.jpi warnings.jpi ssh-slaves.jpi mask-passwords.jpi violation-comments-to-stash.jpi

          And here is the thread dump from the job:

          Thread #34
          at DSL.sh(completed process (code 0) in /home/jenkins/workspace/<jobname>@2@tmp/durable-039a0a47 on <hostname> (pid: 6808); recurrence period: 0ms)
          at WorkflowScript.deployStep(WorkflowScript:401)
          at DSL.timeout(killer task nowhere to be found)
          at WorkflowScript.deployStep(WorkflowScript:400)
          at DSL.ws(Native Method)
          at WorkflowScript.deployStep(WorkflowScript:366)
          at DSL.sshagent(Native Method)
          at WorkflowScript.deployStep(WorkflowScript:308)
          at DSL.lock(Native Method)
          at WorkflowScript.deployStep(WorkflowScript:307)
          at DSL.node(running on voltron.coalition.local)
          at WorkflowScript.deployStep(WorkflowScript:306)
          at DSL.stage(Native Method)
          at WorkflowScript.deployStep(WorkflowScript:305)
          at WorkflowScript.run(WorkflowScript:419)
          at DSL.timestamps(Native Method)
          at WorkflowScript.run(WorkflowScript:416)
          

          Erik Lattimore added a comment - And here is the thread dump from the job: Thread #34 at DSL.sh(completed process (code 0) in /home/jenkins/workspace/<jobname>@2@tmp/durable-039a0a47 on <hostname> (pid: 6808); recurrence period: 0ms) at WorkflowScript.deployStep(WorkflowScript:401) at DSL.timeout(killer task nowhere to be found) at WorkflowScript.deployStep(WorkflowScript:400) at DSL.ws(Native Method) at WorkflowScript.deployStep(WorkflowScript:366) at DSL.sshagent(Native Method) at WorkflowScript.deployStep(WorkflowScript:308) at DSL.lock(Native Method) at WorkflowScript.deployStep(WorkflowScript:307) at DSL.node(running on voltron.coalition.local) at WorkflowScript.deployStep(WorkflowScript:306) at DSL.stage(Native Method) at WorkflowScript.deployStep(WorkflowScript:305) at WorkflowScript.run(WorkflowScript:419) at DSL.timestamps(Native Method) at WorkflowScript.run(WorkflowScript:416)

          The pipeline is roughly:

          stage('Deploy') {
            node(getNode(tenant, vpc)) {
              lock(getLockableResource(tenant, vpc)) {
                sshagent([GIT_AUTH]) {
                  ws {
                    try {
                      for (int i = 0; i < products.size(); i++) {
                        timeout(time: 2, unit: 'HOURS') {
                          sh("deploy.py ${products[i]}")
                        }
                      }
                    } finally {
                      deleteDir()
                    }
                  }
                }
              }
            }
          }

           

          Erik Lattimore added a comment - The pipeline is roughly: stage('Deploy') { node(getNode(tenant, vpc)) { lock(getLockableResource(tenant, vpc)) { sshagent([GIT_AUTH]) { ws { try { for (int i = 0; i < products.size(); i++) { timeout(time: 2, unit: 'HOURS') { sh("deploy.py ${products[i]}") } } } finally { deleteDir() } } } } } }  

          Finally, the node that this was running on as 5 executors.

          Erik Lattimore added a comment - Finally, the node that this was running on as 5 executors.
          Erik Lattimore made changes -
          Description Original: During a recent Jenkins plugin upgrade and master restart, it seems that Jenkins failed to resume at least one Pipeline job. The pipeline was in the middle of a sh() step when the master was restarted. In the console output after the restart we see the following:
          {noformat}
          Resuming build at Thu Apr 13 15:01:50 EDT 2017 after Jenkins restart
          Waiting to resume part of <job name...>: ???
          Ready to run at Thu Apr 13 15:01:51 EDT 2017{noformat}
           

          However this text has been displayed for several minutes now with no obvious indication on what the job is waiting for. We can see that the pipeline is still running on the correct executor that it was running on pre-restart however, if we log into the server, there is no durable task or process of the script that the sh() step was running. From logging of the script that we were running, we can tell that the command did finish successfully but can't understand how Jenkins lost track of it. From the logging, the time when the command finished was around the same time when the master was restarting (it is difficult to pinpoint exactly). 
          New: During a recent Jenkins plugin upgrade and master restart, it seems that Jenkins failed to resume at least two Pipeline jobs. The pipeline was in the middle of a sh() step when the master was restarted. Both jobs have output similar to the following in the console:
          {noformat}
          Resuming build at Thu Apr 13 15:01:50 EDT 2017 after Jenkins restart
          Waiting to resume part of <job name...>: ???
          Ready to run at Thu Apr 13 15:01:51 EDT 2017{noformat}
           

          However this text has been displayed for several minutes now with no obvious indication on what the job is waiting for. We can see that the pipeline is still running on the correct executor that it was running on pre-restart however, if we log into the server, there is no durable task or process of the script that the sh() step was running. From logging of the script that we were running, we can tell that the command did finish successfully but can't understand how Jenkins lost track of it. From the logging, the time when the command finished was around the same time when the master was restarting (it is difficult to pinpoint exactly). 

          Hmm, in the second case, it seems like the process actually died when the master was restarted because this one did not run to completion but just terminated abruptly based on the logs

          Erik Lattimore added a comment - Hmm, in the second case, it seems like the process actually died when the master was restarted because this one did not run to completion but just terminated abruptly based on the logs

          Jon B added a comment -

          I am also getting stranded with 'Ready to run at'

          In my case, I run Jenkins within a Docker container. If one of my pipelines is running and someone does a Docker restart on the container, it strands with "

          "Ready to run at"

          Side note - it seems like the Jenkins pipeline features are really clunky. If a slave server goes away, I'm seeing similar hanging problems. Any advice/guidance would be appreciated.

          Jon B added a comment - I am also getting stranded with 'Ready to run at' In my case, I run Jenkins within a Docker container. If one of my pipelines is running and someone does a Docker restart on the container, it strands with " "Ready to run at" Side note - it seems like the Jenkins pipeline features are really clunky. If a slave server goes away, I'm seeing similar hanging problems. Any advice/guidance would be appreciated.

          Jon B added a comment - - edited

          It may be worth noting that the pipeline im getting stranded on calls another pipeline with

          build job: 'mysubpipeline', parameters: [
          [$class: 'StringParameterValue', name: 'BRANCH_NAME', value: "$BRANCH_NAME"]
          ]

           

          Jon B added a comment - - edited It may be worth noting that the pipeline im getting stranded on calls another pipeline with build job: 'mysubpipeline' , parameters: [ [$class: 'StringParameterValue' , name: 'BRANCH_NAME' , value: "$BRANCH_NAME" ] ]  

            ataylor Alex Taylor
            elatt Erik Lattimore
            Votes:
            13 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated:
              Resolved: