Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43587

Pipeline fails to resume after master restart/plugin upgrade

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • pipeline
    • None
    • Jenkins 2.46.1, Latest version of pipeline plugins (pipeline-build-step: 2.5, pipeline-rest-api: 2.6, pipeline-stage-step: 2.2, etc)
    • durable-task 1.18

      During a recent Jenkins plugin upgrade and master restart, it seems that Jenkins failed to resume at least two Pipeline jobs. The pipeline was in the middle of a sh() step when the master was restarted. Both jobs have output similar to the following in the console:

      Resuming build at Thu Apr 13 15:01:50 EDT 2017 after Jenkins restart
      Waiting to resume part of <job name...>: ???
      Ready to run at Thu Apr 13 15:01:51 EDT 2017

       

      However this text has been displayed for several minutes now with no obvious indication on what the job is waiting for. We can see that the pipeline is still running on the correct executor that it was running on pre-restart however, if we log into the server, there is no durable task or process of the script that the sh() step was running. From logging of the script that we were running, we can tell that the command did finish successfully but can't understand how Jenkins lost track of it. From the logging, the time when the command finished was around the same time when the master was restarting (it is difficult to pinpoint exactly). 

          [JENKINS-43587] Pipeline fails to resume after master restart/plugin upgrade

          Alex Taylor added a comment -

          piratejohnny

          For that issue, the pipeline does not have the ability to resume on an agent with the same label because it needs access to the same workspace it was building on before the restart to resume properly. In this case(if it did not have the same workspace) it would try to reconnect to the same agent(which is destroyed) and it would eventually timeout after it can not find the workspace. In your case if you want it to resume on an agent with the same label you would need to persist that workspace somehow.

           

          Either way not related to this ticket in particular(also I assume you meant durable-task 1.17 rather than 1.7 )

          Alex Taylor added a comment - piratejohnny For that issue, the pipeline does not have the ability to resume on an agent with the same label because it needs access to the same workspace it was building on before the restart to resume properly. In this case(if it did not have the same workspace) it would try to reconnect to the same agent(which is destroyed) and it would eventually timeout after it can not find the workspace. In your case if you want it to resume on an agent with the same label you would need to persist that workspace somehow.   Either way not related to this ticket in particular(also I assume you meant durable-task 1.17 rather than 1.7 )

          +1
          Same behaviour also on Jenkins 2.164.2
          Our master doesn't have any executors and after master restart all the agents are staying like this for a while without resuming:

          Mircea-Andrei Albu added a comment - +1 Same behaviour also on Jenkins 2.164.2 Our master doesn't have any executors and after master restart all the agents are staying like this for a while without resuming:

          Alex Taylor added a comment -

          mirceaalbu I think this may be a different issue since this is a much later version you are updating Jenkins to. I would open a new JIRA with the Jenkins logs included since there may be an error there about why the build did not resume

          Alex Taylor added a comment - mirceaalbu I think this may be a different issue since this is a much later version you are updating Jenkins to. I would open a new JIRA with the Jenkins logs included since there may be an error there about why the build did not resume

          papanito added a comment - - edited

          I face the same (or similar) issue. I actually get this:

          Resuming build at Mon May 11 01:26:46 CEST 2020 after Jenkins restart
          Waiting to resume part of Delivery Pipelines » mdp-delivery-pipeline » master mdp-release-1.5.52#23: In the quiet period. Expires in 0 ms
          [Pipeline] End of Pipeline
          [Bitbucket] Notifying commit build result
          [Bitbucket] Build result notified
          

          jenkins.log

          We are using Jenkins ver. 2.222.1

          papanito added a comment - - edited I face the same (or similar) issue. I actually get this: Resuming build at Mon May 11 01:26:46 CEST 2020 after Jenkins restart Waiting to resume part of Delivery Pipelines » mdp-delivery-pipeline » master mdp-release-1.5.52#23: In the quiet period. Expires in 0 ms [Pipeline] End of Pipeline [Bitbucket] Notifying commit build result [Bitbucket] Build result notified jenkins.log We are using Jenkins ver. 2.222.1

          Alex Taylor added a comment -

          papanito This issue is for a pipeline which is hung waiting to restart and not a build which failed immediately after restart. If you feel there is an error after the build resumed then please create a new issue as your listed problem has nothing to do with the current jira.

          Additionally if you want help diagnosing the problem you will need to attached a full build folder to that new jira case as that is where the information about why the build stopped will be located. But just based on that very short log, it seems to be operating correctly so I am not clear on why you believe it to be a failure.

          Alex Taylor added a comment - papanito This issue is for a pipeline which is hung waiting to restart and not a build which failed immediately after restart. If you feel there is an error after the build resumed then please create a new issue as your listed problem has nothing to do with the current jira. Additionally if you want help diagnosing the problem you will need to attached a full build folder to that new jira case as that is where the information about why the build stopped will be located. But just based on that very short log, it seems to be operating correctly so I am not clear on why you believe it to be a failure.

          Alex Taylor added a comment -

          This issue is being marked fixed as it was originally reported for a durable task plugin issue which has since been fixed and released.

          If people are seeing similar issues in later versions of Jenkins, please open a new case and maybe mention it is similar to this one.

          Additionally if you are experiencing this issue on a particular build, please attach the full build folder zipped up as that will contain all the relevant data

          Alex Taylor added a comment - This issue is being marked fixed as it was originally reported for a durable task plugin issue which has since been fixed and released. If people are seeing similar issues in later versions of Jenkins, please open a new case and maybe mention it is similar to this one. Additionally if you are experiencing this issue on a particular build, please attach the full build folder zipped up as that will contain all the relevant data

          papanito added a comment -

          papanito added a comment - JENKINS-62248

          Frédéric Meyrou added a comment - - edited

          Dear,

          I have a very similar issue, but my Jenkins LTS version and plugins are now all up-to-date.

          After a difficult restart I have many jobs pending with the following kind of message :

          00:00:00.008 Started by timer
          00:00:00.219 Opening connection to http://jirasvnprod.agfahealthcare.com/svn/idrg/diagnosis-coding/
          00:00:37.968 Obtained Jenkinsfile_PROPERTIES from 119148
          00:00:37.968 Running in Durability level: MAX_SURVIVABILITY
          00:00:47.292 [Pipeline] Start of Pipeline
          00:01:53.178 [Pipeline] node
          00:02:08.424 Still waiting to schedule task
          00:02:08.425 All nodes of label ‘SHARED&&BORDEAUX&&WINDOWS64’ are offline (>>> ACTUALLY they are online!)
          00:52:09.681 Ready to run at Sun Nov 15 17:54:08 CET 2020
          00:52:09.681 Resuming build at Sun Nov 15 17:54:08 CET 2020 after Jenkins restart
          18:54:07.898 Ready to run at Mon Nov 16 11:56:06 CET 2020
          18:54:07.898 Resuming build at Mon Nov 16 11:56:06 CET 2020 after Jenkins restart

          >>> We are now the 18th! 

          Do you guys have a console groovy script to ends all thoses Jobs (I have more then 500 of them on a platform 10K Jobs)
          I need to scan all Jobs i this situation and kill them.

          Any help apreciated.

          ./Fred

           

          Frédéric Meyrou added a comment - - edited Dear, I have a very similar issue, but my Jenkins LTS version and plugins are now all up-to-date. After a difficult restart I have many jobs pending with the following kind of message : 00:00:00.008 Started by timer 00:00:00.219 Opening connection to http://jirasvnprod.agfahealthcare.com/svn/idrg/diagnosis-coding/ 00:00:37.968 Obtained Jenkinsfile_PROPERTIES from 119148 00:00:37.968 Running in Durability level: MAX_SURVIVABILITY 00:00:47.292 [Pipeline] Start of Pipeline 00:01:53.178 [Pipeline] node 00:02:08.424 Still waiting to schedule task 00:02:08.425 All nodes of label ‘SHARED&&BORDEAUX&&WINDOWS64’ are offline (>>> ACTUALLY they are online!) 00:52:09.681 Ready to run at Sun Nov 15 17:54:08 CET 2020 00:52:09.681 Resuming build at Sun Nov 15 17:54:08 CET 2020 after Jenkins restart 18:54:07.898 Ready to run at Mon Nov 16 11:56:06 CET 2020 18:54:07.898 Resuming build at Mon Nov 16 11:56:06 CET 2020 after Jenkins restart >>> We are now the 18th!  Do you guys have a console groovy script to ends all thoses Jobs ( I have more then 500 of them on a platform 10K Jobs ) I need to scan all Jobs i this situation and kill them. Any help apreciated. ./Fred  

          If anyone else has a lot of zombie jobs, this is a script that I come up with to kill them, without killing any non-zombie job: 

          def x = 0
          for (job in Hudson.instance.getAllItems(org.jenkinsci.plugins.workflow.job.WorkflowJob)) {
            
            try{
              def isZombie = job.getLastBuild().dump()==~ /.*state=null.*/
              def isRunning = job.getLastBuild().completed
              if(!isRunning && isZombie ){
                x = x +1
                println "Candidate for Zombie: ${job}"
                job.getLastBuild().doKill()
              }
            }catch(e){
            }
          }
          println "Number of zombies killed: ${x}"  

          it timeouts in jenkinsurl/script around ~600 zombies, so it's possible you'll have to run it more times.

           

          Tomas Hartmann added a comment - If anyone else has a lot of zombie jobs, this is a script that I come up with to kill them, without killing any non-zombie job:  def x = 0 for (job in Hudson.instance.getAllItems(org.jenkinsci.plugins.workflow.job.WorkflowJob)) { try { def isZombie = job.getLastBuild().dump()==~ /.*state= null .*/ def isRunning = job.getLastBuild().completed if (!isRunning && isZombie ){ x = x +1 println "Candidate for Zombie: ${job}" job.getLastBuild().doKill() } } catch (e){ } } println " Number of zombies killed: ${x}" it timeouts in jenkinsurl/script around ~600 zombies, so it's possible you'll have to run it more times.  

          Patrick Riegler added a comment - - edited

          We ran into a similar problem and it took us a while to figure out the solution.

          In our case the issue was, that the name of the "package" in the included library wasn't properly defined.
          It is essential, that the name of the package is the same as the folder path and filename as described in this example:
          https://www.jenkins.io/doc/book/pipeline/shared-libraries/#writing-libraries

          // src/org/foo/Zot.groovy 
          package org.foo 
          
          def checkOutFrom(repo) { 
            git url: "git@github.com:jenkinsci/${repo}" 
          } 
          
          return this 

          and in the pipeline script use:

          def z = new org.foo.Zot()
          z.checkOutFrom(repo) 

          It would work if the package is called:

          package org.something.foo  

          but the class cannot be found after serialization.

          I hope this is of help 

          Patrick Riegler added a comment - - edited We ran into a similar problem and it took us a while to figure out the solution. In our case the issue was, that the name of the "package" in the included library wasn't properly defined. It is essential, that the name of the package is the same as the folder path and filename as described in this example: https://www.jenkins.io/doc/book/pipeline/shared-libraries/#writing-libraries // src/org/foo/Zot.groovy package org.foo  def checkOutFrom(repo) { git url: "git@github.com:jenkinsci/${repo}" } return this and in the pipeline script use: def z = new org.foo.Zot() z.checkOutFrom(repo) It would work if the package is called: package org.something.foo  but the class cannot be found after serialization. I hope this is of help 

            ataylor Alex Taylor
            elatt Erik Lattimore
            Votes:
            13 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated:
              Resolved: