Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-60434

"Prepare for shutdown" should continue executing already running pipelines to completion

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • workflow-cps-plugin
    • None

      Based on dnusbaum's comment from JENKINS-34256:

      A fix for this issue was just released in Pipeline: Groovy Plugin version 2.78. I think there is/was some confusion as to the expected behavior (myself included!), so let me try to clarify: When Jenkins prepares for shutdown, all running Pipelines are paused, and this is the intended behavior. The unintended behavior was that if you canceled shutdown, Pipelines remained paused. This has been fixed in 2.78; Pipelines will now resume execution if shutdown is canceled. Before 2.78, you had to manually pause and unpause each Pipeline to get it to resume execution, or restart Jenkins. Additionally, preparing Jenkins for shutdown and canceling shutdown now each cause a message to be printed to Pipeline build logs indicating that the Pipeline is being paused or resumed due to shutdown so that it is easier to understand what is happening.

      Based on comments here and elsewhere, I think some users would prefer a variant of "Prepare for shutdown" in which Pipelines continue executing to completion, the same as other types of jobs like Freestyle. If that is something you want, please open a new ticket, describing your use case and the desired behavior.

      [...]

      If there is some other aspect of this issue that you would like to see addressed, or a different behavior you would prefer, please open a new ticket describing your particular use case.

      My use case is to make restarting Jenkins master to allow upgrading Jenkins core or updating Jenkins plugins easier, because now I need to do the following:

      1. wait until no pipelines are running anymore
        • which can get difficult in bigger Jenkins environments during the day i.e. normal working hours (due to steady commits triggering pipelines), but also in case there are longer lasting test suites that e.g. are triggered all around the clock
      2. click "prepare for shutdown"
      3. ... (continue normal work like upgrading/updating)

          [JENKINS-60434] "Prepare for shutdown" should continue executing already running pipelines to completion

          Ivan Rossier added a comment -

          Totally agree with last remark: "Prepare for shutdown' was really useful for maintenance  purpose.

          Ivan Rossier added a comment - Totally agree with last remark: "Prepare for shutdown' was really useful for maintenance  purpose.

          Jen added a comment -

          Yes, 100% agree! Please make fixing the "Prepare for shutdown" a priority!

          Jen added a comment - Yes, 100% agree! Please make fixing the "Prepare for shutdown" a priority!

          Kari Niemi added a comment -

          Kari Niemi added a comment - The official KB from Cloudbees also states it incorrectly: https://docs.cloudbees.com/docs/cloudbees-ci-kb/latest/client-and-managed-masters/how-do-i-stop-builds-on-slaves-to-prepare-for-routine-jenkins-maintenance

          Kari Niemi added a comment - - edited

          Another possible work around described here: JENKINS-72097 "Run Exclusive" does not work after Jenkins restart/prepareForShutdown - Jenkins Jira

          I've been looking for pretty much all the alternatives to tackle this proble ... but all have some short-comings. That is now the closest I've found and does not require adding any new logic nor plugins to all the jenkins jobs.

           

          Edit: It's BS. That solution does not work either. Despite the docs of the plug-in, the Jenkins-pipelines get paused at next node()-section if a "Run Exclusive"-job is running. I'm considering to go for abruptly aborting all jobs and rebooting all jenkins-nodes when the maintenance breaks start - nevermind the devs and the running builds.

          Kari Niemi added a comment - - edited Another possible work around described here: JENKINS-72097 "Run Exclusive" does not work after Jenkins restart/prepareForShutdown - Jenkins Jira I've been looking for pretty much all the alternatives to tackle this proble ... but all have some short-comings. That is now the closest I've found and does not require adding any new logic nor plugins to all the jenkins jobs.   Edit: It's BS. That solution does not work either. Despite the docs of the plug-in, the Jenkins-pipelines get paused at next node()-section if a "Run Exclusive"-job is running. I'm considering to go for abruptly aborting all jobs and rebooting all jenkins-nodes when the maintenance breaks start - nevermind the devs and the running builds.

          Alan Kyffin added a comment -

          I've found this frustrating because jobs often don't survive a restart.

          I have a solution which allows you to disable the pausing of pipelines by setting a system property: https://github.com/jenkinsci/workflow-cps-plugin/pull/846.

          Alan Kyffin added a comment - I've found this frustrating because jobs often don't survive a restart. I have a solution which allows you to disable the pausing of pipelines by setting a system property: https://github.com/jenkinsci/workflow-cps-plugin/pull/846 .

          Jesse Glick added a comment -

          The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately.

          Not really. The behavior is designed to allow the controller to be restarted nearly immediately, even when there are Pipeline builds running. Any builds running inside a sh step (for example) may be “paused” at the Groovy level but the actual task running on the agent continues without interruption and may complete during the quiet period, while the controller is restarting, or after restart, without affecting ultimate build status. Once all builds get to a safe spot (which should normally be in a matter of seconds, assuming there are not any freestyle builds running) the restart can proceed.

          There probably needs to be a distinct admin gesture to “prepare for eventual shutdown” to handle special circumstances, such as:

          • there are freestyle (or other non-Pipeline) builds running, which cannot tolerate a controller restart
          • there some Pipeline builds running which are marked with the option to not permit resumption across controller restarts

          This would need to suppress the behavior of pausing the CPS VM portion of running Pipeline builds and force CpsFlowExecution.blocksRestart on so that the restart would wait until all builds of all types have completed naturally. As I recall there is also logic to suppress scheduling of new queue items, which would need to exempt new node blocks from running Pipeline builds (or else the system would livelock).

          Jesse Glick added a comment - The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately. Not really. The behavior is designed to allow the controller to be restarted nearly immediately, even when there are Pipeline builds running. Any builds running inside a sh step (for example) may be “paused” at the Groovy level but the actual task running on the agent continues without interruption and may complete during the quiet period, while the controller is restarting, or after restart, without affecting ultimate build status. Once all builds get to a safe spot (which should normally be in a matter of seconds, assuming there are not any freestyle builds running) the restart can proceed. There probably needs to be a distinct admin gesture to “prepare for eventual shutdown” to handle special circumstances, such as: there are freestyle (or other non-Pipeline) builds running, which cannot tolerate a controller restart there some Pipeline builds running which are marked with the option to not permit resumption across controller restarts This would need to suppress the behavior of pausing the CPS VM portion of running Pipeline builds and force CpsFlowExecution.blocksRestart on so that the restart would wait until all builds of all types have completed naturally. As I recall there is also logic to suppress scheduling of new queue items, which would need to exempt new node blocks from running Pipeline builds (or else the system would livelock).

          Roman Zwi added a comment -

          This issue is quite complicated in our setup because we have many cascaded jobs and lots of them don't survive a restart (for one or another reason).
          So I could think of 2 possibilities to solve this:

          • inhibit "external" triggers: don't allow things like manually (re)starting a build, SCM trigger, time trigger,... but still allow subjobs to be executed.
            This would allow cascaded jobs to finish (as they need to execute subjobs to get finished).
            OR
          • don't start any new jobs AND wait until all running jobs are in a state where they are waiting for a subjob to be finished - presuming that this is a good state for a safe restart in any case.

          I don't know if any of this would be easy (or even possible) to implement.
          And of course it still leaves the inconvenience that you have to wait until long running jobs get finished (if any) but it would help in our case.

          Roman Zwi added a comment - This issue is quite complicated in our setup because we have many cascaded jobs and lots of them don't survive a restart (for one or another reason). So I could think of 2 possibilities to solve this: inhibit "external" triggers: don't allow things like manually (re)starting a build, SCM trigger, time trigger,... but still allow subjobs to be executed. This would allow cascaded jobs to finish (as they need to execute subjobs to get finished). OR don't start any new jobs AND wait until all running jobs are in a state where they are waiting for a subjob to be finished - presuming that this is a good state for a safe restart in any case. I don't know if any of this would be easy (or even possible) to implement. And of course it still leaves the inconvenience that you have to wait until long running jobs get finished (if any) but it would help in our case.

          Alan Kyffin added a comment -

          To allow pipelines to run to completion, including new node steps, ContinuedTask would have to extend Task.NonBlockingTask to allow them to be scheduled. CpsFlowExecution.blocksRestart() could simply return true. This failed for me with cloud executors because no new agents were provisioned during quietDown.

          Allowing pipelines to run until the next node step requires not pausing the pipeline. CpsFlowExecution.blocksRestart() already checks with each StepExecution so ExecutorStepExecution.blocksRestart() could return true unless it itself is blocked. However, the pipeline then has to be paused to save its state before Jenkins can be restarted.

          I think both approaches would fail in the case of nested jobs.

          Alan Kyffin added a comment - To allow pipelines to run to completion, including new node steps, ContinuedTask would have to extend Task.NonBlockingTask to allow them to be scheduled. CpsFlowExecution.blocksRestart() could simply return true . This failed for me with cloud executors because no new agents were provisioned during quietDown. Allowing pipelines to run until the next node step requires not pausing the pipeline. CpsFlowExecution.blocksRestart() already checks with each StepExecution so ExecutorStepExecution.blocksRestart() could return true unless it itself is blocked. However, the pipeline then has to be paused to save its state before Jenkins can be restarted. I think both approaches would fail in the case of nested jobs.

          Kevin added a comment -

          I would also greatly appreciate this enhancement. Currently our Jenkins maintenances are very archaic. We need to wait for an idle time manually to run upgrades while avoiding interrupting running pipelines. We mostly only have complexe pipelines that are long-running, multi node, and with nested pipeline calls. They are not designed to be able to survive restarts.

          It would be great if I could just initiate a safeRestart and it would run if and only if all running jobs/pipelines are completed while preventing new jobs from running.

          Thank you.

          Kevin added a comment - I would also greatly appreciate this enhancement. Currently our Jenkins maintenances are very archaic. We need to wait for an idle time manually to run upgrades while avoiding interrupting running pipelines. We mostly only have complexe pipelines that are long-running, multi node, and with nested pipeline calls. They are not designed to be able to survive restarts. It would be great if I could just initiate a safeRestart and it would run if and only if all running jobs/pipelines are completed while preventing new jobs from running. Thank you.

          Jesse Glick added a comment -

          a distinct admin gesture

          Unnecessary I guess, if there are any running non-Pipeline builds or Pipeline builds marked with Do not allow the pipeline to resume if the controller restarts: this should be a sufficient signal.

          As mentioned in recent comments, there would be some work to do to ensure that new node blocks could be scheduled but not new top-level builds…except perhaps new downstream builds triggered via the build step (with the default wait: true), since otherwise you would again livelock.

          The trickier question is the case that there is a mixture of resumable and non-resumable builds. The current behavior optimizes for a quicker restart, by pausing new activity in the resumable Pipeline builds. But if the non-resumable builds, currently freestyle, would still be running for a long time anyway then you may as well get more work done in the resumable builds while you wait. You just do not want to be initiating new agent connections and the like right before the controller is about to shut down. Perhaps it would make sense to wait for a few minutes to see if the safe restart proceeds in a timely fashion, before giving up and unpausing any resumable builds.

          Of course it would also be valuable to track down cases of Pipeline builds which ought to survive restarts (i.e., do not involve weird Groovy logic with non-Serializable local variables!) but sometimes do not, come up with reproducible test cases, and get those fixed.

          Jesse Glick added a comment - a distinct admin gesture Unnecessary I guess, if there are any running non-Pipeline builds or Pipeline builds marked with Do not allow the pipeline to resume if the controller restarts : this should be a sufficient signal. As mentioned in recent comments, there would be some work to do to ensure that new node blocks could be scheduled but not new top-level builds…except perhaps new downstream builds triggered via the build step (with the default wait: true ), since otherwise you would again livelock. The trickier question is the case that there is a mixture of resumable and non-resumable builds. The current behavior optimizes for a quicker restart, by pausing new activity in the resumable Pipeline builds. But if the non-resumable builds, currently freestyle, would still be running for a long time anyway then you may as well get more work done in the resumable builds while you wait. You just do not want to be initiating new agent connections and the like right before the controller is about to shut down. Perhaps it would make sense to wait for a few minutes to see if the safe restart proceeds in a timely fashion, before giving up and unpausing any resumable builds. Of course it would also be valuable to track down cases of Pipeline builds which ought to survive restarts (i.e., do not involve weird Groovy logic with non- Serializable local variables!) but sometimes do not, come up with reproducible test cases, and get those fixed.

            Unassigned Unassigned
            reinholdfuereder Reinhold Füreder
            Votes:
            40 Vote for this issue
            Watchers:
            48 Start watching this issue

              Created:
              Updated: