[JENKINS-60434] "Prepare for shutdown" should continue executing already running pipelines to completion

Ulrich Köhler added a comment - 2019-12-11 09:59

Another use case: The ThinBackup plugin sets Jenkins to shutdown and waits for all jobs to finish. But Pipeline Jobs never finishes: dead lock.

Ulrich Köhler added a comment - 2019-12-11 09:59 Another use case: The ThinBackup plugin sets Jenkins to shutdown and waits for all jobs to finish. But Pipeline Jobs never finishes: dead lock.

Thomas de Grenier de Latour added a comment - 2019-12-11 11:27

In case it can be useful to anyone, here is the "planned upgrade" process we have for Jenkins in my company.
It relies on a custom quiet-mode implementation we've implemented in an internal plugin, which basically allows already running builds to terminate (including Pipelines), but forbids starting execution of new builds (expect if they are necessary for termination of the already running builds).

The overall process is automated (we have many Jenkins instances), and it goes like this:

activate the custom quiet-down mode (forbid starting new builds)
poll Jenkins until it's idle, for up to X minutes, and then do the upgrade (including an actual restart)
on time-out of this polling, cancel the planned upgrade (cancel the custom quiet-mode), and retry it all later (sometimes we have to find arrangements with users, so that they don't launch their freaking 18 hours tests suite on the day we are planning to do an upgrade)

We don't have plans/time to publish and maintain this as a community plugin, but if someone wants to do something similar, I will dump the code below, feel free to reuse what you want.

Note that we would probably never had written this code if we had not been bitten many times by ~~JENKINS-34256~~. A few years ago, we were simply using the standard Jenkins quiet-mode, but then stuck Pipelines (when the upgrade was cancelled) really became an issue...
Now that ~~JENKINS-34256~~ is fixed, I don't know, we might consider going back to this standard solution. But I think our users prefer having their Pipelines finished before the upgrade, rather than paused/resumed (mainly because the "resume" part is not always smooth: some plugins upgrades might break compatibility of the serialized data, etc.).

Anyway, this is the "interesting" part of the code, the QuietDownQueueTaskDispatcher, which filters which new Queue.Item can actually be started when in (custom) quiet-mode.

@Extension
public class QuietDownQueueTaskDispatcher extends QueueTaskDispatcher {

	@Inject
	QuietDownStateManager quietDownStateManager;

	// key: upstreamProject+upstreamBuild from an UpstreamCause
	// value: true if children builds should be allowed to run
	private ConcurrentHashMap<String, Boolean> knownUpstreamCauses = new ConcurrentHashMap<>();

	// used to decide when cache should be flushed
	private AtomicLong quietDownTimestamp = new AtomicLong(0l);

	@Override
	public @CheckForNull CauseOfBlockage canRun(Queue.Item item) {
		QuietDownState currentState = quietDownStateManager.getState();
		if (!currentState.isDown()) {
			return null;
		}

		// flush cache if quietDown state has changed
		if (quietDownTimestamp.getAndSet(currentState.since()) != currentState.since()) {
			knownUpstreamCauses.clear();
		}

		Queue.Task task = item.task;
		// always allow some kind of tasks
		if (task instanceof NonBlockingTask || task instanceof ContinuedTask) {
			return null;
		}
		// allow build task because of its upstream cause
		if (hasAllowingCause(item.getCauses())) {
			return null;
		}
		// not allowed, let's explain why
		return new QuietDownBlockageCause(currentState);
	}

	private boolean hasAllowingCause(@Nonnull List<Cause> causes) {
		boolean result = false;
		for (Cause parentCause: causes) {
			if (!(parentCause instanceof UpstreamCause)) {
				continue;
			}
			result = result || isAllowingUpstreamCause((UpstreamCause) parentCause);
		}
		return result;
	}

	private boolean isAllowingUpstreamCause(@Nonnull UpstreamCause cause) {
		String runKey = cause.getUpstreamProject() + ':' + cause.getUpstreamBuild();
		Boolean decisionFromCache = knownUpstreamCauses.get(runKey);
		if (decisionFromCache != null) {
			return decisionFromCache;
		}
		boolean newDecision = hasAllowingCause(cause.getUpstreamCauses())
				|| isRunAllowingDownstreamBuilds(cause.getUpstreamRun());
		knownUpstreamCauses.put(runKey, newDecision);
		return newDecision;
	}

	private boolean isRunAllowingDownstreamBuilds(@CheckForNull Run<?, ?> run) {
		if (run == null || !run.isBuilding()) {
			return false;
		}
		// a running WorkflowRun or MatrixBuild may wait for its children to complete
		// Note: assume there exists no MatrixBuild subclass, it saves an optional plugin dependency
		return (run instanceof WorkflowRun || "hudson.matrix.MatrixBuild".equals(run.getClass().getName()));
	}

	public static class QuietDownBlockageCause extends CauseOfBlockage {

		private final @Nonnull QuietDownState quietDownState;

		private QuietDownBlockageCause(QuietDownState quietDownState) {
			this.quietDownState = quietDownState;
		}

		public static @CheckForNull QuietDownBlockageCause from(QuietDownState quietDownState) {
			if (!quietDownState.isDown()) {
				return null;
			}
			return new QuietDownBlockageCause(quietDownState);
		}

		@Override
		public String getShortDescription() {
			return quietDownState.toShortDescriptionString();
		}

	}
}

The currently implemented policy is to only allow tasks which are:

NonBlockingTask, or Pipeline ContinuedTask (I can't remember the specific details, I wrote that long time ago)
children of an already running Pipeline or Matrix build (that's necessary to let these builds terminate, because they can wait for their children termination, but it could be refined: for instance we don't really need to allow builds launched by a Pipeline build step with wait=false parameter)

Other than these, new builds will be declined, and stay in the queue.

To avoid spending too much time walking the UpstreamCause of the candidate tasks, we keep a cache of already made decisions (whether a specific build is a legitimate cause for allowing children builds, or not).

A QuietDownState has a State (AVAILABLE or QUIET_DOWN enumeration), a starting timestamp, and a cause message.

public class QuietDownState {

	private final String cause;
	private final State state;
	private final long timestamp;

	private QuietDownState(@Nonnull State state) {
		this(state, null);
	}

	private QuietDownState(@Nonnull State state, String cause) {
		this.cause = cause;
		this.state = state;
		this.timestamp = System.currentTimeMillis();
	}

	public static @Nonnull QuietDownState available() {
		return new QuietDownState(State.AVAILABLE);
	}

	public static @Nonnull QuietDownState quietDown(@Nonnull String cause) {
		return new QuietDownState(State.QUIET_DOWN, cause);
	}

	public boolean is(State state) {
		return this.state == state;
	}

	public boolean isDown() {
		return state.down;
	}

	public @CheckForNull String why() {
		return cause;
	}

	public long since() {
		return timestamp;
	}

	public @Nonnull String toApiString() {
		StringBuilder sb = new StringBuilder();
		sb.append(state);
		sb.append(" since ");
		sb.append(Util.XS_DATETIME_FORMATTER.format(timestamp));
		if (StringUtils.isNotEmpty(cause)) {
			sb.append(" - ").append(cause);
		}
		return sb.toString();
	}

	// FIXME: better message/formatting
	public @Nonnull String toUserString() {
		StringBuilder sb = new StringBuilder();
		sb.append("Jenkins has been ");
		sb.append(state.label);
		sb.append(" for ");
		sb.append(Util.getTimeSpanString(System.currentTimeMillis() - timestamp));
		if (StringUtils.isNotEmpty(cause)) {
			sb.append(" - ").append(cause);
		}
		return sb.toString();
	}

	// FIXME: make it shorter?
	public @Nonnull String toShortDescriptionString() {
		return toUserString();
	}

	public @Nonnull String toString() {
		return toApiString();
	}

	@Override
	public int hashCode() {
		// <snip>
	}

	@Override
	public boolean equals(Object obj) {
		// <snip>
	}

	public enum State {
		AVAILABLE(false, "available"), QUIET_DOWN(true, "sleeping");
		private boolean down;
		private String label;

		private State(boolean down, String label) {
			this.down = down;
			this.label = label;
		}
	}
}

The (global) current state can be changed via a QuietDownStateManager, which is a Guice singleton:

public class QuietDownStateManager {

	private AtomicReference<QuietDownState> currentState = new AtomicReference<>(QuietDownState.available());

	public QuietDownState getState() {
		return currentState.get();
	}

	public QuietDownState quietDown(String cause) {
		final QuietDownState newState = QuietDownState.quietDown(cause);
		return currentState.updateAndGet(
				state -> state.is(QUIET_DOWN) ? state : newState);
		// TODO: updating the cause (when already down) could be nice (while still preserving the initial timestamp)
	}

	public QuietDownState cancelQuietDown() {
		final QuietDownState newState = QuietDownState.available();
		return currentState.updateAndGet(
				state -> state.is(AVAILABLE) ? state : newState);
	}

}

@Extension
public class GuiceBindings extends AbstractModule {

	@Override
	protected void configure() {
		//...
		bind(QuietDownStateManager.class).in(Singleton.class);
	}

}

We control the QuietDownStateManager through a few simple HTTP methods:

doQuietDown(): enable quiet-down mode (with a cause message)
doCancelQuietDown(): disable quiet-down mode
doGetQuietDownStatus(): get current quiet-down status

We also have a method (doActivity() below) which we can poll to know whether Jenkins is BUSY or IDLE (that's what we use to wait for it being idle before triggering an actual restart - this too could be refined, for instance we could consider that Jenkins is idle when the only running Pipelines which are left are actually blocked on input steps).

@Extension
public class SomethingRemoteAPI extends AbstractModelObject implements UnprotectedRootAction {
	@Inject
	QuietDownStateManager quietDownStateManager;

	public String getDisplayName() {
		return "SomethingAPI";
	}

	public String getSearchUrl() {
		return getUrlName();
	}

	public String getIconFileName() {
		return null;
	}

	public String getUrlName() {
		return "somethingAPI";
	}

	// <snip> other unrelated methods

	@RequirePOST
	public HttpResponse doQuietDown() {
		Jenkins.getInstance().checkPermission(Jenkins.ADMINISTER);
		return (req, rsp, node) -> {
			final QuietDownState state = quietDownStateManager.quietDown(defaultString(req.getParameter("cause")));
			rsp.setStatus(HttpServletResponse.SC_OK);
			rsp.setContentType("text/plain");
			PrintWriter w = rsp.getWriter();
			w.println(state.toApiString());
		};
	}

	@RequirePOST
	public HttpResponse doCancelQuietDown() {
		Jenkins.getInstance().checkPermission(Jenkins.ADMINISTER);
		return (req, rsp, node) -> {
			final QuietDownState state = quietDownStateManager.cancelQuietDown();
			rsp.setStatus(HttpServletResponse.SC_OK);
			rsp.setContentType("text/plain");
			PrintWriter w = rsp.getWriter();
			w.println(state.toApiString());
		};
	}

	public HttpResponse doGetQuietDownStatus() {
		return (req, rsp, node) -> {
			final QuietDownState state = quietDownStateManager.getState();
			rsp.setStatus(HttpServletResponse.SC_OK);
			rsp.setContentType("text/plain");
			PrintWriter w = rsp.getWriter();
			w.println(state.toApiString());
		};
	}

	public HttpResponse doActivity() {
		final int httpStatus;
		final String body;
		try {
			body = countBusyExecutors() > 0 ? "BUSY" : "IDLE" ;
			httpStatus = HttpServletResponse.SC_OK;
		} catch (RuntimeException e) {
			LOGGER.log(Level.WARNING, "failed to count busy executors: " + e.getMessage(), e);
			body = "UNKOWN" ;
			httpStatus = HttpServletResponse.SC_INTERNAL_SERVER_ERROR;
		}
		return (req, rsp, node) -> {
			rsp.setStatus(httpStatus);
			rsp.setContentType("text/plain");
			PrintWriter w = rsp.getWriter();
			w.println(body);
		};
	}

	private int countBusyExecutors() {
		// see hudson.model.ComputerSet.getBusyExecutors()
		int r = 0;
		for (Computer c : Jenkins.get().getComputers()) {
			if (c.isOnline()) {
				r += c.countBusy();
			}
		}
		return r;
	}
}

Finally, we also have some bits of code to display a message in Jenkins GUI when our quiet-mode is enabled (that's part of a more general-purpose system we have for pushing notification messages to our Jenkins users, but that could ofcourse be implemented differently in the context of a dedicated plugin).

Thomas de Grenier de Latour added a comment - 2019-12-11 11:27 In case it can be useful to anyone, here is the "planned upgrade" process we have for Jenkins in my company. It relies on a custom quiet-mode implementation we've implemented in an internal plugin, which basically allows already running builds to terminate (including Pipelines), but forbids starting execution of new builds (expect if they are necessary for termination of the already running builds). The overall process is automated (we have many Jenkins instances), and it goes like this: activate the custom quiet-down mode (forbid starting new builds) poll Jenkins until it's idle, for up to X minutes, and then do the upgrade (including an actual restart) on time-out of this polling, cancel the planned upgrade (cancel the custom quiet-mode), and retry it all later (sometimes we have to find arrangements with users, so that they don't launch their freaking 18 hours tests suite on the day we are planning to do an upgrade) We don't have plans/time to publish and maintain this as a community plugin, but if someone wants to do something similar, I will dump the code below, feel free to reuse what you want. Note that we would probably never had written this code if we had not been bitten many times by JENKINS-34256 . A few years ago, we were simply using the standard Jenkins quiet-mode, but then stuck Pipelines (when the upgrade was cancelled) really became an issue... Now that JENKINS-34256 is fixed, I don't know, we might consider going back to this standard solution. But I think our users prefer having their Pipelines finished before the upgrade, rather than paused/resumed (mainly because the "resume" part is not always smooth: some plugins upgrades might break compatibility of the serialized data, etc.). Anyway, this is the "interesting" part of the code, the QuietDownQueueTaskDispatcher , which filters which new Queue.Item can actually be started when in (custom) quiet-mode. @Extension public class QuietDownQueueTaskDispatcher extends QueueTaskDispatcher { @Inject QuietDownStateManager quietDownStateManager; // key: upstreamProject+upstreamBuild from an UpstreamCause // value: true if children builds should be allowed to run private ConcurrentHashMap< String , Boolean > knownUpstreamCauses = new ConcurrentHashMap<>(); // used to decide when cache should be flushed private AtomicLong quietDownTimestamp = new AtomicLong(0l); @Override public @CheckForNull CauseOfBlockage canRun(Queue.Item item) { QuietDownState currentState = quietDownStateManager.getState(); if (!currentState.isDown()) { return null ; } // flush cache if quietDown state has changed if (quietDownTimestamp.getAndSet(currentState.since()) != currentState.since()) { knownUpstreamCauses.clear(); } Queue.Task task = item.task; // always allow some kind of tasks if (task instanceof NonBlockingTask || task instanceof ContinuedTask) { return null ; } // allow build task because of its upstream cause if (hasAllowingCause(item.getCauses())) { return null ; } // not allowed, let's explain why return new QuietDownBlockageCause(currentState); } private boolean hasAllowingCause(@Nonnull List<Cause> causes) { boolean result = false ; for (Cause parentCause: causes) { if (!(parentCause instanceof UpstreamCause)) { continue ; } result = result || isAllowingUpstreamCause((UpstreamCause) parentCause); } return result; } private boolean isAllowingUpstreamCause(@Nonnull UpstreamCause cause) { String runKey = cause.getUpstreamProject() + ':' + cause.getUpstreamBuild(); Boolean decisionFromCache = knownUpstreamCauses.get(runKey); if (decisionFromCache != null ) { return decisionFromCache; } boolean newDecision = hasAllowingCause(cause.getUpstreamCauses()) || isRunAllowingDownstreamBuilds(cause.getUpstreamRun()); knownUpstreamCauses.put(runKey, newDecision); return newDecision; } private boolean isRunAllowingDownstreamBuilds(@CheckForNull Run<?, ?> run) { if (run == null || !run.isBuilding()) { return false ; } // a running WorkflowRun or MatrixBuild may wait for its children to complete // Note: assume there exists no MatrixBuild subclass, it saves an optional plugin dependency return (run instanceof WorkflowRun || "hudson.matrix.MatrixBuild" .equals(run.getClass().getName())); } public static class QuietDownBlockageCause extends CauseOfBlockage { private final @Nonnull QuietDownState quietDownState; private QuietDownBlockageCause(QuietDownState quietDownState) { this .quietDownState = quietDownState; } public static @CheckForNull QuietDownBlockageCause from(QuietDownState quietDownState) { if (!quietDownState.isDown()) { return null ; } return new QuietDownBlockageCause(quietDownState); } @Override public String getShortDescription() { return quietDownState.toShortDescriptionString(); } } } The currently implemented policy is to only allow tasks which are: NonBlockingTask , or Pipeline ContinuedTask (I can't remember the specific details, I wrote that long time ago) children of an already running Pipeline or Matrix build (that's necessary to let these builds terminate, because they can wait for their children termination, but it could be refined: for instance we don't really need to allow builds launched by a Pipeline build step with wait=false parameter) Other than these, new builds will be declined, and stay in the queue. To avoid spending too much time walking the UpstreamCause of the candidate tasks, we keep a cache of already made decisions (whether a specific build is a legitimate cause for allowing children builds, or not). A QuietDownState has a State ( AVAILABLE or QUIET_DOWN enumeration), a starting timestamp, and a cause message. public class QuietDownState { private final String cause; private final State state; private final long timestamp; private QuietDownState(@Nonnull State state) { this (state, null ); } private QuietDownState(@Nonnull State state, String cause) { this .cause = cause; this .state = state; this .timestamp = System .currentTimeMillis(); } public static @Nonnull QuietDownState available() { return new QuietDownState(State.AVAILABLE); } public static @Nonnull QuietDownState quietDown(@Nonnull String cause) { return new QuietDownState(State.QUIET_DOWN, cause); } public boolean is(State state) { return this .state == state; } public boolean isDown() { return state.down; } public @CheckForNull String why() { return cause; } public long since() { return timestamp; } public @Nonnull String toApiString() { StringBuilder sb = new StringBuilder(); sb.append(state); sb.append( " since " ); sb.append(Util.XS_DATETIME_FORMATTER.format(timestamp)); if (StringUtils.isNotEmpty(cause)) { sb.append( " - " ).append(cause); } return sb.toString(); } // FIXME: better message/formatting public @Nonnull String toUserString() { StringBuilder sb = new StringBuilder(); sb.append( "Jenkins has been " ); sb.append(state.label); sb.append( " for " ); sb.append(Util.getTimeSpanString( System .currentTimeMillis() - timestamp)); if (StringUtils.isNotEmpty(cause)) { sb.append( " - " ).append(cause); } return sb.toString(); } // FIXME: make it shorter? public @Nonnull String toShortDescriptionString() { return toUserString(); } public @Nonnull String toString() { return toApiString(); } @Override public int hashCode() { // <snip> } @Override public boolean equals( Object obj) { // <snip> } public enum State { AVAILABLE( false , "available" ), QUIET_DOWN( true , "sleeping" ); private boolean down; private String label; private State( boolean down, String label) { this .down = down; this .label = label; } } } The (global) current state can be changed via a QuietDownStateManager , which is a Guice singleton: public class QuietDownStateManager { private AtomicReference<QuietDownState> currentState = new AtomicReference<>(QuietDownState.available()); public QuietDownState getState() { return currentState.get(); } public QuietDownState quietDown( String cause) { final QuietDownState newState = QuietDownState.quietDown(cause); return currentState.updateAndGet( state -> state.is(QUIET_DOWN) ? state : newState); // TODO: updating the cause (when already down) could be nice ( while still preserving the initial timestamp) } public QuietDownState cancelQuietDown() { final QuietDownState newState = QuietDownState.available(); return currentState.updateAndGet( state -> state.is(AVAILABLE) ? state : newState); } } @Extension public class GuiceBindings extends AbstractModule { @Override protected void configure() { //... bind(QuietDownStateManager.class).in(Singleton.class); } } We control the QuietDownStateManager through a few simple HTTP methods: doQuietDown() : enable quiet-down mode (with a cause message) doCancelQuietDown() : disable quiet-down mode doGetQuietDownStatus() : get current quiet-down status We also have a method ( doActivity() below) which we can poll to know whether Jenkins is BUSY or IDLE (that's what we use to wait for it being idle before triggering an actual restart - this too could be refined, for instance we could consider that Jenkins is idle when the only running Pipelines which are left are actually blocked on input steps). @Extension public class SomethingRemoteAPI extends AbstractModelObject implements UnprotectedRootAction { @Inject QuietDownStateManager quietDownStateManager; public String getDisplayName() { return "SomethingAPI" ; } public String getSearchUrl() { return getUrlName(); } public String getIconFileName() { return null ; } public String getUrlName() { return "somethingAPI" ; } // <snip> other unrelated methods @RequirePOST public HttpResponse doQuietDown() { Jenkins.getInstance().checkPermission(Jenkins.ADMINISTER); return (req, rsp, node) -> { final QuietDownState state = quietDownStateManager.quietDown(defaultString(req.getParameter( "cause" ))); rsp.setStatus(HttpServletResponse.SC_OK); rsp.setContentType( "text/plain" ); PrintWriter w = rsp.getWriter(); w.println(state.toApiString()); }; } @RequirePOST public HttpResponse doCancelQuietDown() { Jenkins.getInstance().checkPermission(Jenkins.ADMINISTER); return (req, rsp, node) -> { final QuietDownState state = quietDownStateManager.cancelQuietDown(); rsp.setStatus(HttpServletResponse.SC_OK); rsp.setContentType( "text/plain" ); PrintWriter w = rsp.getWriter(); w.println(state.toApiString()); }; } public HttpResponse doGetQuietDownStatus() { return (req, rsp, node) -> { final QuietDownState state = quietDownStateManager.getState(); rsp.setStatus(HttpServletResponse.SC_OK); rsp.setContentType( "text/plain" ); PrintWriter w = rsp.getWriter(); w.println(state.toApiString()); }; } public HttpResponse doActivity() { final int httpStatus; final String body; try { body = countBusyExecutors() > 0 ? "BUSY" : "IDLE" ; httpStatus = HttpServletResponse.SC_OK; } catch (RuntimeException e) { LOGGER.log(Level.WARNING, "failed to count busy executors: " + e.getMessage(), e); body = "UNKOWN" ; httpStatus = HttpServletResponse.SC_INTERNAL_SERVER_ERROR; } return (req, rsp, node) -> { rsp.setStatus(httpStatus); rsp.setContentType( "text/plain" ); PrintWriter w = rsp.getWriter(); w.println(body); }; } private int countBusyExecutors() { // see hudson.model.ComputerSet.getBusyExecutors() int r = 0; for (Computer c : Jenkins.get().getComputers()) { if (c.isOnline()) { r += c.countBusy(); } } return r; } } Finally, we also have some bits of code to display a message in Jenkins GUI when our quiet-mode is enabled (that's part of a more general-purpose system we have for pushing notification messages to our Jenkins users, but that could ofcourse be implemented differently in the context of a dedicated plugin).

Reinhold Füreder added a comment - 2019-12-11 12:54

tom_gl Thanks for the insight! And wow, that is impressive and I am not sure you got that right in the first attempt

Reinhold Füreder added a comment - 2019-12-11 12:54 tom_gl Thanks for the insight! And wow, that is impressive and I am not sure you got that right in the first attempt

Jason Antman added a comment - 2020-09-30 10:46

We could really use this as well; our use case is similar to the above, largely around upgrades to Jenkins or the infrastructure that it runs on. To put it simply:

1. There's a button on the Manage Jenkins page that says, "Prepare for Shutdown: Stops executing new builds, so that the system can be eventually shut down safely." I'd say that this is no longer correct, since it actually does more than that, it now also pauses running builds.
2. In our case at least, pausing a pipeline is almost never the right thing to do. This has negative impacts for both cost (if we spin up a bunch of billed-by-the-minute EC2 instances for a test environment, we don't want to pause after doing that and before tearing it down) and user experience (when a pipeline kicks off, the people who are watching it expect it to run to completion, not get paused). We also occasionally have issues around timeouts, due to pausing between time-dependent stages.
3. There's no clear visual indication of this state. If you look at "Build Executor Status" on the main page, it looks like the builds are running. There doesn't appear to be anything clearly indicating, "HEY, THIS BUILD IS PAUSED!"
4. This is, in my opinion, a really major and unintuitive change from previous behavior. I've been using "Prepare for Shutdown" to upgrade Jenkins for years. The first time I found out that it's now pausing jobs, I spent an hour waiting for the currently-running jobs to complete (with no indication they were paused, see above) before I finally looked at the console output of one and found out that it was paused.

Jason Antman added a comment - 2020-09-30 10:46 We could really use this as well; our use case is similar to the above, largely around upgrades to Jenkins or the infrastructure that it runs on. To put it simply: 1. There's a button on the Manage Jenkins page that says, "Prepare for Shutdown: Stops executing new builds, so that the system can be eventually shut down safely." I'd say that this is no longer correct, since it actually does more than that, it now also pauses running builds. 2. In our case at least, pausing a pipeline is almost never the right thing to do. This has negative impacts for both cost (if we spin up a bunch of billed-by-the-minute EC2 instances for a test environment, we don't want to pause after doing that and before tearing it down) and user experience (when a pipeline kicks off, the people who are watching it expect it to run to completion, not get paused). We also occasionally have issues around timeouts, due to pausing between time-dependent stages. 3. There's no clear visual indication of this state. If you look at "Build Executor Status" on the main page, it looks like the builds are running. There doesn't appear to be anything clearly indicating, "HEY, THIS BUILD IS PAUSED!" 4. This is, in my opinion, a really major and unintuitive change from previous behavior. I've been using "Prepare for Shutdown" to upgrade Jenkins for years . The first time I found out that it's now pausing jobs, I spent an hour waiting for the currently-running jobs to complete (with no indication they were paused, see above) before I finally looked at the console output of one and found out that it was paused.

Tim Black added a comment - 2020-10-02 06:00

I agree completely with all 4 of @Jason Antman's points, and share the same negative experience with this misbehavior. This is a major problem with companies using pipelines and performing upgrades.

Tim Black added a comment - 2020-10-02 06:00 I agree completely with all 4 of @Jason Antman's points, and share the same negative experience with this misbehavior. This is a major problem with companies using pipelines and performing upgrades.

Tim Brown added a comment - 2020-12-15 11:24 - edited

Have you tried using <jenkins_url>/safeRestart?
It seems like it restarts once Pipelines are paused, and restart Pipelines after restart. That said it seems to claim it waits until they are finished (which I am hoping is just needs updating.

Tim Brown added a comment - 2020-12-15 11:24 - edited Have you tried using <jenkins_url>/safeRestart? It seems like it restarts once Pipelines are paused, and restart Pipelines after restart. That said it seems to claim it waits until they are finished (which I am hoping is just needs updating.

Jonathan Delizy added a comment - 2021-03-17 12:47

I tried the Lenient shutdown plugin but it pauses running pipelines and doesn't prevent new pipelines to start.
This plugin is pretty old so it is probably not compatible with pipelines "by design".

Jonathan Delizy added a comment - 2021-03-17 12:47 I tried the Lenient shutdown plugin but it pauses running pipelines and doesn't prevent new pipelines to start. This plugin is pretty old so it is probably not compatible with pipelines "by design".

Brett Alex added a comment - 2021-05-13 12:41

@Tim Brown, I think this whole ticket is based on the fact that <jenkins_url>/safeRestart doesn't actually work. I believe it actually does work in some cases but not in others.

For example, I hit this bug at least once a week when applying patches. I think our cause may be that we have some freestyle jobs that trigger pipeline jobs (the horror!). Since the pipeline jobs pause indefinitely, the freestyle jobs never complete and the restart will never happen.

I think the more common use case would be /safeRestart should just let all running jobs finish as it has done since the beginning. The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately.

I agree completely with all 4 of @Jason Antman's points as well.

Brett Alex added a comment - 2021-05-13 12:41 @Tim Brown, I think this whole ticket is based on the fact that <jenkins_url>/safeRestart doesn't actually work. I believe it actually does work in some cases but not in others. For example, I hit this bug at least once a week when applying patches. I think our cause may be that we have some freestyle jobs that trigger pipeline jobs (the horror!). Since the pipeline jobs pause indefinitely, the freestyle jobs never complete and the restart will never happen. I think the more common use case would be /safeRestart should just let all running jobs finish as it has done since the beginning. The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately. I agree completely with all 4 of @Jason Antman's points as well.

Tim Brown added a comment - 2021-11-30 10:09 - edited

Hi,

This is not a direct solution, but hopefully will help someone. It appears that as well as Online/Offline and Disconnected, Jenkins nodes also have a Suspended state. This seems to mean that they will run their current tasks and not take on new ones. So I wrote two script console snippets to suspend and resume all nodes (can be easily extended to updated work on a subset by using `findAll` and setting a predicate, instead of using `each`):

Suspend (Pause) all nodes

Jenkins.instance.getNodes().each{ node ->
    def computer = node.toComputer()
    computer.setAcceptingTasks(false)
    println("${computer.getName()} accepting tasks: ${computer.isAcceptingTasks()}")
}
// Prevent this from dumping the list of nodes
return null

Resume (unpause) all nodes

Jenkins.instance.getNodes().each{ node ->
    def computer = node.toComputer()
    computer.setAcceptingTasks(true)
    println("${computer.getName()} accepting tasks: ${computer.isAcceptingTasks()}")
}
// Prevent this from dumping the list of nodes
return null

Note: if this works for this use-case it should be trivial to add as an option the Management page (e.g. a toggle button to Suspend Nodes, resume Nodes).

I think this will be useful (for us at least) in cases where we want to pause all (or a subset) of Physical nodes, e.g. for a driver update, but leave Jenkins itself running.

We have a problem with the Restart functionality that we want to Prepare for shutdown, but then shut down when we are ready (not when safeRestart is ready).
I have a snippet which (I think) will return the list of Jenkins Pipeline jobs that are blocking the shutdown (not Paused/SUSPENDED - not sure what the difference is but `.isPaused()` seemed to return false despite the Pipeline console stating `Pausing (Preparing for shutdown)`).

import org.jenkinsci.plugins.workflow.job.WorkflowRun

Jenkins.instance.getView('All').getBuilds().findAll()
{
  // Get list of job that have have started (not in queue) but have not finished.
  it.getResult().equals(null)
}.findAll() {
  // Return list of runs workflow runs that block the (safe) restart.
  it instanceof WorkflowRun && it.getExecution().blocksRestart()
}

(feel free to extend for non-Pipeline jobs).

Tim Brown added a comment - 2021-11-30 10:09 - edited Hi, This is not a direct solution, but hopefully will help someone. It appears that as well as Online/Offline and Disconnected, Jenkins nodes also have a Suspended state. This seems to mean that they will run their current tasks and not take on new ones. So I wrote two script console snippets to suspend and resume all nodes (can be easily extended to updated work on a subset by using `findAll` and setting a predicate, instead of using `each`): Suspend (Pause) all nodes Jenkins.instance.getNodes().each{ node -> def computer = node.toComputer() computer.setAcceptingTasks( false ) println( "${computer.getName()} accepting tasks: ${computer.isAcceptingTasks()}" ) } // Prevent this from dumping the list of nodes return null Resume (unpause) all nodes Jenkins.instance.getNodes().each{ node -> def computer = node.toComputer() computer.setAcceptingTasks( true ) println( "${computer.getName()} accepting tasks: ${computer.isAcceptingTasks()}" ) } // Prevent this from dumping the list of nodes return null Note: if this works for this use-case it should be trivial to add as an option the Management page (e.g. a toggle button to Suspend Nodes, resume Nodes). I think this will be useful (for us at least) in cases where we want to pause all (or a subset) of Physical nodes, e.g. for a driver update, but leave Jenkins itself running. We have a problem with the Restart functionality that we want to Prepare for shutdown, but then shut down when we are ready (not when safeRestart is ready). I have a snippet which (I think) will return the list of Jenkins Pipeline jobs that are blocking the shutdown (not Paused/SUSPENDED - not sure what the difference is but `.isPaused()` seemed to return false despite the Pipeline console stating `Pausing (Preparing for shutdown)`). import org.jenkinsci.plugins.workflow.job.WorkflowRun Jenkins.instance.getView( 'All' ).getBuilds().findAll() { // Get list of job that have have started (not in queue) but have not finished. it.getResult().equals( null ) }.findAll() { // Return list of runs workflow runs that block the (safe) restart. it instanceof WorkflowRun && it.getExecution().blocksRestart() } (feel free to extend for non-Pipeline jobs).

David Taylor added a comment - 2022-08-01 16:10

Any updates on this? /safeRestart does not work correctly and it hasn't for at least 3-4 years.

I thought it's supposed to prevent any new jobs from starting, but allow currently running pipelines to complete. Instead it pauses currently running pipelines, which prevents them from ever finishing. I think it allows freestyle jobs to complete, but pipelines get paused when they move to the next stage.

David Taylor added a comment - 2022-08-01 16:10 Any updates on this? /safeRestart does not work correctly and it hasn't for at least 3-4 years. I thought it's supposed to prevent any new jobs from starting, but allow currently running pipelines to complete. Instead it pauses currently running pipelines, which prevents them from ever finishing. I think it allows freestyle jobs to complete, but pipelines get paused when they move to the next stage.

Pay Bas added a comment - 2022-08-04 19:30

Yeah this has been plaguing me forever. The "Prepare for Shutdown" feature not allowing already running pipelines to continue/finish is a major p.i.t.a.

I need a way to put Jenkins in maintenance mode at 17:00 to prevent new builds from being started, but let current ones finish so I have a clean Jenkins at 18:00 when I actually start the maintenance.

Pay Bas added a comment - 2022-08-04 19:30 Yeah this has been plaguing me forever. The "Prepare for Shutdown" feature not allowing already running pipelines to continue/finish is a major p.i.t.a. I need a way to put Jenkins in maintenance mode at 17:00 to prevent new builds from being started, but let current ones finish so I have a clean Jenkins at 18:00 when I actually start the maintenance.

Ivan Rossier added a comment - 2022-08-05 04:43

Totally agree with last remark: "Prepare for shutdown' was really useful for maintenance purpose.

Ivan Rossier added a comment - 2022-08-05 04:43 Totally agree with last remark: "Prepare for shutdown' was really useful for maintenance purpose.

Jen added a comment - 2022-08-05 15:41

Yes, 100% agree! Please make fixing the "Prepare for shutdown" a priority!

Jen added a comment - 2022-08-05 15:41 Yes, 100% agree! Please make fixing the "Prepare for shutdown" a priority!

Kari Niemi added a comment - 2023-02-16 07:47

The official KB from Cloudbees also states it incorrectly: https://docs.cloudbees.com/docs/cloudbees-ci-kb/latest/client-and-managed-masters/how-do-i-stop-builds-on-slaves-to-prepare-for-routine-jenkins-maintenance

Kari Niemi added a comment - 2023-02-16 07:47 The official KB from Cloudbees also states it incorrectly: https://docs.cloudbees.com/docs/cloudbees-ci-kb/latest/client-and-managed-masters/how-do-i-stop-builds-on-slaves-to-prepare-for-routine-jenkins-maintenance

Kari Niemi added a comment - 2023-09-29 07:10 - edited

~~Another possible work around described here: JENKINS-72097 "Run Exclusive" does not work after Jenkins restart/prepareForShutdown - Jenkins Jira~~

I've been looking for pretty much all the alternatives to tackle this proble ... but all have some short-comings. That is now the closest I've found and does not require adding any new logic nor plugins to all the jenkins jobs.

Edit: It's BS. That solution does not work either. Despite the docs of the plug-in, the Jenkins-pipelines get paused at next node()-section if a "Run Exclusive"-job is running. I'm considering to go for abruptly aborting all jobs and rebooting all jenkins-nodes when the maintenance breaks start - nevermind the devs and the running builds.

Kari Niemi added a comment - 2023-09-29 07:10 - edited Another possible work around described here: JENKINS-72097 "Run Exclusive" does not work after Jenkins restart/prepareForShutdown - Jenkins Jira I've been looking for pretty much all the alternatives to tackle this proble ... but all have some short-comings. That is now the closest I've found and does not require adding any new logic nor plugins to all the jenkins jobs. Edit: It's BS. That solution does not work either. Despite the docs of the plug-in, the Jenkins-pipelines get paused at next node()-section if a "Run Exclusive"-job is running. I'm considering to go for abruptly aborting all jobs and rebooting all jenkins-nodes when the maintenance breaks start - nevermind the devs and the running builds.

Alan Kyffin added a comment - 2024-02-09 17:11

I've found this frustrating because jobs often don't survive a restart.

I have a solution which allows you to disable the pausing of pipelines by setting a system property: https://github.com/jenkinsci/workflow-cps-plugin/pull/846.

Alan Kyffin added a comment - 2024-02-09 17:11 I've found this frustrating because jobs often don't survive a restart. I have a solution which allows you to disable the pausing of pipelines by setting a system property: https://github.com/jenkinsci/workflow-cps-plugin/pull/846 .

Jesse Glick added a comment - 2024-02-09 17:30

The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately.

Not really. The behavior is designed to allow the controller to be restarted nearly immediately, even when there are Pipeline builds running. Any builds running inside a sh step (for example) may be “paused” at the Groovy level but the actual task running on the agent continues without interruption and may complete during the quiet period, while the controller is restarting, or after restart, without affecting ultimate build status. Once all builds get to a safe spot (which should normally be in a matter of seconds, assuming there are not any freestyle builds running) the restart can proceed.

There probably needs to be a distinct admin gesture to “prepare for eventual shutdown” to handle special circumstances, such as:

there are freestyle (or other non-Pipeline) builds running, which cannot tolerate a controller restart
there some Pipeline builds running which are marked with the option to not permit resumption across controller restarts

This would need to suppress the behavior of pausing the CPS VM portion of running Pipeline builds and force CpsFlowExecution.blocksRestart on so that the restart would wait until all builds of all types have completed naturally. As I recall there is also logic to suppress scheduling of new queue items, which would need to exempt new node blocks from running Pipeline builds (or else the system would livelock).

Jesse Glick added a comment - 2024-02-09 17:30 The only case to pause running jobs would be if you had one of those 18 hour jobs running and had to restart immediately. Not really. The behavior is designed to allow the controller to be restarted nearly immediately, even when there are Pipeline builds running. Any builds running inside a sh step (for example) may be “paused” at the Groovy level but the actual task running on the agent continues without interruption and may complete during the quiet period, while the controller is restarting, or after restart, without affecting ultimate build status. Once all builds get to a safe spot (which should normally be in a matter of seconds, assuming there are not any freestyle builds running) the restart can proceed. There probably needs to be a distinct admin gesture to “prepare for eventual shutdown” to handle special circumstances, such as: there are freestyle (or other non-Pipeline) builds running, which cannot tolerate a controller restart there some Pipeline builds running which are marked with the option to not permit resumption across controller restarts This would need to suppress the behavior of pausing the CPS VM portion of running Pipeline builds and force CpsFlowExecution.blocksRestart on so that the restart would wait until all builds of all types have completed naturally. As I recall there is also logic to suppress scheduling of new queue items, which would need to exempt new node blocks from running Pipeline builds (or else the system would livelock).

Roman Zwi added a comment - 2024-02-12 08:13

This issue is quite complicated in our setup because we have many cascaded jobs and lots of them don't survive a restart (for one or another reason).
So I could think of 2 possibilities to solve this:

inhibit "external" triggers: don't allow things like manually (re)starting a build, SCM trigger, time trigger,... but still allow subjobs to be executed.
This would allow cascaded jobs to finish (as they need to execute subjobs to get finished).
OR
don't start any new jobs AND wait until all running jobs are in a state where they are waiting for a subjob to be finished - presuming that this is a good state for a safe restart in any case.

I don't know if any of this would be easy (or even possible) to implement.
And of course it still leaves the inconvenience that you have to wait until long running jobs get finished (if any) but it would help in our case.

Roman Zwi added a comment - 2024-02-12 08:13 This issue is quite complicated in our setup because we have many cascaded jobs and lots of them don't survive a restart (for one or another reason). So I could think of 2 possibilities to solve this: inhibit "external" triggers: don't allow things like manually (re)starting a build, SCM trigger, time trigger,... but still allow subjobs to be executed. This would allow cascaded jobs to finish (as they need to execute subjobs to get finished). OR don't start any new jobs AND wait until all running jobs are in a state where they are waiting for a subjob to be finished - presuming that this is a good state for a safe restart in any case. I don't know if any of this would be easy (or even possible) to implement. And of course it still leaves the inconvenience that you have to wait until long running jobs get finished (if any) but it would help in our case.

Alan Kyffin added a comment - 2024-02-13 17:38

To allow pipelines to run to completion, including new node steps, ContinuedTask would have to extend Task.NonBlockingTask to allow them to be scheduled. CpsFlowExecution.blocksRestart() could simply return true. This failed for me with cloud executors because no new agents were provisioned during quietDown.

Allowing pipelines to run until the next node step requires not pausing the pipeline. CpsFlowExecution.blocksRestart() already checks with each StepExecution so ExecutorStepExecution.blocksRestart() could return true unless it itself is blocked. However, the pipeline then has to be paused to save its state before Jenkins can be restarted.

I think both approaches would fail in the case of nested jobs.

Alan Kyffin added a comment - 2024-02-13 17:38 To allow pipelines to run to completion, including new node steps, ContinuedTask would have to extend Task.NonBlockingTask to allow them to be scheduled. CpsFlowExecution.blocksRestart() could simply return true . This failed for me with cloud executors because no new agents were provisioned during quietDown. Allowing pipelines to run until the next node step requires not pausing the pipeline. CpsFlowExecution.blocksRestart() already checks with each StepExecution so ExecutorStepExecution.blocksRestart() could return true unless it itself is blocked. However, the pipeline then has to be paused to save its state before Jenkins can be restarted. I think both approaches would fail in the case of nested jobs.

Kevin added a comment - 2024-02-14 13:15

I would also greatly appreciate this enhancement. Currently our Jenkins maintenances are very archaic. We need to wait for an idle time manually to run upgrades while avoiding interrupting running pipelines. We mostly only have complexe pipelines that are long-running, multi node, and with nested pipeline calls. They are not designed to be able to survive restarts.

It would be great if I could just initiate a safeRestart and it would run if and only if all running jobs/pipelines are completed while preventing new jobs from running.

Thank you.

Kevin added a comment - 2024-02-14 13:15 I would also greatly appreciate this enhancement. Currently our Jenkins maintenances are very archaic. We need to wait for an idle time manually to run upgrades while avoiding interrupting running pipelines. We mostly only have complexe pipelines that are long-running, multi node, and with nested pipeline calls. They are not designed to be able to survive restarts. It would be great if I could just initiate a safeRestart and it would run if and only if all running jobs/pipelines are completed while preventing new jobs from running. Thank you.

Jesse Glick added a comment - 2024-02-14 14:53

a distinct admin gesture

Unnecessary I guess, if there are any running non-Pipeline builds or Pipeline builds marked with Do not allow the pipeline to resume if the controller restarts: this should be a sufficient signal.

As mentioned in recent comments, there would be some work to do to ensure that new node blocks could be scheduled but not new top-level builds…except perhaps new downstream builds triggered via the build step (with the default wait: true), since otherwise you would again livelock.

The trickier question is the case that there is a mixture of resumable and non-resumable builds. The current behavior optimizes for a quicker restart, by pausing new activity in the resumable Pipeline builds. But if the non-resumable builds, currently freestyle, would still be running for a long time anyway then you may as well get more work done in the resumable builds while you wait. You just do not want to be initiating new agent connections and the like right before the controller is about to shut down. Perhaps it would make sense to wait for a few minutes to see if the safe restart proceeds in a timely fashion, before giving up and unpausing any resumable builds.

Of course it would also be valuable to track down cases of Pipeline builds which ought to survive restarts (i.e., do not involve weird Groovy logic with non-Serializable local variables!) but sometimes do not, come up with reproducible test cases, and get those fixed.

Jesse Glick added a comment - 2024-02-14 14:53 a distinct admin gesture Unnecessary I guess, if there are any running non-Pipeline builds or Pipeline builds marked with Do not allow the pipeline to resume if the controller restarts : this should be a sufficient signal. As mentioned in recent comments, there would be some work to do to ensure that new node blocks could be scheduled but not new top-level builds…except perhaps new downstream builds triggered via the build step (with the default wait: true ), since otherwise you would again livelock. The trickier question is the case that there is a mixture of resumable and non-resumable builds. The current behavior optimizes for a quicker restart, by pausing new activity in the resumable Pipeline builds. But if the non-resumable builds, currently freestyle, would still be running for a long time anyway then you may as well get more work done in the resumable builds while you wait. You just do not want to be initiating new agent connections and the like right before the controller is about to shut down. Perhaps it would make sense to wait for a few minutes to see if the safe restart proceeds in a timely fashion, before giving up and unpausing any resumable builds. Of course it would also be valuable to track down cases of Pipeline builds which ought to survive restarts (i.e., do not involve weird Groovy logic with non- Serializable local variables!) but sometimes do not, come up with reproducible test cases, and get those fixed.

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Ulrich Köhler added a comment - 2019-12-11 09:59

Expand comment: Ulrich Köhler added a comment - 2019-12-11 09:59

Collapse comment: Thomas de Grenier de Latour added a comment - 2019-12-11 11:27

Expand comment: Thomas de Grenier de Latour added a comment - 2019-12-11 11:27

Collapse comment: Reinhold Füreder added a comment - 2019-12-11 12:54

Expand comment: Reinhold Füreder added a comment - 2019-12-11 12:54

Collapse comment: Jason Antman added a comment - 2020-09-30 10:46

Expand comment: Jason Antman added a comment - 2020-09-30 10:46

Collapse comment: Tim Black added a comment - 2020-10-02 06:00

Expand comment: Tim Black added a comment - 2020-10-02 06:00

Collapse comment: Tim Brown added a comment - 2020-12-15 11:24, Edited by Tim Brown - 2020-12-15 11:25

Expand comment: Tim Brown added a comment - 2020-12-15 11:24, Edited by Tim Brown - 2020-12-15 11:25

Collapse comment: Jonathan Delizy added a comment - 2021-03-17 12:47

Expand comment: Jonathan Delizy added a comment - 2021-03-17 12:47

Collapse comment: Brett Alex added a comment - 2021-05-13 12:41

Expand comment: Brett Alex added a comment - 2021-05-13 12:41

Collapse comment: Tim Brown added a comment - 2021-11-30 10:09, Edited by Tim Brown - 2021-11-30 10:13

Expand comment: Tim Brown added a comment - 2021-11-30 10:09, Edited by Tim Brown - 2021-11-30 10:13

Collapse comment: David Taylor added a comment - 2022-08-01 16:10

Expand comment: David Taylor added a comment - 2022-08-01 16:10

Collapse comment: Pay Bas added a comment - 2022-08-04 19:30

Expand comment: Pay Bas added a comment - 2022-08-04 19:30

Collapse comment: Ivan Rossier added a comment - 2022-08-05 04:43

Expand comment: Ivan Rossier added a comment - 2022-08-05 04:43

Collapse comment: Jen added a comment - 2022-08-05 15:41

Expand comment: Jen added a comment - 2022-08-05 15:41

Collapse comment: Kari Niemi added a comment - 2023-02-16 07:47

Expand comment: Kari Niemi added a comment - 2023-02-16 07:47

Collapse comment: Kari Niemi added a comment - 2023-09-29 07:10, Edited by Kari Niemi - 2023-10-18 12:40

Expand comment: Kari Niemi added a comment - 2023-09-29 07:10, Edited by Kari Niemi - 2023-10-18 12:40

Collapse comment: Alan Kyffin added a comment - 2024-02-09 17:11

Expand comment: Alan Kyffin added a comment - 2024-02-09 17:11

Collapse comment: Jesse Glick added a comment - 2024-02-09 17:30

Expand comment: Jesse Glick added a comment - 2024-02-09 17:30

Collapse comment: Roman Zwi added a comment - 2024-02-12 08:13

Expand comment: Roman Zwi added a comment - 2024-02-12 08:13

Collapse comment: Alan Kyffin added a comment - 2024-02-13 17:38

Expand comment: Alan Kyffin added a comment - 2024-02-13 17:38

Collapse comment: Kevin added a comment - 2024-02-14 13:15

Expand comment: Kevin added a comment - 2024-02-14 13:15

Collapse comment: Jesse Glick added a comment - 2024-02-14 14:53

Expand comment: Jesse Glick added a comment - 2024-02-14 14:53

People

Dates