-
Bug
-
Resolution: Unresolved
-
Minor
-
workflow-durable-task-step-plugin:1317.v5337e0c1fe28
durable-task-plugin:543.v262f6a_803410
Steps to reproduce
- Create a pipeline job which calls sh with a script (I don't think it matters what's in there, as long as it succeeds), plus other steps after that (like another sh step)
- Queue a run
- Abort the run near the end of the script
Expected behavior
- The run is recorded as ABORTED
- The run stops executing at the first sh step
Actual behavior
- The run is recorded as ABORTED
- The remaining steps executed anyway.
Example snippet of run's console output where this happened
[2024-01-30T16:42:22.256Z] [Pipeline] sh [2024-01-30T16:42:22.364Z] Aborted by someUser [2024-01-30T16:42:22.368Z] Sending interrupt signal to process [2024-01-30T16:42:22.580Z] [Pipeline] sh
Analysis
Punchline
If you time it just right, the run continues executing because the sub-process was successful, as if no interruption was requested!
The details
Thankfully, we had logging enabled for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep on that instance (among other things) and were able to capture the following (slightly edited) lines:
2024-01-30 16:42:22.240+0000 [id=62552] FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener$1#close: calling close with nl=true 2024-01-30 16:42:22.263+0000 [id=62938] FINEST o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[41:sh]:Owner[path/to/job/1217:path/to/job #1217] 2024-01-30 16:42:22.263+0000 [id=62938] FINE o.j.p.w.s.d.DurableTaskStep$Execution#start: launching task against hudson.remoting.Channel@395408fe:remoteAgent using RemoteLauncher[hudson.remoting.Channel@395408fe:remoteAgent] 2024-01-30 16:42:22.277+0000 [id=62938] FINE o.j.p.d.BourneShellScript#launchWithCookie: launching [/path/to/agent/caches/durable-task/durable_task_monitor_543.v262f6a_803410_linux_64, -controldir=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452, -result=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/jenkins-result.txt, -log=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/jenkins-log.txt, -cookiename=JENKINS_SERVER_COOKIE, -cookieval=durable-32ae10b6354f5c7ed6687931b481399dd6f54c7f5ea557d68f98bc47c1a40b06, -script=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/script.sh, -output=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/output.txt, -daemon] 2024-01-30 16:42:22.288+0000 [id=62938] FINE o.j.p.w.s.d.DurableTaskStep$Execution#start: launched task 2024-01-30 16:42:22.367+0000 [id=62938] FINER o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: remoteAgent seems to be online so using /path/to/agent/workspace/path/to/job 2024-01-30 16:42:22.368+0000 [id=62938] FINE o.j.p.w.s.d.DurableTaskStep$Execution#stop: stopping process org.jenkinsci.plugins.workflow.steps.FlowInterruptedException at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1210) at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.lambda$interrupt$0(WorkflowRun.java:397) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 2024-01-30 16:42:22.542+0000 [id=62832] FINER o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: remoteAgent seems to be online so using /path/to/agent/workspace/path/to/job 2024-01-30 16:42:22.548+0000 [id=62832] FINER o.j.p.d.FileMonitoringTask$FileMonitoringController#writeLog: remote transcoding charset: US-ASCII 2024-01-30 16:42:22.548+0000 [id=62832] FINER o.j.p.d.FileMonitoringTask$FileMonitoringController#writeLog: remote transcoding charset: US-ASCII 2024-01-30 16:42:22.554+0000 [id=62832] FINE o.j.p.d.BourneShellScript$ShellController#exitStatus: found exit code 0 in /path/to/agent/workspace/path/to/job_tmp/durable-ec948452 2024-01-30 16:42:22.560+0000 [id=62832] FINE o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener$1#close: calling close with nl=true 2024-01-30 16:42:22.587+0000 [id=63209] FINEST o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[42:sh]:Owner[path/to/job/1217:path/to/job #1217] 2024-01-30 16:42:22.587+0000 [id=63209] FINE o.j.p.w.s.d.DurableTaskStep$Execution#start: launching task against hudson.remoting.Channel@395408fe:remoteAgent using RemoteLauncher[hudson.remoting.Channel@395408fe:remoteAgent]
...which allowed me to trace to the code.
We can see that line 502 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin reads:
LOGGER.log(Level.FINE, "stopping process", cause);
...which only executes after line 498 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin has executed:
causeOfStoppage = cause;
...so we know that the "abort" was received and recorded. Unfortunately, the script sh was executing completed successfully rather quickly and had time to write its result (0) to the result file.
We know that line 309 of src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java at 543.v262f6a_803410 in jenkinsci/durable-task-plugin is called, because it reports found exit code 0 in the log:
LOGGER.log(Level.FINE, "found exit code {0} in {1}", new Object[] {status, controlDir});
...unfortunately, I think the defect is a bit later, on line 657 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin where we decide to call the step a success based on:
if ((returnStatus && originalCause == null) || exitCode == 0) {
...where we're ignoring originalCause (previously initialized from causeOfStoppage on the previous line) just because the sub-process completed successfully.
Resolution suggestion
If originalCause is not null, fail the DurableTaskStep (such as sh), otherwise fallback to the original logic, i.e. don't fail if returnStatus was requested.
Suggested implementation
private void handleExit(int exitCode, OutputSupplier output) throws IOException, InterruptedException { Throwable originalCause = causeOfStoppage; if (originalCause != null && (returnStatus || exitCode == 0)) { getContext().onSuccess(returnStatus ? exitCode : returnStdout ? new String(output.produce(), StandardCharsets.UTF_8) : null); } else { if (returnStdout) { _listener().getLogger().write(output.produce()); // diagnostic } if (originalCause != null) { // JENKINS-28822: Use the previous cause instead of throwing a new AbortException _listener().getLogger().println("script returned exit code " + exitCode); getContext().onFailure(originalCause); } else { getContext().onFailure(new AbortException("script returned exit code " + exitCode)); } } listener().getLogger().close(); }