Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-72633

originalCause ignored if sub-process completed successfully

XMLWordPrintable

      Steps to reproduce

      1. Create a pipeline job which calls sh with a script (I don't think it matters what's in there, as long as it succeeds), plus other steps after that (like another sh step)
      2. Queue a run
      3. Abort the run near the end of the script

      Expected behavior

      1. The run is recorded as ABORTED
      2. The run stops executing at the first sh step

      Actual behavior

      1. The run is recorded as ABORTED
      2. The remaining steps executed anyway.

      Example snippet of run's console output where this happened

      [2024-01-30T16:42:22.256Z] [Pipeline] sh
      [2024-01-30T16:42:22.364Z] Aborted by someUser
      [2024-01-30T16:42:22.368Z] Sending interrupt signal to process
      [2024-01-30T16:42:22.580Z] [Pipeline] sh
      

      Analysis

      Punchline

      If you time it just right, the run continues executing because the sub-process was successful, as if no interruption was requested!

      The details

      Thankfully, we had logging enabled for org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep on that instance (among other things) and were able to capture the following (slightly edited) lines:

      2024-01-30 16:42:22.240+0000 [id=62552]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener$1#close: calling close with nl=true
      2024-01-30 16:42:22.263+0000 [id=62938]	FINEST	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[41:sh]:Owner[path/to/job/1217:path/to/job #1217]
      2024-01-30 16:42:22.263+0000 [id=62938]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution#start: launching task against hudson.remoting.Channel@395408fe:remoteAgent using RemoteLauncher[hudson.remoting.Channel@395408fe:remoteAgent]
      2024-01-30 16:42:22.277+0000 [id=62938]	FINE	o.j.p.d.BourneShellScript#launchWithCookie: launching [/path/to/agent/caches/durable-task/durable_task_monitor_543.v262f6a_803410_linux_64, -controldir=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452, -result=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/jenkins-result.txt, -log=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/jenkins-log.txt, -cookiename=JENKINS_SERVER_COOKIE, -cookieval=durable-32ae10b6354f5c7ed6687931b481399dd6f54c7f5ea557d68f98bc47c1a40b06, -script=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/script.sh, -output=/path/to/agent/workspace/path/to/job_tmp/durable-ec948452/output.txt, -daemon]
      2024-01-30 16:42:22.288+0000 [id=62938]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution#start: launched task
      2024-01-30 16:42:22.367+0000 [id=62938]	FINER	o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: remoteAgent seems to be online so using /path/to/agent/workspace/path/to/job
      2024-01-30 16:42:22.368+0000 [id=62938]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution#stop: stopping process
      org.jenkinsci.plugins.workflow.steps.FlowInterruptedException
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.interrupt(CpsFlowExecution.java:1210)
      	at org.jenkinsci.plugins.workflow.job.WorkflowRun$2.lambda$interrupt$0(WorkflowRun.java:397)
      	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	at java.base/java.lang.Thread.run(Thread.java:840)
      2024-01-30 16:42:22.542+0000 [id=62832]	FINER	o.j.p.w.s.d.DurableTaskStep$Execution#getWorkspace: remoteAgent seems to be online so using /path/to/agent/workspace/path/to/job
      2024-01-30 16:42:22.548+0000 [id=62832]	FINER	o.j.p.d.FileMonitoringTask$FileMonitoringController#writeLog: remote transcoding charset: US-ASCII
      2024-01-30 16:42:22.548+0000 [id=62832]	FINER	o.j.p.d.FileMonitoringTask$FileMonitoringController#writeLog: remote transcoding charset: US-ASCII
      2024-01-30 16:42:22.554+0000 [id=62832]	FINE	o.j.p.d.BourneShellScript$ShellController#exitStatus: found exit code 0 in /path/to/agent/workspace/path/to/job_tmp/durable-ec948452
      2024-01-30 16:42:22.560+0000 [id=62832]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution$NewlineSafeTaskListener$1#close: calling close with nl=true
      2024-01-30 16:42:22.587+0000 [id=63209]	FINEST	o.j.p.w.s.d.DurableTaskStep$Execution#_listener: JENKINS-34021: DurableTaskStep.Execution.listener present in CpsStepContext[42:sh]:Owner[path/to/job/1217:path/to/job #1217]
      2024-01-30 16:42:22.587+0000 [id=63209]	FINE	o.j.p.w.s.d.DurableTaskStep$Execution#start: launching task against hudson.remoting.Channel@395408fe:remoteAgent using RemoteLauncher[hudson.remoting.Channel@395408fe:remoteAgent]
      

      ...which allowed me to trace to the code.

      We can see that line 502 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin reads:

                      LOGGER.log(Level.FINE, "stopping process", cause);
      

      ...which only executes after line 498 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin has executed:

                  causeOfStoppage = cause;
      

      ...so we know that the "abort" was received and recorded. Unfortunately, the script sh was executing completed successfully rather quickly and had time to write its result (0) to the result file.

      We know that line 309 of src/main/java/org/jenkinsci/plugins/durabletask/BourneShellScript.java at 543.v262f6a_803410 in jenkinsci/durable-task-plugin is called, because it reports found exit code 0 in the log:

                      LOGGER.log(Level.FINE, "found exit code {0} in {1}", new Object[] {status, controlDir});
      

      ...unfortunately, I think the defect is a bit later, on line 657 of src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java at 1317.v5337e0c1fe28 in jenkinsci/workflow-durable-task-step-plugin where we decide to call the step a success based on:

                  if ((returnStatus && originalCause == null) || exitCode == 0) {
      

      ...where we're ignoring originalCause (previously initialized from causeOfStoppage on the previous line) just because the sub-process completed successfully.

      Resolution suggestion

      If originalCause is not null, fail the DurableTaskStep (such as sh), otherwise fallback to the original logic, i.e. don't fail if returnStatus was requested.

      Suggested implementation

              private void handleExit(int exitCode, OutputSupplier output) throws IOException, InterruptedException {
                  Throwable originalCause = causeOfStoppage;
                  if (originalCause != null && (returnStatus || exitCode == 0)) {
                      getContext().onSuccess(returnStatus ? exitCode : returnStdout ? new String(output.produce(), StandardCharsets.UTF_8) : null);
                  } else {
                      if (returnStdout) {
                          _listener().getLogger().write(output.produce()); // diagnostic
                      }
                      if (originalCause != null) {
                          // JENKINS-28822: Use the previous cause instead of throwing a new AbortException
                          _listener().getLogger().println("script returned exit code " + exitCode);
                          getContext().onFailure(originalCause);
                      } else {
                          getContext().onFailure(new AbortException("script returned exit code " + exitCode));
                      }
                  }
                  listener().getLogger().close();
              }
      

            Unassigned Unassigned
            oli Olivier
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: