Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39072

timeout step should include more logging/diagnostics information

      While trying to diagnose the failure of the timeout step to terminate a process (see process tree, the Pipeline was awaiting completion of pid 1031) with jglick we determined that the timeout step doesn't include sufficient information in the logs of the Pipeline Thread Dump to be terrifically useful when things go awry.

      See the following chat log:

      16:39 < jglick> rtyler: so that was a normal termination via `SIGTERM` (143 = 128 + 15; cf. `kill -l`). Not sure why `timeout` did not do it. No Jenkins restart that I can see. Could try to add more status information to `timeout` indicating in virtual thread dump (a) whether it ever delivered a cancellation, (b) whether its scheduled task is still there.
      16:40 <@rtyler> jglick: should adding more verbiage to timeout be among the tickets I should file/
      16:40 < jglick> rtyler: that would be in `workflow-basic-steps-plugin`; `sh` is in `workflow-durable-task-step-plugin`
      16:41 < jglick> rtyler: verbiage in the log, but also information in virtual thread dump (currently it does not override `getStatus`)
      

      The more information the timeout step can include in both the Console Output and the Thread Dump, the better.

      Potentially related to JENKINS-34637

          [JENKINS-39072] timeout step should include more logging/diagnostics information

          Jesse Glick added a comment -

          As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order.

          And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

          Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.

          Jesse Glick added a comment - As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order. And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
          http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40
          Log:
          JENKINS-39072 Print log messages about what is happening with timeouts.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40 Log: JENKINS-39072 Print log messages about what is happening with timeouts.

          Jesse Glick added a comment -

          if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

          Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running.

          In case the problem is just an unresponsive StepExecution.stop, I think we can kill it. There will be other cases that are not handled.

          Jesse Glick added a comment - if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running. In case the problem is just an unresponsive StepExecution.stop , I think we can kill it. There will be other cases that are not handled.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
          src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
          http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f
          Log:
          JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm.
          Does not suffice to kill the body in all cases.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f Log: JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm. Does not suffice to kill the body in all cases.

          Code changed in jenkins
          User: Jesse Glick
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
          http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2
          Log:
          [FIXED JENKINS-39072] Include status information for a running timeout step.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2 Log: [FIXED JENKINS-39072] Include status information for a running timeout step.

            jglick Jesse Glick
            rtyler R. Tyler Croy
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: