Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-39072

timeout step should include more logging/diagnostics information

    XMLWordPrintable

Details

    Description

      While trying to diagnose the failure of the timeout step to terminate a process (see process tree, the Pipeline was awaiting completion of pid 1031) with jglick we determined that the timeout step doesn't include sufficient information in the logs of the Pipeline Thread Dump to be terrifically useful when things go awry.

      See the following chat log:

      16:39 < jglick> rtyler: so that was a normal termination via `SIGTERM` (143 = 128 + 15; cf. `kill -l`). Not sure why `timeout` did not do it. No Jenkins restart that I can see. Could try to add more status information to `timeout` indicating in virtual thread dump (a) whether it ever delivered a cancellation, (b) whether its scheduled task is still there.
      16:40 <@rtyler> jglick: should adding more verbiage to timeout be among the tickets I should file/
      16:40 < jglick> rtyler: that would be in `workflow-basic-steps-plugin`; `sh` is in `workflow-durable-task-step-plugin`
      16:41 < jglick> rtyler: verbiage in the log, but also information in virtual thread dump (currently it does not override `getStatus`)
      

      The more information the timeout step can include in both the Console Output and the Thread Dump, the better.

      Potentially related to JENKINS-34637

      Attachments

        Issue Links

          Activity

            jglick Jesse Glick added a comment -

            As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order.

            And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

            Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.

            jglick Jesse Glick added a comment - As part of JENKINS-34637 I am adding a message to the log, but something in getStatus is also in order. And I suppose if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Independently, it might be appropriate for DurableTaskStep.Execution.stop to forcibly end after a (say) 10s grace period.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40
            Log:
            JENKINS-39072 Print log messages about what is happening with timeouts.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/f36a5c382fbdfc7bc2b0d98430f34ace03375b40 Log: JENKINS-39072 Print log messages about what is happening with timeouts.
            jglick Jesse Glick added a comment -

            if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort.

            Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running.

            In case the problem is just an unresponsive StepExecution.stop, I think we can kill it. There will be other cases that are not handled.

            jglick Jesse Glick added a comment - if canceling the body does not work for whatever reason, it should set a (say) 1m timer and then declare itself dead so the rest of the build will abort. Unfortunately this does not seem to be possible in general. The fix of JENKINS-25504 means that calling getContext().onFailure(…) from a block-scoped StepExecution does not immediately terminate the block; it only sends another interruption signal to the innermost step, which is pointless in this case since the innermost step already ignored an earlier interrupt and will probably ignore this one too. Experimentally commenting out some part of that fix in CpsStepContext.scheduleNextRun does not help: we just get out-of-order nodes again, and anyway the body keeps running. In case the problem is just an unresponsive StepExecution.stop , I think we can kill it. There will be other cases that are not handled.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f
            Log:
            JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm.
            Does not suffice to kill the body in all cases.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java src/test/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepTest.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/040ed79ce10f7fc7b1c93562d4211db7e470486f Log: JENKINS-39072 In case a timeout fails to kill its block normally, try forcibly killing the innermost step, akin to doTerm. Does not suffice to kill the body in all cases.

            Code changed in jenkins
            User: Jesse Glick
            Path:
            src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java
            http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2
            Log:
            [FIXED JENKINS-39072] Include status information for a running timeout step.

            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: src/main/java/org/jenkinsci/plugins/workflow/steps/TimeoutStepExecution.java http://jenkins-ci.org/commit/workflow-basic-steps-plugin/83dfe040379f49806b6adc19b2bd11ad396ce1c2 Log: [FIXED JENKINS-39072] Include status information for a running timeout step.

            People

              jglick Jesse Glick
              rtyler R. Tyler Croy
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: