Details
-
Bug
-
Status: Resolved (View Workflow)
-
Minor
-
Resolution: Fixed
-
None
-
workflow-durable-task-step 2.24
Description
When a workflow is manually aborted in Jenkins it fires a hudson.AbortException. This is the same thing that happens when a step fails. Thus, it's impossible to properly set the build status to ABORTED on a manual abort and to FAILED on a failed step because you can't programmatically tell the difference.
jglick’s recommended fix: see comment as of 2017-06-28
Attachments
Issue Links
- is duplicated by
-
JENKINS-41944 Unable to understand when AbortError is created by user`s action like pressing button Stop
-
- Resolved
-
-
JENKINS-34376 Cannot easily/safely catch exceptions without disrupting job abortion
-
- Resolved
-
-
JENKINS-37724 Record cause of interruption in DurableTaskStep
-
- Resolved
-
- relates to
-
JENKINS-51928 Timeout is caught by the most internal try-catch, but should be caught by the try-catch that surrounds the timeout step
-
- Resolved
-
- links to
Activity
What do you mean by manually stopping?
If you abort the build by clicking "Abort" when paused for input, a FlowInterrupedException is fired. However, if you click the Jenkins abort button, an abort exception with no cause is fired.
Started by user Joe Mifsud Running: Build Stage Entering stage Build Stage Proceeding Running: Allocate node : Start Running on master in /var/lib/jenkins/jobs/REDACTED/workspace Running: Allocate node : Body : Start Running: Git > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url git@REDACTED Fetching upstream changes from git@REDACTED > git --version # timeout=10 using GIT_SSH to set credentials Credentials for git - created by Chef > git -c core.askpass=true fetch --tags --progress REDACTED +refs/heads/*:refs/remotes/origin/* > git rev-parse refs/remotes/origin/master^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10 Checking out Revision REDACTED (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f REDACTED > git rev-list REDACTED # timeout=10 Running: Archive Artifacts Running: Allocate node : Body : End Running: Allocate node : End Running: Allocate node : Start Running on master in /var/lib/jenkins/jobs/REDACTED/workspace Running: Allocate node : Body : Start Aborted by Joe Mifsud Running: Write file to workspace Running: Shell Script [workspace] Running shell script + flock -w 600 /var/lock/sp_js_build_npm_install.lock -c npm install Terminated Running: Allocate node : Body : End Running: Allocate node : End Running: Print Message Received hudson.AbortException with message script returned exit code 143 <--- e.getClass().name Running: Print Message Received org.codehaus.groovy.runtime.NullObject with message script returned exit code 143 <--- e.getCause().getClass().name Running: Print Message Build failed due to class hudson.AbortException script returned exit code 143
Reopening to ensure that our definition of "manual abort" is the same. Was just able to reproduce by clicking the jenkins abort button (red x).
An AbortException is also fired when the job is paused for input and the Jenkins abort button is clicked.
If you abort inside a sh step, Jenkins sends SIGTERM to the process. That will normally cause it to exit with code 143 (128 + the SIGTERM signal code IIRC), which is treated as a failed script by sh, thus throwing an error (and causing the build to fail, if you do not catch it). The process could however be trapping the signal and behaving differently.
If the process does not respond to SIGTERM, pressing the abort button a second time forcibly aborts the build (FlowInterruptedException).
Canceling an input step uses FlowInterruptedException and sets the build status to aborted. This is unrelated to handling of durable tasks.
I am also confused by this handling:
In https://github.com/jenkinsci/workflow-durable-task-step-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/steps/durable_task/DurableTaskStep.java#L319 we send AbortException. If I surround a "sh" step with try/catch, I cannot tell if Abort was clicked or if the shell script failed. Is there an easy way to distinguish this?
On Windows (bat step) is even harder to determine the difference between:
1) Aborted build by a user
2) Aborted build by a timeout
Both return a hudson.AbortException, with an empty cause. An in both cases the exception says: "hudson.AbortException: script returned exit code -1". I would have expected a non-NULL cause in the case of a user aborted AbortException....or am I missing an easier way to determine the difference?
What I did in the meantime is the following "hackish" solution:
class DoneException extends Exception boolean wasAborted = false try { parallel { "main": { sh "myCommand" throw DoneException }, "abortMonitor": { try { sleep(10, java.util.concurrent.TimeUnit.YEARS) } catch (AbortException) { wasAborted = true } }, failFast: true } } catch (DoneException e) { // Do nothing. This is expected. } catch (Exception e) { if (wasAborted) { throw e // Effectively do nothing } else { // Handle your failure of the sh command here } }
This is the rough, untested sketch of how I worked around this limitation. There are other, less convoluted approaches if you are willing to grab into the internals via means of getContext. I decided to stick with the public API instead to spare myself some headaches.
Why this works (in theory):
The main insight for me was that the sleep command actually tells you if it was aborted (throws an AbortException) versus if it failed because of other reasons (throws a FlowInterruptedException). The "other reasons" are in this case a failure in a parallel branch as failFast is specified. This means that aborting a build will abort both the sh and the sleep step as they are outer leaves of the execution graph. If instead the sh step fails, it will cause the sleep inside the abortMonitor branch to fail as well. Why sleep was cancelled can be determined easily then by the type of exception that is thrown.
There is one caveat to this approach:
The "eternal" sleep has to be cancelled once the "main" branch was successful. This is done by throwing an exception of the custom type DoneException. This is abusing exceptions for non-exceptional control flow. Apart from being a bad pattern in general, it has the concrete downside of making the "bubble" for the sleep step in the workflow steps overview of the build appear red. I.e., you will have failing sleep steps in a successful build.
Lately, on a project I work with, we replaced try/catch/catch/... blocks that was checking exception "class" type and return code to guess if the build was aborted by Jenkins user.
Such approach has its flaws as you can not be sure about consistency of exit code if your pipeline depends on many `sh` step calls, unless those are all your processes (you control them) which I believe is not a very often case.
jmif - I believe you can get information you are after from api/json?depth=2 endpoint.
In order to improve how we do it, I recently switched to approach based on asking jenkins API for information and parsing it.
Here is a snippet that might become useful. It is not state of the art groovy programming but it works
I also left the part that shows how to get failed stage name from wfapi/describe API endpoint.
Hopefully someone will find my post useful.
def fetchBuildDetails() { url1 = env.BUILD_URL + 'wfapi/describe' url2 = env.BUILD_URL + 'api/json?depth=2' jsonResponse = sh(script: 'echo "{\\"wfapi\\":`curl --silent --netrc-file /home/ubuntu/.curl-netrc '+url1+'`,\\"api\\":`curl --silent --netrc-file /home/ubuntu/.curl-netrc '+url2+'`}"', returnStdout: true).trim() echo "Fetched build details:" echo prettyPrint(jsonResponse) return new JsonSlurperClassic().parseText(jsonResponse) } def getBuildDetails() { def buildDetails = fetchBuildDetails() buildDetails.shouldNotificationGetSentToInfrastructureTeam = false buildDetails.abortedByUser = false // checking if user aborted the build for(int i=0; i < buildDetails.api.actions.size(); i++) { if (buildDetails.api.actions[i]._class == 'jenkins.model.InterruptedBuildAction') { if (buildDetails.api.actions[i].causes[0]._class == 'jenkins.model.CauseOfInterruption$UserInterruption') { buildDetails.abortedByUser = true } } } // checking name of failed stage and deciding if in-team should get mentioned in slack message for(int i=0; i < buildDetails.wfapi.stages.size(); i++) { if (buildDetails.wfapi.stages[i].status == 'FAILED') { def failedStageName = buildDetails.wfapi.stages[i].name.toLowerCase() if ( failedStageName.indexOf("checkout") >= 0 || failedStageName.indexOf("build") >= 0 || failedStageName.indexOf("deploy") >= 0 ) { buildDetails.shouldNotificationGetSentToInfrastructureTeam = true } } } echo "Build details enchanced:" println prettyPrint(toJson(buildDetails)) return buildDetails }
Thanks!!!! this indeed workarounds the current limitation. I'll try your code in our pipeline
Probably if DurableTaskStep.Execution.stop is called, the exception should be saved, and if the process subsequently exits with a nonzero code (particularly 143, in the case of sh, though perhaps it does not matter), the status code should be printed and the original exception should be used as the failure rather than a fresh AbortException.
Not sure about returnStatus; would be a bigger behavioral change to make the step fail with the original FlowInterruptedException rather than returning 143 or whatever, but this is probably what a user would expect to happen.
I stumbled upon this issue while working on scripted pipeline jobs. My problem was how to reliably identify that pipeline job was aborted by user from job it self for correct email template selection. My workaround was to look for InterruptedBuildAction within running builds actions if it's present then build was aborted. I'm not part of Jenkins development team so I'm probably wrong but my guess is that internally Jenkins uses similar concept for aborted build identification.
Scripted pipeline code example to be placed in exception handling code
if (manager.build.getAction(InterruptedBuildAction.class) ){ currentBuild.result = 'ABORTED' }
Logical chain: hudson.model.AbstractBuild -> (hudson.model.Actionable) getAction if something is returned then build was aborted. If needed you can look more deeply in getAction result vai getCauses() method and afterwards vai getShortDescription() you even can get userid who aborted the build.
I guess this is also the reason for JENKINS-43339, isn't it?
This is IMO a big issue as it makes the post { aborted {} } section in declarative pipelines largely unusable. In fact, I've not seen a single occasion when aborted worked as expected - when manually aborting a build. Then the failure section is triggered instead.
Piecing together clues from comments here and in similar issues (such as JENKINS-41604), I've added this to our shared lib with pretty good results, so far.
This version has some debug output and logic which isn't needed for our purposes (such as distinguishing between user and system rejections for inputs), but I left that logic in for the purposes of this comment in case it is useful for others.
Pipeline
try { // work } catch(err) { (new ExecutionHelper()).fixBuildResult(error) // Don't hide the error from jenkins throw err } finally { // React to currentBuild.result }
ExcecutionHelper.groovy
import jenkins.model.CauseOfInterruption import jenkins.model.InterruptedBuildAction import org.jenkinsci.plugins.workflow.steps.FlowInterruptedException import org.jenkinsci.plugins.workflow.support.steps.input.Rejection def fixBuildResult(error) { def iba = currentBuild.rawBuild.getAction(InterruptedBuildAction.class) if (iba && iba.causes.size() && iba.causes.any{ it instanceof CauseOfInterruption.UserInterruption }) { // probably aborted the execution from outside pipeline echo "run cancelled" currentBuild.result = "ABORTED" } else if (error instanceof FlowInterruptedException && error.getCauses().any { it instanceof Rejection }) { // Dig out the rejection and see if it is associated to a user. def user = null error.getCauses().each { if (it instanceof Rejection) { user = it.user } } // System user probably means rejected because of timeout echo (user && "${user}" != "SYSTEM" ? "rejected by user: ${user}" : "rejected by... not user") currentBuild.result = "ABORTED" } else { currentBuild.result = "FAILURE" } }
I wonder whether my issue is related to this one:
I get a hudson.AbortException() sometimes if some parallel steps are waiting for some free node when a timeout aborts the build. I would expect to get a FlowInterruptedException, though.
I get Encountered exception: hudson.AbortException: Queue task was cancelled.
Thanks for the info. After updating to 2.21 I can confirm that the issue we faced is fixed.
I am working on this in https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/75. My main concern is whether fixing it now will break all of the workarounds the people are currently using, so I want to do a few tests with some of the workarounds posted here to get an idea of the impact.
There are 2 main differences that my change would cause:
- If a sh step is manually aborted, the exception thrown will be a FlowInterruptedException rather than an AbortException (and FlowInterruptedException will be thrown even if returnStatus is true). The script will exit with RESULT.ABORTED both before and after my change. The behavior of crussell52's workaround is unaffected by my change in this case.
- If a sh step is automatically aborted by a timeout step, the exception thrown will be a FlowInterruptedException rather than an AbortException (and FlowInterruptedException will be thrown even if returnStatus is true), and as a result the script will exit with Result.ABORTED instead of RESULT.FAILURE. crussell52's workaround does not handle this case, so the behavior would change in the same way that it would for people not using a workaround. This change also makes timeout}}s of {{sh steps consistent with other steps such as sleep. (ABORTED instead of FAILURE)
It looks like the workaround from kaihowl does not work, because FlowInterruptedException is thrown in the sleeping parallel branch when the pipeline is manually aborted, when the sh step is aborted by a timeout step, or when the sh step fails because of an issue in the script itself, so wasAborted is set to true in all cases and we always rethrow an AbortedException and never execute the block where the sh step's script failed.
Based on that info, I think it is ok to move forward with the change, but if anyone knows of other common workarounds that may be broken by this change, feel free leave a comment.
crussell52: I see that your workaround uses instanceof checks, but these are not permitted in pipeline sandbox. Using instanceof generates the below sort of error:
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: Scripts not permitted to use method java.lang.Class isInstance java.lang.Object
I just submitted this PR that whitelists isinstance check: https://github.com/jenkinsci/script-security-plugin/pull/226
Fixed in Pipeline Nodes and Processes 2.24. See my comment here for an overview of the changes, but in short, externally aborted tasks should throw FlowInterruptedException, while tasks that fail in the script itself should throw AbortException.
Manually stopping a build should set the result to ABORTED. This uses FlowInterruptedException, not AbortException.