New Feature
Resolution: Fixed
Powered by SuggestiMate -
Blue Ocean 1.4 - beta 2, Pipeline - December
Having some state being generated at the each node during execution, resuming builds after jenkins restarts or nodes reboots are just not feasible sometimes and can result in infinite hangs in some cases. Also, providing durability results in extensive writes to disk that can bring performance crashing down.
It would be great to be able to specify that jobs don't resume upon interruptions, but rather just fail. This would increase the robustness of the system ideally, since upon nodes restarting, they quickly pick up jobs that tries to resume and hangs exhausting all available executors quickly.
Implementation notes:
- Requires a new OptionalJobProperty on the job, optionally a new BranchProperty in workflow-multibranch-plugin that echoes that same property
- Needs some way to signal to storage (workflow-support) and execution (workflow-cps) that the pipeline is running with resume OFF to hint that they can use faster nondurable execution.
- is blocked by
JENKINS-49961 NPE from CpsFlowExecution.saveOwner
- Resolved
- is duplicated by
JENKINS-37475 Add a configurable switch on job that causes CPS to silently not continue on non-serializable code
- Resolved
- is related to
JENKINS-28183 Hard killed job's stage blocks stage in following jobs
- Resolved
JENKINS-45917 [Jenkins v2.63] Build queue deadlocks
- Closed
JENKINS-47173 Offer a high-performance storage engine for pipeline at some cost to resumability
- Closed
JENKINS-47390 Allow durable task step to, uh, not be. Durable, that is.
- Open
- relates to
JENKINS-36013 Automatically abort ExecutorPickle rehydration from an ephemeral node
- Closed
JENKINS-49079 Copy job properties for new branch project from base/default branch
- Open
- links to
[JENKINS-33761] Ability to disable Pipeline durability and "resume" build.
Well, in my case I have 7 parallel tasks running heavy shell scripts (scripts run cmake builds and ctest tests). Resuming build after restart starts the scripts from the beginning (which is basically the same as starting the whole job anew) and the worst part is that although 7 parallel branches are being executed, all executors are free, thus making possible for Jenkins to trigger another build and hog up the server resources.
Resuming build after restart starts the scripts from the beginning
Never heard of such a bug and cannot even imagine how it could occur. If you have steps to reproduce from scratch, please file separately.
Got same problem.
Resuming build at Fri Sep 30 16:18:26 CEST 2016 after Jenkins restart Ready to run at Fri Sep 30 16:18:28 CEST 2016
and then it just sits there.
Thread #6 at DSL.emailext(not yet scheduled) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:101) at DSL.sshagent(Native Method) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:28) at DSL.ws(Native Method) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:27) at DSL.node(running on slave24-rhel) at standardBuild.call(/data/jenkins/workflow-libs/vars/standardBuild.groovy:25) at WorkflowScript.run(WorkflowScript:15)
seems to be stuck in emailext
I would also really appreciate the ability to disable resumption. We have a few builds where it doesn't make sense to resume them, so it'd be better to have it off completely.
a few builds where it doesn't make sense to resume them
Because they intrinsically could not be resumed? Or you just do not really care about loss of a build or two?
There are a couple cases:
- We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back. Those builds just hang looking for the agent.
- We run deploys that require someone to monitor them. We'd prefer that they not be restarted automatically, that instead someone be there to start and watch them.
We run builds on ephemeral EC2 agents. If Jenkins is restarted, the agents are often dead by the time Jenkins is back.
If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build.
they not be restarted automatically
Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).
I'd be happy with resume either working or there being an option to disable it.
> I've also never ever seen it work correctly...
Likewise but maybe it happens without my noticing
> Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).
In the cases where we see resume hanging there is no process running so maybe the problem is with finding the process again, or handling it not being there.
Should it handle bat() and sh()? Is the assumption that the slave keeps track of the process output, return value etc.? Will a job always resume to the same slave?
If true, that is a bug in the EC2 plugin. It is supposed to keep the agent connected for the entire duration of the build.
Maybe the build finished before Jenkins came back? It doesn't really change that we'd prefer not to resume these builds.
Pipeline does not restart any build steps when Jenkins is restarted. It simply lets the existing process continue running and displaying output (or it might have ended on its own during the Jenkins restart).
Oh, it was my understanding that it'd continue to the next step after that? Does it only complete the current step and stop?
maybe the problem is with finding the process again, or handling it not being there
Should it handle bat() and sh()?
Yes, this is the principle use case.
Is the assumption that the slave keeps track of the process output, return value etc.?
Will a job always resume to the same slave?
Maybe the build finished before Jenkins came back?
The external process I suppose you mean. Should be fine, the sh/bat step should then simply print any final output, and exit according to the process’ exit code.
it'd continue to the next step after that?
Ahh, well what we want is for it not to continue without someone being there to monitor it. Automatically continuing to the next step means that if the person previously monitoring it left after Jenkins went down, they wouldn't be there when it comes back. We'd like to not have it resume for those builds.
I think what you are looking for is an input step between stages. Whether Jenkins happened to be restarted in the middle of the build is not relevant to that.
Not really. While the build is running normally, we don't want anyone to have to confirm proceeding to the next step.
It's only when the build is interrupted by a restart that we don't want it to automatically continue. Whenever Jenkins goes down, it's generally a larger failure with a variable amount of time before service is restored. We can't guarantee that the person watching it will be there when Jenkins comes back without making the normal case worse.
jglick, I am aware that resuming build is one of the core features of pipeline and that you would very much like it to work by default, however from my experience most of the plugins do not properly support resuming the build (i.e. they have bugs, not that they do not deliberately support it). After restarting the Jenkins (mostly due to plugin updates), I've seen jobs resuming without taking any executors, jobs taking an executor and just waiting something to happen (which never happens, i.e. waiting is infinite), jobs which do the same thing (that was already done) after being resumed, plugins failing to parse XML test reports after being "resumed" in the middle of the process, jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way (neither with stop nor with kill - we had to manually remove the job from database while jenkins was offline to stop this job from resuming), ...
Please be aware that jenkins is also used by software developers that do not develop in Java (which is natively supported by jenkins) and that we do some very weird things in our build scripts to support behaviour and flexibility we need (for example, in my case I need to clone repository on SSD, while one of its submodules must be cloned on rotational disk - such a use case will never be supported by default jenkins scm plugins and I must therefore write my on build script which does that, and improper/buggy resuming of such a script usually makes the executor wait indefinitely for something which never happens, so I must manually kill the build (yes, kill, because stop is also ignored)).
For such proper supporting such use cases, there are two ways - either add support in every plugin for every use case (no matter how weird it is) and make it correctly work in all cases, including bug-free resuming build on multi-node heterogenous system - which is nearly impossible, or simply add this simple checkbox saying "disable resuming of build" which will either prevent jenkins to be restarted while build is ongoing (behaviour of freestyle job), or simply fail the build. Yes, failing the build is not technically correct, but it is exactly what is currently happening for us, except after resuming, a engineer needs to log into the jenkins and manually kill the zombie build which only waits and never properly resumes.
jobs that were restarted during perfoming network operations inside shell script which were frozen after resume and could not be stopped in any way
If this is reproducible somehow I would consider it a high-priority bug to be fixed. Ditto for the other scenarios, unless they are limited to usage of some particular plugin.
simply add this simple checkbox
Adding a checkbox is of course simple. Making it do what you request is not necessarily simple and would require significant study. CpsThreadGroup.saveProgram can easily be suppressed (possibly the CPS transform could even be disabled), and WorkflowRun.onLoad could easily be made to fail a build which had not terminated cleanly, and CpsFlowExecution.blocksRestart could be made to unconditionally return true. But then you will still get unreleased workspace locks and the like. Possibly there is some way a Terminator could throw ThreadDeath into the Groovy call stack to try to unwind blocks cleanly.
jglick, unfortunately the bug is not deterministic - usually after restart jobs just hang, but can be killed (with kill of course, rarely works with stop).
The shell script which hanged in a way that after resume job could not be killed at all is this:
sh "curl -s -X POST https://bitbucket.org/site/oauth2/access_token -u \"${getBitbucketOAuthKey()}:${getBitbucketOAuthSecret()}\" -d grant_type=client_credentials | jsawk 'return this.access_token' | tr -d \"\\n\" > accessToken.txt"
Script obtains access token for BitBucket API so it can be used later for notifying commit statuses and approving pull request - something that bitbucket branch source plugin does not support (later they added support for that, but it is not configurable to give flexibility we need). I cannot guarantee you that this will trigger the bug, since this shell script is executed for every build we have and only 3 jobs (out of dozens daily) have been executing this script at the time of jenkins restart, which caused them to lock in a way that even kill didn't work.
However, it would be better to just fix resuming of jobs - since the original bug report, I have seen much improvements in this field (unix shell scripts now rarely hang after resume, but windows batch script almost always do). As I said, the most problematic are windows batch scripts running cmake-base build of visual studio c++ projects (cmake is used to create visual studio solution and then 'cmake --build . --config Release .' is used to invoke MSBuild builder to build the project). When restart is triggered (on master node, which is linux box) while this build is executing on windows slave, first this batch script is terminated (I guess with some kind of interrupt signal) which causes MSVC to report build as failed (MSVC reports cancelled builds as failures) and after restart this batch script is resumed, but instead of either new build with MSBuild or new call to entire batch script (which should build the project correctly) or continuing with next batch script which is followed after the one which performs the build (which actually collects test results and stashes them so later master node can utilize XUnit publisher plugin to publish test results), the job simply hangs and does nothing indefinitely (until someone logs in and kills it, because stop command is also ignored).
Hello, I have recently also come across the bug of jobs not restarting, I can also provide a testcase to help with investigation, three jobs are required:
Job 1 will trigger job_40_sec and job_50_sec in parallel
If jenkins restarts or is killed when job_40_sec and job_50_sec are both running, then, when Jenkins comes back online only one of the jobs is restarted whilst the other hangs indefinitely
Please let me know if you need any more information or if this is the wrong place for this information
Pipeline scripts:
Job 1
Map parallel_jobs = ['branch_1': {build job: 'job_50_sec'}, 'branch_2': {build job: 'job_40_sec'}] parallel parallel_jobs
node { sleep(40) }
node { sleep(50) }
WRT JENKINS-41916, It would be good if the Resume build option is disabled as it doesn't respect Security.
I have recently also come across the bug of jobs not restarting
This was filed separately and a fix released.
We use the jenkins-kubernetes plugin and it also does not resume as expected.
Even if it did, we strive to have jobs take less than a couple of minutes, so we don't care if jobs don't resume after a restart. Having the ability to chose between "resume job", "restart job" or "discard job" would be a nice feature. We'd probably use the "restart" functionality, although I can also see use-cases for discarding the job in its entirety.
We should be able to disable auto resume jobs on Jenkins, it causes lots of jobs to hang for hrs before having to manually kill the builds. Also, sometime we don't get any notification about by being stuck for days because it could not resume.
Definitely need a resume disable capability. I'd love a global option, but a per job option is also not a crime. There are times when you know your workspace and what your job is doing will not be viable for resumption even if Jenkins can, in theory resume your build. The resumes will just fail or worse lead to red-herring issues. Yes, people should create stateless build steps with always clean thinking but that's just not reality at scale. Having a disable resume option would be handy.
Second, if you are working in a situation where your workspace, or build nodes are ephemeral, resume literally breaks jenkins. The resume feature locks to the precise name of the comptue/build slave. Which in a ephemeral state (think mesos/kubernetes/docker plugins, or jclouds and dynamically named slaves), when jenkins is restarted that build slave no longer exists. The job sits in the queue waiting for the slave to come online, but it never does. Because it's resume is using the slave name, not the "label" tied to the actual job, the act of being in the queue never triggers a the dynamic provisioner (jclouds) to create a new slave. And the job hangs indefinitely.
We see evidence that these resume states can also cause thread locking on the build queue itself which then prevents any jobs from queuing at all. We have to go through quite the arduous process to manually clean the jenkins file system to prevent builds from requeing.
If you won't provide s disable resume feature, t hen at least tell us the logic for how Jenkins decides which jobs need to be resumed so we can properly clean up markers jenkins looks for on the file system to tell it to requeue.
It seems to be a combination of jenkins home xml files as well as some file state inside the Jobs folder, (jobs/JobName/builds/#/workflow/....). But I don't know exactly what.
Another option could be to wrap "node" blocks with "Resume from here" rather than resume from inside a node block. I suppose we could try putting the "node" call inside a function marked with @NonCPS but that seems extreme and may have unexpected results and I doubt I could get all my users to follow that convention anyway.
Please stop adding +1 comments. You may use the voting feature in JIRA.
build nodes are ephemeral
Already tracked as JENKINS-36013.
maxfields2000 The resume issue with missing build agents was fixed as of workflow-durable-task-step 2.15 – patched a week or so ago. The request for NON-durable pipeline came up at Jenkins World too, although with a different motivation (performance).
I think this ties into work happening now on how we store data for pipelines and their logs that I'm launching into shortly (larger project though, will take a while to land).
As Jesse says, the failure to resume is generally a specific bug of some sort and needs to be addressed
I'm attaching this to the storage epic because what I have in mind will also let you use this "unsafe" mode for gigantic performance gains.
In such a mode you might as well also switch DurableTaskStep to be, well, nondurable, and thus to just use Launcher synchronously to run processes. Would probably require some API massaging. Would lose durability across transient agent disconnections as well as Jenkins restarts—i.e., same as traditional builds.
jglick Thanks for mentioning that. To capture an out-of-band discussion, this has value to improve reliability. I've forked that off into JENKINS-47390 because it is a nice work item on its own with clearly defined boundaries, and to avoid expanding the scope of this item too much.
I for one would very much like the option to disable durability in pipelines. I prefer to fail early and fail fast in our environment. Pipelines with the ability to resume are fantastic in many situations but resuming in our environment continues to cause issues and headaches. At this point I am looking for ways to stop using pipelines yet have the ability to dynamically choose nodes (the only reason we use pipelines at this point is for the ability to programatically choose nodes).
It is far better for our automation to fail and retry again from the start or generate a report, therefore I heartily support the notion of making durability an option in pipelines rather than a requirement.
sbeckwithiii I think you'll be happy with this when it lands - I'm hoping within the next few weeks, but take that with a grain of salt because it's part of a larger effort that brings a lot of useful features to permit pipeline to run faster and reduce the load it puts on masters.
I'm sorry you've had so many issues with resume though – if you wouldn't mind, could you tell us what problems you've had? We've done a fair bit of work recently to resolve resume-time issues so it's quite likely some of these have been fixed, and if not we'd like to ensure that is robust. Thanks!
My previous comment could be taken as a complaint or a "I expect this and deserve this" which was not my intention. We are very pleased with Jenkins as a whole. I'm surprised by how much we're able to do with it.
svanoort, thank you for getting back. We have tasted some of the issues listed above including the zombie process that just would not die, as well as plugins that could use some improvement resulting in interesting indefinite wait situations like that above mentioned kubernetes plugin. My words were not meant to convey that the resumability feature of pipelines is useless or that y'all have been wasting time, but rather that I agree with others here that not every team or circumstance benefits from the resume feature and in some situations is actually hindered by it.
If it is useful, I can clarify why I said, "the only reason we use pipelines at this point". Why wouldn't we joyfully jump on board with pipelines and be all in? It doesn't align with the framework we've built nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy), however dynamically choosing nodes based on user input is a massive boon to us in some circumstances. Where we use pipelines we must do more work to accomplish our goals and lose almost all the features of our framework but that's a choice we make. The loss is far greater to us whenever we run into the above mentioned issues with pipelines.
Not sure where to put this but it is worth noting that we use scripted pipelines, and do so out of necessity and design even though declarative has desirable niceties like post build steps. I didn't realize how nice it was for Jenkins to determine if a build was unstable for me until I had to write the code myself.
sbeckwithiii No worries, no offense taken at what you said – we know that pipeline isn't perfect and just want to improve it over time.
nor does it fit the direction we've been going (wrap up automation into nice, neat little "modules" or function calls that do not require the team to learn Jenkins, Job DSL, Pipelines or Groovy),
This is what Pipeline Shared Libraries are intended to offer – in ci.jenkins.io, building a plugin is as simple as a JenkinsFile containing nothing but "buildPlugin()" in the repo. But I'm guessing you've invested in building a framework around specific business needs and moving over to pipeline represents a loss of that invested effort + something not as closely aligned to your specific needs?
> We have tasted some of the issues listed above including the zombie process that just would not die
Three key causes of this is resolved in the resolved in the most recent round of pipeline plugin updates (specifically: waiting for a throwaway node that will never reappear, waiting for a disconnected node to respond, and issues with stop operations on steps).
So, you might find that an update to the plugins resolves the issue (if not, we'd really love to see an issue filed for it so we can put it to rest for good, because that represents a clear bug).
But anyway, even aside from specific bugs, I think there's a recognition that automatic resume just plain may not make sense for every case... and softening that requirement for pipelines opens up a ton of opportunities.
You are very encouraging, svanoort.
I'm looking at the following change in an effort to work around the issue we currently are facing. Thank you.
2.14 (Aug 23, 2017)
JENKINS-36013- Prevent Jenkins from spinning indefinitely trying to resume a build where the Agent is an EphemeralNode and will never come back
Also covers cases where the node was removed by RetentionPolicy because it is destroyed, by aborting after a timeout (5 minutes by default)
This ONLY happens if the Node is removed, not for simply disconnected nodes, and only is triggered upon restart of the master
Added System property 'org.jenkinsci.plugins.workflow.support.pickles.ExecutorPickle.timeOutForNodeMillis' for how long to wait before aborting builds
I would also like to see the ability to restart a Jenkins master without it restarting or resuming any pipeline builds. Before restarting Jenkins to change a startup parameter, I verified my Jenkins master server was idle with no jobs running on master or slave executors. After the restart I saw the following errors in the log file and an attempt was made to resume builds 38, 107, and 108 all which are weeks old. It appears one of these builds was originally hung on "java.lang.InterruptedException" and the other two were hung on "org.jenkinsci.plugins.workflow.steps.FlowInterruptedException". The information used in these builds is obsolete and I prefer a Jenkins master restart to not resume any builds. These builds did not resume properly and I had to force stop them anyways.
Oct 31, 2017 10:17:51 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/108:JOBNAME #108]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Oct 31, 2017 10:17:55 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/107:JOBNAME #107]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Oct 31, 2017 10:18:02 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[JOBNAME/38:JOBNAME #38]] java.util.EmptyStackException at java.util.Stack.peek(Stack.java:102) at java.util.Stack.pop(Stack.java:84) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onProgramEnd(CpsFlowExecution.java:1026) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:350) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$100(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:242) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:230) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
mkozell Have you tried the betas for the pipeline plugins that are currently in the experimental update center? I'm fairly sure I fixed an error of this category when hardening the work in workflow-cps – this is also the same beta that provides the ability to prevent individual flows from resuming. With a little work in the script console it should be possible to write a quick script to invoke that on all currently running builds.
Code changed in jenkins
User: Sam Van Oort
Merge pull request #75 from svanoort/disable-pipeline-resume-JENKINS-33761
Provide job property for durability hints & add ability to disable pipeline resume JENKINS-33761
Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/2dfc94ac80bc...5d3b91a68514
Released with... uh, well take a look at the Jenkins Pipeline Handbook entry on scaling pipeline for versions.
After upgrading Jenkins with the following, I was not able to reproduce the issue after a build timeout, cancelling a build, and restarting Jenkins in the middle of a build.
Jenkins 2.89.4
Pipeline 2.5
Pipeline API 2.26
Pipeline Nodes and Processes 2.19
Pipeline Step API 2.14
Scripts Security 1.41
Groovy Sandbox = disabled
Java = 1.8.0_162
Although my jobs correctly didn't resume after Jenkins restart, I did see the message below in the build logs.
Resuming build at Sat Feb 24 06:38:10 UTC 2018 after Jenkins restart [Pipeline] End of Pipeline java.io.IOException: Cannot resume build – was not cleanly saved when Jenkins shut down.
I am not sure this is related to this issue, but in our pipeline build job we recently added the disableResume step and it does not seem to work correctly:
Jenkins 2.89.3
Pipeline 2.5
Pipeline API 2.27
Pipeline Nodes and Processes 2.20
Pipeline Step API 2.16
Scripts Security 1.44
Groovy Sandbox = disabled
Creating placeholder flownodes because failed loading originals. Resuming build at Thu Aug 30 12:42:45 UTC 2018 after Jenkins restart [Bitbucket] Notifying pull request build result [Bitbucket] Build result notified [lockable-resources] released lock on [UNIT_TEST_RESOURCE_3] java.io.IOException: Tried to load head FlowNodes for execution Owner[Products.Pipeline/PR-5615/7:Products.Pipeline/PR-5615 #7] but FlowNode was not found in storage for head id:FlowNodeId 1:586 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745) at hudson.model.RunMap.retrieve(RunMap.java:225) at hudson.model.RunMap.retrieve(RunMap.java:57) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482) at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380) at hudson.model.RunMap.getById(RunMap.java:205) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1098) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1109) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:974) at hudson.model.Hudson.<init>(Hudson.java:86) at hudson.model.Hudson.<init>(Hudson.java:82) at hudson.WebAppMain$3.run(WebAppMain.java:233) Finished: SUCCESS
This is an issue for us as the build was marked as SUCCESS in bitbucket, which allowed a user to merge a failing test into our release branch.
The job was definitely running with resume disabled, as this was printed at start of job:
Resume disabled by user, switching to high-performance, low-durability mode.
Any ideas?
We have seen the same thing. Resume definitely disabled and still causing hangs.
Thread dump from a node this is happening on attached.
Jenkins 2.107.3
Pipeline 2.5
Pipeline API 2.27
Pipeline Nodes and Processes 2.19
Pipeline Step API 2.15
Oddly, having killed the job from "Build Executor Status" the node is freed up but the job seems to still think it is running:
[Pipeline] {
Creating placeholder flownodes because failed loading originals.
Resuming build at Thu Sep 20 11:47:46 BST 2018 after Jenkins restart
[Pipeline] End of Pipeline
Finished: FAILURE
<spinning indicator>
The next thing that this would be doing would be retry { checkout git ... }
I've just experienced the "Creating placeholder flownodes because failed loading originals." error with this stack trace on a Jenkins system running workflow-job 2.25 and workflow-cps 2.64:
java.io.IOException: Tried to load head FlowNodes for execution Owner[Redacted/dev/3:Redacted/dev #3] but FlowNode was not found in storage for head id:FlowNodeId 1:59 at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.initializeStorage(CpsFlowExecution.java:678) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:715) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:875) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:745) at hudson.model.RunMap.retrieve(RunMap.java:225) at hudson.model.RunMap.retrieve(RunMap.java:57) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:501) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:483) at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:381) at hudson.model.RunMap.getById(RunMap.java:205) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.run(WorkflowRun.java:1112) at org.jenkinsci.plugins.workflow.job.WorkflowRun$Owner.get(WorkflowRun.java:1123) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:65) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$1.computeNext(FlowExecutionList.java:57) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:178) at jenkins.model.Jenkins.<init>(Jenkins.java:989) at hudson.model.Hudson.<init>(Hudson.java:85) at hudson.model.Hudson.<init>(Hudson.java:81) at hudson.WebAppMain$3.run(WebAppMain.java:233) Finished: FAILURE
Restarting the job manually appears to have resolved it, but is there additional information I can provide to troubleshoot what might have caused this? Or is this a different issue than what's discussed here?
Edit: I should add that workflow-cps was upgraded from 2.63 to 2.64 between the last successful job and the one that failed with the stack trace above. Workflow-job was not changed.
If you can describe how to reproduce from scratch, I will try to fix it. Surviving Jenkins restarts is a key feature of Pipeline.