-
Bug
-
Resolution: Fixed
-
Minor
-
all-latest
-
Powered by SuggestiMate
I've been using infinite Pipeline for a demo. The job was running for a couple of hours, and then Jenkins master declined to the build due to the StackOverflow.
node { int i = 0; while(true) { sh "echo 'Hello, world ${i}!'" sh "sleep 5" i = i + 1; } }
Log: See attachment
- InfinitePipeline.zip
- 1.45 MB
- log.txt
- 7.50 MB
- workflow-job.hpi
- 109 kB
- relates to
-
JENKINS-50350 JEP-200: ConversionException: Refusing to unmarshal logPrefixCache
-
- Closed
-
- links to
[JENKINS-38383] StackOverflow when loading a large pipeline job
Downgrading priority because it should only happen if you have to search many nodes to find one with a log.
That said, there's some weirdness with the logic here.
I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now:
[...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError 2018-02-22T11:17:37.589287583Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232) 2018-02-22T11:17:37.589290948Z at com.google.common.cache.LocalCache.get(LocalCache.java:3965) 2018-02-22T11:17:37.589293750Z at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) 2018-02-22T11:17:37.589296421Z at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) 2018-02-22T11:17:37.589299106Z at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834) 2018-02-22T11:17:37.589301839Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556) 2018-02-22T11:17:37.589304561Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134) 2018-02-22T11:17:37.589308088Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547) 2018-02-22T11:17:37.589310809Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537) 2018-02-22T11:17:37.589313487Z at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) 2018-02-22T11:17:37.589316198Z at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) 2018-02-22T11:17:37.589318881Z at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) 2018-02-22T11:17:37.589323997Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) 2018-02-22T11:17:37.589326739Z ... 222 more 2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...]
Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart.
Is this because I'm having pipeline-jobs using while-loops?
Would be resolved by JENKINS-38381 (long handling rewrite). This is the third bug I've seen tied to the logPrefix cache and similar logic.
This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item).
thxmasj the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.
I'm going to take this on for fix in the next week or so.
svanoort: If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.
thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there). There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used.
I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.
svanoort: I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build).
Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents.
I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages).
Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page. Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?
Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead:
[...] 18:00:36 Iteration 6188 [Pipeline] echo Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart [Pipeline] End of Pipeline java.lang.NullPointerException at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE
Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.
thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't.
But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue.
In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi
svanoort: I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops.
[Pipeline] echo 19:41:58 Iteration 6143 [Pipeline] echo Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline 19:41:58 Iteration 6143 Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2 jenkins-slave-1 doesn’t have label jenkins-slave-2 jenkins-slave-2 is offline Ready to run at Tue Feb 27 19:43:10 GMT 2018
thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173) – do you have a sample of your pipeline here to reproduce with?
Code changed in jenkins
User: Sam Van Oort
Path:
src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63
Log:
Merge pull request #91 from svanoort/fix-prefix-bugs
JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes
Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7
Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478
Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that jglick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions.
Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P