[JENKINS-38383] StackOverflow when loading a large pipeline job

Type: Bug
Resolution: Fixed
Priority: Minor
Component/s: workflow-job-plugin
Labels:
- stackoverflow
Environment:
all-latest

Similar Issues:
Powered by SuggestiMate

Show

I've been using infinite Pipeline for a demo. The job was running for a couple of hours, and then Jenkins master declined to the build due to the StackOverflow.

node {
    int i = 0;
    while(true) {
        sh "echo 'Hello, world ${i}!'"
        sh "sleep 5"
        i = i + 1;
    }
}

Log: See attachment

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

InfinitePipeline.zip
1.45 MB
2016-09-20 17:51
log.txt
7.50 MB
2016-09-20 17:51
workflow-job.hpi
109 kB
2018-02-27 19:01

relates to

JENKINS-50350 JEP-200: ConversionException: Refusing to unmarshal logPrefixCache

Closed

links to

Workflow-Job Plugin PR#1

Sam Van Oort added a comment - 2016-09-20 17:55

Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478

Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that jglick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions.

Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P

Sam Van Oort added a comment - 2016-09-20 17:55 Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478 Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that jglick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions. Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P

Sam Van Oort added a comment - 2016-09-20 19:27

Downgrading priority because it should only happen if you have to search many nodes to find one with a log.

That said, there's some weirdness with the logic here.

Sam Van Oort added a comment - 2016-09-20 19:27 Downgrading priority because it should only happen if you have to search many nodes to find one with a log. That said, there's some weirdness with the logic here.

Thomas Johansen added a comment - 2018-02-22 11:34 - edited

I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now:

[...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
2018-02-22T11:17:37.589287583Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232)
2018-02-22T11:17:37.589290948Z  at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
2018-02-22T11:17:37.589293750Z  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
2018-02-22T11:17:37.589296421Z  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
2018-02-22T11:17:37.589299106Z  at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834)
2018-02-22T11:17:37.589301839Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556)
2018-02-22T11:17:37.589304561Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134)
2018-02-22T11:17:37.589308088Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547)
2018-02-22T11:17:37.589310809Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537)
2018-02-22T11:17:37.589313487Z  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
2018-02-22T11:17:37.589316198Z  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
2018-02-22T11:17:37.589318881Z  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
2018-02-22T11:17:37.589323997Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
2018-02-22T11:17:37.589326739Z  ... 222 more
2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...]

Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart.

Is this because I'm having pipeline-jobs using while-loops?

Thomas Johansen added a comment - 2018-02-22 11:34 - edited I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now: [...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError 2018-02-22T11:17:37.589287583Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232) 2018-02-22T11:17:37.589290948Z at com.google.common.cache.LocalCache.get(LocalCache.java:3965) 2018-02-22T11:17:37.589293750Z at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) 2018-02-22T11:17:37.589296421Z at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) 2018-02-22T11:17:37.589299106Z at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834) 2018-02-22T11:17:37.589301839Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556) 2018-02-22T11:17:37.589304561Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134) 2018-02-22T11:17:37.589308088Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547) 2018-02-22T11:17:37.589310809Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537) 2018-02-22T11:17:37.589313487Z at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) 2018-02-22T11:17:37.589316198Z at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) 2018-02-22T11:17:37.589318881Z at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) 2018-02-22T11:17:37.589323997Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) 2018-02-22T11:17:37.589326739Z ... 222 more 2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...] Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart. Is this because I'm having pipeline-jobs using while-loops?

Sam Van Oort added a comment - 2018-02-23 22:09

Would be resolved by ~~JENKINS-38381~~ (long handling rewrite). This is the third bug I've seen tied to the logPrefix cache and similar logic.

This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item).

thxmasj the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.

I'm going to take this on for fix in the next week or so.

Sam Van Oort added a comment - 2018-02-23 22:09 Would be resolved by JENKINS-38381 (long handling rewrite). This is the third bug I've seen tied to the logPrefix cache and similar logic. This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item). thxmasj the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call. I'm going to take this on for fix in the next week or so.

Thomas Johansen added a comment - 2018-02-26 08:45

svanoort: If I understand you correctly ~~JENKINS-38381~~ will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

Thomas Johansen added a comment - 2018-02-26 08:45 svanoort : If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

Sam Van Oort added a comment - 2018-02-26 15:26

thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there). There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used.

I aim to fix this with ~~JENKINS-38381~~ but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

Sam Van Oort added a comment - 2018-02-26 15:26 thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there). There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used. I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

Thomas Johansen added a comment - 2018-02-27 13:19

svanoort: I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build).

Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents.

I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages).

Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page. Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

Thomas Johansen added a comment - 2018-02-27 13:19 svanoort : I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build). Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents. I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages). Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page . Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

Thomas Johansen added a comment - 2018-02-27 18:06

Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead:

[...]
18:00:36 Iteration 6188
[Pipeline] echo
Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart
[Pipeline] End of Pipeline
java.lang.NullPointerException
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Finished: FAILURE

Thomas Johansen added a comment - 2018-02-27 18:06 Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead: [...] 18:00:36 Iteration 6188 [Pipeline] echo Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart [Pipeline] End of Pipeline java.lang.NullPointerException at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE

Sam Van Oort added a comment - 2018-02-27 18:58

Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

Sam Van Oort added a comment - 2018-02-27 18:58 Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

Sam Van Oort added a comment - 2018-02-27 19:01

thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't.

But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue.

In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

Sam Van Oort added a comment - 2018-02-27 19:01 thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't. But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue. In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

Thomas Johansen added a comment - 2018-02-27 20:10

svanoort: I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops.

[Pipeline] echo
19:41:58 Iteration 6143
[Pipeline] echo
Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart
Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
19:41:58 Iteration 6143
Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2
jenkins-slave-1 doesn’t have label jenkins-slave-2
jenkins-slave-2 is offline
Ready to run at Tue Feb 27 19:43:10 GMT 2018

Thomas Johansen added a comment - 2018-02-27 20:10 svanoort : I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops. [Pipeline] echo 19:41:58 Iteration 6143 [Pipeline] echo Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline 19:41:58 Iteration 6143 Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2 jenkins-slave-1 doesn’t have label jenkins-slave-2 jenkins-slave-2 is offline Ready to run at Tue Feb 27 19:43:10 GMT 2018

Sam Van Oort added a comment - 2018-03-08 21:12

thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173) – do you have a sample of your pipeline here to reproduce with?

Sam Van Oort added a comment - 2018-03-08 21:12 thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173 ) – do you have a sample of your pipeline here to reproduce with?

SCM/JIRA link daemon added a comment - 2018-04-13 23:04

Code changed in jenkins
User: Sam Van Oort
Path:
src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63
Log:
Merge pull request #91 from svanoort/fix-prefix-bugs

~~JENKINS-38383~~ Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes

Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

SCM/JIRA link daemon added a comment - 2018-04-13 23:04 Code changed in jenkins User: Sam Van Oort Path: src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63 Log: Merge pull request #91 from svanoort/fix-prefix-bugs JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

Sam Van Oort added a comment - 2018-04-13 23:19

Released with v2.19

Sam Van Oort added a comment - 2018-04-13 23:19 Released with v2.19

Assignee:: Sam Van Oort

Reporter:: Oleg Nenashev

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2016-09-20 17:51

Updated:: 2018-04-13 23:19

Resolved:: 2018-04-13 23:19

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Sam Van Oort added a comment - 2016-09-20 17:55

Expand comment: Sam Van Oort added a comment - 2016-09-20 17:55

Collapse comment: Sam Van Oort added a comment - 2016-09-20 19:27

Expand comment: Sam Van Oort added a comment - 2016-09-20 19:27

Collapse comment: Thomas Johansen added a comment - 2018-02-22 11:34, Edited by Thomas Johansen - 2018-02-22 11:38

Expand comment: Thomas Johansen added a comment - 2018-02-22 11:34, Edited by Thomas Johansen - 2018-02-22 11:38

Collapse comment: Sam Van Oort added a comment - 2018-02-23 22:09

Expand comment: Sam Van Oort added a comment - 2018-02-23 22:09

Collapse comment: Thomas Johansen added a comment - 2018-02-26 08:45

Expand comment: Thomas Johansen added a comment - 2018-02-26 08:45

Collapse comment: Sam Van Oort added a comment - 2018-02-26 15:26

Expand comment: Sam Van Oort added a comment - 2018-02-26 15:26

Collapse comment: Thomas Johansen added a comment - 2018-02-27 13:19

Expand comment: Thomas Johansen added a comment - 2018-02-27 13:19

Collapse comment: Thomas Johansen added a comment - 2018-02-27 18:06

Expand comment: Thomas Johansen added a comment - 2018-02-27 18:06

Collapse comment: Sam Van Oort added a comment - 2018-02-27 18:58

Expand comment: Sam Van Oort added a comment - 2018-02-27 18:58

Collapse comment: Sam Van Oort added a comment - 2018-02-27 19:01

Expand comment: Sam Van Oort added a comment - 2018-02-27 19:01

Collapse comment: Thomas Johansen added a comment - 2018-02-27 20:10

Expand comment: Thomas Johansen added a comment - 2018-02-27 20:10

Collapse comment: Sam Van Oort added a comment - 2018-03-08 21:12

Expand comment: Sam Van Oort added a comment - 2018-03-08 21:12

Collapse comment: SCM/JIRA link daemon added a comment - 2018-04-13 23:04

Expand comment: SCM/JIRA link daemon added a comment - 2018-04-13 23:04

Collapse comment: Sam Van Oort added a comment - 2018-04-13 23:19

Expand comment: Sam Van Oort added a comment - 2018-04-13 23:19

People

Dates