Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38383

StackOverflow when loading a large pipeline job

      I've been using infinite Pipeline for a demo. The job was running for a couple of hours, and then Jenkins master declined to the build due to the StackOverflow.

      node {
          int i = 0;
          while(true) {
              sh "echo 'Hello, world ${i}!'"
              sh "sleep 5"
              i = i + 1;
          }
      }
      

      Log: See attachment

        1. InfinitePipeline.zip
          1.45 MB
        2. log.txt
          7.50 MB
        3. workflow-job.hpi
          109 kB

          [JENKINS-38383] StackOverflow when loading a large pipeline job

          Sam Van Oort added a comment -

          Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478

          Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that jglick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions.

          Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P

          Sam Van Oort added a comment - Problem is here: we're doing a recursive call to get the logs - https://github.com/jenkinsci/workflow-job-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java#L478 Needs to be rewritten as a while(...) loop with the current search candidate inside the loop. However I understand that jglick is doing some significant rewrites of log storage at the moment, so I'll hold off touching the code to avoid collisions. Side note, this is explicitly a "don't do that" in my recent Jenkins World talk :-P

          Sam Van Oort added a comment -

          Downgrading priority because it should only happen if you have to search many nodes to find one with a log.

          That said, there's some weirdness with the logic here.

          Sam Van Oort added a comment - Downgrading priority because it should only happen if you have to search many nodes to find one with a log. That said, there's some weirdness with the logic here.

          Thomas Johansen added a comment - - edited

          I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now:

          [...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError
          2018-02-22T11:17:37.589287583Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232)
          2018-02-22T11:17:37.589290948Z  at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
          2018-02-22T11:17:37.589293750Z  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969)
          2018-02-22T11:17:37.589296421Z  at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829)
          2018-02-22T11:17:37.589299106Z  at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834)
          2018-02-22T11:17:37.589301839Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556)
          2018-02-22T11:17:37.589304561Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134)
          2018-02-22T11:17:37.589308088Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547)
          2018-02-22T11:17:37.589310809Z  at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537)
          2018-02-22T11:17:37.589313487Z  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
          2018-02-22T11:17:37.589316198Z  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
          2018-02-22T11:17:37.589318881Z  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
          2018-02-22T11:17:37.589323997Z  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
          2018-02-22T11:17:37.589326739Z  ... 222 more
          2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...]

          Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart.

          Is this because I'm having pipeline-jobs using while-loops?

          Thomas Johansen added a comment - - edited I get a similar symptom on startup with Jenkins version 2.102. The stacktrace is now: [...] com.google.common.util.concurrent.ExecutionError: java.lang.StackOverflowError 2018-02-22T11:17:37.589287583Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2232) 2018-02-22T11:17:37.589290948Z at com.google.common.cache.LocalCache.get(LocalCache.java:3965) 2018-02-22T11:17:37.589293750Z at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3969) 2018-02-22T11:17:37.589296421Z at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4829) 2018-02-22T11:17:37.589299106Z at com.google.common.cache.LocalCache$LocalManualCache.getUnchecked(LocalCache.java:4834) 2018-02-22T11:17:37.589301839Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.getLogPrefix(WorkflowRun.java:556) 2018-02-22T11:17:37.589304561Z at org.jenkinsci.plugins.workflow.job.WorkflowRun.access$900(WorkflowRun.java:134) 2018-02-22T11:17:37.589308088Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:547) 2018-02-22T11:17:37.589310809Z at org.jenkinsci.plugins.workflow.job.WorkflowRun$6.load(WorkflowRun.java:537) 2018-02-22T11:17:37.589313487Z at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568) 2018-02-22T11:17:37.589316198Z at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350) 2018-02-22T11:17:37.589318881Z at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313) 2018-02-22T11:17:37.589323997Z at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228) 2018-02-22T11:17:37.589326739Z ... 222 more 2018-02-22T11:17:37.589333918Z Caused by: com.google.common.util.concurrent.ExecutionError [...] Jenkins is then irresponsive for some minutes, and also it seems like the jobs resumed while in the loop is sometimes stuck after restart. Is this because I'm having pipeline-jobs using while-loops?

          Sam Van Oort added a comment -

          Would be resolved by JENKINS-38381 (long handling rewrite).  This is the third bug I've seen tied to the logPrefix cache and similar logic.   

          This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item).

          thxmasj the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.  

          I'm going to take this on for fix in the next week or so. 

          Sam Van Oort added a comment - Would be resolved by JENKINS-38381 (long handling rewrite).  This is the third bug I've seen tied to the logPrefix cache and similar logic.    This officially means it's time for at least a fast-patch solution to tide us over until the comprehensive solution lands with the rewrite (my next Big Work Item). thxmasj the issue is that your pipeline is running a very large number of steps (several thousand generally) and under the wrong circumstances this particular operation converts to rather nasty recursive call.   I'm going to take this on for fix in the next week or so. 

          svanoort: If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

          Thomas Johansen added a comment - svanoort : If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

          Sam Van Oort added a comment -

          thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used.

          I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

          Sam Van Oort added a comment - thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used. I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

          svanoort: I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build).

          Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents.

          I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages).

          Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page. Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

          Thomas Johansen added a comment - svanoort : I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build). Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents. I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages). Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page . Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

          Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead:

          [...]
          18:00:36 Iteration 6188
          [Pipeline] echo
          Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart
          [Pipeline] End of Pipeline
          java.lang.NullPointerException
          	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43)
          	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
          	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:748)
          Finished: FAILURE
          

          Thomas Johansen added a comment - Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead: [...] 18:00:36 Iteration 6188 [Pipeline] echo Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart [Pipeline] End of Pipeline java.lang.NullPointerException at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE

          Sam Van Oort added a comment -

          Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

          Sam Van Oort added a comment - Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

          Sam Van Oort added a comment -

          thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't.

          But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue.

          In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

          Sam Van Oort added a comment - thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't. But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue. In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

          svanoort: I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops.

          [Pipeline] echo
          19:41:58 Iteration 6143
          [Pipeline] echo
          Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          19:41:58 Iteration 6143
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2
          jenkins-slave-1 doesn’t have label jenkins-slave-2
          jenkins-slave-2 is offline
          Ready to run at Tue Feb 27 19:43:10 GMT 2018

          Thomas Johansen added a comment - svanoort : I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops. [Pipeline] echo 19:41:58 Iteration 6143 [Pipeline] echo Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline 19:41:58 Iteration 6143 Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2 jenkins-slave-1 doesn’t have label jenkins-slave-2 jenkins-slave-2 is offline Ready to run at Tue Feb 27 19:43:10 GMT 2018

          Sam Van Oort added a comment -

          thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173) – do you have a sample of your pipeline here to reproduce with?

          Sam Van Oort added a comment - thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173 ) – do you have a sample of your pipeline here to reproduce with?

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63
          Log:
          Merge pull request #91 from svanoort/fix-prefix-bugs

          JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes

          Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63 Log: Merge pull request #91 from svanoort/fix-prefix-bugs JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

          Sam Van Oort added a comment -

          Released with v2.19

          Sam Van Oort added a comment - Released with v2.19

            svanoort Sam Van Oort
            oleg_nenashev Oleg Nenashev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: