Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38383

StackOverflow when loading a large pipeline job

      I've been using infinite Pipeline for a demo. The job was running for a couple of hours, and then Jenkins master declined to the build due to the StackOverflow.

      node {
          int i = 0;
          while(true) {
              sh "echo 'Hello, world ${i}!'"
              sh "sleep 5"
              i = i + 1;
          }
      }
      

      Log: See attachment

        1. InfinitePipeline.zip
          1.45 MB
        2. log.txt
          7.50 MB
        3. workflow-job.hpi
          109 kB

          [JENKINS-38383] StackOverflow when loading a large pipeline job

          svanoort: If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

          Thomas Johansen added a comment - svanoort : If I understand you correctly JENKINS-38381 will resolve my issue, and you aim to fix it in the close future? I am still considering to rewrite my pipelines to avoid while-loops (which result in thousands of steps), as it seems like the Jenkins-code is not designed for this. Btw I'm using loops for polling Jira and Git. My rewrite would then be to move the loop inside a shell-step.

          Sam Van Oort added a comment -

          thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used.

          I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

          Sam Van Oort added a comment - thxmasj I'd suggest definitely using a sh loop or timeout/retry pipeline loops instead (with less loop cycles involved there).  There are negative implications to using a ton of Pipeline steps in general – the Project Cheetah work I launched a little while ago makes it less harmful if you're using the new performance-optimized mode, but it's still not free. In general if your Pipeline is over 1000 FlowNodes (500-700+ steps) then it's a sign one should reconsider the approach used. I aim to fix this with JENKINS-38381 but that isn't trivial, so I'm not comfortable giving an explicit date date yet and will look to make tactical fixes initially.

          svanoort: I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build).

          Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents.

          I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages).

          Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page. Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

          Thomas Johansen added a comment - svanoort : I already tested a retry step and a retry option on the stages where I poll Jira, but this also seems to produce tons of pipeline steps (I can see a lot of XML-files under the workflow-directory for the build). Unfortunately I found that my strategy with sh-loops is a dead-end, as the sh-step requires an agent (with node, workspace and executor). As we might have tens or even hundreds of concurrent builds in a Jira-polling stage for several days those stages should not use agents. I had a look at the Cheetah-documentation you linked to, and it looks really interesting. I think it might solve my problem IF the durability option could be set on a per stage level – then I could make my Jira-polling stages PERFORMANCE_OPTIMIZED, which I assume would avoid the StackOverflow-problem for while/retry loops. I will test the option set on pipeline level now (although we need durability for some of the stages). Btw I noticed the line "You can force a Pipeline to persist data by pausing it." on the Scaling pipelines page . Can the pipeline be paused/resumed from within itself (using steps?), sort of like a transaction commit?

          Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead:

          [...]
          18:00:36 Iteration 6188
          [Pipeline] echo
          Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart
          [Pipeline] End of Pipeline
          java.lang.NullPointerException
          	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43)
          	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243)
          	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231)
          	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:748)
          Finished: FAILURE
          

          Thomas Johansen added a comment - Resuming a build after restart with a PERFORMANCE_OPTIMIZED pipeline with 6000+ flow nodes did not give me a hang, but rather a NPE and a failing build instead: [...] 18:00:36 Iteration 6188 [Pipeline] echo Resuming build at Tue Feb 27 18:01:17 GMT 2018 after Jenkins restart [Pipeline] End of Pipeline java.lang.NullPointerException at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:43) at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:174) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:331) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:82) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:243) at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:231) at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Finished: FAILURE

          Sam Van Oort added a comment -

          Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

          Sam Van Oort added a comment - Solution provided by removing explicit recursion and instead using an iteration-based approach + changes to caching.

          Sam Van Oort added a comment -

          thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't.

          But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue.

          In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

          Sam Van Oort added a comment - thxmasj I'm not sure exactly what's triggering an NPE there (looks like it somehow got caught in an invalid state) – tried to reproduce but I can't. But I've got a fix that's verified to remove the StackOverflowException and allow a build to cleanly resume in https://github.com/jenkinsci/workflow-job-plugin/pull/91 – please try the attached snapshot built off of it which should remove your issue. In general I'd advise to find alternate courses in Pipeline than explicit polling loops – in this case an enhancement to JIRA plugin or simply accept the use of an agent to do polling while needed (perhaps with exponential backoff). workflow-job.hpi

          svanoort: I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops.

          [Pipeline] echo
          19:41:58 Iteration 6143
          [Pipeline] echo
          Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          19:41:58 Iteration 6143
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline
          Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2
          jenkins-slave-1 doesn’t have label jenkins-slave-2
          jenkins-slave-2 is offline
          Ready to run at Tue Feb 27 19:43:10 GMT 2018

          Thomas Johansen added a comment - svanoort : I tried the snapshot. No StackOverflowException in the log this time, but the build did not resume either (waited almost half an hour). I'm considering to use a polling agent in combination with the input construct to completely avoid these damned loops. [Pipeline] echo 19:41:58 Iteration 6143 [Pipeline] echo Resuming build at Tue Feb 27 19:42:36 GMT 2018 after Jenkins restart Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline 19:41:58 Iteration 6143 Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: jenkins-slave-2 is offline Waiting to resume part of jenkins-pipeline » work/PBLEID-15078 #1: Jenkins doesn’t have label jenkins-slave-2 jenkins-slave-1 doesn’t have label jenkins-slave-2 jenkins-slave-2 is offline Ready to run at Tue Feb 27 19:43:10 GMT 2018

          Sam Van Oort added a comment -

          thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173) – do you have a sample of your pipeline here to reproduce with?

          Sam Van Oort added a comment - thxmasj That sounds like an issue I'm tracking that sometimes happens with performance-optimized builds with specific combinations of build agents used (noted in a couple quirks by https://issues.jenkins-ci.org/browse/JENKINS-47173 ) – do you have a sample of your pipeline here to reproduce with?

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63
          Log:
          Merge pull request #91 from svanoort/fix-prefix-bugs

          JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes

          Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java http://jenkins-ci.org/commit/workflow-job-plugin/896703ee33d7c7678272a1cecb7cba1a4c3e1f63 Log: Merge pull request #91 from svanoort/fix-prefix-bugs JENKINS-38383 Avoid Stackoverflow and full-flowgraph scan when trying to get branch names for log prefixes Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/d23e989dc6d5...896703ee33d7

          Sam Van Oort added a comment -

          Released with v2.19

          Sam Van Oort added a comment - Released with v2.19

            svanoort Sam Van Oort
            oleg_nenashev Oleg Nenashev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: