Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59465

StackOverflowError When Resuming Build After Restart

      On a Windows system, I am getting a StackOverflowError when safeRestart is used during the input step of this pipeline.

      pipeline {
          agent none
      
          environment {
              submitters = 'charles'
          }
      
          stages {
              stage('Stage1') {
                  agent any
                  steps {
                      echo 'Stage1'
                  }
              }
              stage('Stage2') {
                  agent any
                  steps {
                      echo 'Stage2'
                  }
              }
              stage('Stage3') {
                  agent any
                  steps {
                      echo 'Stage3'
                  }
              }
              stage('Input') {
                  agent none
                  options {
                      timeout time: 15, unit: 'MINUTES'
                  }
                  steps {
                      input message: 'Select Proceed to continue.', submitter: "${submitters}"
                  }
              }
          }
      }
      

      Here is a portion of the console log. The full console log is attached.

      [Pipeline] {
      [Pipeline] input
      Select Proceed to continue.
      Proceed or Abort
      Resuming build at Tue Sep 17 13:50:18 EDT 2019 after Jenkins restart
      [Pipeline] End of Pipeline
      java.lang.StackOverflowError
      	at java.lang.Class.forName0(Native Method)
      	at java.lang.Class.forName(Unknown Source)
      	at org.jboss.marshalling.AbstractClassResolver.loadClass(AbstractClassResolver.java:123)
      	at org.jboss.marshalling.AbstractClassResolver.resolveClass(AbstractClassResolver.java:104)
      	at org.jboss.marshalling.river.RiverUnmarshaller.doReadClassDescriptor(RiverUnmarshaller.java:998)
      ...
      Caused: java.io.IOException: Failed to load build state
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:855)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$3.onSuccess(CpsFlowExecution.java:853)
      	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4$1.run(CpsFlowExecution.java:907)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:37)
      

      While experimenting with this I found four changes that made it stop failing. Apply any of the four by themselves and the pipeline will resume without an error. The fourth one seems odd that it would have an effect on the issue.

      • Change input stage to "agent any".
      • Comment out timeout option.
      • Comment out environment section and hardcode submitters value.
      • Comment out one of the three numbered stages.

      This error does not occur on the jenkins/jenkins:lts Docker image.

      The "Pipeline Default Speed/Durability Level" is set to "None: use pipeline default (MAX_SURVIVABILITY)".

      The pipeline is running on the master (Windows) system with no slaves defined.

          [JENKINS-59465] StackOverflowError When Resuming Build After Restart

          I'm still experiencing this issue with Jenkins v2.222.3. Plugin workflow-durable-task-step is v2.35 and workflow-cps is v2.80.

          I've been trying to find a log that provides more information, but no luck.

          I did learn another way to modify the example pipeline to survive a restart. For any of the numbered stages change "agent any" to "agent none".

          I also learned that the example pipeline does not fail on a system running CloudBees Core v2.176.2.3 with a Linux master and the build running on a Windows Server slave.

          Charles Bozarth added a comment - I'm still experiencing this issue with Jenkins v2.222.3. Plugin workflow-durable-task-step is v2.35 and workflow-cps is v2.80. I've been trying to find a log that provides more information, but no luck. I did learn another way to modify the example pipeline to survive a restart. For any of the numbered stages change "agent any" to "agent none". I also learned that the example pipeline does not fail on a system running CloudBees Core v2.176.2.3 with a Linux master and the build running on a Windows Server slave.

          I found JENKINS-52966 because it had a similar error message. This comment suggested changing to 64-bit JRE. After doing that I cannot recreate the issue. I did not change heap size or other options. I then changed the heap size and it continues to work.

          I will monitor this for a while before considering it resolved.

          Charles Bozarth added a comment - I found JENKINS-52966 because it had a similar error message. This comment suggested changing to 64-bit JRE. After doing that I cannot recreate the issue. I did not change heap size or other options. I then changed the heap size and it continues to work. I will monitor this for a while before considering it resolved.

          I likewise hit a StackOverflowError when resuming after plugin upgrades (analysis-model-api, azure-ad, bootstrap5-api, popper2-api, atlassian-bitbucket-server-integration) and a Jenkins controller restart. This is Jenkins 2.346.3 on a 64-bit Eclipse Adoptium JRE 11 on Windows. The pipeline does not define any timeouts and does not use input steps. It has agent none at top level. The stage that was paused is nested within a stage that specifies an agent.

          If I understand correctly, the "Failed to load build state" message comes from CpsFlowExecution.loadProgramFailed, which is called after CpsFlowExecution.onLoad or CpsFlowExecution.loadProgramAsync has caught an exception. These methods would apparently load the file that CpsFlowExecution.getProgramDataFile returns, i.e. "program.dat", but there is no such file in the directory of the build in the Jenkins controller, so I cannot check whether the file contains some kind of recursive reference.

          When Jenkins detects this error, should it stash the "program.dat" file somewhere for later analysis?

          Kalle Niemitalo added a comment - I likewise hit a StackOverflowError when resuming after plugin upgrades (analysis-model-api, azure-ad, bootstrap5-api, popper2-api, atlassian-bitbucket-server-integration) and a Jenkins controller restart. This is Jenkins 2.346.3 on a 64-bit Eclipse Adoptium JRE 11 on Windows. The pipeline does not define any timeouts and does not use input steps. It has agent none at top level. The stage that was paused is nested within a stage that specifies an agent. If I understand correctly, the "Failed to load build state" message comes from CpsFlowExecution.loadProgramFailed, which is called after CpsFlowExecution.onLoad or CpsFlowExecution.loadProgramAsync has caught an exception. These methods would apparently load the file that CpsFlowExecution.getProgramDataFile returns, i.e. "program.dat", but there is no such file in the directory of the build in the Jenkins controller, so I cannot check whether the file contains some kind of recursive reference. When Jenkins detects this error, should it stash the "program.dat" file somewhere for later analysis?

          Jesse Glick added a comment -

          Presuming duplicate, unless this is reproduced in Scripted syntax.

          Jesse Glick added a comment - Presuming duplicate, unless this is reproduced in Scripted syntax.

            Unassigned Unassigned
            charleswb Charles Bozarth
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: