Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50199

Failed pipeline jobs stuck running after incorrect resume

    • Pipeline - April 2018
    • Pipeline Job 2.25

      Setup

      Jenkins v2.89.4 LTS
      Pipeline API Plugin: 2.26
      Pipeline Nodes and Processes Plugin: 2.19
      Durable Task Plugin: 1.18
      Pipeline Job Plugin: 2.17
      Pipeline Shared Groovy Libraries Plugin: 2.9
      Pipeline Supporting APIs Plugin: 2.18
      Pipeline Groovy Plugin: 2.45
      Script Security Plugin: 1.41

      Pipeline Default Speed/Durability Level: Performance-Optimized
      "Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

      Problem

      I logged into a Jenkins master and saw no builds running but there was a queue of about 10 jobs. When mousing over the queued jobs, I saw "pending - Already running 2 builds across all nodes". This is strange because no jobs were showing as running and no Jenkins agents or executors were showing any running builds.

      I then ran "http://xx.xxx.xxx.xxx:8080/computer/api/xml?tree=computer[executors[currentExecutable[url]],oneOffExecutors[currentExecutable[url]]]&xpath=//url&wrapper=builds" which did show 5 builds were running. I checked these builds and they were red (failure) and were not running.

      Research

      I checked the console log of a build that showed as running but isn't and saw the line below near the top of the log.

      Resume disabled by user, switching to high-performance, low-durability mode.

      At the end of the of the log I saw the following:

      Finished: FAILURE
      Resuming build at Tue Mar 13 23:04:52 UTC 2018 after Jenkins restart

      Why Resume Build?

      The build failed on Mar 12, 2018 6:02:37 PM.
      Why did the build try to resume almost a day later? The job and system durability are configured to not resume builds. Below are some details taken from the API for the build.

      _class "hudson.model.OneOffExecutor"
      id "41"
      keepLog false
      number 41
      queueId 7178
      result "FAILURE"
      timestamp 1520877757466

      I checked the Java process on the server and it was last restarted on March 02 2018.

      What triggered the "Jenkins restart" identified on Mar 13 23:04:52 UTC 2018 since the Java process was not restarted?

      Why does this get the build stuck in a "running" state when it's not running?

      Scope

      This issue can be seen across many of our Jenkins masters. In each case we see "Resuming build at xxxxx after Jenkins restart" occur a few days after the build failure or abort even though Java was not restarted. This issue didn't occur on Jenkins 2.60.3 running the older (pre-durability configurable) Pipeline plugins.

      Logs

      I checked the jenkins.log file and saw the following when the build was attempting to be resumed.

       

      Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
       WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME1/42/
      
       Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/42:JENKINS-JOB-NAME1 #42]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/42:JENKINS-JOB-NAME1 #42]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
       WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME1/40/
      
       Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/40:JENKINS-JOB-NAME1 #40]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      Mar 13, 2018 9:29:56 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME1/40:JENKINS-JOB-NAME1 #40]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
       WARNING: Pipeline state not properly persisted, cannot resume job/JENKINS-JOB-NAME2/41/
      
       Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME2/41:JENKINS-JOB-NAME2 #41]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      Mar 13, 2018 11:04:52 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
       WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[OwnerJENKINS-JOB-NAME2/41:JENKINS-JOB-NAME2 #41]
       java.lang.NullPointerException
       at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
       at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
       at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
       at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
       at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
       at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      

       

      Thread Dump

      I only saw WorkflowRun.copyLogs with workflow in the name in a thread dump so I'm not sure if it is related. I didn't see anything BLOCKING. A lot of different items were WAITING on "java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject".

      "WorkflowRun.copyLogs [#3]" Id=8829 Group=main TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
       at sun.misc.Unsafe.park(Native Method)
       - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
       at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
       at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
       at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      
      "WorkflowRun.copyLogs [#4]" Id=8830 Group=main WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
       at sun.misc.Unsafe.park(Native Method)
       - waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@477b1566
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
       at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
       at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
       at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
      

      Workaround

      Running the Jenkins script below will set the running build to aborted to work around the issue.

      Jenkins.instance.getItemByFullName("JOBNAME").getBuildByNumber(JOB#).finish(hudson.model.Result.ABORTED, new java.io.IOException("Aborting build"));

        1. 867.tar.gz
          4 kB
        2. flownode-84.xml
          21 kB
        3. workflow-cps.hpi
          548 kB
        4. workflow-job.hpi
          112 kB
        5. workflows.xml
          5 kB
        6. build.xml
          10 kB
        7. workflow-cps.hpi
          540 kB
        8. workflow-job.hpi
          110 kB
        9. workflow-job.hpi
          110 kB
        10. workflow-cps.hpi
          540 kB
        11. workflow-fallback-flowNodeStore.xml
          7 kB
        12. flowNodeStore.xml
          4 kB
        13. build.xml
          1 kB
        14. build.log
          0.4 kB

          [JENKINS-50199] Failed pipeline jobs stuck running after incorrect resume

          Sam Van Oort added a comment -

          mkozell The NPEs you see look like a duplicate of https://issues.jenkins-ci.org/browse/JENKINS-49686 which I am investigating right now.

          May I please request the following diagnostic information:

          1. Please can you attach both the config.xml for the job and a build.xml for one or more Pipelines that experienced this issue.
          2. In the build's 'workflow' directory, there's a flownodeStore.xml file - please can you GZIP it and attach it.
          3. Can I request the last part of the build log for these builds (the final parts indicating build completed and anything after that).
          4. Are all the Pipelines experiencing issues making use of parallels?
          5. Do you have any fairly simple pipelines that will reproduce the issues, which you can provide code for?

          There's also fix in workflow-cps plugin 2.46 for some issues with Resume Disabled - without that fix there may be odd behaviors if the Resume Disable flag is modified after the build begins. If this has happened, the fix there will help and after gathering the above information, I would suggest installing this update and seeing if issues recur when Resume Disabled in on.

          If that does not resolve the resume issue, it may be necessary to toggle this option OFF for the meantime:

          Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

          We hope to have a hotfix available soon to at least try out.

          Thank you

          Sam Van Oort added a comment - mkozell The NPEs you see look like a duplicate of https://issues.jenkins-ci.org/browse/JENKINS-49686 which I am investigating right now. May I please request the following diagnostic information: 1. Please can you attach both the config.xml for the job and a build.xml for one or more Pipelines that experienced this issue. 2. In the build's 'workflow' directory, there's a flownodeStore.xml file - please can you GZIP it and attach it. 3. Can I request the last part of the build log for these builds (the final parts indicating build completed and anything after that). 4. Are all the Pipelines experiencing issues making use of parallels? 5. Do you have any fairly simple pipelines that will reproduce the issues, which you can provide code for? There's also fix in workflow-cps plugin 2.46 for some issues with Resume Disabled - without that fix there may be odd behaviors if the Resume Disable flag is modified after the build begins. If this has happened, the fix there will help and after gathering the above information, I would suggest installing this update and seeing if issues recur when Resume Disabled in on. If that does not resolve the resume issue, it may be necessary to toggle this option OFF for the meantime: Do not allow the pipeline to resume if the master restarts": Enabled on all jobs We hope to have a hotfix available soon to at least try out. Thank you

          Sam Van Oort added a comment -

          Thanks mkozell that data is indeed very useful and solidly confirms my working theory (which I will explain in a moment). Before I explain (since it'll take a bit to write up and I'm hoping you're still around), please could you do one more analysis:

          1. Do you see any IOExceptions in the jenkins log or build log for these builds
          2. Please can you do a search for "FlowStartNode" and "FlowEndNode" in both flowNodeStore.xml files and report the id and any error shown for that element?

          Thanks!

          Sam Van Oort added a comment - Thanks mkozell that data is indeed very useful and solidly confirms my working theory (which I will explain in a moment). Before I explain (since it'll take a bit to write up and I'm hoping you're still around), please could you do one more analysis: 1. Do you see any IOExceptions in the jenkins log or build log for these builds 2. Please can you do a search for "FlowStartNode" and "FlowEndNode" in both flowNodeStore.xml files and report the id and any error shown for that element? Thanks!

          Sam Van Oort added a comment -

          mkozell Also, do you see anything anything in a directory "workflow-fallback" (perhaps with a flowNodeStore.xml)?

          Sam Van Oort added a comment - mkozell Also, do you see anything anything in a directory "workflow-fallback" (perhaps with a flowNodeStore.xml)?

          Sam Van Oort added a comment -

          Ahh, nevermind, I missed that in the listing. That confirms another part of the theory.

          Sam Van Oort added a comment - Ahh, nevermind, I missed that in the listing. That confirms another part of the theory.

          Mike Kozell added a comment -

          1. I didn't see any IOExceptions in the jenkins.log file. There is one is the workflow-fallback/flowNodeStore.xml file which is included in this ticket. I did get an IOException after I aborted the build with the work around.

          java.io.IOException: Aborting build
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
                  at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
                  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
                  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
                  at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83)
                  at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105)
                  at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:60)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:235)
                  at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247)
                  at Script1.run(Script1.groovy:1)
                  at groovy.lang.GroovyShell.evaluate(GroovyShell.java:585)
                  at groovy.lang.GroovyShell.evaluate(GroovyShell.java:623)
                  at groovy.lang.GroovyShell.evaluate(GroovyShell.java:594)
                  at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142)
                  at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114)
                  at hudson.remoting.LocalChannel.call(LocalChannel.java:45)
                  at hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111)
                  at jenkins.model.Jenkins._doScript(Jenkins.java:4360)
                  at jenkins.model.Jenkins.doScript(Jenkins.java:4331)
                  at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627)
                  at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343)
                  at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184)
                  at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117)
                  at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129)
                  at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58)
                  at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:715)
                  at org.kohsuke.stapler.Stapler.invoke(Stapler.java:845)
                  at org.kohsuke.stapler.Stapler.invoke(Stapler.java:649)
                  at org.kohsuke.stapler.Stapler.service(Stapler.java:238)
                  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
                  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:841)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
                  at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154)
                  at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47)
                  at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
                  at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59)
                  at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151)
                  at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84)
                  at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249)
                  at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67)
                  at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87)
                  at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90)
                  at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30)
                  at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
                  at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
                  at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
                  at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
                  at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
                  at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
                  at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
                  at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
                  at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
                  at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
                  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
                  at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
                  at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
                  at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
                  at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
                  at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
                  at org.eclipse.jetty.server.Server.handle(Server.java:564)
                  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
                  at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
                  at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
                  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
                  at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
                  at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
                  at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
                  at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
                  at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
                  at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
                  at java.lang.Thread.run(Thread.java:748)
          Finished: ABORTED
          
          

          Mike Kozell added a comment - 1. I didn't see any IOExceptions in the jenkins.log file. There is one is the workflow-fallback/flowNodeStore.xml file which is included in this ticket. I did get an IOException after I aborted the build with the work around. java.io.IOException: Aborting build at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:83) at org.codehaus.groovy.runtime.callsite.ConstructorSite$ConstructorSiteNoUnwrapNoCoerce.callConstructor(ConstructorSite.java:105) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallConstructor(CallSiteArray.java:60) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:235) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callConstructor(AbstractCallSite.java:247) at Script1.run(Script1.groovy:1) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:585) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:623) at groovy.lang.GroovyShell.evaluate(GroovyShell.java:594) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:142) at hudson.util.RemotingDiagnostics$Script.call(RemotingDiagnostics.java:114) at hudson.remoting.LocalChannel.call(LocalChannel.java:45) at hudson.util.RemotingDiagnostics.executeGroovy(RemotingDiagnostics.java:111) at jenkins.model.Jenkins._doScript(Jenkins.java:4360) at jenkins.model.Jenkins.doScript(Jenkins.java:4331) at java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:627) at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:343) at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:184) at org.kohsuke.stapler.Function.bindAndInvokeAndServeResponse(Function.java:117) at org.kohsuke.stapler.MetaClass$1.doDispatch(MetaClass.java:129) at org.kohsuke.stapler.NameBasedDispatcher.dispatch(NameBasedDispatcher.java:58) at org.kohsuke.stapler.Stapler.tryInvoke(Stapler.java:715) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:845) at org.kohsuke.stapler.Stapler.invoke(Stapler.java:649) at org.kohsuke.stapler.Stapler.service(Stapler.java:238) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:841) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:154) at com.smartcodeltd.jenkinsci.plugin.assetbundler.filters.LessCSS.doFilter(LessCSS.java:47) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.plugins.greenballs.GreenBallFilter.doFilter(GreenBallFilter.java:59) at hudson.util.PluginServletFilter$1.doFilter(PluginServletFilter.java:151) at hudson.util.PluginServletFilter.doFilter(PluginServletFilter.java:157) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at hudson.security.csrf.CrumbFilter.doFilter(CrumbFilter.java:64) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:84) at hudson.security.UnwrapSecurityExceptionFilter.doFilter(UnwrapSecurityExceptionFilter.java:51) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:117) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.providers.anonymous.AnonymousProcessingFilter.doFilter(AnonymousProcessingFilter.java:125) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.rememberme.RememberMeProcessingFilter.doFilter(RememberMeProcessingFilter.java:142) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.ui.AbstractProcessingFilter.doFilter(AbstractProcessingFilter.java:271) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at jenkins.security.BasicHeaderProcessor.doFilter(BasicHeaderProcessor.java:93) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at org.acegisecurity.context.HttpSessionContextIntegrationFilter.doFilter(HttpSessionContextIntegrationFilter.java:249) at hudson.security.HttpSessionContextIntegrationFilter2.doFilter(HttpSessionContextIntegrationFilter2.java:67) at hudson.security.ChainedServletFilter$1.doFilter(ChainedServletFilter.java:87) at hudson.security.ChainedServletFilter.doFilter(ChainedServletFilter.java:90) at hudson.security.HudsonFilter.doFilter(HudsonFilter.java:171) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at org.kohsuke.stapler.compression.CompressionFilter.doFilter(CompressionFilter.java:49) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at hudson.util.CharacterEncodingFilter.doFilter(CharacterEncodingFilter.java:82) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at org.kohsuke.stapler.DiagnosticThreadNameFilter.doFilter(DiagnosticThreadNameFilter.java:30) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:564) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128) at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199) at winstone.BoundedExecutorService$1.run(BoundedExecutorService.java:77) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang. Thread .run( Thread .java:748) Finished: ABORTED

          Sam Van Oort added a comment -

          So, here's the things I can put together:

          1. Flow completed successfully initially. Speculation: somehow there is invalid state in the list of persisted 'heads' (final FlowNodes) for the build with a null head stored, or that FlowNode wasn't persisted into storage for $reasons and cannot be loaded.
          2. Because the flow ended in a way that does not evaluate as 'isComplete' or state was not persisted such that it would show as complete – this would happen if either the number of 'head' (most recent) nodes for the flow is >1 or the final 'head' FlowNode is NOT the FlowEndNode as expected (could be null) – this means the Pipeline will show as incomplete and eligible to try to resume (may be blocked if resume is disabled explicitly or fails).
          3. Eventually the build is cleared from the SoftReference cache of builds... and one of the processes that iterates through builds is triggered a couple days later (there's a variety of things that can do this, but it's normal)
          4. FlowNode Storage is initialized – if it can't load one of the heads OR can't load one of the startNodes (starts of a block), then it will see the null is a problem, and switch to fallback storage (to avoid overwriting original data which might be recoverable), and try to fake up a dummy set of startNodes/heads/etc.
          5. This dummy storage will NOT show as complete, so it'll try to resume, sees that the build was marked "can't resume" and fails with the IOException observed.
          6. This failure will invoke onProgramEnd and somehow onNewHead runs with the residual FlowNode that isn't in this storage, triggering NPE – also this triggers creation of the secondary fallbock storage ("workflow-fallback/flownodeStore.xml"). Conclusion: something must be wrong internally here, but it's not clear where precisely yet.
          7. The Process from 5 triggers creation of a new FlowEndNode written to the FlowNodeStorage (observed in your workflow-fallback flowNodeStore.xml x2). But yet somehow this isn't persisted as the head for the Execution and is once again marked as incomplete and eligible to be reloaded.
          8. Because the previous does not work correctly, it can happen again (seen 2x in this cycle)

          Sam Van Oort added a comment - So, here's the things I can put together: 1. Flow completed successfully initially. Speculation: somehow there is invalid state in the list of persisted 'heads' (final FlowNodes) for the build with a null head stored, or that FlowNode wasn't persisted into storage for $reasons and cannot be loaded. 2. Because the flow ended in a way that does not evaluate as 'isComplete' or state was not persisted such that it would show as complete – this would happen if either the number of 'head' (most recent) nodes for the flow is >1 or the final 'head' FlowNode is NOT the FlowEndNode as expected (could be null) – this means the Pipeline will show as incomplete and eligible to try to resume (may be blocked if resume is disabled explicitly or fails). 3. Eventually the build is cleared from the SoftReference cache of builds... and one of the processes that iterates through builds is triggered a couple days later (there's a variety of things that can do this, but it's normal) 4. FlowNode Storage is initialized – if it can't load one of the heads OR can't load one of the startNodes (starts of a block), then it will see the null is a problem, and switch to fallback storage (to avoid overwriting original data which might be recoverable), and try to fake up a dummy set of startNodes/heads/etc. 5. This dummy storage will NOT show as complete, so it'll try to resume, sees that the build was marked "can't resume" and fails with the IOException observed. 6. This failure will invoke onProgramEnd and somehow onNewHead runs with the residual FlowNode that isn't in this storage, triggering NPE – also this triggers creation of the secondary fallbock storage ("workflow-fallback/flownodeStore.xml"). Conclusion: something must be wrong internally here, but it's not clear where precisely yet. 7. The Process from 5 triggers creation of a new FlowEndNode written to the FlowNodeStorage (observed in your workflow-fallback flowNodeStore.xml x2). But yet somehow this isn't persisted as the head for the Execution and is once again marked as incomplete and eligible to be reloaded. 8. Because the previous does not work correctly, it can happen again (seen 2x in this cycle)

          Sam Van Oort added a comment -

          mkozell No, I think we have enough data, thank you – I've got a potential fix that also adds additional diagnostic information almost ready to go. Just working through some test failures and then I can hand over a binary and PR to try out on a test instance.

          Sam Van Oort added a comment - mkozell No, I think we have enough data, thank you – I've got a potential fix that also adds additional diagnostic information almost ready to go. Just working through some test failures and then I can hand over a binary and PR to try out on a test instance.

          Sam Van Oort added a comment -

          mkozell We now have the root cause for https://issues.jenkins-ci.org/browse/JENKINS-49686 identified, which accounts for the final missing piece, the NPE itself. The root cause is described in the comments there, and the fix should resolve most of the remaining issues with the partial fix preceding it and create a comprehensive solution to both parts of the problem.

          Sam Van Oort added a comment - mkozell We now have the root cause for https://issues.jenkins-ci.org/browse/JENKINS-49686 identified, which accounts for the final missing piece, the NPE itself. The root cause is described in the comments there, and the fix should resolve most of the remaining issues with the partial fix preceding it and create a comprehensive solution to both parts of the problem.

          We're also affected by this issue and seeing similar NPE errors. I can provide more details if needed.

          Piotr Paczyński added a comment - We're also affected by this issue and seeing similar NPE errors. I can provide more details if needed.

          Sam Van Oort added a comment -

          ppaczyn mkozell Would you by any chance be set up with a non-critical / test master and interested in trying out some SNAPSHOT builds of the proposed fixes? I think what we have now is more or less ready to go and would love to see how it plays for you guys in the wild.

          Sam Van Oort added a comment - ppaczyn mkozell Would you by any chance be set up with a non-critical / test master and interested in trying out some SNAPSHOT builds of the proposed fixes? I think what we have now is more or less ready to go and would love to see how it plays for you guys in the wild.

          Mike Kozell added a comment -

          I have a dedicated test Jenkins master I can use for testing.  I will just need to know where I can get the build and how to enable it.

          Mike Kozell added a comment - I have a dedicated test Jenkins master I can use for testing.  I will just need to know where I can get the build and how to enable it.

          Sam Van Oort added a comment -

          mkozell I've attached plugin builds for workflow-cps and workflow-job plugins that should resolve the issues for you – and fixes a whole bunch of other related problems around the same logic (plus adds an optimization and some serious extra bulletproofing)-- please could you give them a try and let me know?

          workflow-cps.hpi
          workflow-job.hpi

          Or if you prefer to build them yourself, the PRs are here:
          https://github.com/jenkinsci/workflow-cps-plugin/pull/216
          https://github.com/jenkinsci/workflow-job-plugin/pull/93

          Now, there's one remaining test failure I can sometimes trigger with my homemade durability-fuzzing tool with Maximum Durability mode – I'm not clear if it's an issue with the test hardness or not, but if you do see a case where that results in zero flow nodes in the build directory, please let me know. I'm not comfortable releasing until I have it addressed + reviewed though (hopefully in the next day or two).

          Logs will show something like this (which is OK and sometimes expected in performance-optimized mode but NOT maximum durability):

          0.150 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#initializeStorage: Tried to load head FlowNodes for execution OwnerNestedParallelDurableJob/6:NestedParallelDurableJob #6 but FlowNode was not found in storage for head id:FlowNodeId 1:25
          0.151 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#rebuildEmptyGraph: Failed to load pipeline heads, so faking some up for execution CpsFlowExecutionOwner[NestedParallelDurableJob/6:NestedParallelDurableJob #6]
          0.161 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#onLoad: Completed flow without FlowEndNode: CpsFlowExecutionOwner[NestedParallelDurableJob/6:NestedParallelDurableJob #6] heads:26::26:org.jenkinsci.plugins.workflow.graph.FlowStartNode[id=27]

          Thank you!

          Sam Van Oort added a comment - mkozell I've attached plugin builds for workflow-cps and workflow-job plugins that should resolve the issues for you – and fixes a whole bunch of other related problems around the same logic (plus adds an optimization and some serious extra bulletproofing)-- please could you give them a try and let me know? workflow-cps.hpi workflow-job.hpi Or if you prefer to build them yourself, the PRs are here: https://github.com/jenkinsci/workflow-cps-plugin/pull/216 https://github.com/jenkinsci/workflow-job-plugin/pull/93 Now, there's one remaining test failure I can sometimes trigger with my homemade durability-fuzzing tool with Maximum Durability mode – I'm not clear if it's an issue with the test hardness or not, but if you do see a case where that results in zero flow nodes in the build directory, please let me know. I'm not comfortable releasing until I have it addressed + reviewed though (hopefully in the next day or two). Logs will show something like this (which is OK and sometimes expected in performance-optimized mode but NOT maximum durability): 0.150 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#initializeStorage: Tried to load head FlowNodes for execution Owner NestedParallelDurableJob/6:NestedParallelDurableJob #6 but FlowNode was not found in storage for head id:FlowNodeId 1:25 0.151 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#rebuildEmptyGraph: Failed to load pipeline heads, so faking some up for execution CpsFlowExecution Owner[NestedParallelDurableJob/6:NestedParallelDurableJob #6] 0.161 [id=1636] WARNING o.j.p.w.cps.CpsFlowExecution#onLoad: Completed flow without FlowEndNode: CpsFlowExecution Owner[NestedParallelDurableJob/6:NestedParallelDurableJob #6] heads:26::26:org.jenkinsci.plugins.workflow.graph.FlowStartNode [id=27] Thank you!

          Sam Van Oort added a comment -

          mkozell I'm attaching updated SNAPSHOTs - these I consider finalized and release-ready, and include a resolution of the issue above.

          workflow-job.hpi
          workflow-cps.hpi

          Please can you give these a try?

          Sam Van Oort added a comment - mkozell I'm attaching updated SNAPSHOTs - these I consider finalized and release-ready, and include a resolution of the issue above. workflow-job.hpi workflow-cps.hpi Please can you give these a try?

          Code changed in jenkins
          User: Sam Van Oort
          Path:
          pom.xml
          src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunRestartTest.java
          src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunTest.java
          http://jenkins-ci.org/commit/workflow-job-plugin/f0c26058f31d4f159a82a3cace52935e93f20701
          Log:
          Merge pull request #93 from svanoort/fix-resume-issues

          Fix resume issues JENKINS-49686 and JENKINS-50199 and JENKINS-50407

          Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/e11cea623f61...f0c26058f31d

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Sam Van Oort Path: pom.xml src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunRestartTest.java src/test/java/org/jenkinsci/plugins/workflow/job/WorkflowRunTest.java http://jenkins-ci.org/commit/workflow-job-plugin/f0c26058f31d4f159a82a3cace52935e93f20701 Log: Merge pull request #93 from svanoort/fix-resume-issues Fix resume issues JENKINS-49686 and JENKINS-50199 and JENKINS-50407 Compare: https://github.com/jenkinsci/workflow-job-plugin/compare/e11cea623f61...f0c26058f31d

          Sam Van Oort added a comment -

          Resolved as of workflow-cps 2.47 and workflow-job 2.18

          Sam Van Oort added a comment - Resolved as of workflow-cps 2.47 and workflow-job 2.18

          Sam Van Oort added a comment -

          mkozell Thanks for providing a reproduction. To update, I'm working on this still – a bit of a tricky one, but having a clear case for what triggers this is very helpful.

          Sam Van Oort added a comment - mkozell Thanks for providing a reproduction. To update, I'm working on this still – a bit of a tricky one, but having a clear case for what triggers this is very helpful.

          Matt Gaspar added a comment -

          I'm seeing some similar behavior on a few masters running 2.89.3

           

          workflow-cps: 2.48

          workflow-job: 2.19

           

          It appears to even try and resume on a build that completed successfully previously after any restart of jenkins:

          Job Console snippet

          GitHub has been notified of this commit’s build result
          
          Finished: SUCCESS
          Resuming build at Wed Apr 18 03:45:16 MST 2018 after Jenkins restart
          Resuming build at Fri Apr 20 01:21:23 MST 2018 after Jenkins restart
          Resuming build at Fri Apr 20 01:52:35 MST 2018 after Jenkins restart
          Resuming build at Fri Apr 20 11:53:07 MST 2018 after Jenkins restart
          Resuming build at Fri Apr 20 12:13:33 MST 2018 after Jenkins restart

          jenkins.log

          Apr 20, 2018 12:18:52 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6 onFailure
          WARNING: Failed to interrupt steps in Owner[JOB_NAME1:JOB_NAME #1]
          java.lang.IllegalStateException: completed or broken execution
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:741)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:821)
          at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:694)
          at hudson.model.RunMap.retrieve(RunMap.java:225)
          at hudson.model.RunMap.retrieve(RunMap.java:57)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:345)
          at jenkins.model.lazy.AbstractLazyLoadRunMap.newestBuild(AbstractLazyLoadRunMap.java:275)
          at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.<init>(LazyLoadRunMapEntrySet.java:65)
          at jenkins.model.lazy.LazyLoadRunMapEntrySet.iterator(LazyLoadRunMapEntrySet.java:63)
          at java.util.AbstractMap$2$1.<init>(AbstractMap.java:411)
          at java.util.AbstractMap$2.iterator(AbstractMap.java:410)
          at hudson.util.RunList.iterator(RunList.java:113)
          at com.google.common.collect.Iterables$15.apply(Iterables.java:1128)
          at com.google.common.collect.Iterables$15.apply(Iterables.java:1125)
          at com.google.common.collect.Iterators$8.next(Iterators.java:812)
          at com.google.common.collect.Iterators$MergingIterator.<init>(Iterators.java:1306)
          at com.google.common.collect.Iterators.mergeSorted(Iterators.java:1274)
          at com.google.common.collect.Iterables$14.iterator(Iterables.java:1113)
          at com.google.common.collect.Iterables$UnmodifiableIterable.iterator(Iterables.java:94)
          at hudson.util.RunList.iterator(RunList.java:113)
          at jenkins.widgets.RunListProgressiveRendering.compute(RunListProgressiveRendering.java:60)
          at jenkins.util.ProgressiveRendering$1.run(ProgressiveRendering.java:122)
          at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)

          Hoping this information might help. If you need more examples or anything let me know.

          Matt Gaspar added a comment - I'm seeing some similar behavior on a few masters running 2.89.3   workflow-cps: 2.48 workflow-job: 2.19   It appears to even try and resume on a build that completed successfully previously after any restart of jenkins: Job Console snippet GitHub has been notified of this commit’s build result Finished: SUCCESS Resuming build at Wed Apr 18 03:45:16 MST 2018 after Jenkins restart Resuming build at Fri Apr 20 01:21:23 MST 2018 after Jenkins restart Resuming build at Fri Apr 20 01:52:35 MST 2018 after Jenkins restart Resuming build at Fri Apr 20 11:53:07 MST 2018 after Jenkins restart Resuming build at Fri Apr 20 12:13:33 MST 2018 after Jenkins restart jenkins.log Apr 20, 2018 12:18:52 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$6 onFailure WARNING: Failed to interrupt steps in Owner[JOB_NAME1:JOB_NAME #1] java.lang.IllegalStateException: completed or broken execution at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.onLoad(CpsFlowExecution.java:741) at org.jenkinsci.plugins.workflow.job.WorkflowRun.getExecution(WorkflowRun.java:821) at org.jenkinsci.plugins.workflow.job.WorkflowRun.onLoad(WorkflowRun.java:694) at hudson.model.RunMap.retrieve(RunMap.java:225) at hudson.model.RunMap.retrieve(RunMap.java:57) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:500) at jenkins.model.lazy.AbstractLazyLoadRunMap.load(AbstractLazyLoadRunMap.java:482) at jenkins.model.lazy.AbstractLazyLoadRunMap.getByNumber(AbstractLazyLoadRunMap.java:380) at jenkins.model.lazy.AbstractLazyLoadRunMap.search(AbstractLazyLoadRunMap.java:345) at jenkins.model.lazy.AbstractLazyLoadRunMap.newestBuild(AbstractLazyLoadRunMap.java:275) at jenkins.model.lazy.LazyLoadRunMapEntrySet$1.<init>(LazyLoadRunMapEntrySet.java:65) at jenkins.model.lazy.LazyLoadRunMapEntrySet.iterator(LazyLoadRunMapEntrySet.java:63) at java.util.AbstractMap$2$1.<init>(AbstractMap.java:411) at java.util.AbstractMap$2.iterator(AbstractMap.java:410) at hudson.util.RunList.iterator(RunList.java:113) at com.google.common.collect.Iterables$15.apply(Iterables.java:1128) at com.google.common.collect.Iterables$15.apply(Iterables.java:1125) at com.google.common.collect.Iterators$8.next(Iterators.java:812) at com.google.common.collect.Iterators$MergingIterator.<init>(Iterators.java:1306) at com.google.common.collect.Iterators.mergeSorted(Iterators.java:1274) at com.google.common.collect.Iterables$14.iterator(Iterables.java:1113) at com.google.common.collect.Iterables$UnmodifiableIterable.iterator(Iterables.java:94) at hudson.util.RunList.iterator(RunList.java:113) at jenkins.widgets.RunListProgressiveRendering.compute(RunListProgressiveRendering.java:60) at jenkins.util.ProgressiveRendering$1.run(ProgressiveRendering.java:122) at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:58) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Hoping this information might help. If you need more examples or anything let me know.

          Sam Van Oort added a comment -

          gattie Thanks, do you have the build.xml and the flowNodeStorage.xml or the XML files in the workflow subdirectory of the build?

          I have something that feels like it's 80% of the way there to resolve this cluster of issues – it addresses most of the things noted by mkozell – but it has one known issue still to resolve with shutdown-time persistence. I would like to identify why YOUR build is not showing as already-complete though – all signs point to something going wrong with loading the FlowExecution information, pointing to it looking as if it is incomplete.

          Do you have any other log messages associated with this build, also?

          Sam Van Oort added a comment - gattie Thanks, do you have the build.xml and the flowNodeStorage.xml or the XML files in the workflow subdirectory of the build? I have something that feels like it's 80% of the way there to resolve this cluster of issues – it addresses most of the things noted by mkozell – but it has one known issue still to resolve with shutdown-time persistence. I would like to identify why YOUR build is not showing as already-complete though – all signs point to something going wrong with loading the FlowExecution information, pointing to it looking as if it is incomplete. Do you have any other log messages associated with this build, also?

          Sam Van Oort added a comment -

          gattie Your data shows that we did indeed fail at some point in the loading of the heads/start nodes/flownodes, and switched to the fallback directory. Under your build workflow folder, please could you upload the file 40.xml through 47.xml – or any that are present, in any case? Perhaps as a single matt-flownodes.xml?

          I also see that, paradoxically, we show persistedClean = false, for the execution (it thinks that we did not cleanly save data), and completed is false too (may be a result of preceding issues and showing as the build being incomplete).

          Thanks! I think both of those issues should be covered by fixes I've done now.

          Sam Van Oort added a comment - gattie Your data shows that we did indeed fail at some point in the loading of the heads/start nodes/flownodes, and switched to the fallback directory. Under your build workflow folder, please could you upload the file 40.xml through 47.xml – or any that are present, in any case? Perhaps as a single matt-flownodes.xml? I also see that, paradoxically, we show persistedClean = false, for the execution (it thinks that we did not cleanly save data), and completed is false too (may be a result of preceding issues and showing as the build being incomplete). Thanks! I think both of those issues should be covered by fixes I've done now.

          Matt Gaspar added a comment -

          svanoort I've attached the build.xml for one of the builds that shows SUCCESS (at least in the console) but appears to want to replay after a Jenkins restart. I also attached workflows.xml which is the concatenated files from the workflow directory of the same build.

          I haven't found any other useful logs associated with this build beyond the exception in my last comment.

          build.xml

          workflows.xml

          Matt Gaspar added a comment - svanoort I've attached the build.xml for one of the builds that shows SUCCESS (at least in the console) but appears to want to replay after a Jenkins restart. I also attached workflows.xml which is the concatenated files from the workflow directory of the same build. I haven't found any other useful logs associated with this build beyond the exception in my last comment. build.xml workflows.xml

          Matt Gaspar added a comment -

          I didn't see your comment before I posted, but yeah, the workflows.xml are the files from the workflow folder for the build. There was only 45-47 available in there.

          I did see that this particular one failed despite the console output saying SUCCESS near the bottom. Regardless, this was a build that already ran and completed and shouldn't have reran after a restart of Jenkins.

          Matt Gaspar added a comment - I didn't see your comment before I posted, but yeah, the workflows.xml are the files from the workflow folder for the build. There was only 45-47 available in there. I did see that this particular one failed despite the console output saying SUCCESS near the bottom. Regardless, this was a build that already ran and completed and shouldn't have reran after a restart of Jenkins.

          Sam Van Oort added a comment -

          gattie mkozell I appreciate your patience and assistance with diagnosing this. Please could you try the attached SNAPSHOTs and see if the issue is fully resolved? workflow-job.hpi workflow-cps.hpi

          I've done several additional rounds of testing (including unit tests based on your scenarios) and fixes to try to ensure this is as robust and the fixes are as comprehensive as possible. These changes have received initial code review, and if all looks good they will be ready for full release soon – but obviously it would be helpful to have you try them out with your situations.

          Sam Van Oort added a comment - gattie mkozell I appreciate your patience and assistance with diagnosing this. Please could you try the attached SNAPSHOTs and see if the issue is fully resolved? workflow-job.hpi workflow-cps.hpi I've done several additional rounds of testing (including unit tests based on your scenarios) and fixes to try to ensure this is as robust and the fixes are as comprehensive as possible. These changes have received initial code review, and if all looks good they will be ready for full release soon – but obviously it would be helpful to have you try them out with your situations.

          Sam Van Oort added a comment -

          mkozell Sure! The code is here, modulo an extra line or two of logging and a couple of small tweaks since then (nothing substantial):

          https://github.com/jenkinsci/workflow-job-plugin/pull/96

          https://github.com/jenkinsci/workflow-cps-plugin/pull/223

          Feel free to add code review too if you like, though it's just gotten a final pass by the other two people that do a lot of Pipeline core development.

          Sam Van Oort added a comment - mkozell Sure! The code is here, modulo an extra line or two of logging and a couple of small tweaks since then (nothing substantial): https://github.com/jenkinsci/workflow-job-plugin/pull/96 https://github.com/jenkinsci/workflow-cps-plugin/pull/223 Feel free to add code review too if you like, though it's just gotten a final pass by the other two people that do a lot of Pipeline core development.

          Sam Van Oort added a comment - - edited

          mkozell Also I think we're clear for release of this now – but I'm aiming to delay until at least tonight EST if that'll permit enough time to land it on some real-world Jenkins masters as a sanity check.

          Sam Van Oort added a comment - - edited mkozell Also I think we're clear for release of this now – but I'm aiming to delay until at least tonight EST if that'll permit enough time to land it on some real-world Jenkins masters as a sanity check.

          Sam Van Oort added a comment -

          mkozell Are you building on Windows, by any chance? I'd try to set the platform encoding to UTF-8 – that error is safe to ignore though. You can comment out the whole test by adding "@Ignore" to src/test/java/org/jenkinsci/plugins/workflow/job/CLITest.java on line 85.

          Sam Van Oort added a comment - mkozell Are you building on Windows, by any chance? I'd try to set the platform encoding to UTF-8 – that error is safe to ignore though. You can comment out the whole test by adding "@Ignore" to src/test/java/org/jenkinsci/plugins/workflow/job/CLITest.java on line 85.

          Sam Van Oort added a comment -

          Fixes released with workflow-cps 2.50 and workflow-job 2.21 – would really appreciate if you guys could install the latest and confirm the issue is resolved.

          CCmkozell gattie

          Sam Van Oort added a comment - Fixes released with workflow-cps 2.50 and workflow-job 2.21 – would really appreciate if you guys could install the latest and confirm the issue is resolved. CC mkozell gattie

          Sam Van Oort added a comment -

          mkozell That is... the exact opposite of what I would expect. Not sure what's going on there, since the ci.jenkins.io builder does both Windows and Linux builds and they succeed. Weird.

          Sam Van Oort added a comment - mkozell That is... the exact opposite of what I would expect. Not sure what's going on there, since the ci.jenkins.io builder does both Windows and Linux builds and they succeed. Weird.

          Christian V added a comment -

          I'm also seeing this:

          org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@134626b. The executor service is shutting down
          
          at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:108)
          
          at java.util.concurrent.AbstractExecutorService.submit(Unknown Source)
          
          at com.google.common.util.concurrent.ForwardingExecutorService.submit(ForwardingExecutorService.java:110)
          
          at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:49)
          
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:903)
          
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:899)
          
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150)
          
          at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253)
          
          at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149)
          
          at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105)
          
          at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155)
          
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160)
          
          at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90)
          
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:899)
          
          at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentExecutions(CpsFlowExecution.java:977)
          
          at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:180)
          
          at jenkins.model.Jenkins.<init>(Jenkins.java:972)
          
          at hudson.model.Hudson.<init>(Hudson.java:85)
          
          at hudson.model.Hudson.<init>(Hudson.java:81)
          
          at hudson.WebAppMain$3.run(WebAppMain.java:233)

           

          I have updated to the latest Jenkins version and all my Pipeline plugins are up to date. This happened before the update, after the plugin updates, and after updating Jenkins. It's always the same exception.

          Christian V added a comment - I'm also seeing this: org.jenkinsci.remoting.util.ExecutorServiceUtils$FatalRejectedExecutionException: Cannot execute the command java.util.concurrent.FutureTask@134626b. The executor service is shutting down at hudson.remoting.SingleLaneExecutorService.execute(SingleLaneExecutorService.java:108) at java.util.concurrent.AbstractExecutorService.submit(Unknown Source) at com.google.common.util.concurrent.ForwardingExecutorService.submit(ForwardingExecutorService.java:110) at jenkins.util.InterceptingExecutorService.submit(InterceptingExecutorService.java:49) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:903) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution$4.onSuccess(CpsFlowExecution.java:899) at org.jenkinsci.plugins.workflow.support.concurrent.Futures$1.run(Futures.java:150) at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:253) at com.google.common.util.concurrent.ExecutionList$RunnableExecutorPair.execute(ExecutionList.java:149) at com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:105) at com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:155) at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:160) at org.jenkinsci.plugins.workflow.support.concurrent.Futures.addCallback(Futures.java:90) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.runInCpsVmThread(CpsFlowExecution.java:899) at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.getCurrentExecutions(CpsFlowExecution.java:977) at org.jenkinsci.plugins.workflow.flow.FlowExecutionList$ItemListenerImpl.onLoaded(FlowExecutionList.java:180) at jenkins.model.Jenkins.<init>(Jenkins.java:972) at hudson.model.Hudson.<init>(Hudson.java:85) at hudson.model.Hudson.<init>(Hudson.java:81) at hudson.WebAppMain$3.run(WebAppMain.java:233)   I have updated to the latest Jenkins version and all my Pipeline plugins are up to date. This happened before the update, after the plugin updates, and after updating Jenkins. It's always the same exception.

          Sam Van Oort added a comment -

          Adding a note before I forget that this likely relates to JENKINS-45571.

          Based on some test scenarios that are generating similar symptoms, I suspect both are potentially related to the the CpsContext.isReady call hanging indefinitely – which seems to be a reproducible but obscure threading bug which I have yet to solve.

          Symptoms if that's the same culprit:

          1. Build never gets marked as complete, even though the execution may - it appears the WorkflowRun never gets its "finish" method called and may not have a result set as a result
          2. Pipeline still shows as having a running OneOffExecutor running its WorkflowJob as a Flyweight task
          3. I generally see the CpsContext.isReady hang running, and this is triggered when something fails in early stages of loading & resuming the Pipeline program for an incomplete Pipeline after a restart
          4. Investigation showed that the AsynchronousExecution created by WorkflowRun#sleep never finished (probably due to the CpsStepContext hang).

          Sam Van Oort added a comment - Adding a note before I forget that this likely relates to JENKINS-45571 . Based on some test scenarios that are generating similar symptoms, I suspect both are potentially related to the the CpsContext.isReady call hanging indefinitely – which seems to be a reproducible but obscure threading bug which I have yet to solve. Symptoms if that's the same culprit: 1. Build never gets marked as complete, even though the execution may - it appears the WorkflowRun never gets its "finish" method called and may not have a result set as a result 2. Pipeline still shows as having a running OneOffExecutor running its WorkflowJob as a Flyweight task 3. I generally see the CpsContext.isReady hang running, and this is triggered when something fails in early stages of loading & resuming the Pipeline program for an incomplete Pipeline after a restart 4. Investigation showed that the AsynchronousExecution created by WorkflowRun#sleep never finished (probably due to the CpsStepContext hang).

          John Arnold added a comment -

          I'm definitely still seeing this "stuck on resume" problem for completed builds AND aborted builds, after a Jenkins reboot.  I have Jenkins set for max perf, low durability mode, to disable this Resume functionality, and it's still trying to resume.

          I can't Stop the jobs that are stuck on resume, I have to do a hard-kill via the /kill uri.  Even trying to kill .finish() them with Groovy script doesn't seem to work reliably.

          John Arnold added a comment - I'm definitely still seeing this "stuck on resume" problem for completed builds AND aborted builds, after a Jenkins reboot.  I have Jenkins set for max perf, low durability mode, to disable this Resume functionality, and it's still trying to resume. I can't Stop the jobs that are stuck on resume, I have to do a hard-kill via the /kill uri.  Even trying to kill .finish() them with Groovy script doesn't seem to work reliably.

          John Arnold added a comment - - edited

          Example job log:
          https://pastebin.com/LbnRRTw4

          Jenkins 2.135

          John Arnold added a comment - - edited Example job log: https://pastebin.com/LbnRRTw4 Jenkins 2.135

          Devin Nusbaum added a comment - - edited

          johnar Are you able to upload the build.xml for one of the stuck builds so that we can see all of its persisted state? (Please redact any sensitive information if you do upload it.)

          Also, can you please confirm what versions of the Pipeline Groovy Plugin and the Pipeline Job Plugin plugin you are using? Thanks!

          Devin Nusbaum added a comment - - edited johnar Are you able to upload the build.xml for one of the stuck builds so that we can see all of its persisted state? (Please redact any sensitive information if you do upload it.) Also, can you please confirm what versions of the Pipeline Groovy Plugin and the Pipeline Job Plugin plugin you are using? Thanks!

          John Arnold added a comment - - edited

          dnusbaum

          Pipeline Groovy: 2.54
          Pipeline Job: 2.24

          build.xml (redacted internal strings)
          https://pastebin.com/rqZ7EwcG

          John Arnold added a comment - - edited dnusbaum Pipeline Groovy: 2.54 Pipeline Job: 2.24 build.xml (redacted internal strings) https://pastebin.com/rqZ7EwcG

          Sam Van Oort added a comment -

          dnusbaum I took a quick pass through the build.xml and it fits my comment in https://issues.jenkins-ci.org/browse/JENKINS-50199?focusedCommentId=345572&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-345572 to the T AFAICT.

          Please, would you be able to take a peek? It should tie to the issues with CpsContext.isReady and the reproduction case in https://github.com/svanoort/workflow-cps-plugin/tree/reproduce-hung-pipelines – and at this point a fresh perspective might be helpful.

          Sam Van Oort added a comment - dnusbaum I took a quick pass through the build.xml and it fits my comment in https://issues.jenkins-ci.org/browse/JENKINS-50199?focusedCommentId=345572&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-345572 to the T AFAICT. Please, would you be able to take a peek? It should tie to the issues with CpsContext.isReady and the reproduction case in https://github.com/svanoort/workflow-cps-plugin/tree/reproduce-hung-pipelines – and at this point a fresh perspective might be helpful.

          Devin Nusbaum added a comment -

          svanoort I have been looking at the issue. Like I commented here I don't think that case is actually a reproduction, because it does complete if you increase the timeout, although I might be misunderstanding the test case. I have been trying to find a way to reproduce the problem, but have been unsuccessful so far.

          Devin Nusbaum added a comment - svanoort I have been looking at the issue. Like I commented here I don't think that case is actually a reproduction, because it does complete if you increase the timeout, although I might be misunderstanding the test case. I have been trying to find a way to reproduce the problem, but have been unsuccessful so far.

          John Arnold added a comment -

          dnusbaum svanoort  Is there any additional data I can gather for you. I have lots of repros of this problem.

          John Arnold added a comment - dnusbaum svanoort   Is there any additional data I can gather for you. I have lots of repros of  this problem.

          Devin Nusbaum added a comment -

          johnar Could you upload the whole build folder for one of the zombie builds? I'm most interested in the workflow directory adjacent to build.xml to see if the serialized flow nodes tell us anything interesting, but any exceptions/warnings in the various log files would be interesting as well.

          Devin Nusbaum added a comment - johnar Could you upload the whole build folder for one of the zombie builds? I'm most interested in the workflow directory adjacent to build.xml to see if the serialized flow nodes tell us anything interesting, but any exceptions/warnings in the various log files would be interesting as well.

          Devin Nusbaum added a comment -

          svanoort Any thoughts on adding something like the following patch as a temporary fix in workflow-job to prevent these pipelines from resuming while we continue investigating?

          diff --git a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          index 38e6ac1..70d9474 100644
          --- a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          +++ b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java
          @@ -752,2 +752,8 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F
                                   if (!completed == Boolean.TRUE) {
          +                            if (execution.isComplete()) {
          +                                LOGGER.log(Level.WARNING, "JENKINS-50199: Run completion inconsistent with execution, defaulting to failure for {0}", this);
          +                                setResult(Result.FAILURE);
          +                                completed = Boolean.TRUE;
          +                                needsToPersist = true;
          +                            } else {
                                       // we've been restarted while we were running. let's get the execution going again.
          @@ -757,2 +763,3 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F
                                       Timer.get().submit(() -> Queue.getInstance().schedule(new AfterRestartTask(WorkflowRun.this), 0)); // JENKINS-31614
          +                            }
                                   }
          

          Devin Nusbaum added a comment - svanoort Any thoughts on adding something like the following patch as a temporary fix in workflow-job to prevent these pipelines from resuming while we continue investigating? diff --git a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java index 38e6ac1..70d9474 100644 --- a/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java +++ b/src/main/java/org/jenkinsci/plugins/workflow/job/WorkflowRun.java @@ -752,2 +752,8 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F if (!completed == Boolean.TRUE) { + if (execution.isComplete()) { + LOGGER.log(Level.WARNING, "JENKINS-50199: Run completion inconsistent with execution, defaulting to failure for {0}", this); + setResult(Result.FAILURE); + completed = Boolean.TRUE; + needsToPersist = true; + } else { // we've been restarted while we were running. let's get the execution going again. @@ -757,2 +763,3 @@ public final class WorkflowRun extends Run<WorkflowJob,WorkflowRun> implements F Timer.get().submit(() -> Queue.getInstance().schedule(new AfterRestartTask(WorkflowRun.this), 0)); // JENKINS-31614 + } }

          Sam Van Oort added a comment -

          Observations from digging in together:

          1. We can reproduce the restarted builds simply by setting the FlowExecution to completed for an incomplete build and dirty-restarting.
          2. On restart that will trigger the build to be able to resume even when it shouldn't – we can avoid this by marking the build completed upon loading if the execution is done
          3. The only place where the message about the build being done can be printed is WorkflowRun line 815 where it gets the StreamBuildListener and calls #finished on it (that prints the build completed message)
          4. The build record has non-null logsToCopy - since that's nulled a few lines down from where the "build completed" message prints, just before saving the build, something prevented saving the build after that mutation - dnusbaum speculates that it might be due to the BulkChange logic in WorkflowRun.save, which early-exits if there's a BulkChange containing the WorkflowRun – if the BulkChange aborts, the WorkflowRun's save operation will be a no-op (and it may not ever be saved).

          I have a small patch which closes the loophole around the completed execution but incomplete build (and passes a basic testcase), but point 4 still needs some thought.

          Sam Van Oort added a comment - Observations from digging in together: 1. We can reproduce the restarted builds simply by setting the FlowExecution to completed for an incomplete build and dirty-restarting. 2. On restart that will trigger the build to be able to resume even when it shouldn't – we can avoid this by marking the build completed upon loading if the execution is done 3. The only place where the message about the build being done can be printed is WorkflowRun line 815 where it gets the StreamBuildListener and calls #finished on it (that prints the build completed message) 4. The build record has non-null logsToCopy - since that's nulled a few lines down from where the "build completed" message prints, just before saving the build, something prevented saving the build after that mutation - dnusbaum speculates that it might be due to the BulkChange logic in WorkflowRun.save, which early-exits if there's a BulkChange containing the WorkflowRun – if the BulkChange aborts, the WorkflowRun's save operation will be a no-op (and it may not ever be saved). I have a small patch which closes the loophole around the completed execution but incomplete build (and passes a basic testcase), but point 4 still needs some thought.

          Roy Arnon added a comment - - edited

          svanoort we have also recently encountered this issue, or something similar as in our case the resuming jobs completed with SUCCESS.

          I commented here

          But it seems this issue is more relevant to this ticket.

          In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). The build itself never completed and in fact had a failing test. This allowed one of our developers to merge a failing build into release unfortunately, 

          If there is anything you need from our jenkins instance I am happy to provide it.

          Roy Arnon added a comment - - edited svanoort we have also recently encountered this issue, or something similar as in our case the resuming jobs completed with SUCCESS. I commented here But it seems this issue is more relevant to this ticket. In our case the resuming build was marked as SUCCESS which marked at as success in Bitbucket (with bitbucket-branch-source plugin). The build itself never completed and in fact had a failing test. This allowed one of our developers to merge a failing build into release unfortunately,  If there is anything you need from our jenkins instance I am happy to provide it.

          John Arnold added a comment -

          John Arnold added a comment - dnusbaum See  https://issues.jenkins-ci.org/secure/attachment/43981/867.tar.gz   for a build folder.

          Devin Nusbaum added a comment -

          I just released version 2.25 of the Pipeline Job plugin which should fix the issue where completed builds are resuming after restarting Jenkins. If anyone is still seeing the issue after upgrading the plugin, please comment with any additional information you can provide, such as the build folder of the build in question.

          Note that there are other issues that may appear similar but seem to have distinct causes. For anyone newly commenting on this issue, if you are seeing stuck executors without ever restarting Jenkins, please check if JENKINS-45571JENKINS-53223, or JENKINS-51568 are closer to your situation and if so, comment there instead. Thanks!

          Devin Nusbaum added a comment - I just released version 2.25 of the Pipeline Job plugin which should fix the issue where completed builds are resuming after restarting Jenkins. If anyone is still seeing the issue after upgrading the plugin, please comment with any additional information you can provide, such as the build folder of the build in question. Note that there are other issues that may appear similar but seem to have distinct causes. For anyone newly commenting on this issue, if you are seeing stuck executors without ever restarting Jenkins, please check if  JENKINS-45571 ,  JENKINS-53223 , or  JENKINS-51568 are closer to your situation and if so, comment there instead. Thanks!

          Keith Ball added a comment -

          I just tried to update the plugin but received a FileNotFoundException: http://archives.jenkins-ci.org/plugins/workflow-job/2.25/workflow-job.hpi

          Failed to download from http://updates.jenkins-ci.org/download/plugins/workflow-job/2.25/workflow-job.hpi (redirected to: http://archives.jenkins-ci.org/plugins/workflow-job/2.25/workflow-job.hpi)

          Did someone forget to publish the latest version?

          Keith Ball added a comment - I just tried to update the plugin but received a FileNotFoundException: http://archives.jenkins-ci.org/plugins/workflow-job/2.25/workflow-job.hpi Failed to download from http://updates.jenkins-ci.org/download/plugins/workflow-job/2.25/workflow-job.hpi (redirected to: http://archives.jenkins-ci.org/plugins/workflow-job/2.25/workflow-job.hpi) Did someone forget to publish the latest version?

          Keith Ball added a comment -

          Forget it.  it's there now

          Keith Ball added a comment - Forget it.  it's there now

          Devin Nusbaum added a comment -

          keith_ball There is often a slight time delay between when a new release is visible on the update center and when it is available on all mirrors, my guess is that was what happened for you.

          Devin Nusbaum added a comment - keith_ball There is often a slight time delay between when a new release is visible on the update center and when it is available on all mirrors, my guess is that was what happened for you.

            dnusbaum Devin Nusbaum
            mkozell Mike Kozell
            Votes:
            5 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: