Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49686

NPE in CPS VM thread at WorkflowRun$GraphL.onNewHead

    • Pipeline - April 2018

      I have 2 jobs stuck in the build queue waiting, jobs are apparently waiting for 2 other jobs to complete but the nodes executors are free. I don't if these NPE can cause this behavior but they don't look right anyway.

      Feb 21, 2018 8:38:55 PM org.jenkinsci.plugins.workflow.cps.CpsFlowExecution onLoad
      WARNING: Pipeline state not properly persisted, cannot resume job/ice/job/3.7/221/
      Feb 21, 2018 8:38:55 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
      WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[ice/3.7/221:ice/3.7 #221]]
      java.lang.NullPointerException
      at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
      at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
      at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
      at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      
      Feb 21, 2018 8:38:55 PM org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService reportProblem
      WARNING: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[ice/3.7/221:ice/3.7 #221]]
      java.lang.NullPointerException
      at org.jenkinsci.plugins.workflow.job.WorkflowRun$GraphL.onNewHead(WorkflowRun.java:997)
      at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.notifyListeners(CpsFlowExecution.java:1368)
      at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$3.run(CpsThreadGroup.java:412)
      at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$1.run(CpsVmExecutorService.java:35)
      at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
      
      

        1. workflow-job.hpi
          110 kB
        2. workflow-cps.hpi
          540 kB
        3. jenkins.log.gz
          257 kB
        4. flowNodeStore.xml.gz
          53 kB
        5. build.xml.gz
          3 kB
        6. jenkins.log.gz
          257 kB
        7. flowNodeStore.xml.gz
          53 kB
        8. build.xml.gz
          3 kB

          [JENKINS-49686] NPE in CPS VM thread at WorkflowRun$GraphL.onNewHead

          bentoi created issue -

          Andrew Bayer added a comment -

          Hmm, I'm pretty sure this would be blocking the builds, yes. Not sure how we could end up with a null FlowNode or a null logsToCopy in GraphL#onNewHead, though. svanoort - any thoughts?

          Andrew Bayer added a comment - Hmm, I'm pretty sure this would be blocking the builds, yes. Not sure how we could end up with a null FlowNode or a null logsToCopy in GraphL#onNewHead , though. svanoort - any thoughts?
          Sam Van Oort made changes -
          Assignee New: Sam Van Oort [ svanoort ]

          Sam Van Oort added a comment -

          bentoi This should be possible to solve, but I'm going to need a bit fair bit of info to diagnose this one, I'm afraid, because the cause is complex:

          1. Which versions of Workflow-CPS and Workflow-Job plugins are you running?
          2. What Durability Setting is your pipeline running in? (Should be at the top of your build log, something like MAX_SURVIVABILITY or PERFORMANCE_OPTIMIZED).
          3. Did you recently restart the master while the pipelines were running, and if so, by which method did you restart it? (visiting the /restart url, service jenkins restart, pkill java, etc).
          4. Please can you attach a copy of the Jenkins.log covering the time the build started until these errors appeared?
          5. Please can you attach the build.xml for the build?
          6. In the build's workflow directory, there's either a flownodeStore.xml file or a bunch of tiny XML files. Please, can you upload the entry with the highest ID (they'll be numbered 2, 3, 4, etc) – if it's the flownodeStore.xml it's fine to just GZIP it and attach it.

          Sam Van Oort added a comment - bentoi This should be possible to solve, but I'm going to need a bit fair bit of info to diagnose this one, I'm afraid, because the cause is complex: 1. Which versions of Workflow-CPS and Workflow-Job plugins are you running? 2. What Durability Setting is your pipeline running in? (Should be at the top of your build log, something like MAX_SURVIVABILITY or PERFORMANCE_OPTIMIZED). 3. Did you recently restart the master while the pipelines were running, and if so, by which method did you restart it? (visiting the /restart url, service jenkins restart, pkill java, etc). 4. Please can you attach a copy of the Jenkins.log covering the time the build started until these errors appeared? 5. Please can you attach the build.xml for the build? 6. In the build's workflow directory, there's either a flownodeStore.xml file or a bunch of tiny XML files. Please, can you upload the entry with the highest ID (they'll be numbered 2, 3, 4, etc) – if it's the flownodeStore.xml it's fine to just GZIP it and attach it.

          Sam Van Oort added a comment -

          Analysis to continue once we have more info:

          I've fixed a bunch of cases that result in similar issues – the root cause is a mismatches between the FlowNode storage and either the Pipeline Program or the Build.xml. This usually happens when the master restarts and for whatever reason is unable to load the FlowNode matching a FlowHead – this is why I ask about the restart and restart method, build.xml, and last Pipeline FlowNode from the storage.

          A "graceful shutdown" (https://jenkins.io/doc/book/pipeline/scaling-pipeline/#what-am-i-giving-up-with-this-durability-setting-trade-off) should force persistence of any un-saved data but it's possible for there to be subtle ordering issues if somehow execution continued after that (it shouldn't, but there could be bugs).

          I'm looking for the Jenkins Log to see if there were other issues associate with this – I'd expect this has generated other stack traces potentially, especially around restart. An error like 'List of Flow Heads Unset for...' would point the build.xml (containing the CpsFlowExecution info) being behind the program state (might need another persistence call somewhere).

          Sam Van Oort added a comment - Analysis to continue once we have more info: I've fixed a bunch of cases that result in similar issues – the root cause is a mismatches between the FlowNode storage and either the Pipeline Program or the Build.xml. This usually happens when the master restarts and for whatever reason is unable to load the FlowNode matching a FlowHead – this is why I ask about the restart and restart method, build.xml, and last Pipeline FlowNode from the storage. A "graceful shutdown" ( https://jenkins.io/doc/book/pipeline/scaling-pipeline/#what-am-i-giving-up-with-this-durability-setting-trade-off ) should force persistence of any un-saved data but it's possible for there to be subtle ordering issues if somehow execution continued after that (it shouldn't, but there could be bugs). I'm looking for the Jenkins Log to see if there were other issues associate with this – I'd expect this has generated other stack traces potentially, especially around restart. An error like 'List of Flow Heads Unset for...' would point the build.xml (containing the CpsFlowExecution info) being behind the program state (might need another persistence call somewhere).

          Sam Van Oort added a comment -

          This line indicates we did detect incomplete/inconsistent persistence of data – the build should have failed promptly and cleanly at this point:

          > WARNING: Pipeline state not properly persisted, cannot resume job/ice/job/3.7/221/

          Somehow the CpsVmExecutorService is still continuing to run threads for the Pipeline even after it should have failed cleanly. Possible candidate: `FlowExecutionList.ItemListenerImpl.onLoaded()`

          Sam Van Oort added a comment - This line indicates we did detect incomplete/inconsistent persistence of data – the build should have failed promptly and cleanly at this point: > WARNING: Pipeline state not properly persisted, cannot resume job/ice/job/3.7/221/ Somehow the CpsVmExecutorService is still continuing to run threads for the Pipeline even after it should have failed cleanly. Possible candidate: `FlowExecutionList.ItemListenerImpl.onLoaded()`
          bentoi made changes -
          Attachment New: jenkins.log.gz [ 41700 ]
          Attachment New: flowNodeStore.xml.gz [ 41701 ]
          Attachment New: build.xml.gz [ 41702 ]

          bentoi added a comment -

          Thanks for your response, here's my answers to your questions.

          We use:

          • org.jenkins-ci.plugins.workflow:workflow-job:2.17
          • org.jenkins-ci.plugins.workflow:workflow-cps:2.45

          We started using PERFORMANCE_OPTIMIZED recently to see if it helped working around JENKINS-49646. That's also when we started seeing this problem which now occurs quite frequently.

          We usually don't restart Jenkins while JOBs are running. We either use "systemctl restart jenkins" or a "plugin update + restart". This problem causes our job queue to hang frequently because some invalid jobs appear to consume executors. When this occurs we restart jenkins and it usually allows the pending job to start but then the problem eventually shows up again requiring to restart jenkins again. The NPE does indeed appear to occur after the restart.

          I've attached build.xml / flownodeStore.xml  for the latest NPE from jenkins.log (build 233).

          bentoi added a comment - Thanks for your response, here's my answers to your questions. We use: org.jenkins-ci.plugins.workflow:workflow-job:2.17 org.jenkins-ci.plugins.workflow:workflow-cps:2.45 We started using PERFORMANCE_OPTIMIZED recently to see if it helped working around  JENKINS-49646 . That's also when we started seeing this problem which now occurs quite frequently. We usually don't restart Jenkins while JOBs are running. We either use "systemctl restart jenkins" or a "plugin update + restart". This problem causes our job queue to hang frequently because some invalid jobs appear to consume executors. When this occurs we restart jenkins and it usually allows the pending job to start but then the problem eventually shows up again requiring to restart jenkins again. The NPE does indeed appear to occur after the restart. I've attached build.xml / flownodeStore.xml  for the latest NPE from jenkins.log (build 233).
          bentoi made changes -
          Attachment New: build.xml.gz [ 41704 ]
          Attachment New: flowNodeStore.xml.gz [ 41705 ]
          Attachment New: jenkins.log.gz [ 41706 ]

          Sam Van Oort added a comment -

          So here's what I've got from digging (mostly to save my notes and provide transparency in case someone else sees something):

          The build nominally completed with a failure (tests failed, and you ran an Error step which was not caught) in FlowEndNode ID 1692
          But somehow the build.xml shows:

          <iota>1696</iota>
          <head>1693:1696</head> (flowHead id : flowNodeId for that head)

          And there are no flowNodes after 1692 in the storage.

          So what I'm trying to figure out is:

          • How did the iota, head, and finalFlowNode ID get bumped above 1692?
          • Does this occur with simpler builds? Or does it only happen when there's some specific structure including parallels (as with this one)?
          • Why are we even trying to resume the build? We're done, kaput, finished, finito! The FlowEndNode has been written, and build ended with Failure result (as it should).

          This points to an issue specifically with the closedown of the build or checks on whether we need to resume the build (i.e. finished build showing as incomplete).

          The second part first: there's a 'done' flag on the CpsFlowExecution which if true marks the execution as complete. ODDLY this is absent in the persisted build record, and that appears to be normal somehow (going back to v2.0 of this plugin it is not persisted by the ConverterImpl that does marshall/unmarshall).

          Failing that, we look for having just a single head FlowNode (check!) and that head being a FlowEndNode (should have been the case?)

          Now for the end-of-build log: the execution sets itself to 'done' (which is not persisted, since that logic was not changed), sets the stored heads to just the first head (should be the FlowEndNode), and the flushes FlowNode storage. However, the WorkflowRun saves itself when done.

          Possible solutions:

          1. Explicitly save 'done' in the flowExecution
          2. Figure out why the FlowEndNode isn't ending up as the final head for the CpsFlowExecution thus signalling we're done done done.

          Sam Van Oort added a comment - So here's what I've got from digging (mostly to save my notes and provide transparency in case someone else sees something): The build nominally completed with a failure (tests failed, and you ran an Error step which was not caught) in FlowEndNode ID 1692 But somehow the build.xml shows: <iota>1696</iota> <head>1693:1696</head> (flowHead id : flowNodeId for that head) And there are no flowNodes after 1692 in the storage. So what I'm trying to figure out is: How did the iota, head, and finalFlowNode ID get bumped above 1692? Does this occur with simpler builds? Or does it only happen when there's some specific structure including parallels (as with this one)? Why are we even trying to resume the build? We're done, kaput, finished, finito! The FlowEndNode has been written, and build ended with Failure result (as it should). This points to an issue specifically with the closedown of the build or checks on whether we need to resume the build (i.e. finished build showing as incomplete). The second part first: there's a 'done' flag on the CpsFlowExecution which if true marks the execution as complete. ODDLY this is absent in the persisted build record, and that appears to be normal somehow (going back to v2.0 of this plugin it is not persisted by the ConverterImpl that does marshall/unmarshall). Failing that, we look for having just a single head FlowNode (check!) and that head being a FlowEndNode (should have been the case?) Now for the end-of-build log: the execution sets itself to 'done' (which is not persisted, since that logic was not changed), sets the stored heads to just the first head (should be the FlowEndNode), and the flushes FlowNode storage. However, the WorkflowRun saves itself when done. Possible solutions: 1. Explicitly save 'done' in the flowExecution 2. Figure out why the FlowEndNode isn't ending up as the final head for the CpsFlowExecution thus signalling we're done done done .

            svanoort Sam Van Oort
            bentoi bentoi
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: