Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-71692

Pipeline sometimes leaks Execution on heavyweight executors

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Minor Minor
    • workflow-cps-plugin
    • None
    • core:2.401
      workflow-durable-task:1247.v7f9dfea_b_4fd0
    • workflow-cps 3785.vee73da_b_9544e

      I have seen instances where pipeline async executions still exist on heavyweight executors although the build has completed. The build may have ended up in SUCCESS or FAILURE, it does not matter. But intermittently, the execution is still kind of orphaned and holding an heavyweight executor.

      When running a groovy script dumpExecutors.groovy, the out put shows the following:

      LABEL
       Executor #0(0)
        OneOff? false
        Active? true
        Likely Stuck? false
        Progress: 99
        Interrupted? false
        Busy? true
        Elasped Time? 14752068
        Current Work Unit? hudson.model.queue.WorkUnit@1a477ff1[work=part of FOLDER » JOB #11 release 3ce8636615 O20010]
        Causes Of Interruption? []
        Idle Start Milliseconds? 1689274902515
        Asynchronous Execution: org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask$PlaceholderExecutable$1
        Executable: PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=FOLDER/JOB#11,label=LABEL,context=CpsStepContext[4:node]:Owner[FOLDER/JOB/11:FOLDER/JOB #11],cookie=e5c90ade-8077-4668-8d86-c1a612d0a8b7,auth=null}
         Executable: PlaceholderExecutable:ExecutorStepExecution.PlaceholderTask{runId=FOLDER/JOB#11,label=LABEL,context=CpsStepContext[4:node]:Owner[FOLDER/JOB/11:FOLDER/JOB #11],cookie=e5c90ade-8077-4668-8d86-c1a612d0a8b7,auth=null}
         Executable (class): class org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask$PlaceholderExecutable
         Executable (url): job/FOLDER/job/JOB/11/
         Executable (parent): ExecutorStepExecution.PlaceholderTask{runId=FOLDER/JOB#11,label=LABEL,context=CpsStepContext[4:node]:Owner[FOLDER/JOB/11:FOLDER/JOB #11],cookie=e5c90ade-8077-4668-8d86-c1a612d0a8b7,auth=null}
         Executable (parent class): class org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution$PlaceholderTask
         Executable (parent ownerTask): org.jenkinsci.plugins.workflow.job.WorkflowJob@60e140cb[FOLDER/JOB]
         Executable (parent url): job/FOLDER/job/JOB/11/
         Executable (parent runId): FOLDER/JOB#11
         Run Id: FOLDER/JOB #11
         Run URL: job/FOLDER/job/JOB/11/
         Run Result: FAILURE
      

      I recently was able to reproduce the same result (having an orphaned execution that holds an heavyweight executor) with a pipeline like the following:

      node {
          stage('Main') {
              outerloop: {
                  for (int i = 0; i < 2; i++) {
                      if (i > 0) {
                      	echo "${i}"
                      } else {
                          break outerloop
                      }
                  }
              }
          }
      }
      

      Unclear if this capture the main cause of the problem or just one of them.

      In that particular case, the exception shown in Jenkins logs is:

      2023-08-29 11:07:53.285+0000 [id=127098]	WARNING	o.j.p.w.cps.CpsVmExecutorService#reportProblem: Unexpected exception in CPS VM thread: CpsFlowExecution[Owner[my-job/1:my-job #1]]
      java.lang.IllegalStateException: unexpected break statement
      	at com.cloudbees.groovy.cps.impl.CallEnv.getBreakAddress(CallEnv.java:101)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.LoopBlockScopeEnv.getBreakAddress(LoopBlockScopeEnv.java:29)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.ProxyEnv.getBreakAddress(ProxyEnv.java:52)
      	at com.cloudbees.groovy.cps.impl.BreakBlock.eval(BreakBlock.java:21)
      	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:152)
      	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:146)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:136)
      	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:275)
      	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:146)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
      	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
      	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:187)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:422)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:330)
      	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:294)
      	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
      	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
      	at jenkins.util.ErrorLoggingExecutorService.lambda$wrap$0(ErrorLoggingExecutorService.java:51)
      	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      	at java.base/java.lang.Thread.run(Thread.java:829)
      

          [JENKINS-71692] Pipeline sometimes leaks Execution on heavyweight executors

          Jesse Glick added a comment -

          (reproducer based on JENKINS-71617)

          Jesse Glick added a comment - (reproducer based on JENKINS-71617 )

          Jesse Glick added a comment -

          JENKINS-60507 a bit related, or changes like https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/259 though that applies to queue items rather than executor slots.

          Jesse Glick added a comment - JENKINS-60507 a bit related, or changes like https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/259 though that applies to queue items rather than executor slots.

          By the way, I am attaching a script used to clean up those executor:

          import hudson.model.Executor
          import jenkins.model.Jenkins
          import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution
          
          Jenkins.instanceOrNull.computers.each {computer ->
              computer.executors
                .findAll { it.asynchronousExecution != null && it.getCurrentExecutable() != null}
                .each {executor ->
                    def executorExecutable = executor.getCurrentExecutable()
                    if (executorExecutable.parent != null && executorExecutable.parent instanceof ExecutorStepExecution.PlaceholderTask) {
                        def executorPlaceholderTask = ((ExecutorStepExecution.PlaceholderTask) executorExecutable.parent)
                        def run = executorPlaceholderTask.runForDisplay()
                        if(run != null && !run.isLogUpdated()) {
                            println "Completing asynchronous execution of ${executor}"
                            executor.completedAsynchronous(null)
                        }
                    }
                }
          }
          return
          

          Allan BURDAJEWICZ added a comment - By the way, I am attaching a script used to clean up those executor: import hudson.model.Executor import jenkins.model.Jenkins import org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution Jenkins.instanceOrNull.computers.each {computer -> computer.executors .findAll { it.asynchronousExecution != null && it.getCurrentExecutable() != null } .each {executor -> def executorExecutable = executor.getCurrentExecutable() if (executorExecutable.parent != null && executorExecutable.parent instanceof ExecutorStepExecution.PlaceholderTask) { def executorPlaceholderTask = ((ExecutorStepExecution.PlaceholderTask) executorExecutable.parent) def run = executorPlaceholderTask.runForDisplay() if (run != null && !run.isLogUpdated()) { println "Completing asynchronous execution of ${executor}" executor.completedAsynchronous( null ) } } } } return

          Devin Nusbaum added a comment - - edited

          Whether the reproducer is related to the real problem observed by Jenkins users, I have no idea. I think there are multiple levels of problems though:

          1. The CPS compilation only looks at labels directly attached to loop statements, but syntactically they are valid on all statements, so when we run LabelVerifier here it is happy, but post-CPS transformation the relevant label has been completely lost. We should modify the CPS transformer to either support arbitrary label placement or throw a compilation error when there is a label on a non-loop statement.
          2. Any runtime groovy-cps error should be thrown into the CPS execution if at all possible, otherwise all flow control in the script is lost, blocks never exit, etc. I think we can update BreakBlock and ContinueBlock to address at least the specific problem in this ticket.
            • (If we do see an exception in CpsVmExecutorService, I don't think we ever shut down the executor, although IDK if that really matters)
          3. There should be some mechanism whereby Pipeline step executions are notified if their corresponding program is fully dead and they should attempt to clean up whatever they can without accessing their StepContext or FlowExecution.
            • At the very least, steps like node should try to detect this case out-of-band and clean up their resources.

          I will take a brief look at the bullet point under point 3, because fixing 1 and 2 may or may not affect point 3 at all if there are other unknown causes of this class of issue, and we already have code for the node step that is supposed to handle this case. In fact the reproducer in the description hits this case, which is significant because we have not ever been able to debug things in that state, so I will check why Queue.cancel is failing and see what we can do about it.

          EDIT: Actually, like Jesse noted, that code in ExecutorStepExecution is about queue entries, not executors. From what I can tell, in this case the executors are released after a restart without any issues).

          In that case, I am not sure about a generic fix. I will check whether the step execution is still around before the restart as well as a few other similar things.

          Devin Nusbaum added a comment - - edited Whether the reproducer is related to the real problem observed by Jenkins users, I have no idea. I think there are multiple levels of problems though: The CPS compilation only looks at labels directly attached to loop statements, but syntactically they are valid on all statements, so when we run LabelVerifier here it is happy, but post-CPS transformation the relevant label has been completely lost. We should modify the CPS transformer to either support arbitrary label placement or throw a compilation error when there is a label on a non-loop statement. Any runtime groovy-cps error should be thrown into the CPS execution if at all possible, otherwise all flow control in the script is lost, blocks never exit, etc. I think we can update BreakBlock and ContinueBlock to address at least the specific problem in this ticket. (If we do see an exception in CpsVmExecutorService , I don't think we ever shut down the executor, although IDK if that really matters) There should be some mechanism whereby Pipeline step executions are notified if their corresponding program is fully dead and they should attempt to clean up whatever they can without accessing their StepContext or FlowExecution . At the very least, steps like node should try to detect this case out-of-band and clean up their resources. I will take a brief look at the bullet point under point 3, because fixing 1 and 2 may or may not affect point 3 at all if there are other unknown causes of this class of issue, and we already have code for the node step that is supposed to handle this case. In fact the reproducer in the description hits this case , which is significant because we have not ever been able to debug things in that state, so I will check why Queue.cancel is failing and see what we can do about it. EDIT: Actually, like Jesse noted, that code in ExecutorStepExecution is about queue entries, not executors. From what I can tell, in this case the executors are released after a restart without any issues). In that case, I am not sure about a generic fix. I will check whether the step execution is still around before the restart as well as a few other similar things.

          Devin Nusbaum added a comment - - edited

          Well, IDK if it will help in practical cases, but I think something along the lines of https://github.com/jenkinsci/workflow-cps-plugin/pull/780 makes sense and addresses point 3 in my above comment.

          Devin Nusbaum added a comment - - edited Well, IDK if it will help in practical cases, but I think something along the lines of https://github.com/jenkinsci/workflow-cps-plugin/pull/780 makes sense and addresses point 3 in my above comment.

          Devin Nusbaum added a comment -

          I confirmed that my fix also helps with two distinct cases described in JENKINS-70267, so it seems somewhat general.

          Devin Nusbaum added a comment - I confirmed that my fix also helps with two distinct cases described in JENKINS-70267 , so it seems somewhat general.

          Devin Nusbaum added a comment -

          Pipeline: Groovy plugin version 3785.vee73da_b_9544e fixes the effects of the bug in the description of this ticket by cleaning up steps when the CPS VM thread dies, which can happen in various ways as noted here and in JENKINS-70267. The proximate cause related to issues with labels on non-loop statements and break and continue remains broken.

          Devin Nusbaum added a comment - Pipeline: Groovy plugin version 3785.vee73da_b_9544e fixes the effects of the bug in the description of this ticket by cleaning up steps when the CPS VM thread dies, which can happen in various ways as noted here and in JENKINS-70267 . The proximate cause related to issues with labels on non-loop statements and break and continue remains broken.

            dnusbaum Devin Nusbaum
            allan_burdajewicz Allan BURDAJEWICZ
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: