Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53223

Finished pipeline jobs appear to occupy executor slots long after completion

    • Icon: Bug Bug
    • Resolution: Incomplete
    • Icon: Minor Minor
    • core, pipeline

      We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

      These "zombie executor slots" do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
       
      The general job heirarchy goes where this manifests looks like {1 root job} -> {produces 1-6 child "target building" jobs in parallel} -> {each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
       
      I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.

        1. zombie-executor-slots-threadDump-2-min-later.rtf
          17 kB
          Elliot Babchick
        2. zombie-executor-slots-threadDump.rtf
          39 kB
          Elliot Babchick
        3. build.xml
          21 kB
          Basil Crow
        4. before.tar.gz
          3 kB
          Basil Crow
        5. after.tar.gz
          3 kB
          Basil Crow

          [JENKINS-53223] Finished pipeline jobs appear to occupy executor slots long after completion

          Sam Van Oort added a comment -

          dnusbaum Did your investigation here turn anything up?

          Sam Van Oort added a comment - dnusbaum Did your investigation here turn anything up?

          Devin Nusbaum added a comment -

          svanoort Still in my queue, have not had time to investigate.

          Devin Nusbaum added a comment - svanoort Still in my queue, have not had time to investigate.

          Devin Nusbaum added a comment - - edited

          I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

          Devin Nusbaum added a comment - - edited I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

          Vivek Pandey added a comment -

          We need more info to investigate it further.

          Vivek Pandey added a comment - We need more info to investigate it further.

          Basil Crow added a comment -

          I ran into a similar issue after upgrading Jenkins from 2.138.1 LTS (with workflow-job 2.25, workflow-cps 2.54, and workflow-durable-task-step 2.27) to 2.150.1 LTS (with workflow-job 2.31, workflow-cps 2.61, and workflow-durable-task-step 2.27). There were over 1,000 flyweight executors running on the master after the upgrade, for jobs that had already completed. The flyweight executors persisted for weeks after the upgrade and we didn't notice them until they eventually caused problems with the Throttle Concurrent Builds plugin.

          Here's an example of one of the 1,000 flyweight executors that had been leaked:

          Display name: Executor #-1
          Is idle? false
          Is busy? true
          Is active? true
          Is display cell? false
          Is parking? false
          Progress: 99
          Elapsed time: 18 days
          Asynchronous execution: org.jenkinsci.plugins.workflow.job.WorkflowRun$2
          Current workspace: null
          Current work unit class: class hudson.model.queue.WorkUnit
          Current work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos]
          Current work unit is main work? true
          Current work unit context: hudson.model.queue.WorkUnitContext@542d694b
          Current work unit executable: dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos #27775
          Current work unit work: org.jenkinsci.plugins.workflow.job.WorkflowJob@73dbb718[dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos]
          Current work unit context primary work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos]
          

          Note that the executor had been active for 18 days. We performed the upgrade 20 days ago, and prior to that had no issues with the old version since September. The job itself started 18 days ago and took 1.6 seconds (it was started by an upstream job and finished quickly with a status of NOT_BUILT). The work unit is of type hudson.model.queue.WorkUnit and not org.jenkinsci.plugins.workflow.job.AfterRestartTask (which is the type for jobs that have resumed), so I know this flyweight executor was launched after the upgrade. The question is: why was it leaked? This happened to this job and over 1,000 other jobs.

          One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

          After reading this comment, I saved the contents of the build directory. Then I tried calling `.interrupt()` on the hung flyweight executor using the script console. That didn't appear to do anything (it was still active afterwards), so I then restarted the Jenkins master. After it restarted the number of flyweight executors was down from over 1,000 back to 70 (which matched the number of jobs that had resumed). Things seem to be stable since then.

          Let me know how else I can help with debugging this.

          Basil Crow added a comment - I ran into a similar issue after upgrading Jenkins from 2.138.1 LTS (with workflow-job 2.25, workflow-cps 2.54, and workflow-durable-task-step 2.27) to 2.150.1 LTS (with workflow-job 2.31, workflow-cps 2.61, and workflow-durable-task-step 2.27). There were over 1,000 flyweight executors running on the master after the upgrade, for jobs that had already completed. The flyweight executors persisted for weeks after the upgrade and we didn't notice them until they eventually caused problems with the Throttle Concurrent Builds plugin. Here's an example of one of the 1,000 flyweight executors that had been leaked: Display name: Executor #-1 Is idle? false Is busy? true Is active? true Is display cell? false Is parking? false Progress: 99 Elapsed time: 18 days Asynchronous execution: org.jenkinsci.plugins.workflow.job.WorkflowRun$2 Current workspace: null Current work unit class: class hudson.model.queue.WorkUnit Current work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos] Current work unit is main work? true Current work unit context: hudson.model.queue.WorkUnitContext@542d694b Current work unit executable: dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos #27775 Current work unit work: org.jenkinsci.plugins.workflow.job.WorkflowJob@73dbb718[dlpx-app-gate/master/integration-tests/on-demand-jobs/split-precommit-dxos] Current work unit context primary work unit: hudson.model.queue.WorkUnit@49dc3e01[work=dlpx-app-gate » master » integration-tests » on-demand-jobs » split-precommit-dxos] Note that the executor had been active for 18 days. We performed the upgrade 20 days ago, and prior to that had no issues with the old version since September. The job itself started 18 days ago and took 1.6 seconds (it was started by an upstream job and finished quickly with a status of NOT_BUILT ). The work unit is of type hudson.model.queue.WorkUnit and not org.jenkinsci.plugins.workflow.job.AfterRestartTask (which is the type for jobs that have resumed), so I know this flyweight executor was launched after the upgrade. The question is: why was it leaked? This happened to this job and over 1,000 other jobs. One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor. After reading this comment, I saved the contents of the build directory. Then I tried calling `.interrupt()` on the hung flyweight executor using the script console. That didn't appear to do anything (it was still active afterwards), so I then restarted the Jenkins master. After it restarted the number of flyweight executors was down from over 1,000 back to 70 (which matched the number of jobs that had resumed). Things seem to be stable since then. Let me know how else I can help with debugging this.

          Basil Crow added a comment -

          Attached the serialized pipeline and console log before the last restart (when the leaked flyweight executor shown above was present) and after the last restart (when it was absent).

          Basil Crow added a comment - Attached the serialized pipeline and console log before the last restart (when the leaked flyweight executor shown above was present) and after the last restart (when it was absent).

          Basil Crow added a comment -

          After the last restart on January 22, one of my Jenkins masters is still leaking flyweight executors. It hasn't quite gotten up to the 1,000s yet as it did last time. There are 420 flyweight executors right now (and this number is increasing), but there are only about 30 running builds that are visible in the UI. This means hundreds of flyweight executors have been leaked. Last night, this resulted in a huge burst in the # of threads and CPU usage with dozens of stacks like this:

          "jenkins.util.Timer [#4]" #77 daemon prio=5 os_prio=0 tid=0x00007f50a800e800 nid=0x80d runnable [0x00007f504788a000]
             java.lang.Thread.State: RUNNABLE
                  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
                  at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
                  at hudson.model.Executor.getCurrentExecutable(Executor.java:514)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOnExecutor(ThrottleQueueTaskDispatcher.java:511)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnNode(ThrottleQueueTaskDispatcher.java:488)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnAllNodes(ThrottleQueueTaskDispatcher.java:501)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.throttleCheckForCategoriesAllNodes(ThrottleQueueTaskDispatcher.java:281)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRunImpl(ThrottleQueueTaskDispatcher.java:253)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:218)
                  at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:176)
                  at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
                  at hudson.model.Queue.maintain(Queue.java:1522)
          

          We were burning through CPU iterating through these leaked flyweight executors from the Throttle Concurrent Builds plugin. This issue would go away if the flyweight executors weren't leaked. After restarting the master, things are back to normal, but the leak grows again. It seems to take about 20 days for the leaked executors to start causing serious problems in my environment.

          dnusbaum, what do you suggest as the next steps here? I see this bug has been resolved as "incomplete", but this issue occurred on January 22 and February 12, and I'm sure it will occur again in 20 days or so after I restart this master. While I don't have a simple reproducer, I do have an environment on which this issue occurs regularly. I can help collect any debugging state that is needed. Please let me know if I can add any additional information to this bug (or a new bug).

          Basil Crow added a comment - After the last restart on January 22, one of my Jenkins masters is still leaking flyweight executors. It hasn't quite gotten up to the 1,000s yet as it did last time. There are 420 flyweight executors right now (and this number is increasing), but there are only about 30 running builds that are visible in the UI. This means hundreds of flyweight executors have been leaked. Last night, this resulted in a huge burst in the # of threads and CPU usage with dozens of stacks like this: "jenkins.util.Timer [#4]" #77 daemon prio=5 os_prio=0 tid=0x00007f50a800e800 nid=0x80d runnable [0x00007f504788a000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at hudson.model.Executor.getCurrentExecutable(Executor.java:514) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOnExecutor(ThrottleQueueTaskDispatcher.java:511) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnNode(ThrottleQueueTaskDispatcher.java:488) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnAllNodes(ThrottleQueueTaskDispatcher.java:501) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.throttleCheckForCategoriesAllNodes(ThrottleQueueTaskDispatcher.java:281) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRunImpl(ThrottleQueueTaskDispatcher.java:253) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:218) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:176) at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197) at hudson.model.Queue.maintain(Queue.java:1522) We were burning through CPU iterating through these leaked flyweight executors from the Throttle Concurrent Builds plugin. This issue would go away if the flyweight executors weren't leaked. After restarting the master, things are back to normal, but the leak grows again. It seems to take about 20 days for the leaked executors to start causing serious problems in my environment. dnusbaum , what do you suggest as the next steps here? I see this bug has been resolved as "incomplete", but this issue occurred on January 22 and February 12, and I'm sure it will occur again in 20 days or so after I restart this master. While I don't have a simple reproducer, I do have an environment on which this issue occurs regularly. I can help collect any debugging state that is needed. Please let me know if I can add any additional information to this bug (or a new bug).

          Devin Nusbaum added a comment - - edited

          basil I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as jglick mentioned in JENKINS-45571. Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running.

          Also, thanks for uploading the build directories of a job with the issue before and after the restart. If you see the issue again, could you do the same but make sure to include the build.xml file (redacted as necessary). The data in build.xml will tell us exactly what data has and hasn't been persisted, so that's a key piece of the puzzle. The logs and flow nodes XML files are helpful, but in this case they don't seem to have any particularly interesting info.

          Devin Nusbaum added a comment - - edited basil I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as jglick mentioned in JENKINS-45571 . Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running. Also, thanks for uploading the build directories of a job with the issue before and after the restart. If you see the issue again, could you do the same but make sure to include the build.xml file (redacted as necessary). The data in build.xml will tell us exactly what data has and hasn't been persisted, so that's a key piece of the puzzle. The logs and flow nodes XML files are helpful, but in this case they don't seem to have any particularly interesting info.

          Basil Crow added a comment -

          Thanks for the suggestions, dnusbaum! I am attempting to implement the Throttle Concurrent Builds change in jenkinsci/throttle-concurrent-builds-plugin#57.

          Regarding debugging state, I uploaded a build.xml from a job with a leaked flyweight executor. The job started and was still running at the time of the last restart. Then I restarted Jenkins, the job resumed, and the job completed. The flyweight executor was leaked. I saved the build.xml at this point, then restarted Jenkins again. I then saved build.xml but noticed it was no different than the first version I saved. I then redacted it and attached it to this bug. I'm not sure if I did this right or if this build.xml will be helpful.

          Basil Crow added a comment - Thanks for the suggestions, dnusbaum ! I am attempting to implement the Throttle Concurrent Builds change in jenkinsci/throttle-concurrent-builds-plugin#57 . Regarding debugging state, I uploaded a build.xml from a job with a leaked flyweight executor. The job started and was still running at the time of the last restart. Then I restarted Jenkins, the job resumed, and the job completed. The flyweight executor was leaked. I saved the build.xml at this point, then restarted Jenkins again. I then saved build.xml but noticed it was no different than the first version I saved. I then redacted it and attached it to this bug. I'm not sure if I did this right or if this build.xml will be helpful.

          Devin Nusbaum added a comment -

          basil Thanks for uploading the build.xml. If that file had <completed>true</completed> both before and after the restart then I don't really understand what is happening. As of workflow-job 2.26, one of the hypotheses as to how these executors were leaking was addressed by commit 18d78f30. If the TODO related to bulk changes were the problem, I'd expect build.xml to have <completed>false</completed> before the restart. Perhaps the bug is somewhere else, maybe in workflow-cps.

          Devin Nusbaum added a comment - basil Thanks for uploading the build.xml. If that file had <completed>true</completed> both before and after the restart then I don't really understand what is happening. As of workflow-job 2.26, one of the hypotheses as to how these executors were leaking was addressed by commit 18d78f30 . If the TODO related to bulk changes were the problem, I'd expect build.xml to have <completed>false</completed> before the restart. Perhaps the bug is somewhere else, maybe in workflow-cps.

            dnusbaum Devin Nusbaum
            elliotb Elliot Babchick
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: