Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53223

Finished pipeline jobs appear to occupy executor slots long after completion

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved (View Workflow)
    • Minor
    • Resolution: Incomplete
    • core, pipeline

    Description

      We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

      These "zombie executor slots" do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
       
      The general job heirarchy goes where this manifests looks like {1 root job} -> {produces 1-6 child "target building" jobs in parallel} -> {each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
       
      I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.

      Attachments

        Issue Links

          Activity

            basil Basil Crow added a comment -

            Attached the serialized pipeline and console log before the last restart (when the leaked flyweight executor shown above was present) and after the last restart (when it was absent).

            basil Basil Crow added a comment - Attached the serialized pipeline and console log before the last restart (when the leaked flyweight executor shown above was present) and after the last restart (when it was absent).
            basil Basil Crow added a comment -

            After the last restart on January 22, one of my Jenkins masters is still leaking flyweight executors. It hasn't quite gotten up to the 1,000s yet as it did last time. There are 420 flyweight executors right now (and this number is increasing), but there are only about 30 running builds that are visible in the UI. This means hundreds of flyweight executors have been leaked. Last night, this resulted in a huge burst in the # of threads and CPU usage with dozens of stacks like this:

            "jenkins.util.Timer [#4]" #77 daemon prio=5 os_prio=0 tid=0x00007f50a800e800 nid=0x80d runnable [0x00007f504788a000]
               java.lang.Thread.State: RUNNABLE
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
                    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
                    at hudson.model.Executor.getCurrentExecutable(Executor.java:514)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOnExecutor(ThrottleQueueTaskDispatcher.java:511)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnNode(ThrottleQueueTaskDispatcher.java:488)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnAllNodes(ThrottleQueueTaskDispatcher.java:501)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.throttleCheckForCategoriesAllNodes(ThrottleQueueTaskDispatcher.java:281)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRunImpl(ThrottleQueueTaskDispatcher.java:253)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:218)
                    at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:176)
                    at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197)
                    at hudson.model.Queue.maintain(Queue.java:1522)
            

            We were burning through CPU iterating through these leaked flyweight executors from the Throttle Concurrent Builds plugin. This issue would go away if the flyweight executors weren't leaked. After restarting the master, things are back to normal, but the leak grows again. It seems to take about 20 days for the leaked executors to start causing serious problems in my environment.

            dnusbaum, what do you suggest as the next steps here? I see this bug has been resolved as "incomplete", but this issue occurred on January 22 and February 12, and I'm sure it will occur again in 20 days or so after I restart this master. While I don't have a simple reproducer, I do have an environment on which this issue occurs regularly. I can help collect any debugging state that is needed. Please let me know if I can add any additional information to this bug (or a new bug).

            basil Basil Crow added a comment - After the last restart on January 22, one of my Jenkins masters is still leaking flyweight executors. It hasn't quite gotten up to the 1,000s yet as it did last time. There are 420 flyweight executors right now (and this number is increasing), but there are only about 30 running builds that are visible in the UI. This means hundreds of flyweight executors have been leaked. Last night, this resulted in a huge burst in the # of threads and CPU usage with dozens of stacks like this: "jenkins.util.Timer [#4]" #77 daemon prio=5 os_prio=0 tid=0x00007f50a800e800 nid=0x80d runnable [0x00007f504788a000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at hudson.model.Executor.getCurrentExecutable(Executor.java:514) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOnExecutor(ThrottleQueueTaskDispatcher.java:511) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnNode(ThrottleQueueTaskDispatcher.java:488) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.buildsOfProjectOnAllNodes(ThrottleQueueTaskDispatcher.java:501) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.throttleCheckForCategoriesAllNodes(ThrottleQueueTaskDispatcher.java:281) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRunImpl(ThrottleQueueTaskDispatcher.java:253) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:218) at hudson.plugins.throttleconcurrents.ThrottleQueueTaskDispatcher.canRun(ThrottleQueueTaskDispatcher.java:176) at hudson.model.Queue.getCauseOfBlockageForItem(Queue.java:1197) at hudson.model.Queue.maintain(Queue.java:1522) We were burning through CPU iterating through these leaked flyweight executors from the Throttle Concurrent Builds plugin. This issue would go away if the flyweight executors weren't leaked. After restarting the master, things are back to normal, but the leak grows again. It seems to take about 20 days for the leaked executors to start causing serious problems in my environment. dnusbaum , what do you suggest as the next steps here? I see this bug has been resolved as "incomplete", but this issue occurred on January 22 and February 12, and I'm sure it will occur again in 20 days or so after I restart this master. While I don't have a simple reproducer, I do have an environment on which this issue occurs regularly. I can help collect any debugging state that is needed. Please let me know if I can add any additional information to this bug (or a new bug).
            dnusbaum Devin Nusbaum added a comment - - edited

            basil I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as jglick mentioned in JENKINS-45571. Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running.

            Also, thanks for uploading the build directories of a job with the issue before and after the restart. If you see the issue again, could you do the same but make sure to include the build.xml file (redacted as necessary). The data in build.xml will tell us exactly what data has and hasn't been persisted, so that's a key piece of the puzzle. The logs and flow nodes XML files are helpful, but in this case they don't seem to have any particularly interesting info.

            dnusbaum Devin Nusbaum added a comment - - edited basil I think at this point the best path forward in the short term would be to modify the Throttle Concurrent Builds plugin to directly examine running builds instead of using executors as a proxy as jglick mentioned in JENKINS-45571 . Without a consistent and simple reproduction case, I don't think we are going to be able to make any progress on fixing the root cause of flyweight executors leaking anytime soon, and since flyweight executors do not use a thread or other resources it doesn't really matter if there are a bunch of them in the system except for the fact that Throttle Concurrent Builds uses them to determine what is currently running. Also, thanks for uploading the build directories of a job with the issue before and after the restart. If you see the issue again, could you do the same but make sure to include the build.xml file (redacted as necessary). The data in build.xml will tell us exactly what data has and hasn't been persisted, so that's a key piece of the puzzle. The logs and flow nodes XML files are helpful, but in this case they don't seem to have any particularly interesting info.
            basil Basil Crow added a comment -

            Thanks for the suggestions, dnusbaum! I am attempting to implement the Throttle Concurrent Builds change in jenkinsci/throttle-concurrent-builds-plugin#57.

            Regarding debugging state, I uploaded a build.xml from a job with a leaked flyweight executor. The job started and was still running at the time of the last restart. Then I restarted Jenkins, the job resumed, and the job completed. The flyweight executor was leaked. I saved the build.xml at this point, then restarted Jenkins again. I then saved build.xml but noticed it was no different than the first version I saved. I then redacted it and attached it to this bug. I'm not sure if I did this right or if this build.xml will be helpful.

            basil Basil Crow added a comment - Thanks for the suggestions, dnusbaum ! I am attempting to implement the Throttle Concurrent Builds change in jenkinsci/throttle-concurrent-builds-plugin#57 . Regarding debugging state, I uploaded a build.xml from a job with a leaked flyweight executor. The job started and was still running at the time of the last restart. Then I restarted Jenkins, the job resumed, and the job completed. The flyweight executor was leaked. I saved the build.xml at this point, then restarted Jenkins again. I then saved build.xml but noticed it was no different than the first version I saved. I then redacted it and attached it to this bug. I'm not sure if I did this right or if this build.xml will be helpful.
            dnusbaum Devin Nusbaum added a comment -

            basil Thanks for uploading the build.xml. If that file had <completed>true</completed> both before and after the restart then I don't really understand what is happening. As of workflow-job 2.26, one of the hypotheses as to how these executors were leaking was addressed by commit 18d78f30. If the TODO related to bulk changes were the problem, I'd expect build.xml to have <completed>false</completed> before the restart. Perhaps the bug is somewhere else, maybe in workflow-cps.

            dnusbaum Devin Nusbaum added a comment - basil Thanks for uploading the build.xml. If that file had <completed>true</completed> both before and after the restart then I don't really understand what is happening. As of workflow-job 2.26, one of the hypotheses as to how these executors were leaking was addressed by commit 18d78f30 . If the TODO related to bulk changes were the problem, I'd expect build.xml to have <completed>false</completed> before the restart. Perhaps the bug is somewhere else, maybe in workflow-cps.

            People

              dnusbaum Devin Nusbaum
              elliotb Elliot Babchick
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: