Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-53223

Finished pipeline jobs appear to occupy executor slots long after completion

    • Icon: Bug Bug
    • Resolution: Incomplete
    • Icon: Minor Minor
    • core, pipeline

      We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

      These "zombie executor slots" do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
       
      The general job heirarchy goes where this manifests looks like {1 root job} -> {produces 1-6 child "target building" jobs in parallel} -> {each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
       
      I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.

          [JENKINS-53223] Finished pipeline jobs appear to occupy executor slots long after completion

          Elliot Babchick created issue -
          Elliot Babchick made changes -
          Description Original: We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

          It seems this "zombie executor slot" issue causes a cascade of issues that eventually requires us to kill all jobs running and queued in order to recover. These zombie executors do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
           
          The general job heirarchy goes where this manifests looks like \{1 root job} -> \{produces 1-6 child "target building" jobs in parallel} -> \{each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
           
          I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.
          New: We have been observing an issue where jobs that are completed occupy executor slots on our Jenkins slaves (AWS EC2 instances), and this seems to be causing a backup in our build queue that is usually managed by the EC2 cloud plugin spinning up/down nodes as needed. When this problem manifests, we usually see it correspond with the ec2 cloud plugin failing to autoscale new nodes and and a subsequent massive buildup in our build queue until we have to restart the master and kill all jobs to recover

          These "zombie executor slots" do clear themselves up after 5-60+ minutes pass it seems, and often they are downstream jobs of still-ongoing parent jobs, but not always (sometimes the parent jobs are also completed but the executor still remains occupied). CPU and memory don't seem too strained when this problem manifests. 
            
           The general job heirarchy goes where this manifests looks like \{1 root job} -> \{produces 1-6 child "target building" jobs in parallel} -> \{each produces 5-80 "unit testing jobs" in parallel}. We usually see the issue manifest on this group of jobs (the only ones really running on this cluster) when it's under medium-high load, running 100+ jobs simultaneously across tens of nodes.
            
           I'm attaching a thread dump I downloaded from a slave exhibiting this behavior of having its executors occupied (all 4/4 of them) with jobs that are finished running. I'm actually attaching two dumps, the second taken a few minutes after the first on the same slave, because it seems like there is some activity happening with new threads spinning up, although I'm not sure what exactly their purpose is. I will try to generated and submit the zip from the core support plugin the next time I see the problem manifesting.
          Elliot Babchick made changes -
          Environment New: System Properties
          awt.toolkit sun.awt.X11.XToolkit
          com.sun.org.apache.xml.internal.dtm.DTMManager com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
          executable-war /usr/share/jenkins/jenkins.war
          file.encoding utf8
          file.encoding.pkg sun.io
          file.separator /
          hudson.model.LoadStatistics.decay 0.7
          hudson.model.ParametersAction.keepUndefinedParameters false
          hudson.plugins.ec2.SlaveTemplate.skipCheckInstance true
          hudson.slaves.NodeProvisioner.MARGIN 30
          hudson.slaves.NodeProvisioner.MARGIN0 0.6
          java.awt.graphicsenv sun.awt.X11GraphicsEnvironment
          java.awt.headless true
          java.awt.printerjob sun.print.PSPrinterJob
          java.class.path /usr/share/jenkins/jenkins.war
          java.class.version 52.0
          java.endorsed.dirs /usr/lib/jvm/java-8-oracle/jre/lib/endorsed
          java.ext.dirs /usr/lib/jvm/java-8-oracle/jre/lib/ext:/usr/java/packages/lib/ext
          java.home /usr/lib/jvm/java-8-oracle/jre
          java.io.tmpdir /tmp
          java.library.path /usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
          java.runtime.name Java(TM) SE Runtime Environment
          java.runtime.version 1.8.0_162-b12
          java.specification.name Java Platform API Specification
          java.specification.vendor Oracle Corporation
          java.specification.version 1.8
          java.vendor Oracle Corporation
          java.vendor.url http://java.oracle.com/
          java.vendor.url.bug http://bugreport.sun.com/bugreport/
          java.version 1.8.0_162
          java.vm.info mixed mode
          java.vm.name Java HotSpot(TM) 64-Bit Server VM
          java.vm.specification.name Java Virtual Machine Specification
          java.vm.specification.vendor Oracle Corporation
          java.vm.specification.version 1.8
          java.vm.vendor Oracle Corporation
          java.vm.version 25.162-b12
          javax.xml.parsers.DocumentBuilderFactory com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
          jetty.git.hash d5fc0523cfa96bfebfbda19606cad384d772f04c
          jna.loaded true
          jna.platform.library.path /usr/lib/x86_64-linux-gnu:/lib/x86_64-linux-gnu:/lib64:/usr/lib:/lib:/usr/lib/x86_64-linux-gnu/libfakeroot:/usr/lib/x86_64-linux-gnu/mesa-egl
          jnidispatch.path /tmp/jna--1712433994/jna898753681106629507.tmp
          line.separator
          mail.smtp.sendpartial true
          mail.smtps.sendpartial true
          os.arch amd64
          os.name Linux
          os.version 4.4.0-130-generic
          path.separator :
          sun.arch.data.model 64
          sun.boot.class.path /usr/lib/jvm/java-8-oracle/jre/lib/resources.jar:/usr/lib/jvm/java-8-oracle/jre/lib/rt.jar:/usr/lib/jvm/java-8-oracle/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jsse.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jce.jar:/usr/lib/jvm/java-8-oracle/jre/lib/charsets.jar:/usr/lib/jvm/java-8-oracle/jre/lib/jfr.jar:/usr/lib/jvm/java-8-oracle/jre/classes
          sun.boot.library.path /usr/lib/jvm/java-8-oracle/jre/lib/amd64
          sun.cpu.endian little
          sun.cpu.isalist
          sun.font.fontmanager sun.awt.X11FontManager
          sun.io.unicode.encoding UnicodeLittle
          sun.java.command /usr/share/jenkins/jenkins.war --webroot=/var/cache/jenkins/war --httpPort=8082 --ajp13Port=-1 --httpsPort=-1 --sessionTimeout=1440
          sun.java.launcher SUN_STANDARD
          sun.jnu.encoding UTF-8
          sun.management.compiler HotSpot 64-Bit Tiered Compilers
          sun.os.patch.level unknown
          svnkit.http.methods Digest,Basic,NTLM,Negotiate
          svnkit.ssh2.persistent false
          user.country US
          user.dir /
          user.home /home/jenkins
          user.language en
          user.name jenkins
          user.timezone Etc/UTC

          Environment Variables
          Name ↓
          Value
          _ /usr/bin/daemon
          HOME /home/jenkins
          JAVA_TOOL_OPTIONS -Dfile.encoding=UTF8
          JENKINS_HOME /mnt/jenkins
          LANG en_US.UTF-8
          LOGNAME jenkins
          MAIL /var/mail/jenkins
          PATH /usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/snap/bin
          PGSSLROOTCERT /usr/local/etc/ssl/certs/aws-combined.pem
          PWD /home/jenkins
          SHELL /bin/bash
          SHLVL 1
          USER jenkins
          XDG_RUNTIME_DIR /run/user/1002
          XDG_SESSION_ID c4
          Plugins
          Name ↓
          Version
          Enabled
          ace-editor 1.1 true
          analysis-core 1.95 true
          ansicolor 0.5.2 true
          ant 1.8 true
          antisamy-markup-formatter 1.5 true
          apache-httpcomponents-client-4-api 4.5.5-3.0 true
          authentication-tokens 1.3 true
          aws-credentials 1.23 true
          aws-java-sdk 1.11.341 true
          basic-branch-build-strategies 1.0.1 true
          blueocean 1.7.1 true
          blueocean-autofavorite 1.2.2 true
          blueocean-bitbucket-pipeline 1.7.1 true
          blueocean-commons 1.7.1 true
          blueocean-config 1.7.1 true
          blueocean-core-js 1.7.1 true
          blueocean-dashboard 1.7.1 true
          blueocean-display-url 2.2.0 true
          blueocean-events 1.7.1 true
          blueocean-git-pipeline 1.7.1 true
          blueocean-github-pipeline 1.7.1 true
          blueocean-i18n 1.7.1 true
          blueocean-jira 1.7.1 true
          blueocean-jwt 1.7.1 true
          blueocean-personalization 1.7.1 true
          blueocean-pipeline-api-impl 1.7.1 true
          blueocean-pipeline-editor 1.7.1 true
          blueocean-pipeline-scm-api 1.7.1 true
          blueocean-rest 1.7.1 true
          blueocean-rest-impl 1.7.1 true
          blueocean-web 1.7.1 true
          bouncycastle-api 2.16.3 true
          branch-api 2.0.20 true
          build-monitor-plugin 1.12+build.201805070054 true
          build-name-setter 1.6.9 true
          build-pipeline-plugin 1.5.8 true
          build-timeout 1.19 true
          build-user-vars-plugin 1.5 true
          built-on-column 1.1 true
          cloudbees-bitbucket-branch-source 2.2.12 true
          cloudbees-folder 6.5.1 true
          command-launcher 1.2 true
          conditional-buildstep 1.3.6 true
          copyartifact 1.41 true
          credentials 2.1.18 true
          credentials-binding 1.16 true
          display-url-api 2.2.0 true
          docker-commons 1.13 true
          docker-workflow 1.17 true
          durable-task 1.22 true
          ec2 1.40-SNAPSHOT (private-b9392270-elliotbabchick) true
          email-ext 2.62 true
          envinject 2.1.6 true
          envinject-api 1.5 true
          external-monitor-job 1.7 true
          favorite 2.3.2 true
          git 3.9.1 true
          git-client 3.0.0-beta4 true
          git-parameter 0.9.3 true
          git-server 1.7 true
          github 1.29.2 true
          github-api 1.92 true
          github-branch-source 2.3.6 true
          github-tag-trigger 1.0-SNAPSHOT (private-2f26a491-elliotbabchick) true
          google-login 1.4 true
          gradle 1.29 true
          groovy 2.0 true
          handlebars 1.1.1 true
          handy-uri-templates-2-api 2.1.6-1.0 true
          heavy-job 1.1 true
          htmlpublisher 1.16 true
          jackson2-api 2.8.11.3 true
          javadoc 1.4 true
          jdk-tool 1.1 true
          jenkins-design-language 1.7.1 true
          jenkins-multijob-plugin 1.30 true
          jira 3.0.0 true
          job-dsl 1.70 true
          jobConfigHistory 2.18 true
          jquery 1.12.4-0 true
          jquery-detached 1.2.1 true
          jsch 0.1.54.2 true
          junit 1.24 true
          ldap 1.20 true
          mailer 1.21 true
          mapdb-api 1.0.9.0 true
          mask-passwords 2.12.0 true
          matrix-auth 2.3 true
          matrix-project 1.13 true
          maven-plugin 3.1.2 true
          mercurial 2.4 true
          metrics 4.0.2.2 true
          momentjs 1.1.1 true
          node-iterator-api 1.5.0 true
          pam-auth 1.3 true
          parameterized-trigger 2.35.2 true
          phabricator-plugin-affirm-fork 1.9.8-SNAPSHOT-AFFIRM-JENKINS2 true
          pipeline-build-step 2.7 true
          pipeline-github-lib 1.0 true
          pipeline-graph-analysis 1.7 true
          pipeline-input-step 2.8 true
          pipeline-milestone-step 1.3.1 true
          pipeline-model-api 1.3.1 true
          pipeline-model-declarative-agent 1.1.1 true
          pipeline-model-definition 1.3.1 true
          pipeline-model-extensions 1.3.1 true
          pipeline-rest-api 2.10 true
          pipeline-stage-step 2.3 true
          pipeline-stage-tags-metadata 1.3.1 true
          pipeline-stage-view 2.10 true
          pipeline-utility-steps 2.1.0 true
          plain-credentials 1.4 true
          postbuildscript 2.7.0 true
          pubsub-light 1.12 true
          rebuild 1.28 true
          resource-disposer 0.11 true
          role-strategy 2.8.1 true
          run-condition 1.0 true
          s3 0.11.2 true
          scm-api 2.2.7 true
          script-security 1.44 true
          shiningpanda 0.24 true
          simple-theme-plugin 0.4 true
          sse-gateway 1.15 true
          ssh-agent 1.15 true
          ssh-credentials 1.14 true
          ssh-slaves 1.26 true
          structs 1.14 true
          subversion 2.11.1 true
          support-core 2.49 true
          timestamper 1.8.10 true
          token-macro 2.5 true
          variant 1.1 true
          violations 0.7.11 true
          warnings 4.68 true
          windows-slaves 1.3.1 true
          workflow-aggregator 2.5 true
          workflow-api 2.28 true
          workflow-basic-steps 2.9 true
          workflow-cps 2.54 true
          workflow-cps-global-lib 2.9 true
          workflow-durable-task-step 2.19 true
          workflow-job 2.23 true
          workflow-multibranch 2.20 true
          workflow-scm-step 2.6 true
          workflow-step-api 2.16 true
          workflow-support 2.19 true
          ws-cleanup 0.34 true

          Devin Nusbaum added a comment -

          Maybe a dupe of JENKINS-45571 and/or JENKINS-51568? Thanks for including the thread dumps elliotb, I will take a look and see if anything gives us an idea of the cause.

          Devin Nusbaum added a comment - Maybe a dupe of  JENKINS-45571 and/or  JENKINS-51568 ? Thanks for including the thread dumps elliotb , I will take a look and see if anything gives us an idea of the cause.
          Devin Nusbaum made changes -
          Assignee New: Devin Nusbaum [ dnusbaum ]

          Elliot Babchick added a comment - - edited

          Thanks! I'm also working on getting a support bundle while the issue is reproduced, but unfortunately while the issue is occurring Jenkins tends to be in a very large backlogged state w.r.t the build queue, and attempting to generate the bundle crashes Jenkins with an OOM  We'll keep trying ...

          Elliot Babchick added a comment - - edited Thanks! I'm also working on getting a support bundle while the issue is reproduced, but unfortunately while the issue is occurring Jenkins tends to be in a very large backlogged state w.r.t the build queue, and attempting to generate the bundle crashes Jenkins with an OOM  We'll keep trying ...

          Sam Van Oort added a comment -

          dnusbaum Did your investigation here turn anything up?

          Sam Van Oort added a comment - dnusbaum Did your investigation here turn anything up?

          Devin Nusbaum added a comment -

          svanoort Still in my queue, have not had time to investigate.

          Devin Nusbaum added a comment - svanoort Still in my queue, have not had time to investigate.

          Devin Nusbaum added a comment - - edited

          I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

          Devin Nusbaum added a comment - - edited I just took a quick look, and the thread dumps on the agent show that there is a thread pool on the agent waiting for a task to execute and there doesn't seem to be anything else of interest, so it seems that any problems here are likely on the master side. If you see the issue again, could you try to get thread dumps from the master instead? EDIT: One other piece of info that would be helpful would be the contents of the build directory of one of the builds that appears to be hanging, especially if you can obtain the directory both while the build is holding onto the executor and once it has released the executor.

          Vivek Pandey added a comment -

          We need more info to investigate it further.

          Vivek Pandey added a comment - We need more info to investigate it further.
          Vivek Pandey made changes -
          Resolution New: Incomplete [ 4 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

            dnusbaum Devin Nusbaum
            elliotb Elliot Babchick
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: