From time to time - about 2 or 3 free weeks - we're facing strange issue.
      In our Jenkins instance with about 100 agents all executors starting to hangs.
      In GUI it shows that executor is busy with build that is already finished.
      If we try to stop this build it asks about "null" build.

      From script console, something like this will return "null", while normally it return build name.

      computer.getAllExecutors().each { e ->
         println(e.getCurrentExecutable().getParent().run())
      } 

      There's no any usable information in logs.
      Only restart jenkins instance helps.

       

      Most of our agents are permanent, launched via SSH, but we are also using Docker Swarm (via docker-swarm:1.11 plugin).

      Last version problem accour was 2.414.2 - after 2 weeks of usage

      First time we observe it around 2.440, but it's probably not version related.

      We're using only LTS releases.

      We start Jenkins in docker container from official LTS image (currently it's jenkins/jenkins:2.414.2), but with modified entrypoint like

      # Additional Jenkins options
      # -XX:+AlwaysPreTouch - pre-zeroes memory mapped pages on JVM startup
      # -XX:+UseStringDeduplication - looks for the strings with the same contents and deduplicates them
      # -XX:+ParallelRefProcEnabled - enables parallelize reference processing, reducing young and old GC times.
      # -XX:+DisableExplicitGC - disables the system.gc() method called often used by third party plugins to explicitly invoke the garbage collector.
      # -Xms - Allocate min memory pool for operations, garbage collection, etc.
      # -Xmx - Allocate max memory pool for operations, garbage collection, etc.
      # -Djenkins.install.runSetupWizard=false - Skip initial setup wizard. Do not
      #       ask for plugin selection, do not create admin user, do not setup proxy.
      # -Dhudson.slaves.WorkspaceList=_ - use underscore instead @ to create concurrent workspaces
      export JAVA_OPTS="-Xms4096m -Xmx16384m -XX:+AlwaysPreTouch -XX:+UseStringDeduplication -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -Djenkins.install.runSetupWizard=false -Dhudson.slaves.WorkspaceList=_"
      export JENKINS_OPTS="--logfile=/var/log/jenkins/jenkins.log --httpPort=-1 --httpsPort=8080 --http2Port=8443 --httpsKeyStore=${JENKINS_HOME}/https/jenkins_ssl.jks --httpsKeyStorePassword=${SSL_CRT_PWD}"# Call the Jenkins entrypoint
      exec /usr/bin/tini -- /usr/local/bin/jenkins.sh 

        1. jenkins.fail.1.PNG
          jenkins.fail.1.PNG
          6 kB
        2. jenkins.fail.2.PNG
          jenkins.fail.2.PNG
          11 kB
        3. plugins.txt
          6 kB

          [JENKINS-72087] Hangs of executors

          Mark Waite added a comment -

          xjjx thanks for reporting an issue. It may help others when they see your description of the issue. Unfortunately, your description is not detailed enough to persuade others to investigate the issue. Please provide the information described in "How to report an issue" in hopes that others will try to duplicate the issue.

          Mark Waite added a comment - xjjx thanks for reporting an issue. It may help others when they see your description of the issue. Unfortunately, your description is not detailed enough to persuade others to investigate the issue. Please provide the information described in "How to report an issue" in hopes that others will try to duplicate the issue.

          Pawel Xj added a comment -

          Thanks for answer. I tried to extend description. The biggest problem is that I can't provide steps to reproduce, because this problem occurs suddenly. It's most probably related to amount of builds or time the instance is running without restart (last time it was 2 weeks). There is nothing useful in logs, so question is - what kind of data I can dump when the issue occur again?

          Pawel Xj added a comment - Thanks for answer. I tried to extend description. The biggest problem is that I can't provide steps to reproduce, because this problem occurs suddenly. It's most probably related to amount of builds or time the instance is running without restart (last time it was 2 weeks). There is nothing useful in logs, so question is - what kind of data I can dump when the issue occur again?

          sabarinath added a comment -

          We are seeing similar issue and most of the executors are stuck due to zombie jobs and they will go away only after master restart. We have to this exercise every alternate week.

          sabarinath added a comment - We are seeing similar issue and most of the executors are stuck due to zombie jobs and they will go away only after master restart. We have to this exercise every alternate week.

          Pawel Xj added a comment -

          I've made simple groovy script to detect this issue.

          Jenkins.get().getComputers().findAll { it.getName() }.each { c ->
                  c.getAllExecutors().each { e ->
                          def ce = e.getCurrentExecutable()
                          if (ce && ! ce.getParent().run()) {
                                  println(c.getDisplayName() + '\tHere is bad (null)')
                          }
                  }
          }

          What is strange is that in normal situation ce.getParent().run() return the same value as ce.getOwnerExecutable() but for hanging executors getOwnerExecutable return build name that is already finished, but run() return null.

           

          Pawel Xj added a comment - I've made simple groovy script to detect this issue. Jenkins.get().getComputers().findAll { it.getName() }.each { c ->         c.getAllExecutors().each { e ->                 def ce = e.getCurrentExecutable()                 if (ce && ! ce.getParent().run()) {                         println(c.getDisplayName() + '\tHere is bad ( null )' )                 }         } } What is strange is that in normal situation ce.getParent().run() return the same value as ce.getOwnerExecutable() but for hanging executors getOwnerExecutable return build name that is already finished, but run() return null.  

          Lionel added a comment -

          I have the same issue on windows nodes that runs exclusively msbuilds

          Other windows nodes connected to that jenkins instance running simpler jobs runs fine witthout any zombie job blocking the executors.

          I have seen nothing interesting in the node logs sadly

          Lionel added a comment - I have the same issue on windows nodes that runs exclusively msbuilds Other windows nodes connected to that jenkins instance running simpler jobs runs fine witthout any zombie job blocking the executors. I have seen nothing interesting in the node logs sadly

          Pawel Xj added a comment -

          Finally, I've found cause of this issue. Because we need to reboot our nodes from time to time, we have groovy script to first temporary offline node, then disconnect, then reboot it and connect again. The problem was that for connect we were using this method:

          https://javadoc.jenkins-ci.org/hudson/model/Computer.html#connect(boolean)
          with forceReconnect  set to false, like

           

          Jenkins.get().getComputer(nodeName).connect(false) 

          Seems that some threads wasn't cleaned and after a few weeks Jenkins started to be unstable.

           

          Once we changed forceReconnect to "true" everything works fine.
          I'll keep this issue open because there is probably some underlying issue to solve.

          Pawel Xj added a comment - Finally, I've found cause of this issue. Because we need to reboot our nodes from time to time, we have groovy script to first temporary offline node, then disconnect, then reboot it and connect again. The problem was that for connect we were using this method: https://javadoc.jenkins-ci.org/hudson/model/Computer.html#connect(boolean) with forceReconnect   set to false, like   Jenkins.get().getComputer(nodeName).connect( false ) Seems that some threads wasn't cleaned and after a few weeks Jenkins started to be unstable.   Once we changed forceReconnect to "true" everything works fine. I'll keep this issue open because there is probably some underlying issue to solve.

            Unassigned Unassigned
            xjjx Pawel Xj
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: