Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27514

Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • Core - Thread spikes in Computer.threadPoolForRemoting

      This issue has been converted to EPIC, because there are reports of various independent issues inside.

      Issue:

      • Remoting threadPool is being widely used in Jenkins: https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code
      • Not all usages of Computer.threadPoolForRemoting are valid for starters
      • Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
      • The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
      • Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
      • Deadlocks or waits in the threadpool may also make it to grow infinitely

      Proposed fixes:

      • Define usage policy for this thread pool in the documentation
      • Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
      • Fix the most significant issues where the thread pool gets misused or blocked
         
        Original report (tracked as JENKINS-47012):

      > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

      > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

      > I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

      > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.

        1. 20150904-jenkins03.txt
          2.08 MB
        2. file-leak-detector.log
          41 kB
        3. Jenkins_Dump_2017-06-12-10-52.zip
          1.58 MB
        4. jenkins_watchdog_report.txt
          267 kB
        5. jenkins_watchdog.sh
          2 kB
        6. jenkins02-thread-dump.txt
          1.49 MB
        7. support_2015-08-04_14.10.32.zip
          2.17 MB
        8. support_2016-06-29_13.17.36 (2).zip
          3.90 MB
        9. thread-dump.txt
          5.48 MB

          [JENKINS-27514] Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

          Oleg Nenashev added a comment -

          nurupo Yes, the Digital Ocean issue seems to be unrelated to the one originally reported here. Please create a separate ticket to that plugin.

          Oleg Nenashev added a comment - nurupo Yes, the Digital Ocean issue seems to be unrelated to the one originally reported here. Please create a separate ticket to that plugin.

          nurupo nurupo added a comment -

          Are you sure that both the original author and me are not experiencing the same issue? It might be an issue in Jenkins itself that we are experiencing, not an issue in the SSH Launcher plugin. When I googled the issue of too many of "Computer.threadPoolForRemoting" threads, this is the bug report that came up, and it matches my circumstances quite closely: lots of slaves being frequently created and destroyed result in many of "Computer.threadPoolForRemoting" threads to be created and locked, which eventually exhausts computer's/JVM's resources and kills Jenkins.

          nurupo nurupo added a comment - Are you sure that both the original author and me are not experiencing the same issue? It might be an issue in Jenkins itself that we are experiencing, not an issue in the SSH Launcher plugin. When I googled the issue of too many of "Computer.threadPoolForRemoting" threads, this is the bug report that came up, and it matches my circumstances quite closely: lots of slaves being frequently created and destroyed result in many of "Computer.threadPoolForRemoting" threads to be created and locked, which eventually exhausts computer's/JVM's resources and kills Jenkins.

          Oleg Nenashev added a comment -

          > Are you sure that both the original author and me are not experiencing the same issue?

          Well, the digital ocean issue requires extra investigation by the plugin maintainer. This one has been originally reported to SSH Slaves Plugin, and the deadlock cause definitely comes from the SSH Launcher implementation.

          Even if there is a shared part of the issue (Core API flaws), the issue fixes will be different.

           

          Oleg Nenashev added a comment - > Are you sure that both the original author and me are not experiencing the same issue? Well, the digital ocean issue requires extra investigation by the plugin maintainer. This one has been originally reported to SSH Slaves Plugin, and the deadlock cause definitely comes from the SSH Launcher implementation. Even if there is a shared part of the issue (Core API flaws), the issue fixes will be different.  

          Micheal Waltz added a comment - - edited

          oleg_nenashev do you think that this issue may have been addressed in any of the recent Remoting Updates?

          https://github.com/jenkinsci/remoting/blob/master/CHANGELOG.md#311

          We're running Jenkins 2.74 with 8 agents using the Swarm v3.4 agent and have to reboot our jenkins master once a week when CPU gets to high.

          It seems to be related to the hundreds of

          Computer.threadPoolForRemoting

          messages that pile up when viewing the Jenkins monitor. Here's a screenshot of what the Jenkins Monitoring Threads page shows:

          https://keybase.pub/ecliptik/jenkins/Screen%20Shot%202017-08-28%20at%2012.58.54%20PM.png

          Micheal Waltz added a comment - - edited oleg_nenashev do you think that this issue may have been addressed in any of the recent Remoting Updates? https://github.com/jenkinsci/remoting/blob/master/CHANGELOG.md#311 We're running Jenkins 2.74 with 8 agents using the Swarm v3.4 agent and have to reboot our jenkins master once a week when CPU gets to high. It seems to be related to the hundreds of Computer.threadPoolForRemoting messages that pile up when viewing the Jenkins monitor. Here's a screenshot of what the Jenkins Monitoring Threads page shows: https://keybase.pub/ecliptik/jenkins/Screen%20Shot%202017-08-28%20at%2012.58.54%20PM.png

          Oleg Nenashev added a comment -

          ecliptik I checked the implementation in the Jenkins core, and it appears that the thread pool may be still growing without a limit on the master side. Implementation in Executors:

           

          **
           * Creates a thread pool that creates new threads as needed, but
           * will reuse previously constructed threads when they are
           * available, and uses the provided
           * ThreadFactory to create new threads when needed.
           * @param threadFactory the factory to use when creating new threads
           * @return the newly created thread pool
           * @throws NullPointerException if threadFactory is null
           */
          public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
              return new ThreadPoolExecutor(0, Integer.MAX_VALUE,
                                            60L, TimeUnit.SECONDS,
                                            new SynchronousQueue<Runnable>(),
                                            threadFactory);
          }

           

          Computer.threadPool for remoting is being used in multiple places, and IMHO we should at least make the limit configurable. Maybe we also should set a reasonable default limit. I will check if there is a Thread Pool Executor service with downscale support

          Oleg Nenashev added a comment - ecliptik I checked the implementation in the Jenkins core, and it appears that the thread pool may be still growing without a limit on the master side. Implementation in Executors:   ** * Creates a thread pool that creates new threads as needed, but * will reuse previously constructed threads when they are * available, and uses the provided * ThreadFactory to create new threads when needed. * @param threadFactory the factory to use when creating new threads * @ return the newly created thread pool * @ throws NullPointerException if threadFactory is null */ public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) { return new ThreadPoolExecutor(0, Integer .MAX_VALUE, 60L, TimeUnit.SECONDS, new SynchronousQueue< Runnable >(), threadFactory); }   Computer.threadPool for remoting is being used in multiple places, and IMHO we should at least make the limit configurable. Maybe we also should set a reasonable default limit. I will check if there is a Thread Pool Executor service with downscale support

          Oleg Nenashev added a comment -

          Actually I am wrong. The Cached Thread Pool implementation should be able to terminate threads if they are unused for more than 60 seconds. Probably the threads are being created so intensively that the executor service goes crazy. I will see I I can create logging for that at least

          Oleg Nenashev added a comment - Actually I am wrong. The Cached Thread Pool implementation should be able to terminate threads if they are unused for more than 60 seconds. Probably the threads are being created so intensively that the executor service goes crazy. I will see I I can create logging for that at least

          Kanstantsin Shautsou added a comment - - edited

          stephenconnolly https://issues.jenkins-ci.org/browse/JENKINS-27514?focusedCommentId=306267&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-306267 ? I remember that connect/disconnect always had issues in UI. I.e. you try disconnect, but it still infinitely tries reconnect. Maybe related?

          Kanstantsin Shautsou added a comment - - edited stephenconnolly   https://issues.jenkins-ci.org/browse/JENKINS-27514?focusedCommentId=306267&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-306267  ? I remember that connect/disconnect always had issues in UI. I.e. you try disconnect, but it still infinitely tries reconnect. Maybe related?

          Micheal Waltz added a comment -

          This still appears to be an issue, but I figured out what was triggering high CPU and thousands of Computer.threadPoolForRemoting threads with our setup.

          Architecture:

          • 1 Ubuntu 16.04 Master running Jenkins v2.76
          • 6 Ubuntu 16.04 Agents via swarm plugin v3.4
          • Agents are connected to master via a ELB since they are in multiple AWS regions

          There were a few jobs that used containers to mount a reports volume within the job workspace. The container would generate reports as root:root and they would appear within $WORKSPACE with these same permissions. The jenkins agent runs as user jenkins, and couldn't remove these files when it would try and clean up $WORKSPACE after each run.

          jenkins@ip-10-0-0-5:~/workspace/automated-matador-pull_requests_ws-cleanup_1504807099908$ ls -l coverage/
          total 6352
          drwxr-xr-x 3 root root    4096 Sep  7 15:51 assets
          -rw-r--r-- 1 root root 6498213 Sep  7 15:51 index.htm
          

          The jobs that wrote these reports were run regularly, on every push and Pull Request to a repository, causing them to build up quickly. On the master thousands of files named atomic*.tmp would start to appear in /var/lib/jenkins

          ubuntu@jenkins:/var/lib/jenkins$ ls atomic*.tmp | wc -l
          6521
          

          and each file would contain hundreds of lines like,

          <detailMessage>Unable to delete &apos;/var/lib/jenkins/workspace/automated-matador-develop-build-on-push_ws-cleanup_1504728487305/coverage/.resultset.json.lock&apos;. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.</detailMessage>
          

          Eventually the Computer.threadPoolForRemoting errors would reach into the thousands the the master CPU would be 100%. A reboot would temporarily fix it, but CPU would jump again until all the /var/lib/jenkins/atomic*.tmp files were removed.

          We resolved the issue by doing a chown jenkins:jenkins on the report directories created by the containers in a job so there are no longer "Unable to delete errors" and atomic*.tmp files created. We haven't seen a CPU or computer.threadPoolForRemoting spike in two weeks since doing this.

          Hopefully this helps anyone else who may be experiencing this issue and maybe provide some guidance on the root cause of this issue.

          Micheal Waltz added a comment - This still appears to be an issue, but I figured out what was triggering high CPU and thousands of Computer.threadPoolForRemoting threads with our setup. Architecture: 1 Ubuntu 16.04 Master running Jenkins v2.76 6 Ubuntu 16.04 Agents via swarm plugin v3.4 Agents are connected to master via a ELB since they are in multiple AWS regions There were a few jobs that used containers to mount a reports volume within the job workspace. The container would generate reports as root:root and they would appear within $WORKSPACE with these same permissions. The jenkins agent runs as user jenkins, and couldn't remove these files when it would try and clean up $WORKSPACE after each run. jenkins@ip-10-0-0-5:~/workspace/automated-matador-pull_requests_ws-cleanup_1504807099908$ ls -l coverage/ total 6352 drwxr-xr-x 3 root root 4096 Sep 7 15:51 assets -rw-r--r-- 1 root root 6498213 Sep 7 15:51 index.htm The jobs that wrote these reports were run regularly, on every push and Pull Request to a repository, causing them to build up quickly. On the master thousands of files named atomic*.tmp would start to appear in /var/lib/jenkins ubuntu@jenkins:/ var /lib/jenkins$ ls atomic*.tmp | wc -l 6521 and each file would contain hundreds of lines like, <detailMessage>Unable to delete &apos;/ var /lib/jenkins/workspace/automated-matador-develop-build-on-push_ws-cleanup_1504728487305/coverage/.resultset.json.lock&apos;. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.</detailMessage> Eventually the Computer.threadPoolForRemoting errors would reach into the thousands the the master CPU would be 100%. A reboot would temporarily fix it, but CPU would jump again until all the /var/lib/jenkins/atomic*.tmp files were removed. We resolved the issue by doing a chown jenkins:jenkins on the report directories created by the containers in a job so there are no longer "Unable to delete errors" and atomic*.tmp files created. We haven't seen a CPU or computer.threadPoolForRemoting spike in two weeks since doing this. Hopefully this helps anyone else who may be experiencing this issue and maybe provide some guidance on the root cause of this issue.

          Oleg Nenashev added a comment -

          ecliptik Ideally I need a stacktrace to confirm what causes it, but I am pretty sure it happens due to the workspace cleanup. Jenkins-initiated workspace cleanup happens in bursts and it uses Remoting thread pool for sure, so it may cause such behavior.

          Regarding this ticket, I am going to convert it to EPIC. Remoting thread pool is a shared resource in the system, and it may be consumed by various things. I ask everybody to re-report their cases under the EPIC

          Oleg Nenashev added a comment - ecliptik Ideally I need a stacktrace to confirm what causes it, but I am pretty sure it happens due to the workspace cleanup. Jenkins-initiated workspace cleanup happens in bursts and it uses Remoting thread pool for sure, so it may cause such behavior. Regarding this ticket, I am going to convert it to EPIC. Remoting thread pool is a shared resource in the system, and it may be consumed by various things. I ask everybody to re-report their cases under the EPIC

          Oleg Nenashev added a comment -

          Original ticket has been cross-posted in JENKINS-47012 . In this EPIC I will be handling only issues related to the Jenkins Core, SSH Slaves Plugin and Remoting. Issues related to Remoting thread pool usage and misusage by other plugins are separate.

          Oleg Nenashev added a comment - Original ticket has been cross-posted in  JENKINS-47012  . In this EPIC I will be handling only issues related to the Jenkins Core, SSH Slaves Plugin and Remoting. Issues related to Remoting thread pool usage and misusage by other plugins are separate.

            Unassigned Unassigned
            cboylan Clark Boylan
            Votes:
            13 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: