Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27514

Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • Core - Thread spikes in Computer.threadPoolForRemoting

      This issue has been converted to EPIC, because there are reports of various independent issues inside.

      Issue:

      • Remoting threadPool is being widely used in Jenkins: https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code
      • Not all usages of Computer.threadPoolForRemoting are valid for starters
      • Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
      • The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
      • Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
      • Deadlocks or waits in the threadpool may also make it to grow infinitely

      Proposed fixes:

      • Define usage policy for this thread pool in the documentation
      • Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
      • Fix the most significant issues where the thread pool gets misused or blocked
         
        Original report (tracked as JENKINS-47012):

      > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

      > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

      > I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

      > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.

        1. jenkins_watchdog_report.txt
          267 kB
        2. jenkins_watchdog.sh
          2 kB
        3. Jenkins_Dump_2017-06-12-10-52.zip
          1.58 MB
        4. support_2016-06-29_13.17.36 (2).zip
          3.90 MB
        5. thread-dump.txt
          5.48 MB
        6. file-leak-detector.log
          41 kB
        7. 20150904-jenkins03.txt
          2.08 MB
        8. support_2015-08-04_14.10.32.zip
          2.17 MB
        9. jenkins02-thread-dump.txt
          1.49 MB

          [JENKINS-27514] Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

          Clark Boylan created issue -

          Daniel Beck added a comment -

          We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

          Real slaves, or cloud slaves?

          Hoping that by filing it with as much info as I can

          Install the Support Core Plugin and attach a support bundle to this issue.

          Daniel Beck added a comment - We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem. Real slaves, or cloud slaves? Hoping that by filing it with as much info as I can Install the Support Core Plugin and attach a support bundle to this issue.

          Clark Boylan added a comment -

          What is the difference between real slaves and cloud slaves? These are cloud VMs added and removed to the Jenkins master using the Jenkins api via python-jenkins, http://python-jenkins.readthedocs.org/en/latest/

          If opportunity arises I can try to get the support plugin installed. We do have the melody plugin installed though so can get thread dump from that if/when this happens again.

          Clark Boylan added a comment - What is the difference between real slaves and cloud slaves? These are cloud VMs added and removed to the Jenkins master using the Jenkins api via python-jenkins, http://python-jenkins.readthedocs.org/en/latest/ If opportunity arises I can try to get the support plugin installed. We do have the melody plugin installed though so can get thread dump from that if/when this happens again.

          Daniel Beck added a comment -

          What is the difference between real slaves and cloud slaves?

          Cloud slaves are designed to be added and removed all the time, and managed by Jenkins. See some of the plugins on https://wiki.jenkins-ci.org/display/JENKINS/Plugins#Plugins-Slavelaunchersandcontrollers

          Daniel Beck added a comment - What is the difference between real slaves and cloud slaves? Cloud slaves are designed to be added and removed all the time, and managed by Jenkins. See some of the plugins on https://wiki.jenkins-ci.org/display/JENKINS/Plugins#Plugins-Slavelaunchersandcontrollers

          Clark Boylan added a comment -

          This is a thread dump via melody and the monitoring plugin for a master that has entered this state. This master is using ssh-slaves plugin version 1.9.

          Clark Boylan added a comment - This is a thread dump via melody and the monitoring plugin for a master that has entered this state. This master is using ssh-slaves plugin version 1.9.
          Clark Boylan made changes -
          Attachment New: jenkins02-thread-dump.txt [ 28784 ]

          Clark Boylan added a comment -

          From the attached thread dump you can see that there are >2000 Computer.threadPoolForRemoting threads. All but 3 are stuck on hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1187) which should be https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.9/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1187. I have no idea why they would block on that line but probably something to do with the implementation of connection?

          Clark Boylan added a comment - From the attached thread dump you can see that there are >2000 Computer.threadPoolForRemoting threads. All but 3 are stuck on hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1187) which should be https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.9/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1187 . I have no idea why they would block on that line but probably something to do with the implementation of connection?

          Clark Boylan added a comment -

          Actually now that I think about it more afterDisconnect is synchronized that means they are all likely blocking because a single one is blocking somewhere else in that method. I find this thread to be the only one in afterDisconnect not BLOCKING on line 1187:

          "Channel reader thread: bare-trusty-rax-iad-1330909" prio=5 WAITING
          java.lang.Object.wait(Native Method)
          java.lang.Object.wait(Object.java:503)
          com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
          com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
          com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
          com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
          com.trilead.ssh2.SFTPv3Client.readBytes(SFTPv3Client.java:215)
          com.trilead.ssh2.SFTPv3Client.receiveMessage(SFTPv3Client.java:240)
          com.trilead.ssh2.SFTPv3Client.init(SFTPv3Client.java:864)
          com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:108)
          com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
          hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1213)
          hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:443)
          hudson.remoting.Channel.terminate(Channel.java:822)
          hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)

          Which leads me to two questions. 1. Why would the above hang indefinitely? and 2 does this method really need to be synchronized? could we synchronize around access to data that actually needs to be synchronized instead?

          Clark Boylan added a comment - Actually now that I think about it more afterDisconnect is synchronized that means they are all likely blocking because a single one is blocking somewhere else in that method. I find this thread to be the only one in afterDisconnect not BLOCKING on line 1187: "Channel reader thread: bare-trusty-rax-iad-1330909" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212) com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127) com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946) com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58) com.trilead.ssh2.SFTPv3Client.readBytes(SFTPv3Client.java:215) com.trilead.ssh2.SFTPv3Client.receiveMessage(SFTPv3Client.java:240) com.trilead.ssh2.SFTPv3Client.init(SFTPv3Client.java:864) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:108) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1213) hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:443) hudson.remoting.Channel.terminate(Channel.java:822) hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69) Which leads me to two questions. 1. Why would the above hang indefinitely? and 2 does this method really need to be synchronized? could we synchronize around access to data that actually needs to be synchronized instead?

          Same here... see attached Support Bundle (support_2015-08-04_14.10.32.zip) for more details.
          Also seems to be related to JENKINS-23560 and JENKINS-26769.

          Sagi Sinai-Glazer added a comment - Same here... see attached Support Bundle (support_2015-08-04_14.10.32.zip) for more details. Also seems to be related to JENKINS-23560 and JENKINS-26769 .
          Sagi Sinai-Glazer made changes -
          Attachment New: support_2015-08-04_14.10.32.zip [ 30422 ]

            Unassigned Unassigned
            cboylan Clark Boylan
            Votes:
            13 Vote for this issue
            Watchers:
            29 Start watching this issue

              Created:
              Updated: