Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27514

Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

    XMLWordPrintable

    Details

    • Type: Epic
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Labels:
    • Environment:
    • Epic Name:
      Core - Thread spikes in Computer.threadPoolForRemoting
    • Similar Issues:

      Description

      This issue has been converted to EPIC, because there are reports of various independent issues inside.

      Issue:

      • Remoting threadPool is being widely used in Jenkins: https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code
      • Not all usages of Computer.threadPoolForRemoting are valid for starters
      • Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
      • The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
      • Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
      • Deadlocks or waits in the threadpool may also make it to grow infinitely

      Proposed fixes:

      • Define usage policy for this thread pool in the documentation
      • Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
      • Fix the most significant issues where the thread pool gets misused or blocked
         
        Original report (tracked as JENKINS-47012):

      > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

      > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

      > I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

      > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.

        Attachments

        1. 20150904-jenkins03.txt
          2.08 MB
        2. file-leak-detector.log
          41 kB
        3. Jenkins_Dump_2017-06-12-10-52.zip
          1.58 MB
        4. jenkins_watchdog_report.txt
          267 kB
        5. jenkins_watchdog.sh
          2 kB
        6. jenkins02-thread-dump.txt
          1.49 MB
        7. support_2015-08-04_14.10.32.zip
          2.17 MB
        8. support_2016-06-29_13.17.36 (2).zip
          3.90 MB
        9. thread-dump.txt
          5.48 MB

          Issue Links

            Activity

            Hide
            danielbeck Daniel Beck added a comment -

            We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            Real slaves, or cloud slaves?

            Hoping that by filing it with as much info as I can

            Install the Support Core Plugin and attach a support bundle to this issue.

            Show
            danielbeck Daniel Beck added a comment - We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem. Real slaves, or cloud slaves? Hoping that by filing it with as much info as I can Install the Support Core Plugin and attach a support bundle to this issue.
            Hide
            cboylan Clark Boylan added a comment -

            What is the difference between real slaves and cloud slaves? These are cloud VMs added and removed to the Jenkins master using the Jenkins api via python-jenkins, http://python-jenkins.readthedocs.org/en/latest/

            If opportunity arises I can try to get the support plugin installed. We do have the melody plugin installed though so can get thread dump from that if/when this happens again.

            Show
            cboylan Clark Boylan added a comment - What is the difference between real slaves and cloud slaves? These are cloud VMs added and removed to the Jenkins master using the Jenkins api via python-jenkins, http://python-jenkins.readthedocs.org/en/latest/ If opportunity arises I can try to get the support plugin installed. We do have the melody plugin installed though so can get thread dump from that if/when this happens again.
            Hide
            danielbeck Daniel Beck added a comment -

            What is the difference between real slaves and cloud slaves?

            Cloud slaves are designed to be added and removed all the time, and managed by Jenkins. See some of the plugins on https://wiki.jenkins-ci.org/display/JENKINS/Plugins#Plugins-Slavelaunchersandcontrollers

            Show
            danielbeck Daniel Beck added a comment - What is the difference between real slaves and cloud slaves? Cloud slaves are designed to be added and removed all the time, and managed by Jenkins. See some of the plugins on https://wiki.jenkins-ci.org/display/JENKINS/Plugins#Plugins-Slavelaunchersandcontrollers
            Hide
            cboylan Clark Boylan added a comment -

            This is a thread dump via melody and the monitoring plugin for a master that has entered this state. This master is using ssh-slaves plugin version 1.9.

            Show
            cboylan Clark Boylan added a comment - This is a thread dump via melody and the monitoring plugin for a master that has entered this state. This master is using ssh-slaves plugin version 1.9.
            Hide
            cboylan Clark Boylan added a comment -

            From the attached thread dump you can see that there are >2000 Computer.threadPoolForRemoting threads. All but 3 are stuck on hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1187) which should be https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.9/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1187. I have no idea why they would block on that line but probably something to do with the implementation of connection?

            Show
            cboylan Clark Boylan added a comment - From the attached thread dump you can see that there are >2000 Computer.threadPoolForRemoting threads. All but 3 are stuck on hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1187) which should be https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.9/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1187 . I have no idea why they would block on that line but probably something to do with the implementation of connection?
            Hide
            cboylan Clark Boylan added a comment -

            Actually now that I think about it more afterDisconnect is synchronized that means they are all likely blocking because a single one is blocking somewhere else in that method. I find this thread to be the only one in afterDisconnect not BLOCKING on line 1187:

            "Channel reader thread: bare-trusty-rax-iad-1330909" prio=5 WAITING
            java.lang.Object.wait(Native Method)
            java.lang.Object.wait(Object.java:503)
            com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212)
            com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127)
            com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946)
            com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58)
            com.trilead.ssh2.SFTPv3Client.readBytes(SFTPv3Client.java:215)
            com.trilead.ssh2.SFTPv3Client.receiveMessage(SFTPv3Client.java:240)
            com.trilead.ssh2.SFTPv3Client.init(SFTPv3Client.java:864)
            com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:108)
            com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
            hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1213)
            hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:443)
            hudson.remoting.Channel.terminate(Channel.java:822)
            hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)

            Which leads me to two questions. 1. Why would the above hang indefinitely? and 2 does this method really need to be synchronized? could we synchronize around access to data that actually needs to be synchronized instead?

            Show
            cboylan Clark Boylan added a comment - Actually now that I think about it more afterDisconnect is synchronized that means they are all likely blocking because a single one is blocking somewhere else in that method. I find this thread to be the only one in afterDisconnect not BLOCKING on line 1187: "Channel reader thread: bare-trusty-rax-iad-1330909" prio=5 WAITING java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:503) com.trilead.ssh2.channel.FifoBuffer.read(FifoBuffer.java:212) com.trilead.ssh2.channel.Channel$Output.read(Channel.java:127) com.trilead.ssh2.channel.ChannelManager.getChannelData(ChannelManager.java:946) com.trilead.ssh2.channel.ChannelInputStream.read(ChannelInputStream.java:58) com.trilead.ssh2.SFTPv3Client.readBytes(SFTPv3Client.java:215) com.trilead.ssh2.SFTPv3Client.receiveMessage(SFTPv3Client.java:240) com.trilead.ssh2.SFTPv3Client.init(SFTPv3Client.java:864) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:108) com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1213) hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:443) hudson.remoting.Channel.terminate(Channel.java:822) hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69) Which leads me to two questions. 1. Why would the above hang indefinitely? and 2 does this method really need to be synchronized? could we synchronize around access to data that actually needs to be synchronized instead?
            Hide
            esinsag Sagi Sinai-Glazer added a comment -

            Same here... see attached Support Bundle (support_2015-08-04_14.10.32.zip) for more details.
            Also seems to be related to JENKINS-23560 and JENKINS-26769.

            Show
            esinsag Sagi Sinai-Glazer added a comment - Same here... see attached Support Bundle (support_2015-08-04_14.10.32.zip) for more details. Also seems to be related to JENKINS-23560 and JENKINS-26769 .
            Hide
            stephenconnolly Stephen Connolly added a comment -

            SSHLauncher is supposed to have a separate instance for each and every Node. The Cloud implementations are supposed to be using SSHConnector.launch() to produce the concrete instance for connecting to the specific slave. If Cloud implementations are reusing a single shared instance then those Cloud implementations are broken (aside: all Cloud implementations are broken for other reasons, so there is nothing new in asserting that Cloud implementations are broken)

            Show
            stephenconnolly Stephen Connolly added a comment - SSHLauncher is supposed to have a separate instance for each and every Node. The Cloud implementations are supposed to be using SSHConnector.launch() to produce the concrete instance for connecting to the specific slave. If Cloud implementations are reusing a single shared instance then those Cloud implementations are broken (aside: all Cloud implementations are broken for other reasons, so there is nothing new in asserting that Cloud implementations are broken)
            Hide
            cboylan Clark Boylan added a comment - - edited

            To clarify we are not using any cloud plugins so there is no single shared instance special broken here. We run an external process that boot nodes in clouds then adds them as Jenkins slaves via the Jenkins API. When the job is done running on that slave the external process is notified and it removes the slave from jenkins. As a result we add and remove a large number of slaves but all of this happens via the API talking to ssh slaves plugin. We do this because the cloud plugins do indeed have problems. But this means that the bug has to be in the ssh slaves plugin somewhere.

            Show
            cboylan Clark Boylan added a comment - - edited To clarify we are not using any cloud plugins so there is no single shared instance special broken here. We run an external process that boot nodes in clouds then adds them as Jenkins slaves via the Jenkins API. When the job is done running on that slave the external process is notified and it removes the slave from jenkins. As a result we add and remove a large number of slaves but all of this happens via the API talking to ssh slaves plugin. We do this because the cloud plugins do indeed have problems. But this means that the bug has to be in the ssh slaves plugin somewhere.
            Hide
            cboylan Clark Boylan added a comment - - edited

            Thread dump from Jenkins master upgraded to ssh slaves plugin 1.10 but otherwise the same as the previous thread dump example. We thought that 1.10 may have fixed the bug but doesn't appear to have done so. Attaching this thread dump in case the information is sufficiently new and different to be helpful.

            Show
            cboylan Clark Boylan added a comment - - edited Thread dump from Jenkins master upgraded to ssh slaves plugin 1.10 but otherwise the same as the previous thread dump example. We thought that 1.10 may have fixed the bug but doesn't appear to have done so. Attaching this thread dump in case the information is sufficiently new and different to be helpful.
            Hide
            kravenscroft Kris Ravenscroft added a comment -

            Has any silent progress been made on this issue? I currently have to reboot weekly due to this issue. Currently running Jenkins 1.645 and SSH Plugin 1.10.

            Show
            kravenscroft Kris Ravenscroft added a comment - Has any silent progress been made on this issue? I currently have to reboot weekly due to this issue. Currently running Jenkins 1.645 and SSH Plugin 1.10.
            Hide
            stephenconnolly Stephen Connolly added a comment -

            Could you share the code that adds and removes slaves?

            Show
            stephenconnolly Stephen Connolly added a comment - Could you share the code that adds and removes slaves?
            Hide
            cboylan Clark Boylan added a comment -

            It runs as an external process that talks to the Jenkins rest API using the python-jenkins library. Code is at https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/myjenkins.py#n79 for adds and removes are at https://git.openstack.org/cgit/openstack/python-jenkins/tree/jenkins/__init__.py#n1120

            Show
            cboylan Clark Boylan added a comment - It runs as an external process that talks to the Jenkins rest API using the python-jenkins library. Code is at https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/myjenkins.py#n79 for adds and removes are at https://git.openstack.org/cgit/openstack/python-jenkins/tree/jenkins/__init__.py#n1120
            Hide
            kravenscroft Kris Ravenscroft added a comment -

            My setup is slightly different from the original description. I do not do any substantial adding/deleting of slaves. I have narrowed my cause down to a substantial number (30+ slaves) being consistently power cycled at a time. I am using Jenkins to perform hardware tests, and one set of tests requires that each slave be rebooted ~200 times. I performed an experiment last night that seems to prevent my issue. My preventative measures are:

            1) Switch all slaves to use the Availability setting of "Keep this slave on-line as much as possible, but don't reconnect if temporarily marked offline by the user." This option is added with this plugin: https://github.com/daniel-beck/jenkins-keep-slave-disconnected-plugin

            2) Set the script that is power-cycling the slaves to mark the slave as offline using the Jenkins CLI jar file.

            For me these steps prevent Jenkins from trying to reconnect to each slave whenever they come online again, so it prevents the SSH plugin in from connecting/disconnecting. I'm not sure this will help in your case of adding/deleting slaves, but I figured I'd throw it out there.

            Show
            kravenscroft Kris Ravenscroft added a comment - My setup is slightly different from the original description. I do not do any substantial adding/deleting of slaves. I have narrowed my cause down to a substantial number (30+ slaves) being consistently power cycled at a time. I am using Jenkins to perform hardware tests, and one set of tests requires that each slave be rebooted ~200 times. I performed an experiment last night that seems to prevent my issue. My preventative measures are: 1) Switch all slaves to use the Availability setting of "Keep this slave on-line as much as possible, but don't reconnect if temporarily marked offline by the user." This option is added with this plugin: https://github.com/daniel-beck/jenkins-keep-slave-disconnected-plugin 2) Set the script that is power-cycling the slaves to mark the slave as offline using the Jenkins CLI jar file. For me these steps prevent Jenkins from trying to reconnect to each slave whenever they come online again, so it prevents the SSH plugin in from connecting/disconnecting. I'm not sure this will help in your case of adding/deleting slaves, but I figured I'd throw it out there.
            Hide
            casey1911 Stanislav Jursky added a comment - - edited

            Our setup is also different. No slave restarts, no dynamic adding slaves. Just one slave running maven sonar analysis clean install sonar:sonar
            File leak detector plugin reports many maven-plugins with nearly same stack trace.

            hudson.remoting.ResourceImageDirect.<init>(ResourceImageDirect.java:29)
            or
            hudson.remoting.RemoteClassLoader$ClassLoaderProxy.fetch3(RemoteClassLoader.java:810)

            see attached file-leak-detector.log

            Show
            casey1911 Stanislav Jursky added a comment - - edited Our setup is also different. No slave restarts, no dynamic adding slaves. Just one slave running maven sonar analysis clean install sonar:sonar File leak detector plugin reports many maven-plugins with nearly same stack trace. hudson.remoting.ResourceImageDirect.<init>(ResourceImageDirect.java:29) or hudson.remoting.RemoteClassLoader$ClassLoaderProxy.fetch3(RemoteClassLoader.java:810) see attached file-leak-detector.log
            Hide
            dilipm79 Dilip Mahadevappa added a comment - - edited

            Jenkins ver. 1.625.2
            SSH Slaves plugin: 1.10
            Thread Dump attached: thread-dump.txt
            Support bundle attached: support_2016-06-29_13.17.36 (2).zip
            We have hit a similiar issue. We had some 5k+ blocked Computer.threadPoolForRemoting threads.

            Computer.threadPoolForRemoting [#175318] - threadId:2655322 (0x28845a) - state:BLOCKED
            stackTrace:
            - waiting to lock <0x6e0864fe> (a hudson.plugins.sshslaves.SSHLauncher)
            owned by Computer.threadPoolForRemoting [#175203] id=2653481
            at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226)
            at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:603)
            at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
            at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
            at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
            at java.util.concurrent.FutureTask.run(FutureTask.java:166)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:724)
            Locked synchronizers: count = 1
            - java.util.concurrent.ThreadPoolExecutor$Worker@729cb958
            
            .
            .
            .
            
            Show
            dilipm79 Dilip Mahadevappa added a comment - - edited Jenkins ver. 1.625.2 SSH Slaves plugin: 1.10 Thread Dump attached: thread-dump.txt Support bundle attached: support_2016-06-29_13.17.36 (2).zip We have hit a similiar issue. We had some 5k+ blocked Computer.threadPoolForRemoting threads. Computer.threadPoolForRemoting [#175318] - threadId:2655322 (0x28845a) - state:BLOCKED stackTrace: - waiting to lock <0x6e0864fe> (a hudson.plugins.sshslaves.SSHLauncher) owned by Computer.threadPoolForRemoting [#175203] id=2653481 at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:603) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724) Locked synchronizers: count = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@729cb958 . . .
            Hide
            makewhatis David Johansen added a comment -

            So, we were hitting this issue a while back, threads stacking until we had to restart. I tracked it down to some automation that I had written, using http://javadoc.jenkins-ci.org/hudson/model/Computer.html#connect(boolean). I was using connect(true) which is supposed to spawn a new thread if the agent is "disconnected", and cancel the current thread that was disconnected. It doesn't cancel it right away though.

            Anyways, it was spawning them much more frequently than Jenkins was killing those old threads, when a host was not in a state where it could be connected.

            I changed my call to connect(false) to not force reconnect if the agent was unavailable, and we stopped seeing the crazy thread stacking. Kinda sounds like what you are seeing, but either way figured I'd share what we did.

            Show
            makewhatis David Johansen added a comment - So, we were hitting this issue a while back, threads stacking until we had to restart. I tracked it down to some automation that I had written, using http://javadoc.jenkins-ci.org/hudson/model/Computer.html#connect(boolean ). I was using connect(true) which is supposed to spawn a new thread if the agent is "disconnected", and cancel the current thread that was disconnected. It doesn't cancel it right away though. Anyways, it was spawning them much more frequently than Jenkins was killing those old threads, when a host was not in a state where it could be connected. I changed my call to connect(false) to not force reconnect if the agent was unavailable, and we stopped seeing the crazy thread stacking. Kinda sounds like what you are seeing, but either way figured I'd share what we did.
            Hide
            oleg_nenashev Oleg Nenashev added a comment - - edited

            From what I see in the File Leak detector, it requires a fix on the Remoting side. The fix needs to be similar to JENKINS-37332. BTW, it seems to be a standalone issue

            Show
            oleg_nenashev Oleg Nenashev added a comment - - edited From what I see in the File Leak detector, it requires a fix on the Remoting side. The fix needs to be similar to JENKINS-37332 . BTW, it seems to be a standalone issue
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Created JENKINS-43142

            Show
            oleg_nenashev Oleg Nenashev added a comment - Created JENKINS-43142
            Hide
            malsch Malte Schoepski added a comment -

            Hi there,

            we also have issues with our Jenkins and a lot of threadPoolforRemoting-Threads.

             

            Attached, you will find our diagnostics information:

            https://issues.jenkins-ci.org/secure/attachment/38439/Jenkins_Dump_2017-06-12-10-52.zip

             

            We use the following Jenkins Release: 2.46.3

            Slaves are running with the following Clouds:

            Yet Another Docker Plugin [0.1.0-rc37  & SSH Slaves Plugin [1.17]

            vSphere Cloud Plugin [2.15] & SSH Slaves Plugin [1.17] (Linux and Windows 2016)

            vSphere Cloud Plugin [2.15] & Windows Slave plugin [1.3.1]

             

             

            Docker Slaves are always freshly created, vSphere Slaves just do a reboot and optionally Snapshot Revert through vSphere.

            Show
            malsch Malte Schoepski added a comment - Hi there, we also have issues with our Jenkins and a lot of threadPoolforRemoting-Threads.   Attached, you will find our diagnostics information: https://issues.jenkins-ci.org/secure/attachment/38439/Jenkins_Dump_2017-06-12-10-52.zip   We use the following Jenkins Release: 2.46.3 Slaves are running with the following Clouds: Yet Another Docker Plugin [0.1.0-rc37   & SSH Slaves Plugin [1.17] vSphere Cloud Plugin [2.15]  & SSH Slaves Plugin [1.17]  (Linux and Windows 2016) vSphere Cloud Plugin [2.15] & Windows Slave plugin [1.3.1]     Docker Slaves are always freshly created, vSphere Slaves just do a reboot and optionally Snapshot Revert through vSphere.
            Hide
            nurupo nurupo nurupo added a comment - - edited

            I have a similar issue, but I'm not sure if it's ssh-slaves-plugin that is at fault, it sounds more like something in Jenkins itself, since I use different plugin. I use the Digital Ocean plugin, which creates and destroys slaves automatically if there are any jobs in the queue, so there is a large number of slaves being created and destroyed, similar to what other people described. After starting Jenkins I noticed that it goes HTTP 504 after running for several weeks. At the time when it does that, there is a little amount of free RAM left (but there is a lot of free swap space) and there are a lot of threads like this one:

            "Computer.threadPoolForRemoting [#971]" #8350 daemon prio=5 os_prio=0 tid=0x00007fb7f80dc800 nid=0x1a01 waiting on condition [0x00007fb7dacc6000]
               java.lang.Thread.State: TIMED_WAITING (parking)
            

            I have had this issue since September 2016 or so, shortly after I have started using the Digital Ocean plugin.

            As a workaround, I use a shellscript that I cron to run every minute which restarts Jenkins when it's down jenkins_watchdog.sh. The shell script restarts Jenkins if it's HTTP 50X and if the last time we restarted it was at least 15 minutes ago. The latter is in order to prevent continuous restarts as large Jenkins setups might still be HTTP 50X minutes after starting.

            The script also reports free memory, 500 lines of jenkins logs and a thread dump. Here is a report of that script when my Jenkins goes down jenkins_watchdog_report.txt. Notice the large number of Computer.threadPoolForRemoting threads. I'm running the stable LTS release, which is Jenkins ver. 2.60.1 at the moment. Like I have said before, I have been having this issue for almost a year now.

            Show
            nurupo nurupo nurupo added a comment - - edited I have a similar issue, but I'm not sure if it's ssh-slaves-plugin that is at fault, it sounds more like something in Jenkins itself, since I use different plugin. I use the Digital Ocean plugin , which creates and destroys slaves automatically if there are any jobs in the queue, so there is a large number of slaves being created and destroyed, similar to what other people described. After starting Jenkins I noticed that it goes HTTP 504 after running for several weeks. At the time when it does that, there is a little amount of free RAM left (but there is a lot of free swap space) and there are a lot of threads like this one: "Computer.threadPoolForRemoting [#971]" #8350 daemon prio=5 os_prio=0 tid=0x00007fb7f80dc800 nid=0x1a01 waiting on condition [0x00007fb7dacc6000] java.lang.Thread.State: TIMED_WAITING (parking) I have had this issue since September 2016 or so, shortly after I have started using the Digital Ocean plugin. As a workaround, I use a shellscript that I cron to run every minute which restarts Jenkins when it's down jenkins_watchdog.sh . The shell script restarts Jenkins if it's HTTP 50X and if the last time we restarted it was at least 15 minutes ago. The latter is in order to prevent continuous restarts as large Jenkins setups might still be HTTP 50X minutes after starting. The script also reports free memory, 500 lines of jenkins logs and a thread dump. Here is a report of that script when my Jenkins goes down jenkins_watchdog_report.txt . Notice the large number of Computer.threadPoolForRemoting threads. I'm running the stable LTS release, which is Jenkins ver. 2.60.1 at the moment. Like I have said before, I have been having this issue for almost a year now.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            So, this is the most concerning part from the original reports:

            Computer.threadPoolForRemoting [#175216] id=2653706 (0x287e0a) state=WAITING cpu=0%
                - waiting on <0x2d68e7af> (a java.util.concurrent.FutureTask$Sync)
                - locked <0x2d68e7af> (a java.util.concurrent.FutureTask$Sync)
                at sun.misc.Unsafe.park(Native Method)
                at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
                at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
                at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:248)
                at java.util.concurrent.FutureTask.get(FutureTask.java:111)
                at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:243)
                at java.util.concurrent.Executors$DelegatedExecutorService.invokeAll(Executors.java:648)
                at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:749)
                at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:238)
                at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
                at java.util.concurrent.FutureTask.run(FutureTask.java:166)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:724)
            
            Computer.threadPoolForRemoting [#250455] id=3842231 (0x3aa0b7) state=BLOCKED cpu=69%
                - waiting to lock <0x457d3d64> (a hudson.plugins.sshslaves.SSHLauncher)
                 owned by Computer.threadPoolForRemoting [#175224] id=2653714
                at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226)
                at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:603)
                at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
                at java.util.concurrent.FutureTask.run(FutureTask.java:166)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                at java.lang.Thread.run(Thread.java:724)
            
                Locked synchronizers: count = 1
                  - java.util.concurrent.ThreadPoolExecutor$Worker@5781e2cc
            

            Launch operation hangs indefinitely, in results = executorService.invokeAll(callables). This method means that "launchTimeout" is not specified in the launcher settings. I am not sure what causes afterDisconnect calls, seems Jenkins tries to call https://github.com/jenkinsci/jenkins/blob/23cb13aad0f5f5015ca68ca00ca57ba093fa57ec/core/src/main/java/hudson/slaves/SlaveComputer.java#L618-L630 to shut down the agent connection.

            So, there are several suggestions:

            • SSH Launcher should run with timeout by default
            • afterDisconnect() handlers in SSHLauncher should not be synchronized
            • Jenkins core should invoke SlaveComputer#disconnect() with a timeout or to at least check if there are parallel disconnect commands

             

            nurupo nurupo Do you also use SSHLauncher in the Digital Ocean plugin?

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - So, this is the most concerning part from the original reports: Computer.threadPoolForRemoting [#175216] id=2653706 (0x287e0a) state=WAITING cpu=0% - waiting on <0x2d68e7af> (a java.util.concurrent.FutureTask$Sync) - locked <0x2d68e7af> (a java.util.concurrent.FutureTask$Sync) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:248) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:243) at java.util.concurrent.Executors$DelegatedExecutorService.invokeAll(Executors.java:648) at hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:749) at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:238) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724) Computer.threadPoolForRemoting [#250455] id=3842231 (0x3aa0b7) state=BLOCKED cpu=69% - waiting to lock <0x457d3d64> (a hudson.plugins.sshslaves.SSHLauncher) owned by Computer.threadPoolForRemoting [#175224] id=2653714 at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1226) at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:603) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang. Thread .run( Thread .java:724) Locked synchronizers: count = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@5781e2cc Launch operation hangs indefinitely, in results = executorService.invokeAll(callables). This method means that "launchTimeout" is not specified in the launcher settings. I am not sure what causes afterDisconnect calls, seems Jenkins tries to call https://github.com/jenkinsci/jenkins/blob/23cb13aad0f5f5015ca68ca00ca57ba093fa57ec/core/src/main/java/hudson/slaves/SlaveComputer.java#L618-L630 to shut down the agent connection. So, there are several suggestions: SSH Launcher should run with timeout by default afterDisconnect() handlers in SSHLauncher should not be synchronized Jenkins core should invoke SlaveComputer#disconnect() with a timeout or to at least check if there are parallel disconnect commands   nurupo nurupo Do you also use SSHLauncher in the Digital Ocean plugin?  
            Hide
            nurupo nurupo nurupo added a comment - - edited

            Sorry, I'm not sure what the SSH Launcher plugin is. Digital Ocean plugin does connect using ssh and it uses Jenkins ssh agent jar, but I'm not really sure if it uses SSH Launcher plugin for this or not. Here is the ssh connection part of the plugin. I see it uses `com.trilead.ssh2`, which seems to be some ssh library included with Jenkins.

            Show
            nurupo nurupo nurupo added a comment - - edited Sorry, I'm not sure what the SSH Launcher plugin is. Digital Ocean plugin does connect using ssh and it uses Jenkins ssh agent jar, but I'm not really sure if it uses SSH Launcher plugin for this or not. Here is the ssh connection part of the plugin. I see it uses `com.trilead.ssh2`, which seems to be some ssh library included with Jenkins.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            nurupo nurupo Yes, the Digital Ocean issue seems to be unrelated to the one originally reported here. Please create a separate ticket to that plugin.

            Show
            oleg_nenashev Oleg Nenashev added a comment - nurupo nurupo Yes, the Digital Ocean issue seems to be unrelated to the one originally reported here. Please create a separate ticket to that plugin.
            Hide
            nurupo nurupo nurupo added a comment -

            Are you sure that both the original author and me are not experiencing the same issue? It might be an issue in Jenkins itself that we are experiencing, not an issue in the SSH Launcher plugin. When I googled the issue of too many of "Computer.threadPoolForRemoting" threads, this is the bug report that came up, and it matches my circumstances quite closely: lots of slaves being frequently created and destroyed result in many of "Computer.threadPoolForRemoting" threads to be created and locked, which eventually exhausts computer's/JVM's resources and kills Jenkins.

            Show
            nurupo nurupo nurupo added a comment - Are you sure that both the original author and me are not experiencing the same issue? It might be an issue in Jenkins itself that we are experiencing, not an issue in the SSH Launcher plugin. When I googled the issue of too many of "Computer.threadPoolForRemoting" threads, this is the bug report that came up, and it matches my circumstances quite closely: lots of slaves being frequently created and destroyed result in many of "Computer.threadPoolForRemoting" threads to be created and locked, which eventually exhausts computer's/JVM's resources and kills Jenkins.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            > Are you sure that both the original author and me are not experiencing the same issue?

            Well, the digital ocean issue requires extra investigation by the plugin maintainer. This one has been originally reported to SSH Slaves Plugin, and the deadlock cause definitely comes from the SSH Launcher implementation.

            Even if there is a shared part of the issue (Core API flaws), the issue fixes will be different.

             

            Show
            oleg_nenashev Oleg Nenashev added a comment - > Are you sure that both the original author and me are not experiencing the same issue? Well, the digital ocean issue requires extra investigation by the plugin maintainer. This one has been originally reported to SSH Slaves Plugin, and the deadlock cause definitely comes from the SSH Launcher implementation. Even if there is a shared part of the issue (Core API flaws), the issue fixes will be different.  
            Hide
            ecliptik Micheal Waltz added a comment - - edited

            Oleg Nenashev do you think that this issue may have been addressed in any of the recent Remoting Updates?

            https://github.com/jenkinsci/remoting/blob/master/CHANGELOG.md#311

            We're running Jenkins 2.74 with 8 agents using the Swarm v3.4 agent and have to reboot our jenkins master once a week when CPU gets to high.

            It seems to be related to the hundreds of

            Computer.threadPoolForRemoting

            messages that pile up when viewing the Jenkins monitor. Here's a screenshot of what the Jenkins Monitoring Threads page shows:

            https://keybase.pub/ecliptik/jenkins/Screen%20Shot%202017-08-28%20at%2012.58.54%20PM.png

            Show
            ecliptik Micheal Waltz added a comment - - edited Oleg Nenashev do you think that this issue may have been addressed in any of the recent Remoting Updates? https://github.com/jenkinsci/remoting/blob/master/CHANGELOG.md#311 We're running Jenkins 2.74 with 8 agents using the Swarm v3.4 agent and have to reboot our jenkins master once a week when CPU gets to high. It seems to be related to the hundreds of Computer.threadPoolForRemoting messages that pile up when viewing the Jenkins monitor. Here's a screenshot of what the Jenkins Monitoring Threads page shows: https://keybase.pub/ecliptik/jenkins/Screen%20Shot%202017-08-28%20at%2012.58.54%20PM.png
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Micheal Waltz I checked the implementation in the Jenkins core, and it appears that the thread pool may be still growing without a limit on the master side. Implementation in Executors:

             

            **
             * Creates a thread pool that creates new threads as needed, but
             * will reuse previously constructed threads when they are
             * available, and uses the provided
             * ThreadFactory to create new threads when needed.
             * @param threadFactory the factory to use when creating new threads
             * @return the newly created thread pool
             * @throws NullPointerException if threadFactory is null
             */
            public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) {
                return new ThreadPoolExecutor(0, Integer.MAX_VALUE,
                                              60L, TimeUnit.SECONDS,
                                              new SynchronousQueue<Runnable>(),
                                              threadFactory);
            }

             

            Computer.threadPool for remoting is being used in multiple places, and IMHO we should at least make the limit configurable. Maybe we also should set a reasonable default limit. I will check if there is a Thread Pool Executor service with downscale support

            Show
            oleg_nenashev Oleg Nenashev added a comment - Micheal Waltz I checked the implementation in the Jenkins core, and it appears that the thread pool may be still growing without a limit on the master side. Implementation in Executors:   ** * Creates a thread pool that creates new threads as needed, but * will reuse previously constructed threads when they are * available, and uses the provided * ThreadFactory to create new threads when needed. * @param threadFactory the factory to use when creating new threads * @ return the newly created thread pool * @ throws NullPointerException if threadFactory is null */ public static ExecutorService newCachedThreadPool(ThreadFactory threadFactory) { return new ThreadPoolExecutor(0, Integer .MAX_VALUE, 60L, TimeUnit.SECONDS, new SynchronousQueue< Runnable >(), threadFactory); }   Computer.threadPool for remoting is being used in multiple places, and IMHO we should at least make the limit configurable. Maybe we also should set a reasonable default limit. I will check if there is a Thread Pool Executor service with downscale support
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Actually I am wrong. The Cached Thread Pool implementation should be able to terminate threads if they are unused for more than 60 seconds. Probably the threads are being created so intensively that the executor service goes crazy. I will see I I can create logging for that at least

            Show
            oleg_nenashev Oleg Nenashev added a comment - Actually I am wrong. The Cached Thread Pool implementation should be able to terminate threads if they are unused for more than 60 seconds. Probably the threads are being created so intensively that the executor service goes crazy. I will see I I can create logging for that at least
            Hide
            integer Kanstantsin Shautsou added a comment - - edited

            Stephen Connolly https://issues.jenkins-ci.org/browse/JENKINS-27514?focusedCommentId=306267&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-306267 ? I remember that connect/disconnect always had issues in UI. I.e. you try disconnect, but it still infinitely tries reconnect. Maybe related?

            Show
            integer Kanstantsin Shautsou added a comment - - edited Stephen Connolly   https://issues.jenkins-ci.org/browse/JENKINS-27514?focusedCommentId=306267&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-306267  ? I remember that connect/disconnect always had issues in UI. I.e. you try disconnect, but it still infinitely tries reconnect. Maybe related?
            Hide
            ecliptik Micheal Waltz added a comment -

            This still appears to be an issue, but I figured out what was triggering high CPU and thousands of Computer.threadPoolForRemoting threads with our setup.

            Architecture:

            • 1 Ubuntu 16.04 Master running Jenkins v2.76
            • 6 Ubuntu 16.04 Agents via swarm plugin v3.4
            • Agents are connected to master via a ELB since they are in multiple AWS regions

            There were a few jobs that used containers to mount a reports volume within the job workspace. The container would generate reports as root:root and they would appear within $WORKSPACE with these same permissions. The jenkins agent runs as user jenkins, and couldn't remove these files when it would try and clean up $WORKSPACE after each run.

            jenkins@ip-10-0-0-5:~/workspace/automated-matador-pull_requests_ws-cleanup_1504807099908$ ls -l coverage/
            total 6352
            drwxr-xr-x 3 root root    4096 Sep  7 15:51 assets
            -rw-r--r-- 1 root root 6498213 Sep  7 15:51 index.htm
            

            The jobs that wrote these reports were run regularly, on every push and Pull Request to a repository, causing them to build up quickly. On the master thousands of files named atomic*.tmp would start to appear in /var/lib/jenkins

            ubuntu@jenkins:/var/lib/jenkins$ ls atomic*.tmp | wc -l
            6521
            

            and each file would contain hundreds of lines like,

            <detailMessage>Unable to delete &apos;/var/lib/jenkins/workspace/automated-matador-develop-build-on-push_ws-cleanup_1504728487305/coverage/.resultset.json.lock&apos;. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.</detailMessage>
            

            Eventually the Computer.threadPoolForRemoting errors would reach into the thousands the the master CPU would be 100%. A reboot would temporarily fix it, but CPU would jump again until all the /var/lib/jenkins/atomic*.tmp files were removed.

            We resolved the issue by doing a chown jenkins:jenkins on the report directories created by the containers in a job so there are no longer "Unable to delete errors" and atomic*.tmp files created. We haven't seen a CPU or computer.threadPoolForRemoting spike in two weeks since doing this.

            Hopefully this helps anyone else who may be experiencing this issue and maybe provide some guidance on the root cause of this issue.

            Show
            ecliptik Micheal Waltz added a comment - This still appears to be an issue, but I figured out what was triggering high CPU and thousands of Computer.threadPoolForRemoting threads with our setup. Architecture: 1 Ubuntu 16.04 Master running Jenkins v2.76 6 Ubuntu 16.04 Agents via swarm plugin v3.4 Agents are connected to master via a ELB since they are in multiple AWS regions There were a few jobs that used containers to mount a reports volume within the job workspace. The container would generate reports as root:root and they would appear within $WORKSPACE with these same permissions. The jenkins agent runs as user jenkins, and couldn't remove these files when it would try and clean up $WORKSPACE after each run. jenkins@ip-10-0-0-5:~/workspace/automated-matador-pull_requests_ws-cleanup_1504807099908$ ls -l coverage/ total 6352 drwxr-xr-x 3 root root 4096 Sep 7 15:51 assets -rw-r--r-- 1 root root 6498213 Sep 7 15:51 index.htm The jobs that wrote these reports were run regularly, on every push and Pull Request to a repository, causing them to build up quickly. On the master thousands of files named atomic*.tmp would start to appear in /var/lib/jenkins ubuntu@jenkins:/ var /lib/jenkins$ ls atomic*.tmp | wc -l 6521 and each file would contain hundreds of lines like, <detailMessage>Unable to delete &apos;/ var /lib/jenkins/workspace/automated-matador-develop-build-on-push_ws-cleanup_1504728487305/coverage/.resultset.json.lock&apos;. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.</detailMessage> Eventually the Computer.threadPoolForRemoting errors would reach into the thousands the the master CPU would be 100%. A reboot would temporarily fix it, but CPU would jump again until all the /var/lib/jenkins/atomic*.tmp files were removed. We resolved the issue by doing a chown jenkins:jenkins on the report directories created by the containers in a job so there are no longer "Unable to delete errors" and atomic*.tmp files created. We haven't seen a CPU or computer.threadPoolForRemoting spike in two weeks since doing this. Hopefully this helps anyone else who may be experiencing this issue and maybe provide some guidance on the root cause of this issue.
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Micheal Waltz Ideally I need a stacktrace to confirm what causes it, but I am pretty sure it happens due to the workspace cleanup. Jenkins-initiated workspace cleanup happens in bursts and it uses Remoting thread pool for sure, so it may cause such behavior.

            Regarding this ticket, I am going to convert it to EPIC. Remoting thread pool is a shared resource in the system, and it may be consumed by various things. I ask everybody to re-report their cases under the EPIC

            Show
            oleg_nenashev Oleg Nenashev added a comment - Micheal Waltz Ideally I need a stacktrace to confirm what causes it, but I am pretty sure it happens due to the workspace cleanup. Jenkins-initiated workspace cleanup happens in bursts and it uses Remoting thread pool for sure, so it may cause such behavior. Regarding this ticket, I am going to convert it to EPIC. Remoting thread pool is a shared resource in the system, and it may be consumed by various things. I ask everybody to re-report their cases under the EPIC
            Hide
            oleg_nenashev Oleg Nenashev added a comment -

            Original ticket has been cross-posted in JENKINS-47012 . In this EPIC I will be handling only issues related to the Jenkins Core, SSH Slaves Plugin and Remoting. Issues related to Remoting thread pool usage and misusage by other plugins are separate.

            Show
            oleg_nenashev Oleg Nenashev added a comment - Original ticket has been cross-posted in  JENKINS-47012  . In this EPIC I will be handling only issues related to the Jenkins Core, SSH Slaves Plugin and Remoting. Issues related to Remoting thread pool usage and misusage by other plugins are separate.

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              cboylan Clark Boylan
              Votes:
              13 Vote for this issue
              Watchers:
              28 Start watching this issue

                Dates

                Created:
                Updated: