Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27514

Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM

    XMLWordPrintable

    Details

    • Type: Epic
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Labels:
    • Environment:
    • Epic Name:
      Core - Thread spikes in Computer.threadPoolForRemoting
    • Similar Issues:

      Description

      This issue has been converted to EPIC, because there are reports of various independent issues inside.

      Issue:

      • Remoting threadPool is being widely used in Jenkins: https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code
      • Not all usages of Computer.threadPoolForRemoting are valid for starters
      • Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
      • The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
      • Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
      • Deadlocks or waits in the threadpool may also make it to grow infinitely

      Proposed fixes:

      • Define usage policy for this thread pool in the documentation
      • Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
      • Fix the most significant issues where the thread pool gets misused or blocked
         
        Original report (tracked as JENKINS-47012):

      > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

      > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

      > I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

      > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.

        Attachments

        1. 20150904-jenkins03.txt
          2.08 MB
        2. file-leak-detector.log
          41 kB
        3. Jenkins_Dump_2017-06-12-10-52.zip
          1.58 MB
        4. jenkins_watchdog_report.txt
          267 kB
        5. jenkins_watchdog.sh
          2 kB
        6. jenkins02-thread-dump.txt
          1.49 MB
        7. support_2015-08-04_14.10.32.zip
          2.17 MB
        8. support_2016-06-29_13.17.36 (2).zip
          3.90 MB
        9. thread-dump.txt
          5.48 MB

          Issue Links

            Activity

            cboylan Clark Boylan created issue -
            cboylan Clark Boylan made changes -
            Field Original Value New Value
            Attachment jenkins02-thread-dump.txt [ 28784 ]
            esinsag Sagi Sinai-Glazer made changes -
            Attachment support_2015-08-04_14.10.32.zip [ 30422 ]
            cboylan Clark Boylan made changes -
            Attachment 20150904-jenkins03.txt [ 30615 ]
            casey1911 Stanislav Jursky made changes -
            Attachment file-leak-detector.log [ 33011 ]
            dilipm79 Dilip Mahadevappa made changes -
            Attachment thread-dump.txt [ 33180 ]
            dilipm79 Dilip Mahadevappa made changes -
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 161723 ] JNJira + In-Review [ 180812 ]
            oleg_nenashev Oleg Nenashev made changes -
            Component/s remoting [ 15489 ]
            oleg_nenashev Oleg Nenashev made changes -
            Link This issue is related to JENKINS-43142 [ JENKINS-43142 ]
            oleg_nenashev Oleg Nenashev made changes -
            Component/s remoting [ 15489 ]
            malsch Malte Schoepski made changes -
            Attachment Jenkins_Dump_2017-06-12-10-52.zip [ 38439 ]
            nurupo nurupo nurupo made changes -
            Attachment jenkins_watchdog.sh [ 38803 ]
            nurupo nurupo nurupo made changes -
            Attachment jenkins_watchdog_report.txt [ 38804 ]
            oleg_nenashev Oleg Nenashev made changes -
            Assignee Kohsuke Kawaguchi [ kohsuke ] Oleg Nenashev [ oleg_nenashev ]
            oleg_nenashev Oleg Nenashev made changes -
            Issue Type Bug [ 1 ] Epic [ 10001 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-47012 [ 185392 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-47010 [ 185390 ]
            oleg_nenashev Oleg Nenashev made changes -
            Component/s remoting [ 15489 ]
            Epic Name Core - Thread spikes in Computer.threadPoolForRemoting
            oleg_nenashev Oleg Nenashev made changes -
            Summary Jenkins leaks thousands of Computer.threadPoolForRemoting threads leading to eventual server OOM  Core - Thread spikes in Computer.threadPoolForRemoting leading to eventual server OOM
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-47015 [ 185395 ]
            oleg_nenashev Oleg Nenashev made changes -
            Description After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            I thought https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43 may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            This issue has been converted to EPIC, because there are reports of various independent issues inside.

            https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code

             

            Original report (tracked as):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            oleg_nenashev Oleg Nenashev made changes -
            Description This issue has been converted to EPIC, because there are reports of various independent issues inside.

            https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code

             

            Original report (tracked as):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            This issue has been converted to EPIC, because there are reports of various independent issues inside.

            [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]

             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            oleg_nenashev Oleg Nenashev made changes -
            Description This issue has been converted to EPIC, because there are reports of various independent issues inside.

            [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]

             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            This issue has been converted to EPIC, because there are reports of various independent issues inside.


            [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]

             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            oleg_nenashev Oleg Nenashev made changes -
            Description This issue has been converted to EPIC, because there are reports of various independent issues inside.


            [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]

             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            This issue has been converted to EPIC, because there are reports of various independent issues inside.

            Issue:
            * Remoting threadPool is being widely used in Jenkins: [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]
            * Not all usages of the pool
             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            oleg_nenashev Oleg Nenashev made changes -
            Description This issue has been converted to EPIC, because there are reports of various independent issues inside.

            Issue:
            * Remoting threadPool is being widely used in Jenkins: [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]
            * Not all usages of the pool
             

            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            This issue has been converted to EPIC, because there are reports of various independent issues inside.

            Issue:
            * Remoting threadPool is being widely used in Jenkins: [https://github.com/search?q=org%3Ajenkinsci+threadPoolForRemoting&type=Code]
            * Not all usages of Computer.threadPoolForRemoting are valid for starters
            * Computer.threadPoolForRemoting has downscaling logic, threads get killed after 60-second timeout
            * The pool has no thread limit by default, so it may grow infinitely until number of threads kills JVM or causes OOM
            * Some Jenkins use-cases cause burst Computer.threadPoolForRemoting load by design (e.g. Jenkins startup or agent reconnection after the issue)
            * Deadlocks or waits in the threadpool may also make it to grow infinitely

            Proposed fixes:
            * Define usage policy for this thread pool in the documentation
            * Limit number of threads being created depending on the system scale, make the limit configurable (256 by default?)
            * Fix the most significant issues where the thread pool gets misused or blocked
             
            Original report (tracked as JENKINS-47012):

            > After some period of time the Jenkins master will have up to ten thousand or so threads most of which are Computer.theadPoolForRemoting threads that have leaked. This forces us to restart the Jenkins master.

            > We do add and delete slave nodes frequently (thousands per day per master) which I think may be part of the problem.

            > I thought [https://github.com/jenkinsci/ssh-slaves-plugin/commit/b5f26ae3c685496ba942a7c18fc9659167293e43] may be the fix because stacktraces indicated threads are hanging in the plugins afterDisconnect() method. I have updated half of our Jenkins masters to ssh-slaves plugin version 1.9 which includes that change, but early today we had a master with ssh-slaves plugin fall over from this issue.

            > Unfortunately I don't have any stacktraces handy (we had to force reboot the master today), but will update this bug if we get another case of this problem. Hoping that by filing it with as much info as I can we can at least start to diagnose the problem.
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-47257 [ 185672 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-47258 [ 185673 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-19465 [ 150915 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-48389 [ 186985 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-48574 [ 187214 ]
            oleg_nenashev Oleg Nenashev made changes -
            Epic Child JENKINS-48613 [ 187262 ]
            oleg_nenashev Oleg Nenashev made changes -
            Assignee Oleg Nenashev [ oleg_nenashev ]

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              cboylan Clark Boylan
              Votes:
              13 Vote for this issue
              Watchers:
              28 Start watching this issue

                Dates

                Created:
                Updated: