Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27922

Jenkins job execution becomes unstable - jobs fail with OOM: unable to create new native thread

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Critical
    • Resolution: Duplicate
    • Component/s: ssh-agent-plugin
    • Environment:
    • Similar Issues:

      Description

      After running for 2-3 days, jenkins jobs no longer launch.

      The console outputs usually just say that fetching from git failed, but sometimes contain other unusual errors.

      The system log for jenkins reports

      java.lang.OutOfMemoryError: unable to create new native thread

      I was able to get a heap dump but due to the potential inclusion of sensitive data cannot post it.

      In VisualVM analysis of the heap dump, I noticed that there are almost 1000 instances of AgentServer and AgentServer$1. The threads don't show up in the thread monitor, but are still referenced somehow.

      Unfortunately the parent references are numerous and hard to decipher. The proximate parent is the ThreadGroup.threads array in the main ThreadGroup instance. This seems unlikely to be the true root cause.

      I also noticed about the same number of ThreadLocalMap instances, so the leak may be related to incorrect use of ThreadLocal.

      Attached a screenshot of the AgentServer$1 instances in VisualVM, and the jenkins system log.

      Please let me know if there is any other analysis I can provide.

      I am entering this bug as blocker because I don't currently have a workaround. I am using jenkins in conjunction with an external php application that needs to post jobs to the jenkins build queue. Therefore, in order to workaround, I need to implement a controlled shutdown process and restart jenkins at a daily or semi-daily interval. This will ultimately require the calling application to retry, which is probably a good idea anyway, but is not yet implemented.

        Attachments

        1. jenkins.log
          146 kB
        2. plugins.xml
          3 kB
        3. Screen Shot 2015-04-13 at 11.02.27 AM.png
          Screen Shot 2015-04-13 at 11.02.27 AM.png
          314 kB
        4. Thread dump [Jenkins].html
          485 kB
        5. Thread dump [Jenkins].html
          348 kB

          Issue Links

            Activity

            jamie Jamie Doornbos created issue -
            jamie Jamie Doornbos made changes -
            Field Original Value New Value
            Description After running for 2-3 days, jenkins jobs no longer launch.

            The console outputs usually just say that fetching from git failed, but sometimes contain other unusual errors.

            The system log for jenkins reports

                java.lang.OutOfMemoryError: unable to create new native thread

            I was able to get a heap dump but due to the potential inclusion of sensitive data cannot post it.

            In VisualVM analysis of the heap dump, I noticed that there are almost 1000 instances of AgentServer and AgentServer$1. The threads don't show up in the thread monitor, but are still referenced somehow.

            Unfortunately the parent references are numerous and hard to decipher. The proximate parent it the ThreadGroup.threads array in the *main* ThreadGroup instance. This seems unlikely to be the true root cause.

            I also noticed about the same number of ThreadLocalMap instances, so the leak may be related to incorrect use of ThreadLocal.

            Attached a screenshot of the AgentServer$1 instances in VisualVM, and the jenkins system log.

            Please let me know if there is any other analysis I can provide.

            I am entering this bug as critical because I don't currently have a workaround. I am using jenkins in conjunction with an external php application that needs to post jobs to the jenkins build queue. Therefore, in order to workaround, I need to implement a controlled shutdown process and restart jenkins at a daily or semi-daily interval. This will ultimately require the calling application to retry, which is probably a good idea anyway, but is not yet implemented.

            After running for 2-3 days, jenkins jobs no longer launch.

            The console outputs usually just say that fetching from git failed, but sometimes contain other unusual errors.

            The system log for jenkins reports

                java.lang.OutOfMemoryError: unable to create new native thread

            I was able to get a heap dump but due to the potential inclusion of sensitive data cannot post it.

            In VisualVM analysis of the heap dump, I noticed that there are almost 1000 instances of AgentServer and AgentServer$1. The threads don't show up in the thread monitor, but are still referenced somehow.

            Unfortunately the parent references are numerous and hard to decipher. The proximate parent is the ThreadGroup.threads array in the *main* ThreadGroup instance. This seems unlikely to be the true root cause.

            I also noticed about the same number of ThreadLocalMap instances, so the leak may be related to incorrect use of ThreadLocal.

            Attached a screenshot of the AgentServer$1 instances in VisualVM, and the jenkins system log.

            Please let me know if there is any other analysis I can provide.

            I am entering this bug as blocker because I don't currently have a workaround. I am using jenkins in conjunction with an external php application that needs to post jobs to the jenkins build queue. Therefore, in order to workaround, I need to implement a controlled shutdown process and restart jenkins at a daily or semi-daily interval. This will ultimately require the calling application to retry, which is probably a good idea anyway, but is not yet implemented.

            danielbeck Daniel Beck made changes -
            Assignee Daniel Beck [ danielbeck ]
            danielbeck Daniel Beck made changes -
            Assignee Daniel Beck [ danielbeck ]
            Hide
            danielbeck Daniel Beck added a comment -
            Show
            danielbeck Daniel Beck added a comment - A thread dump may be helpful. https://wiki.jenkins-ci.org/display/JENKINS/Obtaining+a+thread+dump
            Hide
            jamie Jamie Doornbos added a comment -

            I had to restart my server yesterday, so this thread dump represents about a day of build up. I can see the trajectory of doom here: there are 658 AgentServer threads. I can post another dump tomorrow or so when the server falls over.

            Show
            jamie Jamie Doornbos added a comment - I had to restart my server yesterday, so this thread dump represents about a day of build up. I can see the trajectory of doom here: there are 658 AgentServer threads. I can post another dump tomorrow or so when the server falls over.
            jamie Jamie Doornbos made changes -
            Attachment Thread dump [Jenkins].html [ 29450 ]
            Hide
            jamie Jamie Doornbos added a comment -

            Server failed sooner than expected, probably due to a large increase in job load for our current testing cycle. Attached a new thread dump, a grand total of 937 AgentServer threads. The system log shows the same java.lang.OutOfMemoryError: unable to create new native thread. I have to restart the service again to move forward with my development.

            Show
            jamie Jamie Doornbos added a comment - Server failed sooner than expected, probably due to a large increase in job load for our current testing cycle. Attached a new thread dump, a grand total of 937 AgentServer threads. The system log shows the same java.lang.OutOfMemoryError: unable to create new native thread. I have to restart the service again to move forward with my development.
            jamie Jamie Doornbos made changes -
            Attachment Thread dump [Jenkins].html [ 29455 ]
            Hide
            jamie Jamie Doornbos added a comment -

            I have a workaround now. I get jenkins to restart itself once per day using a new restart-jenkins job. It use the Credentials Binding Plugin to run this:

            curl -XPOST https://SERVER_URL/safeRestart --user "rebooter:$JENKINS_PASSWORD"

            Since this workaround is readily available and should work for even fairly busy systems, I downgraded severity of the issue.

            Show
            jamie Jamie Doornbos added a comment - I have a workaround now. I get jenkins to restart itself once per day using a new restart-jenkins job. It use the Credentials Binding Plugin to run this: curl -XPOST https://SERVER_URL/safeRestart --user "rebooter:$JENKINS_PASSWORD" Since this workaround is readily available and should work for even fairly busy systems, I downgraded severity of the issue.
            jamie Jamie Doornbos made changes -
            Priority Blocker [ 1 ] Critical [ 2 ]
            Hide
            danielbeck Daniel Beck added a comment -

            In jobs using SSH Agent, do you see the following build log message near the end?

            [ssh-agent] Stopped.

            Show
            danielbeck Daniel Beck added a comment - In jobs using SSH Agent, do you see the following build log message near the end? [ssh-agent] Stopped.
            Hide
            danielbeck Daniel Beck added a comment -

            Are you setting up SSH Agents for builds that do not use them?

            Show
            danielbeck Daniel Beck added a comment - Are you setting up SSH Agents for builds that do not use them?
            Hide
            jamie Jamie Doornbos added a comment -

            I do generally see the Stopped line on builds, but I don't watch every build. I checked using a grep on files no more than 3 days old and found a small discrepancy of 20 stray "Started" lines:

            [/var/lib/jenkins/jobs]$ find . -mtime -3 -type f > /tmp/recent_logs
            [/var/lib/jenkins/jobs]$ grep -l '[ssh-agent] Started.' `cat /tmp/recent_logs` > /tmp/agent-started-logs
            [/var/lib/jenkins/jobs]$ grep -l '[ssh-agent] Stopped.' `cat /tmp/recent_logs` > /tmp/agent-stopped-logs
            [/var/lib/jenkins/jobs]$ wc -l /tmp/agent-st*
            2578 /tmp/agent-started-logs
            2558 /tmp/agent-stopped-logs
            5136 total

            Regarding use of SSH Agent, it is configured for all builds, since the git plugin fails to work in my environment if SSH Agent is not running. (I spent a few hours trying to debug this months ago, but don't really remember the details.) Most of the builds don't require an agent other than for the git plugin.

            Show
            jamie Jamie Doornbos added a comment - I do generally see the Stopped line on builds, but I don't watch every build. I checked using a grep on files no more than 3 days old and found a small discrepancy of 20 stray "Started" lines: [/var/lib/jenkins/jobs] $ find . -mtime -3 -type f > /tmp/recent_logs [/var/lib/jenkins/jobs] $ grep -l '[ssh-agent] Started.' `cat /tmp/recent_logs` > /tmp/agent-started-logs [/var/lib/jenkins/jobs] $ grep -l '[ssh-agent] Stopped.' `cat /tmp/recent_logs` > /tmp/agent-stopped-logs [/var/lib/jenkins/jobs] $ wc -l /tmp/agent-st* 2578 /tmp/agent-started-logs 2558 /tmp/agent-stopped-logs 5136 total Regarding use of SSH Agent, it is configured for all builds, since the git plugin fails to work in my environment if SSH Agent is not running. (I spent a few hours trying to debug this months ago, but don't really remember the details.) Most of the builds don't require an agent other than for the git plugin.
            Hide
            jamie Jamie Doornbos added a comment - - edited

            This is becoming more serious for us as we approach full rollout. Currently, jenkins needs to be restarted every 6 hours and this is sometimes not enough. It also means builds that take longer than 6 hours currently have to be run out of band (from a separate command shell). This means we may be forced to replace jenkins at the last minute, which would make me sad.

            BUT... my employer want to sponsor this issue. How much money would be a good enough incentive? I suggested $500. Do you have any reproducible case or some idea of how to fix? Would it be okay to state the terms as something like "my jenkins instance does not fail due to SSH Agent after 2 days"?

            Show
            jamie Jamie Doornbos added a comment - - edited This is becoming more serious for us as we approach full rollout. Currently, jenkins needs to be restarted every 6 hours and this is sometimes not enough. It also means builds that take longer than 6 hours currently have to be run out of band (from a separate command shell). This means we may be forced to replace jenkins at the last minute, which would make me sad. BUT... my employer want to sponsor this issue. How much money would be a good enough incentive? I suggested $500. Do you have any reproducible case or some idea of how to fix? Would it be okay to state the terms as something like "my jenkins instance does not fail due to SSH Agent after 2 days"?
            Hide
            danielbeck Daniel Beck added a comment -

            I don't have the time right now to work on any bounties (plus I got burned in the past). I'm doing some issue triaging for the Jenkins project, which is my only interest in this specific issue. I am not a developer of the SSH Agent Plugin, nor do I use it myself.

            Maybe try the jenkinsci-users mailing list about your Git Plugin issue.

            Show
            danielbeck Daniel Beck added a comment - I don't have the time right now to work on any bounties (plus I got burned in the past). I'm doing some issue triaging for the Jenkins project, which is my only interest in this specific issue. I am not a developer of the SSH Agent Plugin, nor do I use it myself. Maybe try the jenkinsci-users mailing list about your Git Plugin issue.
            Hide
            danielbeck Daniel Beck added a comment -

            Seems to be the plugin and not an issue in core.

            Show
            danielbeck Daniel Beck added a comment - Seems to be the plugin and not an issue in core.
            danielbeck Daniel Beck made changes -
            Component/s core [ 15593 ]
            Hide
            danielbeck Daniel Beck added a comment -

            Looks like a duplicate of JENKINS-27555 that was fixed in SSH Agent 1.7.

            Show
            danielbeck Daniel Beck added a comment - Looks like a duplicate of JENKINS-27555 that was fixed in SSH Agent 1.7.
            danielbeck Daniel Beck made changes -
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Resolved [ 5 ]
            danielbeck Daniel Beck made changes -
            Link This issue duplicates JENKINS-27555 [ JENKINS-27555 ]
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 162519 ] JNJira + In-Review [ 196971 ]

              People

              Assignee:
              Unassigned Unassigned
              Reporter:
              jamie Jamie Doornbos
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: