Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24155

Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core, remoting
    • Windows 7, Windows Server 2008

      I am running Jenkins 1.570.

      Occasionally out of the blue, a large chunk of my jenkins slaves will go offline and most importantly stay offline until Jenkins is rebooted. All of the slaves that go offline this way say the following as their reason why:

      The current peer is reconnecting.

      If I look in my Jenkins logs, I see this for some of my slaves that remain online:

      Aug 07, 2014 11:13:07 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted connection #2018 from /172.16.100.79:51299
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: dev-build-03 is already connected to this master. Rejecting this connection.
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: Unrecognized name: dev-build-03

      The logs are flooded with all of that, with another one coming in every second.

      Lastly, there is one slave that is online still that should be offline. That slave is fully shut down, yet jenkins sees it as still fully online. All of the offline slaves are running Jenkins' slave.jar file in headless mode, so I can see the console output. All of them think that on their end they are "Online", but Jenkins itself has them all shut down.

      This bug has been haunting me for quite a while now, and it is killing production for me. I really need to know if there's a fix for this, or at the very least, a version of jenkins I can downgrade to that doesn't have this issue.

      Thank you!

        1. jenkins-slave.0.err.log
          427 kB
        2. log.txt
          220 kB
        3. masterJenkins.log
          370 kB

          [JENKINS-24155] Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

          Shane Gannon added a comment -

          I've been told that this issue is the same as JENKINS-28844 and has been resolved in the 1.609.3 LTS.

          Shane Gannon added a comment - I've been told that this issue is the same as JENKINS-28844 and has been resolved in the 1.609.3 LTS.

          mohit tater added a comment -

          We are facing this issue on Jenkins ver. 1.605.

          On most of the offline slaves I am seeing:
          "JNLP agent connected from /x.y.z.a" in the node log.

          Here is the threadDump link of the affected Jenkins instance.
          http://pastebin.com/9hUR1Awf

          Please provide a temporary workaround for this so that it can be avoided in future.

          Note:
          We are using 50+ nodes on a single master.

          mohit tater added a comment - We are facing this issue on Jenkins ver. 1.605. On most of the offline slaves I am seeing: "JNLP agent connected from /x.y.z.a" in the node log. Here is the threadDump link of the affected Jenkins instance. http://pastebin.com/9hUR1Awf Please provide a temporary workaround for this so that it can be avoided in future. Note: We are using 50+ nodes on a single master.

          Alexandre Aubert added a comment - - edited

          same problem since several days with jenkins 2.23, here is the extract of log with :

          • first 'outofmemory' error
            then
          • all 'java.lang.OutOfMemoryError: unable to create new native thread'
          • then 'disconnection of all slaves'

          log.txt

          2 slaves are not disconnected : slave.jar is more recent on those. I will update slave.jar on all and check if it happens again.... (waiting also the autoupdate of slave.jar files which is pending in another ticket....)

          Hope this could help.

          Alexandre Aubert added a comment - - edited same problem since several days with jenkins 2.23, here is the extract of log with : first 'outofmemory' error then all 'java.lang.OutOfMemoryError: unable to create new native thread' then 'disconnection of all slaves' log.txt 2 slaves are not disconnected : slave.jar is more recent on those. I will update slave.jar on all and check if it happens again.... (waiting also the autoupdate of slave.jar files which is pending in another ticket....) Hope this could help.

          In my case this was a outofmemory problem : i fixed it by increasing the -Xmx in jenkins args and all seems to be ok since.

          Alexandre Aubert added a comment - In my case this was a outofmemory problem : i fixed it by increasing the -Xmx in jenkins args and all seems to be ok since.

          Trushar Patel added a comment -

          We are also facing the same issue on Jenkins 1.624. I had to reboot it. Please someone help. This looks like its been going on for while.

          Trushar Patel added a comment - We are also facing the same issue on Jenkins 1.624. I had to reboot it. Please someone help. This looks like its been going on for while.

          Nelu Vasilica added a comment -

          Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically.
          Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs.
          Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?

          Nelu Vasilica added a comment - Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically. Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs. Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?

          Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!

          Cesos Barbarino added a comment - Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!

          Oleg Nenashev added a comment -

          I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above...

          • Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231
          • Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable
          • JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1

          In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.

          Oleg Nenashev added a comment - I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above... Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231 Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1 In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.

          Louis Heche added a comment -

          I'm having what seems to be this issue with Jenkins 2.138.3.

          Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted. 

          In attachment you'll find the logs jenkins-slave.0.err.logmasterJenkins.log

          Louis Heche added a comment - I'm having what seems to be this issue with Jenkins 2.138.3. Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted.  In attachment you'll find the logs  jenkins-slave.0.err.log masterJenkins.log

          hechel oleg_nenashev cesos

           Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.

           Memory stats can be captured using pidstat. Specifically to capture resident set size.

          $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt

           JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection.

          $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt

          Please attach the generated files to this issue.

          Jeremy Whiting added a comment - hechel oleg_nenashev cesos  Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.  Memory stats can be captured using pidstat. Specifically to capture resident set size. $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt  JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection. $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt Please attach the generated files to this issue.

            Unassigned Unassigned
            krandino Kevin Randino
            Votes:
            34 Vote for this issue
            Watchers:
            48 Start watching this issue

              Created:
              Updated: