Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24155

Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core, remoting
    • Windows 7, Windows Server 2008

      I am running Jenkins 1.570.

      Occasionally out of the blue, a large chunk of my jenkins slaves will go offline and most importantly stay offline until Jenkins is rebooted. All of the slaves that go offline this way say the following as their reason why:

      The current peer is reconnecting.

      If I look in my Jenkins logs, I see this for some of my slaves that remain online:

      Aug 07, 2014 11:13:07 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted connection #2018 from /172.16.100.79:51299
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: dev-build-03 is already connected to this master. Rejecting this connection.
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: Unrecognized name: dev-build-03

      The logs are flooded with all of that, with another one coming in every second.

      Lastly, there is one slave that is online still that should be offline. That slave is fully shut down, yet jenkins sees it as still fully online. All of the offline slaves are running Jenkins' slave.jar file in headless mode, so I can see the console output. All of them think that on their end they are "Online", but Jenkins itself has them all shut down.

      This bug has been haunting me for quite a while now, and it is killing production for me. I really need to know if there's a fix for this, or at the very least, a version of jenkins I can downgrade to that doesn't have this issue.

      Thank you!

        1. jenkins-slave.0.err.log
          427 kB
        2. masterJenkins.log
          370 kB
        3. log.txt
          220 kB

          [JENKINS-24155] Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

          Daniel Beck added a comment -

          Why would Jenkins think the nodes are already connected? Does it show the computers as still online? Is there anything in a specific node's log (assuming the excerpt is from main jenkins log)?

          Daniel Beck added a comment - Why would Jenkins think the nodes are already connected? Does it show the computers as still online? Is there anything in a specific node's log (assuming the excerpt is from main jenkins log)?

          Daniel Beck added a comment -

          What's the last Jenkins version that did not have this problem? When you downgrade, does it go away?

          Daniel Beck added a comment - What's the last Jenkins version that did not have this problem? When you downgrade, does it go away?

          Kevin Randino added a comment -

          I'm afraid I am infrequent on updates, and have always have issues with my nodes in one way or another, so it's hard to pinpoint exactly when this started. I would say at least since 1.565, but probably before then too.

          When I say that the node still claims it is connected, I am referring to the console log that is displayed on the node itself. Jenkins still sees the node as offline and says "The current peer is reconnecting." in the node status.

          Kevin Randino added a comment - I'm afraid I am infrequent on updates, and have always have issues with my nodes in one way or another, so it's hard to pinpoint exactly when this started. I would say at least since 1.565, but probably before then too. When I say that the node still claims it is connected, I am referring to the console log that is displayed on the node itself. Jenkins still sees the node as offline and says "The current peer is reconnecting." in the node status.

          I've seen the same issue on a Windows slave running a self-built version of 1.577-SNAPSHOT. The slave error log suggests that slave saw a connection reset but when it reconnected the master thought the slave was still connected and connection retries failed.

          Aug 17, 2014 11:05:02 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel channel
          java.net.SocketException: Connection reset
          	at java.net.SocketInputStream.read(Unknown Source)
          	at java.net.SocketInputStream.read(Unknown Source)
          	at java.io.BufferedInputStream.fill(Unknown Source)
          	at java.io.BufferedInputStream.read(Unknown Source)
          	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
          	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
          	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
          	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          
          Aug 17, 2014 11:05:02 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Terminated
          Aug 17, 2014 11:05:12 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
          INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@5f849b
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          

          After 3 retries at restarting the windows service restarter gave up and unfortuntely I didn't attempt to reconnect until after I had restarted the master over 12 hours later.

          The equivalent part of the master's logs are as follows (only the first restart included here but the others are equivalent).

          Aug 17, 2014 11:05:18 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
          INFO: Accepted connection #7 from /192.168.1.115:60293
          Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error
          WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Cygnet is already connected to this master. Rejecting this connection.
          Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error
          WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Unrecognized name: Cygnet
          Aug 17, 2014 11:06:19 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
          INFO: Accepted connection #8 from /192.168.1.115:60308
          

          The master log does not show any evidence of the slave connection being broken in the 12 hours before the master restart.

          If the master is not detecting that the connection has closed then it will still think that the slave is connected and will refuse re-connections as seen.

          Sadly I didn't get a stacktrace of the master's threads to see if any threads were blocked anywhere.

          Richard Mortimer added a comment - I've seen the same issue on a Windows slave running a self-built version of 1.577-SNAPSHOT. The slave error log suggests that slave saw a connection reset but when it reconnected the master thought the slave was still connected and connection retries failed. Aug 17, 2014 11:05:02 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Connection reset at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Aug 17, 2014 11:05:02 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Aug 17, 2014 11:05:12 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@5f849b Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) After 3 retries at restarting the windows service restarter gave up and unfortuntely I didn't attempt to reconnect until after I had restarted the master over 12 hours later. The equivalent part of the master's logs are as follows (only the first restart included here but the others are equivalent). Aug 17, 2014 11:05:18 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #7 from /192.168.1.115:60293 Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Cygnet is already connected to this master. Rejecting this connection. Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Unrecognized name: Cygnet Aug 17, 2014 11:06:19 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #8 from /192.168.1.115:60308 The master log does not show any evidence of the slave connection being broken in the 12 hours before the master restart. If the master is not detecting that the connection has closed then it will still think that the slave is connected and will refuse re-connections as seen. Sadly I didn't get a stacktrace of the master's threads to see if any threads were blocked anywhere.

          I wonder whether this issue is related to the changes make circa 4th April 2014 to change JNLP slaves to use NIO.

          See commit d4c74bf35d4 in 1.599 andalso the corresponding changes made in remoting 2.38.

          May also be related to JENKINS-23248 but my build was definitely using the integrated fixes for that so it cannot be the whole solution.

          Richard Mortimer added a comment - I wonder whether this issue is related to the changes make circa 4th April 2014 to change JNLP slaves to use NIO. See commit d4c74bf35d4 in 1.599 andalso the corresponding changes made in remoting 2.38. May also be related to JENKINS-23248 but my build was definitely using the integrated fixes for that so it cannot be the whole solution.

          After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat.

          netstat -na | grep 42715
          tcp6       0      0 :::42715                :::*                    LISTEN
          tcp6       0      0 192.168.1.23:42715      192.168.1.24:58905      ESTABLISHED
          tcp6       0   2479 192.168.1.23:42715      192.168.1.115:61283     ESTABLISHED
          

          When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above).

          Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally.

          I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue.

          During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures.

          After a few minutes the channel pinger on the master declares that the slave is dead

          Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead
          INFO: Ping failed. Terminating the channel.
          java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640
                  at hudson.remoting.PingThread.ping(PingThread.java:120)
                  at hudson.remoting.PingThread.run(PingThread.java:81)
          

          But even at this time the TCP stream stays open the slave connection is continuing to operate.

          After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout.

          WARNING: Communication problem
          java.io.IOException: Connection timed out
                  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                  at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
                  at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
                  at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:744)
          
          Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
          WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated
          java.io.IOException: Failed to abort
                  at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:744)
          Caused by: java.io.IOException: Connection timed out
                  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                  at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
                  at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
                  at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
                  ... 6 more
          

          There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out.

          Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.

          Richard Mortimer added a comment - After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat. netstat -na | grep 42715 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.24:58905 ESTABLISHED tcp6 0 2479 192.168.1.23:42715 192.168.1.115:61283 ESTABLISHED When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above). Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally. I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue. During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures. After a few minutes the channel pinger on the master declares that the slave is dead Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel. java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640 at hudson.remoting.PingThread.ping(PingThread.java:120) at hudson.remoting.PingThread.run(PingThread.java:81) But even at this time the TCP stream stays open the slave connection is continuing to operate. After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout. WARNING: Communication problem java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) ... 6 more There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out. Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.

          One potential workaround to try is adding

          -Djenkins.slaves.NioChannelSelector.disabled=true
          

          onto the Jenkins master's launcher command line arguments. On Debian/Ubuntu that is as simple as adding the above to JAVA_ARGS in /etc/default/jenkins

          If launching jenkins from the commandline it would be

          java -Djenkins.slaves.NioChannelSelector.disabled=true -jar jenkins.war
          

          I just tested this on my system and it does seem to change the behaviour when I run my test case. In 3 tests the slave continued working correctly on 3 occassions. 2 of these saw the queued traffic just delivered and things continued as before. In the other example the original TCP connection entered TIME_WAIT state and a new connection was started successfully by the recently suspended slave.

          Wed Aug 20 11:18:23 BST 2014
          tcp6       0      0 :::42715                :::*                    LISTEN     
          tcp6       0      0 192.168.1.23:42715      192.168.1.115:50570     TIME_WAIT  
          tcp6       0      0 192.168.1.23:42715      192.168.1.24:47835      ESTABLISHED
          tcp6       0      0 192.168.1.23:42715      192.168.1.115:50619     ESTABLISHED
          

          From this I suspect that the new NIO based method of communicating with slaves is not causing the full TCP socket to get closed until TCP timers timeout. Whereas in the thread-per-slave method the connection is tore down almost immediately.

          It would be good to know if the NioChannelSelector workaround outlined above helps others.

          Richard Mortimer added a comment - One potential workaround to try is adding -Djenkins.slaves.NioChannelSelector.disabled=true onto the Jenkins master's launcher command line arguments. On Debian/Ubuntu that is as simple as adding the above to JAVA_ARGS in /etc/default/jenkins If launching jenkins from the commandline it would be java -Djenkins.slaves.NioChannelSelector.disabled=true -jar jenkins.war I just tested this on my system and it does seem to change the behaviour when I run my test case. In 3 tests the slave continued working correctly on 3 occassions. 2 of these saw the queued traffic just delivered and things continued as before. In the other example the original TCP connection entered TIME_WAIT state and a new connection was started successfully by the recently suspended slave. Wed Aug 20 11:18:23 BST 2014 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.115:50570 TIME_WAIT tcp6 0 0 192.168.1.23:42715 192.168.1.24:47835 ESTABLISHED tcp6 0 0 192.168.1.23:42715 192.168.1.115:50619 ESTABLISHED From this I suspect that the new NIO based method of communicating with slaves is not causing the full TCP socket to get closed until TCP timers timeout. Whereas in the thread-per-slave method the connection is tore down almost immediately. It would be good to know if the NioChannelSelector workaround outlined above helps others.

          What's likely happening is that the NIO selector thread has died on the master, which uses NioChannelHub.abortAll() to terminate all the JNLP connections. It's consistent with the observation that the problem persists until Jenkins restarts. If anyone can take a thread dump on the master when Jenkins is in this state, we can verify this by checking if the said thread is listed or not.

          When NIO selector thread is killed, it leaves a message in the Jenkins log. I'd like you to look for it, so that we can see why the NIO selector thread is killed. The reconnecting slaves failing to get through is red herring.

          Kohsuke Kawaguchi added a comment - What's likely happening is that the NIO selector thread has died on the master, which uses NioChannelHub.abortAll() to terminate all the JNLP connections. It's consistent with the observation that the problem persists until Jenkins restarts. If anyone can take a thread dump on the master when Jenkins is in this state, we can verify this by checking if the said thread is listed or not. When NIO selector thread is killed, it leaves a message in the Jenkins log. I'd like you to look for it, so that we can see why the NIO selector thread is killed. The reconnecting slaves failing to get through is red herring.

          marlene cote added a comment -

          my slave and server are in this state now. How do I take a thread dump?

          marlene cote added a comment - my slave and server are in this state now. How do I take a thread dump?

          Daniel Beck added a comment -

          funeeldy: Go to the /threadDump URL, or install the Support Core Plugin, go to /support, and generate a bundle that contains at least the thread dumps.

          Daniel Beck added a comment - funeeldy : Go to the /threadDump URL, or install the Support Core Plugin, go to /support , and generate a bundle that contains at least the thread dumps.

          Christian Bremer added a comment - 200$ is up for grabs for this issue on freedomsponsors.org https://freedomsponsors.org/issue/591/jenkins-slaves-go-offline-in-large-quantities-and-dont-reconnect-until-reboot

          Bowse Cardoc added a comment -

          We have this problem, i have a thread dump here
          http://pastebin.com/chMPus8C

          Bowse Cardoc added a comment - We have this problem, i have a thread dump here http://pastebin.com/chMPus8C

          Relevant bits from the thread dump.

          NioChannelHub (and a few other things not included for compactness) are waiting on Channel@2323a3cd

          NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2]
           
          "NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2]" Id=144 Group=main BLOCKED on hudson.remoting.Channel@2d23a3cd owned by "Finalizer" Id=3
                  at hudson.remoting.Channel.terminate(Channel.java:804)
                  -  blocked on hudson.remoting.Channel@2d23a3cd
                  at hudson.remoting.Channel$2.terminate(Channel.java:491)
                  at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
                  at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
                  at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
                  at java.util.concurrent.FutureTask.run(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
                  at java.lang.Thread.run(Unknown Source)
           
                  Number of locked synchronizers = 1
                  - java.util.concurrent.ThreadPoolExecutor$Worker@1b9eef5
          

          this is held by the Finalizer that is waiting on FifoBuffer@782372ec

          Finalizer
           
          "Finalizer" Id=3 Group=system WAITING on org.jenkinsci.remoting.nio.FifoBuffer@782372ec
                  at java.lang.Object.wait(Native Method)
                  -  waiting on org.jenkinsci.remoting.nio.FifoBuffer@782372ec
                  at java.lang.Object.wait(Object.java:503)
                  at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336)
                  at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:324)
                  at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:222)
                  at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83)
                  at hudson.remoting.Channel.send(Channel.java:553)
                  -  locked hudson.remoting.Channel@2d23a3cd
                  at hudson.remoting.RemoteInvocationHandler.finalize(RemoteInvocationHandler.java:240)
                  at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
                  at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
                  at java.lang.ref.Finalizer.access$100(Unknown Source)
                  at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
          

          It looks like the Finalizer is trying to write to a channel to cleanup an object with remote state and has effectively locked things up because nothing can come along and force the FifoBuffer write to terminate.

          Richard Mortimer added a comment - Relevant bits from the thread dump. NioChannelHub (and a few other things not included for compactness) are waiting on Channel@2323a3cd NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2] "NioChannelHub keys=4 gen=5352736: Computer.threadPoolForRemoting [#2]" Id=144 Group=main BLOCKED on hudson.remoting.Channel@2d23a3cd owned by "Finalizer" Id=3 at hudson.remoting.Channel.terminate(Channel.java:804) - blocked on hudson.remoting.Channel@2d23a3cd at hudson.remoting.Channel$2.terminate(Channel.java:491) at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:211) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:631) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang. Thread .run(Unknown Source) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1b9eef5 this is held by the Finalizer that is waiting on FifoBuffer@782372ec Finalizer "Finalizer" Id=3 Group=system WAITING on org.jenkinsci.remoting.nio.FifoBuffer@782372ec at java.lang. Object .wait(Native Method) - waiting on org.jenkinsci.remoting.nio.FifoBuffer@782372ec at java.lang. Object .wait( Object .java:503) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:336) at org.jenkinsci.remoting.nio.FifoBuffer.write(FifoBuffer.java:324) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.writeBlock(NioChannelHub.java:222) at hudson.remoting.AbstractByteArrayCommandTransport.write(AbstractByteArrayCommandTransport.java:83) at hudson.remoting.Channel.send(Channel.java:553) - locked hudson.remoting.Channel@2d23a3cd at hudson.remoting.RemoteInvocationHandler.finalize(RemoteInvocationHandler.java:240) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Unknown Source) at java.lang.ref.Finalizer.access$100(Unknown Source) at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source) It looks like the Finalizer is trying to write to a channel to cleanup an object with remote state and has effectively locked things up because nothing can come along and force the FifoBuffer write to terminate.

          I'm still having this issue. 1.580.2

          I have a large number of slaves that reboot as part of our testing process (tests complete, systems shut down).
          These slaves go offline in this fashion.

          Patricia Wright added a comment - I'm still having this issue. 1.580.2 I have a large number of slaves that reboot as part of our testing process (tests complete, systems shut down). These slaves go offline in this fashion.

          Still happens, daily.

          Jenkins 1.596.1 with all plugins up to date.

          Patricia Wright added a comment - Still happens, daily. Jenkins 1.596.1 with all plugins up to date.

          I will grab a thread dump the next time i see this.

          Patricia Wright added a comment - I will grab a thread dump the next time i see this.

          amal jerbi added a comment -

          I encountered the same problem with my master jenkins server (version 1565) installed on Debian machine and slaves installed on Windows 7 machines.
          When the slaves disconnects and reconnects, jenkins master does not detect the reconnection of slavs.
          I followed the proposed solution of adding to JAVA_ARGS variable in the / etc / default / jenkins:

          -Djenkins.slaves.NioChannelSelector.disabled=true

          This has solved the problem but since this is a workaround, is there a permanent solution?

          amal jerbi added a comment - I encountered the same problem with my master jenkins server (version 1565) installed on Debian machine and slaves installed on Windows 7 machines. When the slaves disconnects and reconnects, jenkins master does not detect the reconnection of slavs. I followed the proposed solution of adding to JAVA_ARGS variable in the / etc / default / jenkins: -Djenkins.slaves.NioChannelSelector.disabled=true This has solved the problem but since this is a workaround, is there a permanent solution?

          Vladimir Lazarev added a comment - - edited

          We have "peer reconnecting" issue ones per 2 weeks. After applying the WA proposed by Amal connections are crashing during heavy jobs (high CPU load and long duration) on regular basis.

          So be aware...

          Vladimir Lazarev added a comment - - edited We have "peer reconnecting" issue ones per 2 weeks. After applying the WA proposed by Amal connections are crashing during heavy jobs (high CPU load and long duration) on regular basis. So be aware...

          This is almost certainly the same issue as JENKINS-25218

          Stephen Connolly added a comment - This is almost certainly the same issue as JENKINS-25218

          Shane Gannon added a comment -

          I've been told that this issue is the same as JENKINS-28844 and has been resolved in the 1.609.3 LTS.

          Shane Gannon added a comment - I've been told that this issue is the same as JENKINS-28844 and has been resolved in the 1.609.3 LTS.

          mohit tater added a comment -

          We are facing this issue on Jenkins ver. 1.605.

          On most of the offline slaves I am seeing:
          "JNLP agent connected from /x.y.z.a" in the node log.

          Here is the threadDump link of the affected Jenkins instance.
          http://pastebin.com/9hUR1Awf

          Please provide a temporary workaround for this so that it can be avoided in future.

          Note:
          We are using 50+ nodes on a single master.

          mohit tater added a comment - We are facing this issue on Jenkins ver. 1.605. On most of the offline slaves I am seeing: "JNLP agent connected from /x.y.z.a" in the node log. Here is the threadDump link of the affected Jenkins instance. http://pastebin.com/9hUR1Awf Please provide a temporary workaround for this so that it can be avoided in future. Note: We are using 50+ nodes on a single master.

          Alexandre Aubert added a comment - - edited

          same problem since several days with jenkins 2.23, here is the extract of log with :

          • first 'outofmemory' error
            then
          • all 'java.lang.OutOfMemoryError: unable to create new native thread'
          • then 'disconnection of all slaves'

          log.txt

          2 slaves are not disconnected : slave.jar is more recent on those. I will update slave.jar on all and check if it happens again.... (waiting also the autoupdate of slave.jar files which is pending in another ticket....)

          Hope this could help.

          Alexandre Aubert added a comment - - edited same problem since several days with jenkins 2.23, here is the extract of log with : first 'outofmemory' error then all 'java.lang.OutOfMemoryError: unable to create new native thread' then 'disconnection of all slaves' log.txt 2 slaves are not disconnected : slave.jar is more recent on those. I will update slave.jar on all and check if it happens again.... (waiting also the autoupdate of slave.jar files which is pending in another ticket....) Hope this could help.

          In my case this was a outofmemory problem : i fixed it by increasing the -Xmx in jenkins args and all seems to be ok since.

          Alexandre Aubert added a comment - In my case this was a outofmemory problem : i fixed it by increasing the -Xmx in jenkins args and all seems to be ok since.

          Trushar Patel added a comment -

          We are also facing the same issue on Jenkins 1.624. I had to reboot it. Please someone help. This looks like its been going on for while.

          Trushar Patel added a comment - We are also facing the same issue on Jenkins 1.624. I had to reboot it. Please someone help. This looks like its been going on for while.

          Nelu Vasilica added a comment -

          Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically.
          Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs.
          Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?

          Nelu Vasilica added a comment - Just seen the same issue on Jenkins 1.642.1 Linux master. the fix was to restart tomcat and the windows slaves reconnected automatically. Found several instances of: Ping started at xxxxxx hasn't completed by xxxxxxx in the logs. Is setting jenkins.slaves.NioChannelSelector.disabled property to true a viable workaround?

          Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!

          Cesos Barbarino added a comment - Same issue here. Only this time, my tests never get done. The slaves area always dropping during the tests please halp!!

          Oleg Nenashev added a comment -

          I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above...

          • Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231
          • Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable
          • JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1

          In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.

          Oleg Nenashev added a comment - I am not sure we can proceed much on this issue. Just to summarize changes related to several reports above... Jenkins 2.50+ introduced runaway process termination in new Windows service termination. It should help with the "is already connected to this master" issues being reported to Windows service agents. See JENKINS-39231 Whatever happens in Jenkins after the "OutOfMemory" exception, it belongs to the "undefined behavior" area. Jenkins should ideally switch to the disabled state after it since the impact is not predictable JENKINS-25218 introduced fixes to FifoBuffer handling logic, all fixes are available in 2.60.1 In order to proceed with this issue, I need somebody to confirm it still happens on 2.60.1 and to provide new diagnostics info.

          Louis Heche added a comment -

          I'm having what seems to be this issue with Jenkins 2.138.3.

          Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted. 

          In attachment you'll find the logs jenkins-slave.0.err.logmasterJenkins.log

          Louis Heche added a comment - I'm having what seems to be this issue with Jenkins 2.138.3. Every 3-4 days all the slaves node go offline although it seems to have no network problem. They return online once the master has been restarted.  In attachment you'll find the logs  jenkins-slave.0.err.log masterJenkins.log

          hechel oleg_nenashev cesos

           Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.

           Memory stats can be captured using pidstat. Specifically to capture resident set size.

          $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt

           JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection.

          $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt

          Please attach the generated files to this issue.

          Jeremy Whiting added a comment - hechel oleg_nenashev cesos  Can one of you do the following. To help narrow down the possible leak areas it will be useful to capture process memory usage and JVM heap usage. Start your master process as normal. Then start 2 tools on the system and redirect the output to separate files. Both tools have low system resource usage.  Memory stats can be captured using pidstat. Specifically to capture resident set size. $ pidstat -r -p <pid> 8 > /tmp/pidstat-capture.txt  JVM heap size and GC behavior. Specifically the percentage of reclaimed heap space after a full collection. $ jstat -gcutil -t -h12 <pid> 8s > /tmp/jstat-capture.txt Please attach the generated files to this issue.

            Unassigned Unassigned
            krandino Kevin Randino
            Votes:
            34 Vote for this issue
            Watchers:
            48 Start watching this issue

              Created:
              Updated: