Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24155

Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core, remoting
    • Windows 7, Windows Server 2008

      I am running Jenkins 1.570.

      Occasionally out of the blue, a large chunk of my jenkins slaves will go offline and most importantly stay offline until Jenkins is rebooted. All of the slaves that go offline this way say the following as their reason why:

      The current peer is reconnecting.

      If I look in my Jenkins logs, I see this for some of my slaves that remain online:

      Aug 07, 2014 11:13:07 AM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run
      Accepted connection #2018 from /172.16.100.79:51299
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: dev-build-03 is already connected to this master. Rejecting this connection.
      Aug 07, 2014 11:13:07 AM WARNING jenkins.slaves.JnlpSlaveHandshake error
      TCP slave agent connection handler #2018 with /172.16.100.79:51299 is aborted: Unrecognized name: dev-build-03

      The logs are flooded with all of that, with another one coming in every second.

      Lastly, there is one slave that is online still that should be offline. That slave is fully shut down, yet jenkins sees it as still fully online. All of the offline slaves are running Jenkins' slave.jar file in headless mode, so I can see the console output. All of them think that on their end they are "Online", but Jenkins itself has them all shut down.

      This bug has been haunting me for quite a while now, and it is killing production for me. I really need to know if there's a fix for this, or at the very least, a version of jenkins I can downgrade to that doesn't have this issue.

      Thank you!

        1. jenkins-slave.0.err.log
          427 kB
        2. masterJenkins.log
          370 kB
        3. log.txt
          220 kB

          [JENKINS-24155] Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

          Kevin Randino created issue -
          Daniel Beck made changes -
          Labels Original: disconnect exception jenkins mass-death multiple offline reboot slave slaves New: disconnect exception jenkins offline slave slaves

          Daniel Beck added a comment -

          Why would Jenkins think the nodes are already connected? Does it show the computers as still online? Is there anything in a specific node's log (assuming the excerpt is from main jenkins log)?

          Daniel Beck added a comment - Why would Jenkins think the nodes are already connected? Does it show the computers as still online? Is there anything in a specific node's log (assuming the excerpt is from main jenkins log)?
          Daniel Beck made changes -
          Component/s New: core [ 15593 ]
          Component/s Original: slave-status [ 15981 ]
          Labels Original: disconnect exception jenkins offline slave slaves New: remoting
          Priority Original: Critical [ 2 ] New: Major [ 3 ]

          Daniel Beck added a comment -

          What's the last Jenkins version that did not have this problem? When you downgrade, does it go away?

          Daniel Beck added a comment - What's the last Jenkins version that did not have this problem? When you downgrade, does it go away?

          Kevin Randino added a comment -

          I'm afraid I am infrequent on updates, and have always have issues with my nodes in one way or another, so it's hard to pinpoint exactly when this started. I would say at least since 1.565, but probably before then too.

          When I say that the node still claims it is connected, I am referring to the console log that is displayed on the node itself. Jenkins still sees the node as offline and says "The current peer is reconnecting." in the node status.

          Kevin Randino added a comment - I'm afraid I am infrequent on updates, and have always have issues with my nodes in one way or another, so it's hard to pinpoint exactly when this started. I would say at least since 1.565, but probably before then too. When I say that the node still claims it is connected, I am referring to the console log that is displayed on the node itself. Jenkins still sees the node as offline and says "The current peer is reconnecting." in the node status.

          I've seen the same issue on a Windows slave running a self-built version of 1.577-SNAPSHOT. The slave error log suggests that slave saw a connection reset but when it reconnected the master thought the slave was still connected and connection retries failed.

          Aug 17, 2014 11:05:02 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel channel
          java.net.SocketException: Connection reset
          	at java.net.SocketInputStream.read(Unknown Source)
          	at java.net.SocketInputStream.read(Unknown Source)
          	at java.io.BufferedInputStream.fill(Unknown Source)
          	at java.io.BufferedInputStream.read(Unknown Source)
          	at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
          	at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
          	at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
          	at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          
          Aug 17, 2014 11:05:02 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Terminated
          Aug 17, 2014 11:05:12 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
          INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@5f849b
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: Cygnet
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [http://jenkins.example/]
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.example:42715
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection.
          	at hudson.remoting.Engine.onConnectionRejected(Engine.java:306)
          	at hudson.remoting.Engine.run(Engine.java:276)
          

          After 3 retries at restarting the windows service restarter gave up and unfortuntely I didn't attempt to reconnect until after I had restarted the master over 12 hours later.

          The equivalent part of the master's logs are as follows (only the first restart included here but the others are equivalent).

          Aug 17, 2014 11:05:18 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
          INFO: Accepted connection #7 from /192.168.1.115:60293
          Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error
          WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Cygnet is already connected to this master. Rejecting this connection.
          Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error
          WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Unrecognized name: Cygnet
          Aug 17, 2014 11:06:19 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
          INFO: Accepted connection #8 from /192.168.1.115:60308
          

          The master log does not show any evidence of the slave connection being broken in the 12 hours before the master restart.

          If the master is not detecting that the connection has closed then it will still think that the slave is connected and will refuse re-connections as seen.

          Sadly I didn't get a stacktrace of the master's threads to see if any threads were blocked anywhere.

          Richard Mortimer added a comment - I've seen the same issue on a Windows slave running a self-built version of 1.577-SNAPSHOT. The slave error log suggests that slave saw a connection reset but when it reconnected the master thought the slave was still connected and connection retries failed. Aug 17, 2014 11:05:02 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Connection reset at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Aug 17, 2014 11:05:02 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Aug 17, 2014 11:05:12 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@5f849b Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:05:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:06:17 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: Cygnet Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among [http://jenkins.example/] Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to jenkins.example:42715 Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Aug 17, 2014 11:07:18 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: Cygnet is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:306) at hudson.remoting.Engine.run(Engine.java:276) After 3 retries at restarting the windows service restarter gave up and unfortuntely I didn't attempt to reconnect until after I had restarted the master over 12 hours later. The equivalent part of the master's logs are as follows (only the first restart included here but the others are equivalent). Aug 17, 2014 11:05:18 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #7 from /192.168.1.115:60293 Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Cygnet is already connected to this master. Rejecting this connection. Aug 17, 2014 11:05:18 PM jenkins.slaves.JnlpSlaveHandshake error WARNING: TCP slave agent connection handler #7 with /192.168.1.115:60293 is aborted: Unrecognized name: Cygnet Aug 17, 2014 11:06:19 PM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #8 from /192.168.1.115:60308 The master log does not show any evidence of the slave connection being broken in the 12 hours before the master restart. If the master is not detecting that the connection has closed then it will still think that the slave is connected and will refuse re-connections as seen. Sadly I didn't get a stacktrace of the master's threads to see if any threads were blocked anywhere.

          I wonder whether this issue is related to the changes make circa 4th April 2014 to change JNLP slaves to use NIO.

          See commit d4c74bf35d4 in 1.599 andalso the corresponding changes made in remoting 2.38.

          May also be related to JENKINS-23248 but my build was definitely using the integrated fixes for that so it cannot be the whole solution.

          Richard Mortimer added a comment - I wonder whether this issue is related to the changes make circa 4th April 2014 to change JNLP slaves to use NIO. See commit d4c74bf35d4 in 1.599 andalso the corresponding changes made in remoting 2.38. May also be related to JENKINS-23248 but my build was definitely using the integrated fixes for that so it cannot be the whole solution.

          After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat.

          netstat -na | grep 42715
          tcp6       0      0 :::42715                :::*                    LISTEN
          tcp6       0      0 192.168.1.23:42715      192.168.1.24:58905      ESTABLISHED
          tcp6       0   2479 192.168.1.23:42715      192.168.1.115:61283     ESTABLISHED
          

          When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above).

          Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally.

          I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue.

          During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures.

          After a few minutes the channel pinger on the master declares that the slave is dead

          Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead
          INFO: Ping failed. Terminating the channel.
          java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640
                  at hudson.remoting.PingThread.ping(PingThread.java:120)
                  at hudson.remoting.PingThread.run(PingThread.java:81)
          

          But even at this time the TCP stream stays open the slave connection is continuing to operate.

          After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout.

          WARNING: Communication problem
          java.io.IOException: Connection timed out
                  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                  at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
                  at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
                  at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:744)
          
          Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
          WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated
          java.io.IOException: Failed to abort
                  at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
                  at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:262)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
                  at java.lang.Thread.run(Thread.java:744)
          Caused by: java.io.IOException: Connection timed out
                  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
                  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
                  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
                  at sun.nio.ch.IOUtil.read(IOUtil.java:197)
                  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
                  at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
                  at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
                  at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
                  ... 6 more
          

          There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out.

          Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.

          Richard Mortimer added a comment - After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat. netstat -na | grep 42715 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.24:58905 ESTABLISHED tcp6 0 2479 192.168.1.23:42715 192.168.1.115:61283 ESTABLISHED When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above). Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally. I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue. During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures. After a few minutes the channel pinger on the master declares that the slave is dead Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead INFO: Ping failed. Terminating the channel. java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640 at hudson.remoting.PingThread.ping(PingThread.java:120) at hudson.remoting.PingThread.run(PingThread.java:81) But even at this time the TCP stream stays open the slave connection is continuing to operate. After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout. WARNING: Communication problem java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514) ... 6 more There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out. Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.

          One potential workaround to try is adding

          -Djenkins.slaves.NioChannelSelector.disabled=true
          

          onto the Jenkins master's launcher command line arguments. On Debian/Ubuntu that is as simple as adding the above to JAVA_ARGS in /etc/default/jenkins

          If launching jenkins from the commandline it would be

          java -Djenkins.slaves.NioChannelSelector.disabled=true -jar jenkins.war
          

          I just tested this on my system and it does seem to change the behaviour when I run my test case. In 3 tests the slave continued working correctly on 3 occassions. 2 of these saw the queued traffic just delivered and things continued as before. In the other example the original TCP connection entered TIME_WAIT state and a new connection was started successfully by the recently suspended slave.

          Wed Aug 20 11:18:23 BST 2014
          tcp6       0      0 :::42715                :::*                    LISTEN     
          tcp6       0      0 192.168.1.23:42715      192.168.1.115:50570     TIME_WAIT  
          tcp6       0      0 192.168.1.23:42715      192.168.1.24:47835      ESTABLISHED
          tcp6       0      0 192.168.1.23:42715      192.168.1.115:50619     ESTABLISHED
          

          From this I suspect that the new NIO based method of communicating with slaves is not causing the full TCP socket to get closed until TCP timers timeout. Whereas in the thread-per-slave method the connection is tore down almost immediately.

          It would be good to know if the NioChannelSelector workaround outlined above helps others.

          Richard Mortimer added a comment - One potential workaround to try is adding -Djenkins.slaves.NioChannelSelector.disabled=true onto the Jenkins master's launcher command line arguments. On Debian/Ubuntu that is as simple as adding the above to JAVA_ARGS in /etc/default/jenkins If launching jenkins from the commandline it would be java -Djenkins.slaves.NioChannelSelector.disabled=true -jar jenkins.war I just tested this on my system and it does seem to change the behaviour when I run my test case. In 3 tests the slave continued working correctly on 3 occassions. 2 of these saw the queued traffic just delivered and things continued as before. In the other example the original TCP connection entered TIME_WAIT state and a new connection was started successfully by the recently suspended slave. Wed Aug 20 11:18:23 BST 2014 tcp6 0 0 :::42715 :::* LISTEN tcp6 0 0 192.168.1.23:42715 192.168.1.115:50570 TIME_WAIT tcp6 0 0 192.168.1.23:42715 192.168.1.24:47835 ESTABLISHED tcp6 0 0 192.168.1.23:42715 192.168.1.115:50619 ESTABLISHED From this I suspect that the new NIO based method of communicating with slaves is not causing the full TCP socket to get closed until TCP timers timeout. Whereas in the thread-per-slave method the connection is tore down almost immediately. It would be good to know if the NioChannelSelector workaround outlined above helps others.

            Unassigned Unassigned
            krandino Kevin Randino
            Votes:
            34 Vote for this issue
            Watchers:
            48 Start watching this issue

              Created:
              Updated: