Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22932

Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

      When using a Windows Jenkins slave with an OSX Master (with the slave set up according to https://wiki.jenkins-ci.org/display/JENKINS/Step+by+step+guide+to+set+up+master+and+slave+machines) either disconnecting from the slave side or from the master (by selecting 'disconnect' from Nodes > NodeName), the slave then cannot reconnect until the master jenkins is restarted and an error is shown in the node information. This is extremely inconvenient as it means that the slave machine must be accessed every time the connection is interrupted (eg. a restart of jenkins or master machine). The following stack trace is seen on disconnect:

      Connection was broken

      java.io.IOException: Failed to abort
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
      at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)
      Caused by: java.nio.channels.ClosedChannelException
      at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663)
      at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
      at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
      at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)

          [JENKINS-22932] Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

          dc r created issue -

          Daniel Beck added a comment -

          How are the slaves installed and started? JNLP, Windows Service? Anything interesting in the log files in the slave's jenkins home dir?

          Daniel Beck added a comment - How are the slaves installed and started? JNLP, Windows Service? Anything interesting in the log files in the slave's jenkins home dir?

          dc r added a comment - - edited

          Thanks for the reply Daniel, sorry I should have said I'm using JNLP for the connection. I browse to the master jenkins in the browser on the slave machine and then find the slave in the nodes list and click 'launch' to load the Java Web Starter, this then gives me a window that says connected. Even when that error occurs and I can't see that the slave is connected from the master, that window on the slave still says connected. There's nothing interesting in the slave jenkins logs and the master jenkins slave logs gives the following, with the same error as seen in my description:

          JNLP agent connected from /xxx.xxx.xx.xx
          <===[JENKINS REMOTING CAPACITY]===>@@^@Slave.jar version: 2.40
          This is a Windows slave
          Slave successfully connected and online
          Effective SlaveRestarter on XXXXXXXXX: [jenkins.slaves.restarter.WinswSlaveRestarter@afe676b]
          Connection terminated
          ERROR: Connection terminated
          [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[[0mjava.io.IOException: Failed to abort
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:563)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
          at java.lang.Thread.run(Thread.java:695)
          Caused by: java.net.SocketException: Socket is not connected
          at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
          at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665)
          at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
          at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
          at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:497)
          ... 7 more
          ERROR: Connection terminated
          [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[[0mjava.io.IOException: Failed to abort
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
          at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
          at java.lang.Thread.run(Thread.java:695)
          Caused by: java.nio.channels.ClosedChannelException
          at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663)
          at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
          at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
          at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)
          ... 7 more

          dc r added a comment - - edited Thanks for the reply Daniel, sorry I should have said I'm using JNLP for the connection. I browse to the master jenkins in the browser on the slave machine and then find the slave in the nodes list and click 'launch' to load the Java Web Starter, this then gives me a window that says connected. Even when that error occurs and I can't see that the slave is connected from the master, that window on the slave still says connected. There's nothing interesting in the slave jenkins logs and the master jenkins slave logs gives the following, with the same error as seen in my description: JNLP agent connected from /xxx.xxx.xx.xx <=== [JENKINS REMOTING CAPACITY] ===> @ @^@Slave.jar version: 2.40 This is a Windows slave Slave successfully connected and online Effective SlaveRestarter on XXXXXXXXX: [jenkins.slaves.restarter.WinswSlaveRestarter@afe676b] Connection terminated ERROR: Connection terminated [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA= [[0mjava.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:563) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.net.SocketException: Socket is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:497) ... 7 more ERROR: Connection terminated [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA= [[0mjava.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474) ... 7 more

          Brian Prodoehl added a comment - - edited

          I am seeing the same thing.

          Master - Jenkins 1.563, Fedora 14, Java 1.7
          Slave - Windows Server 2008 R2, Java JRE 1.8.0_05

          The slave won't connect through the Windows service anymore, even though I've tried uninstalling the service and reinstalling the service, so I've been launching it via JNLP and encountering this error. This past time it only stayed online for maybe a minute before hitting this problem.

          java.io.IOException: Failed to abort
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
          at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
          at java.util.concurrent.FutureTask.run(FutureTask.java:166)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:724)
          Caused by: java.nio.channels.ClosedChannelException
          at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:771)
          at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:421)
          at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
          at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)
          ... 7 more

          EDIT: sorry, I mistakenly said 1.562 initially. I should have said 1.563.

          Brian Prodoehl added a comment - - edited I am seeing the same thing. Master - Jenkins 1.563, Fedora 14, Java 1.7 Slave - Windows Server 2008 R2, Java JRE 1.8.0_05 The slave won't connect through the Windows service anymore, even though I've tried uninstalling the service and reinstalling the service, so I've been launching it via JNLP and encountering this error. This past time it only stayed online for maybe a minute before hitting this problem. java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:771) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:421) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474) ... 7 more EDIT: sorry, I mistakenly said 1.562 initially. I should have said 1.563.

          dc r added a comment -

          I noticed that since upgrading to Jenkins 1.563 this week the JNLP connection seemed to persist even when the slave server was rebooted. I initially thought this was fixed in the the latest release but perhaps it is still an outstanding issue on some platforms. I found that once this error was encountered you could never recover and reconnect until Jenkins was restarted. Try stopping the slave service, restarting Jenkins and then restarting the service again and see if you still get the error?

          dc r added a comment - I noticed that since upgrading to Jenkins 1.563 this week the JNLP connection seemed to persist even when the slave server was rebooted. I initially thought this was fixed in the the latest release but perhaps it is still an outstanding issue on some platforms. I found that once this error was encountered you could never recover and reconnect until Jenkins was restarted. Try stopping the slave service, restarting Jenkins and then restarting the service again and see if you still get the error?

          Greg Tangey added a comment - - edited

          Same problem, Jenkins 1.563

          Machines (all on VMWare):
          MASTER: Ubuntu 12.04
          SLAVE: Windows Server 2012 (JNLP or Windows service) (exhibits issue)
          SLAVE: Ubuntu 12.04 (via SSH, works fine)

          Steps to reproduce

          1. Connect windows slave
          2. Disconnect windows slave from either side (disconnect in jenkins UI or stop service or close JNLP window)
          3. The jenkins.log will output the error in the description above.
          4. Further connections from the slave side will seem as if they work but..
          5. Jenkins UI for that slave node displays http://puu.sh/8Se22.png and node is offline

          When connecting with a slave in a broken state the slave's log outputs as such:

          JNLP agent connected from /10.0.0.248
          <===[JENKINS REMOTING CAPACITY]===>

          However, when it works (after a fresh restart of the jenkins instance) the output gets a lot further:
          JNLP agent connected from /10.0.0.248
          <===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.41
          This is a Windows slave
          Effective SlaveRestarter on MSBuild: []
          Slave successfully connected and online

          Greg Tangey added a comment - - edited Same problem, Jenkins 1.563 Machines (all on VMWare): MASTER: Ubuntu 12.04 SLAVE: Windows Server 2012 (JNLP or Windows service) (exhibits issue) SLAVE: Ubuntu 12.04 (via SSH, works fine) Steps to reproduce 1. Connect windows slave 2. Disconnect windows slave from either side (disconnect in jenkins UI or stop service or close JNLP window) 3. The jenkins.log will output the error in the description above. 4. Further connections from the slave side will seem as if they work but.. 5. Jenkins UI for that slave node displays http://puu.sh/8Se22.png and node is offline When connecting with a slave in a broken state the slave's log outputs as such: JNLP agent connected from /10.0.0.248 <=== [JENKINS REMOTING CAPACITY] ===> However, when it works (after a fresh restart of the jenkins instance) the output gets a lot further: JNLP agent connected from /10.0.0.248 <=== [JENKINS REMOTING CAPACITY] ===>Slave.jar version: 2.41 This is a Windows slave Effective SlaveRestarter on MSBuild: [] Slave successfully connected and online

          Derek Eclavea added a comment -

          Running into the same issue myself, and was able to track it back down to Release 1.560.

          The changelog shows the following update, which I suspect is the where it was introduced:

          JNLP slaves are now handled through NIO-based remoting channels for better scalability.

          Derek Eclavea added a comment - Running into the same issue myself, and was able to track it back down to Release 1.560. The changelog shows the following update, which I suspect is the where it was introduced: JNLP slaves are now handled through NIO-based remoting channels for better scalability.
          Jesse Glick made changes -
          Labels Original: JNLP Master Slave disconnect java start web New: JNLP disconnect regression slaves

          I am seeing this problem as well on Jenkins 1.564, however it seems to only be affecting my Windows 7 slave. The Windows Server 2008 slave I have seems to be able to re-connect just fine. I haven't explicitly tested it for fear of interrupting work, but both machines were affected by some internet failures we had the other day, but only the Windows 7 box was unable to reconnect. The Windows server 2008 box seems to have reconnected on it's own once the connection to the Jenkins master returned.

          Quentin Hartman added a comment - I am seeing this problem as well on Jenkins 1.564, however it seems to only be affecting my Windows 7 slave. The Windows Server 2008 slave I have seems to be able to re-connect just fine. I haven't explicitly tested it for fear of interrupting work, but both machines were affected by some internet failures we had the other day, but only the Windows 7 box was unable to reconnect. The Windows server 2008 box seems to have reconnected on it's own once the connection to the Jenkins master returned.

          Clinton Barr added a comment -

          I am also seeing this issue with 1.561 and 1.565 (I updated recently), but after several reconnections on Windows Server 2008. I wrote a tool that makes the slave offline, shuts down and restarts the slave-agent.jnlp and makes the slave online again. After this project runs 6-7 times on all of my Win2008 Servers nodes, they refuse to connect, even after system reboots. Just as in previous comments, the console shows that the slave is connected and online, but the slave is not marked as online. No new builds are accepted by those nodes.

          I'm running as many as 75 slaves at a time and have previously been able to perform these slave-agent.jnlp restarts.

          Clinton Barr added a comment - I am also seeing this issue with 1.561 and 1.565 (I updated recently), but after several reconnections on Windows Server 2008. I wrote a tool that makes the slave offline, shuts down and restarts the slave-agent.jnlp and makes the slave online again. After this project runs 6-7 times on all of my Win2008 Servers nodes, they refuse to connect, even after system reboots. Just as in previous comments, the console shows that the slave is connected and online, but the slave is not marked as online. No new builds are accepted by those nodes. I'm running as many as 75 slaves at a time and have previously been able to perform these slave-agent.jnlp restarts.

            Unassigned Unassigned
            dcr dc r
            Votes:
            37 Vote for this issue
            Watchers:
            59 Start watching this issue

              Created:
              Updated: