Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-22932

Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

      When using a Windows Jenkins slave with an OSX Master (with the slave set up according to https://wiki.jenkins-ci.org/display/JENKINS/Step+by+step+guide+to+set+up+master+and+slave+machines) either disconnecting from the slave side or from the master (by selecting 'disconnect' from Nodes > NodeName), the slave then cannot reconnect until the master jenkins is restarted and an error is shown in the node information. This is extremely inconvenient as it means that the slave machine must be accessed every time the connection is interrupted (eg. a restart of jenkins or master machine). The following stack trace is seen on disconnect:

      Connection was broken

      java.io.IOException: Failed to abort
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
      at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
      at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      at java.lang.Thread.run(Thread.java:695)
      Caused by: java.nio.channels.ClosedChannelException
      at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663)
      at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
      at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
      at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)

          [JENKINS-22932] Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

          bcygan added a comment -

          Server: 1.590 with swarm client plugin 1.15

          Client: described problem occurs with swarm-client-1.20-jar-with-dependencies.jar, but not with swarm-client-1.15-jar-with-dependencies.jar

          bcygan added a comment - Server: 1.590 with swarm client plugin 1.15 Client: described problem occurs with swarm-client-1.20-jar-with-dependencies.jar, but not with swarm-client-1.15-jar-with-dependencies.jar

          Shannon Kerr added a comment -

          Same issue. Jenkins 1.574. Server Host is Ubuntu 12.04. Slave is Windows 7 x64 VM.

          Shannon Kerr added a comment - Same issue. Jenkins 1.574. Server Host is Ubuntu 12.04. Slave is Windows 7 x64 VM.

          Have the same problem with Jenkins LTS 1.580.3. In our case the nodes goes offline a few hours after restarting the master server and it's not all node, just a few each time (different nodes each time).

          The server is running on Ubuntu 14.04 and the slaves are running Windows 7 x64

          Connection was broken
          java.io.EOFException
          at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:616)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
          at java.util.concurrent.FutureTask.run(FutureTask.java:262)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)

          Marcus Jacobsson added a comment - Have the same problem with Jenkins LTS 1.580.3. In our case the nodes goes offline a few hours after restarting the master server and it's not all node, just a few each time (different nodes each time). The server is running on Ubuntu 14.04 and the slaves are running Windows 7 x64 Connection was broken java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:616) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

          Exceptions that say "NioChannelHub is not currently running", we are expecting a nested exception. Please attach the full stack trace including all the "Caused by ..." sections, not just the top-most part of it.

          Kohsuke Kawaguchi added a comment - Exceptions that say "NioChannelHub is not currently running", we are expecting a nested exception. Please attach the full stack trace including all the "Caused by ..." sections, not just the top-most part of it.

          Code changed in jenkins
          User: Kohsuke Kawaguchi
          Path:
          src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
          http://jenkins-ci.org/commit/remoting/281ee8e02c0d81d46ed612b5dc8c4e41db940d0b
          Log:
          Merge pull request #38 from jenkinsci/JENKINS-22932

          JENKINS-22932

          Compare: https://github.com/jenkinsci/remoting/compare/ba844a624235...281ee8e02c0d

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/281ee8e02c0d81d46ed612b5dc8c4e41db940d0b Log: Merge pull request #38 from jenkinsci/ JENKINS-22932 JENKINS-22932 Compare: https://github.com/jenkinsci/remoting/compare/ba844a624235...281ee8e02c0d

          Hang Dong added a comment -

          seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master...

          10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed
          10:17:07 at hudson.FilePath.act(FilePath.java:987)
          10:17:07 at hudson.FilePath.act(FilePath.java:969)
          10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152)
          10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275)
          10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
          10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
          10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
          10:17:07 at hudson.model.Run.execute(Run.java:1741)
          10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98)
          10:17:07 at hudson.model.Executor.run(Executor.java:381)
          10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed
          10:17:07 at hudson.remoting.Channel.send(Channel.java:550)
          10:17:07 at hudson.remoting.Request.call(Request.java:129)
          10:17:07 at hudson.remoting.Channel.call(Channel.java:752)
          10:17:07 at hudson.FilePath.act(FilePath.java:980)
          10:17:07 ... 10 more
          10:17:07 Caused by: java.io.IOException
          10:17:07 at hudson.remoting.Channel.close(Channel.java:1110)
          10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
          10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126)
          10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85)
          10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286

          Hang Dong added a comment - seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master... 10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.FilePath.act(FilePath.java:987) 10:17:07 at hudson.FilePath.act(FilePath.java:969) 10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152) 10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610) 10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532) 10:17:07 at hudson.model.Run.execute(Run.java:1741) 10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98) 10:17:07 at hudson.model.Executor.run(Executor.java:381) 10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.remoting.Channel.send(Channel.java:550) 10:17:07 at hudson.remoting.Request.call(Request.java:129) 10:17:07 at hudson.remoting.Channel.call(Channel.java:752) 10:17:07 at hudson.FilePath.act(FilePath.java:980) 10:17:07 ... 10 more 10:17:07 Caused by: java.io.IOException 10:17:07 at hudson.remoting.Channel.close(Channel.java:1110) 10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126) 10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85) 10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286

          Shesh Patel added a comment -

          Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix.

          java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02]
          	at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
          	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: java.io.IOException: Connection reset by peer
          	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
          	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
          	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
          	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
          	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
          	at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
          	at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
          	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
          

          Shesh Patel added a comment - Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix. java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)

          Brian L added a comment - - edited

          This is affecting me as well.

          Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17
          Slave: Windows Server 2008, connected via JNLP :

              Microsoft Windows [Version 6.1.7601]
              Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
              
              C:\Users\Administrator>java -version
              java version "1.8.0_31"
              Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
              Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
          
          
          

          Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?

          Brian L added a comment - - edited This is affecting me as well. Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17 Slave: Windows Server 2008, connected via JNLP : Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\Administrator>java -version java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?

          Brian L added a comment -

          I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement:

          1. Install the Groovy plugin
          2. Use this code as it's own Job :

          import jenkins.model.*
          
          println "The system is now going down for restart."
          println "Once the bug 'https://issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed."
            
          Jenkins.instance.doSafeRestart(null);
          

          3. Have the job triggered after any of your Windows slaves finish doing work

          Brian L added a comment - I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement: 1. Install the Groovy plugin 2. Use this code as it's own Job : import jenkins.model.* println "The system is now going down for restart." println "Once the bug 'https: //issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed." Jenkins.instance.doSafeRestart( null ); 3. Have the job triggered after any of your Windows slaves finish doing work

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

            Unassigned Unassigned
            dcr dc r
            Votes:
            37 Vote for this issue
            Watchers:
            59 Start watching this issue

              Created:
              Updated: