[JENKINS-22932] Jenkins slave cannot reconnect to Master once it has been disconnected unless Jenkins is restarted

Daniel Beck added a comment - 2014-05-08 20:06

How are the slaves installed and started? JNLP, Windows Service? Anything interesting in the log files in the slave's jenkins home dir?

Daniel Beck added a comment - 2014-05-08 20:06 How are the slaves installed and started? JNLP, Windows Service? Anything interesting in the log files in the slave's jenkins home dir?

dc r added a comment - 2014-05-09 14:53 - edited

Thanks for the reply Daniel, sorry I should have said I'm using JNLP for the connection. I browse to the master jenkins in the browser on the slave machine and then find the slave in the nodes list and click 'launch' to load the Java Web Starter, this then gives me a window that says connected. Even when that error occurs and I can't see that the slave is connected from the master, that window on the slave still says connected. There's nothing interesting in the slave jenkins logs and the master jenkins slave logs gives the following, with the same error as seen in my description:

JNLP agent connected from /xxx.xxx.xx.xx
<===[JENKINS REMOTING CAPACITY]===>^@@^@Slave.jar version: 2.40
This is a Windows slave
Slave successfully connected and online
Effective SlaveRestarter on XXXXXXXXX: [jenkins.slaves.restarter.WinswSlaveRestarter@afe676b]
Connection terminated
ERROR: Connection terminated
^{[[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=}[[0mjava.io.IOException: Failed to abort
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:563)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.net.SocketException: Socket is not connected
at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:497)
... 7 more
ERROR: Connection terminated
^{[[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=}[[0mjava.io.IOException: Failed to abort
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430)
at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)
... 7 more

dc r added a comment - 2014-05-09 14:53 - edited Thanks for the reply Daniel, sorry I should have said I'm using JNLP for the connection. I browse to the master jenkins in the browser on the slave machine and then find the slave in the nodes list and click 'launch' to load the Java Web Starter, this then gives me a window that says connected. Even when that error occurs and I can't see that the slave is connected from the master, that window on the slave still says connected. There's nothing interesting in the slave jenkins logs and the master jenkins slave logs gives the following, with the same error as seen in my description: JNLP agent connected from /xxx.xxx.xx.xx <=== [JENKINS REMOTING CAPACITY] ===> @ @^@Slave.jar version: 2.40 This is a Windows slave Slave successfully connected and online Effective SlaveRestarter on XXXXXXXXX: [jenkins.slaves.restarter.WinswSlaveRestarter@afe676b] Connection terminated ERROR: Connection terminated [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA= [[0mjava.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:563) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.net.SocketException: Socket is not connected at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:665) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:497) ... 7 more ERROR: Connection terminated [[8mha:AAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA= [[0mjava.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:663) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:430) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474) ... 7 more

Brian Prodoehl added a comment - 2014-05-16 18:14 - edited

I am seeing the same thing.

Master - Jenkins 1.563, Fedora 14, Java 1.7
Slave - Windows Server 2008 R2, Java JRE 1.8.0_05

The slave won't connect through the Windows service anymore, even though I've tried uninstalling the service and reinstalling the service, so I've been launching it via JNLP and encountering this error. This past time it only stayed online for maybe a minute before hitting this problem.

java.io.IOException: Failed to abort
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184)
at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:771)
at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:421)
at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20)
at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289)
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226)
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474)
... 7 more

EDIT: sorry, I mistakenly said 1.562 initially. I should have said 1.563.

Brian Prodoehl added a comment - 2014-05-16 18:14 - edited I am seeing the same thing. Master - Jenkins 1.563, Fedora 14, Java 1.7 Slave - Windows Server 2008 R2, Java JRE 1.8.0_05 The slave won't connect through the Windows service anymore, even though I've tried uninstalling the service and reinstalling the service, so I've been launching it via JNLP and encountering this error. This past time it only stayed online for maybe a minute before hitting this problem. java.io.IOException: Failed to abort at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:184) at org.jenkinsci.remoting.nio.NioChannelHub.abortAll(NioChannelHub.java:599) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:481) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:771) at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:421) at org.jenkinsci.remoting.nio.Closeables$1.close(Closeables.java:20) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:289) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:226) at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport$1.call(NioChannelHub.java:224) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:474) ... 7 more EDIT: sorry, I mistakenly said 1.562 initially. I should have said 1.563.

dc r added a comment - 2014-05-17 12:55

I noticed that since upgrading to Jenkins 1.563 this week the JNLP connection seemed to persist even when the slave server was rebooted. I initially thought this was fixed in the the latest release but perhaps it is still an outstanding issue on some platforms. I found that once this error was encountered you could never recover and reconnect until Jenkins was restarted. Try stopping the slave service, restarting Jenkins and then restarting the service again and see if you still get the error?

dc r added a comment - 2014-05-17 12:55 I noticed that since upgrading to Jenkins 1.563 this week the JNLP connection seemed to persist even when the slave server was rebooted. I initially thought this was fixed in the the latest release but perhaps it is still an outstanding issue on some platforms. I found that once this error was encountered you could never recover and reconnect until Jenkins was restarted. Try stopping the slave service, restarting Jenkins and then restarting the service again and see if you still get the error?

Greg Tangey added a comment - 2014-05-19 06:39 - edited

Same problem, Jenkins 1.563

Machines (all on VMWare):
MASTER: Ubuntu 12.04
SLAVE: Windows Server 2012 (JNLP or Windows service) (exhibits issue)
SLAVE: Ubuntu 12.04 (via SSH, works fine)

Steps to reproduce

1. Connect windows slave
2. Disconnect windows slave from either side (disconnect in jenkins UI or stop service or close JNLP window)
3. The jenkins.log will output the error in the description above.
4. Further connections from the slave side will seem as if they work but..
5. Jenkins UI for that slave node displays http://puu.sh/8Se22.png and node is offline

When connecting with a slave in a broken state the slave's log outputs as such:

JNLP agent connected from /10.0.0.248
<===[JENKINS REMOTING CAPACITY]===>

However, when it works (after a fresh restart of the jenkins instance) the output gets a lot further:
JNLP agent connected from /10.0.0.248
<===[JENKINS REMOTING CAPACITY]===>Slave.jar version: 2.41
This is a Windows slave
Effective SlaveRestarter on MSBuild: []
Slave successfully connected and online

Greg Tangey added a comment - 2014-05-19 06:39 - edited Same problem, Jenkins 1.563 Machines (all on VMWare): MASTER: Ubuntu 12.04 SLAVE: Windows Server 2012 (JNLP or Windows service) (exhibits issue) SLAVE: Ubuntu 12.04 (via SSH, works fine) Steps to reproduce 1. Connect windows slave 2. Disconnect windows slave from either side (disconnect in jenkins UI or stop service or close JNLP window) 3. The jenkins.log will output the error in the description above. 4. Further connections from the slave side will seem as if they work but.. 5. Jenkins UI for that slave node displays http://puu.sh/8Se22.png and node is offline When connecting with a slave in a broken state the slave's log outputs as such: JNLP agent connected from /10.0.0.248 <=== [JENKINS REMOTING CAPACITY] ===> However, when it works (after a fresh restart of the jenkins instance) the output gets a lot further: JNLP agent connected from /10.0.0.248 <=== [JENKINS REMOTING CAPACITY] ===>Slave.jar version: 2.41 This is a Windows slave Effective SlaveRestarter on MSBuild: [] Slave successfully connected and online

Derek Eclavea added a comment - 2014-05-19 16:37

Running into the same issue myself, and was able to track it back down to Release 1.560.

The changelog shows the following update, which I suspect is the where it was introduced:

JNLP slaves are now handled through NIO-based remoting channels for better scalability.

Derek Eclavea added a comment - 2014-05-19 16:37 Running into the same issue myself, and was able to track it back down to Release 1.560. The changelog shows the following update, which I suspect is the where it was introduced: JNLP slaves are now handled through NIO-based remoting channels for better scalability.

Quentin Hartman added a comment - 2014-05-28 17:03

I am seeing this problem as well on Jenkins 1.564, however it seems to only be affecting my Windows 7 slave. The Windows Server 2008 slave I have seems to be able to re-connect just fine. I haven't explicitly tested it for fear of interrupting work, but both machines were affected by some internet failures we had the other day, but only the Windows 7 box was unable to reconnect. The Windows server 2008 box seems to have reconnected on it's own once the connection to the Jenkins master returned.

Quentin Hartman added a comment - 2014-05-28 17:03 I am seeing this problem as well on Jenkins 1.564, however it seems to only be affecting my Windows 7 slave. The Windows Server 2008 slave I have seems to be able to re-connect just fine. I haven't explicitly tested it for fear of interrupting work, but both machines were affected by some internet failures we had the other day, but only the Windows 7 box was unable to reconnect. The Windows server 2008 box seems to have reconnected on it's own once the connection to the Jenkins master returned.

Clinton Barr added a comment - 2014-05-30 22:57

I am also seeing this issue with 1.561 and 1.565 (I updated recently), but after several reconnections on Windows Server 2008. I wrote a tool that makes the slave offline, shuts down and restarts the slave-agent.jnlp and makes the slave online again. After this project runs 6-7 times on all of my Win2008 Servers nodes, they refuse to connect, even after system reboots. Just as in previous comments, the console shows that the slave is connected and online, but the slave is not marked as online. No new builds are accepted by those nodes.

I'm running as many as 75 slaves at a time and have previously been able to perform these slave-agent.jnlp restarts.

Clinton Barr added a comment - 2014-05-30 22:57 I am also seeing this issue with 1.561 and 1.565 (I updated recently), but after several reconnections on Windows Server 2008. I wrote a tool that makes the slave offline, shuts down and restarts the slave-agent.jnlp and makes the slave online again. After this project runs 6-7 times on all of my Win2008 Servers nodes, they refuse to connect, even after system reboots. Just as in previous comments, the console shows that the slave is connected and online, but the slave is not marked as online. No new builds are accepted by those nodes. I'm running as many as 75 slaves at a time and have previously been able to perform these slave-agent.jnlp restarts.

Allister MacLeod added a comment - 2014-06-04 19:46

I am also getting this on 1.566 running Java 6 on the server (Linux) and Java 7 on the slave nodes (Windows 8 and MacOS). For a while I did not notice it because I was using older copies of slave.jar on existing nodes. When I copied the same older slave.jar to one of the new nodes that was misbehaving, it started working again. The old one I have was from mid-March of this year, so probably about 1.555 or so.

Allister MacLeod added a comment - 2014-06-04 19:46 I am also getting this on 1.566 running Java 6 on the server (Linux) and Java 7 on the slave nodes (Windows 8 and MacOS). For a while I did not notice it because I was using older copies of slave.jar on existing nodes. When I copied the same older slave.jar to one of the new nodes that was misbehaving, it started working again. The old one I have was from mid-March of this year, so probably about 1.555 or so.

SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/Closeables.java
http://jenkins-ci.org/commit/remoting/4bb086e15c88e2756e6c90987466a8af8c593b75
Log:
JENKINS-22932

shutdownInput/Output is not idempotent, so attempting to reclose a closed socket fails.

SCM/JIRA link daemon added a comment - 2014-06-09 19:25 Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/org/jenkinsci/remoting/nio/Closeables.java http://jenkins-ci.org/commit/remoting/4bb086e15c88e2756e6c90987466a8af8c593b75 Log: JENKINS-22932 shutdownInput/Output is not idempotent, so attempting to reclose a closed socket fails.

SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/4228cf8ad89faba8716b10f381adcdeb1594bf0d
Log:
JENKINS-22932

Don't let a failed SelectorTask kill the selector thread.

SCM/JIRA link daemon added a comment - 2014-06-09 19:25 Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/4228cf8ad89faba8716b10f381adcdeb1594bf0d Log: JENKINS-22932 Don't let a failed SelectorTask kill the selector thread.

SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/23f817832c18cec9abc65363a0261eab3958adaf
Log:
[FIXED JENKINS-22932]

If the thread that serves NioChannelHub.run() leaves for any reason, stop accepting the new connection as the channel will never be serviced.
It is indicative of a problem in the code.

This is the 3rd and the final part of the fix to the problem.

Compare: https://github.com/jenkinsci/remoting/compare/546728f16212...23f817832c18

SCM/JIRA link daemon added a comment - 2014-06-09 19:25 Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/23f817832c18cec9abc65363a0261eab3958adaf Log: [FIXED JENKINS-22932] If the thread that serves NioChannelHub.run() leaves for any reason, stop accepting the new connection as the channel will never be serviced. It is indicative of a problem in the code. This is the 3rd and the final part of the fix to the problem. Compare: https://github.com/jenkinsci/remoting/compare/546728f16212...23f817832c18

Kohsuke Kawaguchi added a comment - 2014-06-11 18:31

This is a regression in 1.560. Fix will be in 1.568.

Kohsuke Kawaguchi added a comment - 2014-06-11 18:31 This is a regression in 1.560. Fix will be in 1.568.

David Riggleman added a comment - 2014-06-23 12:48

I'm still seeing this problem in 1.568. In my case, the slave nodes are being disconnected due to a ping timeout. Up until recently (not sure exact version but around version 1.560 sounds right), I never had any issues with the slave nodes not connecting. Here's a snippet of the logs. I can provide more info if needed.

Connection #19 failed
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157)

David Riggleman added a comment - 2014-06-23 12:48 I'm still seeing this problem in 1.568. In my case, the slave nodes are being disconnected due to a ping timeout. Up until recently (not sure exact version but around version 1.560 sounds right), I never had any issues with the slave nodes not connecting. Here's a snippet of the logs. I can provide more info if needed. Connection #19 failed java.io.IOException: NioChannelHub is not currently running at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446) at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157)

Mike Kobler added a comment - 2014-06-23 18:38

David,

You might try upgrading the slaves with the new slave.jar that comes with 1.568.

(I was running 1.561 and seeing the issue, but only with a newer version of the slave.jar. Slaves running an older version did not show the issue).

Mike Kobler added a comment - 2014-06-23 18:38 David, You might try upgrading the slaves with the new slave.jar that comes with 1.568. (I was running 1.561 and seeing the issue, but only with a newer version of the slave.jar. Slaves running an older version did not show the issue).

David Riggleman added a comment - 2014-06-24 11:48

Thanks Mike! That apparently was my problem. I updated the slave-agent.jnlp file yesterday and haven't had any issues since. I didn't realize I needed to update the slaves as well as I thought the bug was primarily a server issue.

David Riggleman added a comment - 2014-06-24 11:48 Thanks Mike! That apparently was my problem. I updated the slave-agent.jnlp file yesterday and haven't had any issues since. I didn't realize I needed to update the slaves as well as I thought the bug was primarily a server issue.

Patricia Wright added a comment - 2014-07-17 19:33

Looks like a regression, or sporadic issue. I'm experiencing this now.

In our environment, ubuntu master, windows slaves.
Jenkins: 1.572 slave.jar version: 2.43 (the version on the master)

java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157)

In our environment slaves get disconnected after a suite of tests complete (and revert to a clean vSphere snapshot) and then reconnect.

It runs fine for a while, with many disconnects/reconnects.
Then starts tossing these exceptions and nothing can connect until a restart.

Patricia Wright added a comment - 2014-07-17 19:33 Looks like a regression, or sporadic issue. I'm experiencing this now. In our environment, ubuntu master, windows slaves. Jenkins: 1.572 slave.jar version: 2.43 (the version on the master) java.io.IOException: NioChannelHub is not currently running at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446) at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:157) In our environment slaves get disconnected after a suite of tests complete (and revert to a clean vSphere snapshot) and then reconnect. It runs fine for a while, with many disconnects/reconnects. Then starts tossing these exceptions and nothing can connect until a restart.

Kevin Browder added a comment - 2014-07-23 17:28

I also have the same problem as Patricia with Jenkins 1.571. Slave.jar 2.43. Very frustrating.

Kevin Browder added a comment - 2014-07-23 17:28 I also have the same problem as Patricia with Jenkins 1.571. Slave.jar 2.43. Very frustrating.

Patricia Wright added a comment - 2014-07-23 18:03

It ran for four days, slaves successfully disconnecting and reconnecting, then the problem surfaced again at about 5am this morning.

Many jobs were running, and all slaves disconnected at once.
Restarting Jenkins brought everything back online.

Patricia Wright added a comment - 2014-07-23 18:03 It ran for four days, slaves successfully disconnecting and reconnecting, then the problem surfaced again at about 5am this morning. Many jobs were running, and all slaves disconnected at once. Restarting Jenkins brought everything back online.

Kevin Browder added a comment - 2014-07-23 19:55

Wondering if we should reopen? (not sure what the Jenkins JIRA process is)
Anyways I've restarted my Jenkins too and am hoping for the best. I did (very briefly) look at the source (http://git.io/a4WelA) and am wondering why it bothers to throw an exception there instead of just making a new selector (presumably with an atomic get-or-creator or something) however I'd probably need to look at a lot more code before I can say I know what's going on with this.

Kevin Browder added a comment - 2014-07-23 19:55 Wondering if we should reopen? (not sure what the Jenkins JIRA process is) Anyways I've restarted my Jenkins too and am hoping for the best. I did (very briefly) look at the source ( http://git.io/a4WelA ) and am wondering why it bothers to throw an exception there instead of just making a new selector (presumably with an atomic get-or-creator or something) however I'd probably need to look at a lot more code before I can say I know what's going on with this.

Kevin Browder added a comment - 2014-07-25 13:09

A few of us still have this issue with very recent Jenkins versions.

Kevin Browder added a comment - 2014-07-25 13:09 A few of us still have this issue with very recent Jenkins versions.

Kevin Browder added a comment - 2014-07-28 19:09

My feeling is that this happens for us if we power off a JNLP slave in an ungraceful way (eg pull the virtual power plug), however it doesn't seem to happen every time, again this is similar to Patricia's case. Actually, I'm a bit surprised this doesn't happen at CloudBees since FWIU is that they've got dynamic provisioned VMs too; or maybe they use the LTS?
Anyways I'll try to grab the jenkins logs the next time this happens.

Kevin Browder added a comment - 2014-07-28 19:09 My feeling is that this happens for us if we power off a JNLP slave in an ungraceful way (eg pull the virtual power plug), however it doesn't seem to happen every time, again this is similar to Patricia's case. Actually, I'm a bit surprised this doesn't happen at CloudBees since FWIU is that they've got dynamic provisioned VMs too; or maybe they use the LTS? Anyways I'll try to grab the jenkins logs the next time this happens.

Andy Pham added a comment - 2014-07-29 15:40

I also hit the same problem as Patricia and Kevin with Jenkins 1.571. Slave.jar 2.37. Master on Red Hat Enterprise Linux Server release 6.5 and slave on a Windows 7 VM. Will try to restart Jenkins and see if that helps.

Andy Pham added a comment - 2014-07-29 15:40 I also hit the same problem as Patricia and Kevin with Jenkins 1.571. Slave.jar 2.37. Master on Red Hat Enterprise Linux Server release 6.5 and slave on a Windows 7 VM. Will try to restart Jenkins and see if that helps.

Jesse Glick added a comment - 2014-07-30 19:19

Are you actually seeing the same bug introduced in 1.560 and purportedly fixed in 1.568, or some other bug with related symptoms that should be filed separately?

Jesse Glick added a comment - 2014-07-30 19:19 Are you actually seeing the same bug introduced in 1.560 and purportedly fixed in 1.568, or some other bug with related symptoms that should be filed separately?

Kevin Browder added a comment - 2014-07-30 20:03

At first blush the error appeared to be the same because the trace is very similar, the nodes show the same thing and the symptoms are the same; but on closer investigation the root cause looks like it's somewhat different than this.

Specifically:
...omitted for brevity...
Caused by: java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at java.nio.channels.SelectionKey.isReadable(SelectionKey.java:289)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:513)
... 6 more
As opposed to the ClosedChannelException.

I'll file another issue.

Kevin Browder added a comment - 2014-07-30 20:03 At first blush the error appeared to be the same because the trace is very similar, the nodes show the same thing and the symptoms are the same; but on closer investigation the root cause looks like it's somewhat different than this. Specifically: ...omitted for brevity... Caused by: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87) at java.nio.channels.SelectionKey.isReadable(SelectionKey.java:289) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:513) ... 6 more As opposed to the ClosedChannelException. I'll file another issue.

Kevin Browder added a comment - 2014-07-31 12:20

Actually I think what Andy, Patricia and myself are seeing has a separate root cause and is not a regression of this issue per se. See https://issues.jenkins-ci.org/browse/JENKINS-24050

Kevin Browder added a comment - 2014-07-31 12:20 Actually I think what Andy, Patricia and myself are seeing has a separate root cause and is not a regression of this issue per se. See https://issues.jenkins-ci.org/browse/JENKINS-24050

Andy Pham added a comment - 2014-08-06 03:05

Here is the stack I currently get.

<===[JENKINS REMOTING CAPACITY]===>Failed to establish the connection with the slave wdctp707
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156)

Trying to connect windows 7 VM slave for the first time. This also appears on the console:

"Ping response time is too long or timed out."

Andy Pham added a comment - 2014-08-06 03:05 Here is the stack I currently get. <=== [JENKINS REMOTING CAPACITY] ===>Failed to establish the connection with the slave wdctp707 java.io.IOException: NioChannelHub is not currently running at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:446) at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156) Trying to connect windows 7 VM slave for the first time. This also appears on the console: "Ping response time is too long or timed out."

Kevin Browder added a comment - 2014-08-06 12:16

@Andy did you happen to see if there were any jobs that were running at the time that died? If so did the jobs have a line containing "Caused by: java.nio.channels.ClosedChannelException" (what this issue looks to have fixed) or "Caused by: java.nio.channels.ClosedChannelException" (issue I raised in ~~JENKINS-24050~~); basically the slave log doesn't really tell you enough information for these two issues.

Additionally did your other JNLP slaves disconnect?

If none of the above are true then I think what you've got might be a different issue than this or ~~JENKINS-24050~~ (and I guess it should be filled separately).

Kevin Browder added a comment - 2014-08-06 12:16 @Andy did you happen to see if there were any jobs that were running at the time that died? If so did the jobs have a line containing "Caused by: java.nio.channels.ClosedChannelException" (what this issue looks to have fixed) or "Caused by: java.nio.channels.ClosedChannelException" (issue I raised in JENKINS-24050 ); basically the slave log doesn't really tell you enough information for these two issues. Additionally did your other JNLP slaves disconnect? If none of the above are true then I think what you've got might be a different issue than this or JENKINS-24050 (and I guess it should be filled separately).

Andy Pham added a comment - 2014-08-06 13:49

Kevin, I haven't had a chance to catch the issue when a job is running yet and things still seem working since my last Jenkins restart. I'll keep an eye out for it. It could be a different problem and once I'm able to confirm that I'll log a different defect.

Andy Pham added a comment - 2014-08-06 13:49 Kevin, I haven't had a chance to catch the issue when a job is running yet and things still seem working since my last Jenkins restart. I'll keep an eye out for it. It could be a different problem and once I'm able to confirm that I'll log a different defect.

Philip Cheong added a comment - 2014-10-30 14:39 - edited

I have this this issue multiple times using in the last few days the swarm plugin on Jenkins ver 1.586.

The jenkins master is a brand new server running RHEL 6.5. The slave is also RHEL 6.5. No jobs have previously run on it. The only thing I'm testing is the connection of news slaves. This is a stochastic issue however. Sometimes it works fine, and sometimes it results in this error (I guess maybe 30% of the time).

JNLP agent connected from /172.31.8.131
<===[JENKINS REMOTING CAPACITY]===>Failed to establish the connection with the slave dev-master.phil
java.io.IOException: NioChannelHub is not currently running
at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:479)
at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149)
at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36)
at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156)

Philip Cheong added a comment - 2014-10-30 14:39 - edited I have this this issue multiple times using in the last few days the swarm plugin on Jenkins ver 1.586. The jenkins master is a brand new server running RHEL 6.5. The slave is also RHEL 6.5. No jobs have previously run on it. The only thing I'm testing is the connection of news slaves. This is a stochastic issue however. Sometimes it works fine, and sometimes it results in this error (I guess maybe 30% of the time). JNLP agent connected from /172.31.8.131 <=== [JENKINS REMOTING CAPACITY] ===>Failed to establish the connection with the slave dev-master.phil java.io.IOException: NioChannelHub is not currently running at org.jenkinsci.remoting.nio.NioChannelHub$1.makeTransport(NioChannelHub.java:479) at hudson.remoting.ChannelBuilder.negotiate(ChannelBuilder.java:220) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:149) at hudson.remoting.ChannelBuilder.build(ChannelBuilder.java:159) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:36) at org.jenkinsci.remoting.nio.NioChannelBuilder.build(NioChannelBuilder.java:52) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:120) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:63) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:57) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:31) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:156)

Patricia Wright added a comment - 2014-10-30 15:41 - edited

I've seen this happen even after recent fixes.

Jobs were running on the slaves at the time..
One thing I noticed - linux slaves didn't disconnect, only windows slaves.

This was too big of a problem, I went back to LTS.

Patricia Wright added a comment - 2014-10-30 15:41 - edited I've seen this happen even after recent fixes. Jobs were running on the slaves at the time.. One thing I noticed - linux slaves didn't disconnect, only windows slaves. This was too big of a problem, I went back to LTS.

Philip Cheong added a comment - 2014-10-31 14:31

I downgraded to the LTS Jenkins ver. 1.580.1 today and just hit the same problem.

Philip Cheong added a comment - 2014-10-31 14:31 I downgraded to the LTS Jenkins ver. 1.580.1 today and just hit the same problem.

Guillaume Agile added a comment - 2014-12-16 14:50

Same behavior between slaves and master running both on Windows 7sp1 / 2008R2 with Jenkins 1.592

Guillaume Agile added a comment - 2014-12-16 14:50 Same behavior between slaves and master running both on Windows 7sp1 / 2008R2 with Jenkins 1.592

Serge Dorna added a comment - 2015-01-07 21:41 - edited

I'm seeing the same, master is 2008R2, slave is 7sp1
Jenkins ver. 1.580.2

Serge Dorna added a comment - 2015-01-07 21:41 - edited I'm seeing the same, master is 2008R2, slave is 7sp1 Jenkins ver. 1.580.2

Patricia Wright added a comment - 2015-01-07 23:09

Even after running on LTS I still see this every week.

Patricia Wright added a comment - 2015-01-07 23:09 Even after running on LTS I still see this every week.

bcygan added a comment - 2015-01-08 15:09

Server: 1.590 with swarm client plugin 1.15

Client: described problem occurs with swarm-client-1.20-jar-with-dependencies.jar, but not with swarm-client-1.15-jar-with-dependencies.jar

bcygan added a comment - 2015-01-08 15:09 Server: 1.590 with swarm client plugin 1.15 Client: described problem occurs with swarm-client-1.20-jar-with-dependencies.jar, but not with swarm-client-1.15-jar-with-dependencies.jar

Shannon Kerr added a comment - 2015-01-26 17:55

Same issue. Jenkins 1.574. Server Host is Ubuntu 12.04. Slave is Windows 7 x64 VM.

Shannon Kerr added a comment - 2015-01-26 17:55 Same issue. Jenkins 1.574. Server Host is Ubuntu 12.04. Slave is Windows 7 x64 VM.

Marcus Jacobsson added a comment - 2015-02-25 14:52

Have the same problem with Jenkins LTS 1.580.3. In our case the nodes goes offline a few hours after restarting the master server and it's not all node, just a few each time (different nodes each time).

The server is running on Ubuntu 14.04 and the slaves are running Windows 7 x64

Connection was broken
java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:616)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Marcus Jacobsson added a comment - 2015-02-25 14:52 Have the same problem with Jenkins LTS 1.580.3. In our case the nodes goes offline a few hours after restarting the master server and it's not all node, just a few each time (different nodes each time). The server is running on Ubuntu 14.04 and the slaves are running Windows 7 x64 Connection was broken java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:616) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:111) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Kohsuke Kawaguchi added a comment - 2015-04-15 23:03

Exceptions that say "NioChannelHub is not currently running", we are expecting a nested exception. Please attach the full stack trace including all the "Caused by ..." sections, not just the top-most part of it.

Kohsuke Kawaguchi added a comment - 2015-04-15 23:03 Exceptions that say "NioChannelHub is not currently running", we are expecting a nested exception. Please attach the full stack trace including all the "Caused by ..." sections, not just the top-most part of it.

SCM/JIRA link daemon added a comment - 2015-04-16 14:38

Code changed in jenkins
User: Kohsuke Kawaguchi
Path:
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/281ee8e02c0d81d46ed612b5dc8c4e41db940d0b
Log:
Merge pull request #38 from jenkinsci/JENKINS-22932

JENKINS-22932

Compare: https://github.com/jenkinsci/remoting/compare/ba844a624235...281ee8e02c0d

SCM/JIRA link daemon added a comment - 2015-04-16 14:38 Code changed in jenkins User: Kohsuke Kawaguchi Path: src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/281ee8e02c0d81d46ed612b5dc8c4e41db940d0b Log: Merge pull request #38 from jenkinsci/ JENKINS-22932 JENKINS-22932 Compare: https://github.com/jenkinsci/remoting/compare/ba844a624235...281ee8e02c0d

Hang Dong added a comment - 2015-09-11 19:13

seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master...

10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed
10:17:07 at hudson.FilePath.act(FilePath.java:987)
10:17:07 at hudson.FilePath.act(FilePath.java:969)
10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152)
10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275)
10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
10:17:07 at hudson.model.Run.execute(Run.java:1741)
10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98)
10:17:07 at hudson.model.Executor.run(Executor.java:381)
10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed
10:17:07 at hudson.remoting.Channel.send(Channel.java:550)
10:17:07 at hudson.remoting.Request.call(Request.java:129)
10:17:07 at hudson.remoting.Channel.call(Channel.java:752)
10:17:07 at hudson.FilePath.act(FilePath.java:980)
10:17:07 ... 10 more
10:17:07 Caused by: java.io.IOException
10:17:07 at hudson.remoting.Channel.close(Channel.java:1110)
10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118)
10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126)
10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85)
10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286

Hang Dong added a comment - 2015-09-11 19:13 seeing this on windows master with 1.620, when adding new node, we typically connect via jnlp link, then install as service. We hit the issue onthe service client re-connect. Perhaps this helps: due to https secured master, the first service connect won't have valid cert info (and we suspect this triggers the issue master side), we update xml with certificate info then stop/restart the service, but at this stage the master is already in a bad state (not only the new slave cannot reconnect), the master actually loses connection to all other slaves as well. Our workaround so far is restarting master... 10:17:07 java.io.IOException: remote file operation failed: C:\JSBuilds\workspace****************** at hudson.remoting.Channel@1530a3e:********: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.FilePath.act(FilePath.java:987) 10:17:07 at hudson.FilePath.act(FilePath.java:969) 10:17:07 at hudson.FilePath.mkdirs(FilePath.java:1152) 10:17:07 at hudson.model.AbstractProject.checkout(AbstractProject.java:1275) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610) 10:17:07 at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86) 10:17:07 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532) 10:17:07 at hudson.model.Run.execute(Run.java:1741) 10:17:07 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 10:17:07 at hudson.model.ResourceController.execute(ResourceController.java:98) 10:17:07 at hudson.model.Executor.run(Executor.java:381) 10:17:07 Caused by: hudson.remoting.ChannelClosedException: channel is already closed 10:17:07 at hudson.remoting.Channel.send(Channel.java:550) 10:17:07 at hudson.remoting.Request.call(Request.java:129) 10:17:07 at hudson.remoting.Channel.call(Channel.java:752) 10:17:07 at hudson.FilePath.act(FilePath.java:980) 10:17:07 ... 10 more 10:17:07 Caused by: java.io.IOException 10:17:07 at hudson.remoting.Channel.close(Channel.java:1110) 10:17:07 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:118) 10:17:07 at hudson.remoting.PingThread.ping(PingThread.java:126) 10:17:07 at hudson.remoting.PingThread.run(PingThread.java:85) 10:17:07 Caused by: java.util.concurrent.TimeoutException: Ping started at 1441990735275 hasn't completed by 1441990975286

Shesh Patel added a comment - 2015-09-14 18:28

Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix.

java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02]
	at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
	at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)

Shesh Patel added a comment - 2015-09-14 18:28 Encounter this issue after upgrading jenkins version to 1.622. I am getting following error while connecting to windows slave. I am using "launch slave agents via Java Web Start" option to launch slave. It used to work fine in previous version of 1.597. It seems to be re-introduced, please follow up with suggested fix. java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@7029f3e3[name=windows_02] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)

Brian L added a comment - 2015-11-25 21:11 - edited

This is affecting me as well.

Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17
Slave: Windows Server 2008, connected via JNLP :

    Microsoft Windows [Version 6.1.7601]
    Copyright (c) 2009 Microsoft Corporation.  All rights reserved.
    
    C:\Users\Administrator>java -version
    java version "1.8.0_31"
    Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)

Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?

Brian L added a comment - 2015-11-25 21:11 - edited This is affecting me as well. Master: Jenkins ver. 1.638, Ubuntu 14.04.3 LTS, running JRE 1.8.0_65-b17 Slave: Windows Server 2008, connected via JNLP : Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation. All rights reserved. C:\Users\Administrator>java -version java version "1.8.0_31" Java(TM) SE Runtime Environment (build 1.8.0_31-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode) Do we have a workaround? I wonder if adding some Job configuration to programmatically kill the process running java ... -jar "...\slave.jar" might work?

Brian L added a comment - 2015-11-25 22:32

I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement:

1. Install the Groovy plugin
2. Use this code as it's own Job :

import jenkins.model.*

println "The system is now going down for restart."
println "Once the bug 'https://issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed."
  
Jenkins.instance.doSafeRestart(null);

3. Have the job triggered after any of your Windows slaves finish doing work

Brian L added a comment - 2015-11-25 22:32 I didn't have much luck with an actual patch, but in the meantime, here's the workaround I'm attempting to implement: 1. Install the Groovy plugin 2. Use this code as it's own Job : import jenkins.model.* println "The system is now going down for restart." println "Once the bug 'https: //issues.jenkins-ci.org/browse/JENKINS-22932' is resolved, this job should be removed." Jenkins.instance.doSafeRestart( null ); 3. Have the job triggered after any of your Windows slaves finish doing work

Oleg Nenashev added a comment - 2018-03-14 02:32

Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

Oleg Nenashev added a comment - 2018-03-14 02:32 Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Daniel Beck added a comment - 2014-05-08 20:06

Expand comment: Daniel Beck added a comment - 2014-05-08 20:06

Collapse comment: dc r added a comment - 2014-05-09 14:53, Edited by dc r - 2014-05-12 08:42

Expand comment: dc r added a comment - 2014-05-09 14:53, Edited by dc r - 2014-05-12 08:42

Collapse comment: Brian Prodoehl added a comment - 2014-05-16 18:14, Edited by Brian Prodoehl - 2014-05-16 18:15

Expand comment: Brian Prodoehl added a comment - 2014-05-16 18:14, Edited by Brian Prodoehl - 2014-05-16 18:15

Collapse comment: dc r added a comment - 2014-05-17 12:55

Expand comment: dc r added a comment - 2014-05-17 12:55

Collapse comment: Greg Tangey added a comment - 2014-05-19 06:39, Edited by Greg Tangey - 2014-05-20 02:28

Expand comment: Greg Tangey added a comment - 2014-05-19 06:39, Edited by Greg Tangey - 2014-05-20 02:28

Collapse comment: Derek Eclavea added a comment - 2014-05-19 16:37

Expand comment: Derek Eclavea added a comment - 2014-05-19 16:37

Collapse comment: Quentin Hartman added a comment - 2014-05-28 17:03

Expand comment: Quentin Hartman added a comment - 2014-05-28 17:03

Collapse comment: Clinton Barr added a comment - 2014-05-30 22:57

Expand comment: Clinton Barr added a comment - 2014-05-30 22:57

Collapse comment: Allister MacLeod added a comment - 2014-06-04 19:46

Expand comment: Allister MacLeod added a comment - 2014-06-04 19:46

Collapse comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Expand comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Collapse comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Expand comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Collapse comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Expand comment: SCM/JIRA link daemon added a comment - 2014-06-09 19:25

Collapse comment: Kohsuke Kawaguchi added a comment - 2014-06-11 18:31

Expand comment: Kohsuke Kawaguchi added a comment - 2014-06-11 18:31

Collapse comment: David Riggleman added a comment - 2014-06-23 12:48

Expand comment: David Riggleman added a comment - 2014-06-23 12:48

Collapse comment: Mike Kobler added a comment - 2014-06-23 18:38

Expand comment: Mike Kobler added a comment - 2014-06-23 18:38

Collapse comment: David Riggleman added a comment - 2014-06-24 11:48

Expand comment: David Riggleman added a comment - 2014-06-24 11:48

Collapse comment: Patricia Wright added a comment - 2014-07-17 19:33

Expand comment: Patricia Wright added a comment - 2014-07-17 19:33

Collapse comment: Kevin Browder added a comment - 2014-07-23 17:28

Expand comment: Kevin Browder added a comment - 2014-07-23 17:28

Collapse comment: Patricia Wright added a comment - 2014-07-23 18:03

Expand comment: Patricia Wright added a comment - 2014-07-23 18:03

Collapse comment: Kevin Browder added a comment - 2014-07-23 19:55

Expand comment: Kevin Browder added a comment - 2014-07-23 19:55

Collapse comment: Kevin Browder added a comment - 2014-07-25 13:09

Expand comment: Kevin Browder added a comment - 2014-07-25 13:09

Collapse comment: Kevin Browder added a comment - 2014-07-28 19:09

Expand comment: Kevin Browder added a comment - 2014-07-28 19:09

Collapse comment: Andy Pham added a comment - 2014-07-29 15:40

Expand comment: Andy Pham added a comment - 2014-07-29 15:40

Collapse comment: Jesse Glick added a comment - 2014-07-30 19:19

Expand comment: Jesse Glick added a comment - 2014-07-30 19:19

Collapse comment: Kevin Browder added a comment - 2014-07-30 20:03

Expand comment: Kevin Browder added a comment - 2014-07-30 20:03

Collapse comment: Kevin Browder added a comment - 2014-07-31 12:20

Expand comment: Kevin Browder added a comment - 2014-07-31 12:20

Collapse comment: Andy Pham added a comment - 2014-08-06 03:05

Expand comment: Andy Pham added a comment - 2014-08-06 03:05

Collapse comment: Kevin Browder added a comment - 2014-08-06 12:16

Expand comment: Kevin Browder added a comment - 2014-08-06 12:16

Collapse comment: Andy Pham added a comment - 2014-08-06 13:49

Expand comment: Andy Pham added a comment - 2014-08-06 13:49

Collapse comment: Philip Cheong added a comment - 2014-10-30 14:39, Edited by Philip Cheong - 2014-10-30 14:41

Expand comment: Philip Cheong added a comment - 2014-10-30 14:39, Edited by Philip Cheong - 2014-10-30 14:41

Collapse comment: Patricia Wright added a comment - 2014-10-30 15:41, Edited by Patricia Wright - 2015-01-07 21:55

Expand comment: Patricia Wright added a comment - 2014-10-30 15:41, Edited by Patricia Wright - 2015-01-07 21:55

Collapse comment: Philip Cheong added a comment - 2014-10-31 14:31

Expand comment: Philip Cheong added a comment - 2014-10-31 14:31

Collapse comment: Guillaume Agile added a comment - 2014-12-16 14:50

Expand comment: Guillaume Agile added a comment - 2014-12-16 14:50

Collapse comment: Serge Dorna added a comment - 2015-01-07 21:41, Edited by Serge Dorna - 2015-01-07 21:43

Expand comment: Serge Dorna added a comment - 2015-01-07 21:41, Edited by Serge Dorna - 2015-01-07 21:43

Collapse comment: Patricia Wright added a comment - 2015-01-07 23:09

Expand comment: Patricia Wright added a comment - 2015-01-07 23:09

Collapse comment: bcygan added a comment - 2015-01-08 15:09

Expand comment: bcygan added a comment - 2015-01-08 15:09

Collapse comment: Shannon Kerr added a comment - 2015-01-26 17:55

Expand comment: Shannon Kerr added a comment - 2015-01-26 17:55