[JENKINS-23852] slave builds repeatedly broken by "hudson.remoting.ChannelClosedException: channel is already closed"

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: remoting
Labels:
- remoting
Environment:
Master: Jenkins 1.572 on OsX 10.7.5, Java: 1.6.0_65-b14-462-11M4609
Slave: Slave version 2.43 (as reported by the versioncolumn plugin) on Windows 2008 R2 Standard (64bit), Java 7 update 55 (32 bit)

Similar Issues:
Powered by SuggestiMate

Show

I have a similar issue to ~~JENKINS-14332~~, although that says there is no problem on a windows slave (because it eventually reconnects?), so I'm logging this seperatly.

The message "hudson.remoting.ChannelClosedException: channel is already closed" breaks my builds on a windows slave. It doesn't happen on every build, but a nightly run of 7 builds seems to have at least one each time.

Because these seem to be connection issues, so might be imeout related, here's some setup info, even though I doubt it will be relevant: all those builds are matrix builds. My setup consist of an OSX master YYY that does the build for the 'mac' axis, and a slave XXX that does the build of the 'windows' axis. The windows machine is significantly slower than the mac, so some the builds spend quite some time in a state of '1 axis done, other axis queued'. I don't really see issues here, it seems to work fine for most of the builds most of the time: e.g. these 7 builds are all queued at 20:00, and the example below fails shortly after 21:00, but other builds finish fine later on (so after a longer wait). The failing build on XXX started at 20:50, so it fails after 17 minutes. Other, later and earlier builds run for an hour without problems.

In the middle of a build, the build log (as seen on the master) suddenly says:

==========================================
21:07:10 FATAL: channel is already closed
21:07:10 hudson.remoting.ChannelClosedException: channel is already closed
21:07:10 at hudson.remoting.Channel.send(Channel.java:541)
21:07:10 at hudson.remoting.Request.call(Request.java:129)
21:07:10 at hudson.remoting.Channel.call(Channel.java:739)
21:07:10 at hudson.EnvVars.getRemote(EnvVars.java:404)
21:07:10 at hudson.model.Computer.getEnvironment(Computer.java:912)
21:07:10 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
21:07:10 at hudson.model.Run.getEnvironment(Run.java:2250)
21:07:10 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:907)
21:07:10 at org.jenkinsci.plugins.tokenmacro.impl.EnvironmentVariableMacro.evaluate(EnvironmentVariableMacro.java:23)
21:07:10 at org.jenkinsci.plugins.tokenmacro.DataBoundTokenMacro.evaluate(DataBoundTokenMacro.java:189)
21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:182)
21:07:10 at org.jenkinsci.plugins.tokenmacro.TokenMacro.expand(TokenMacro.java:154)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.setDisplayName(BuildNameSetter.java:50)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter.access$000(BuildNameSetter.java:26)
21:07:10 at org.jenkinsci.plugins.buildnamesetter.BuildNameSetter$1.tearDown(BuildNameSetter.java:42)
21:07:10 at hudson.model.Build$BuildExecution.doRun(Build.java:171)
21:07:10 at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:535)
21:07:10 at hudson.model.Run.execute(Run.java:1732)
21:07:10 at hudson.matrix.MatrixRun.run(MatrixRun.java:146)
21:07:10 at hudson.model.ResourceController.execute(ResourceController.java:88)
21:07:10 at hudson.model.Executor.run(Executor.java:234)
21:07:10 Caused by: java.io.IOException: Failed to abort
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
21:07:10 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
21:07:10 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
21:07:10 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
21:07:10 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
21:07:10 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
21:07:10 at java.lang.Thread.run(Thread.java:695)
21:07:10 Caused by: java.io.IOException: Operation timed out
21:07:10 at sun.nio.ch.FileDispatcher.read0(Native Method)
21:07:10 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
21:07:10 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
21:07:10 at sun.nio.ch.IOUtil.read(IOUtil.java:171)
21:07:10 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
21:07:10 at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
21:07:10 at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
21:07:10 ... 7 more
==========================================

This combines with this in the slave error log:

==========================================
Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Jul 16, 2014 9:01:57 PM hudson.util.ProcessTree getKillers
WARNING: Failed to obtain killers
hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:541)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:739)
at hudson.util.ProcessTree.getKillers(ProcessTree.java:162)
at hudson.util.ProcessTree$OSProcess.killByKiller(ProcessTree.java:221)
at hudson.util.ProcessTree$Windows$1.killRecursively(ProcessTree.java:413)
at hudson.util.ProcessTree.killAll(ProcessTree.java:149)
at hudson.Proc$LocalProc.destroy(Proc.java:379)
at hudson.Proc$LocalProc.join(Proc.java:352)
at hudson.Launcher$RemoteLaunchCallable$1.join(Launcher.java:1116)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.perform(RemoteInvocationHandler.java:309)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:290)
at hudson.remoting.RemoteInvocationHandler$RPCRequest.call(RemoteInvocationHandler.java:249)
at hudson.remoting.UserRequest.perform(UserRequest.java:118)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:328)
at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:63)
at java.lang.Thread.run(Unknown Source)
Caused by: java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:67)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:93)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:33)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Jul 16, 2014 9:01:57 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Jul 16, 2014 9:01:57 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onDisconnect
INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@18f3cc5
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: XXX
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://YYY:8080/
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to YYY:62768
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jul 16, 2014 9:02:05 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

[repeated reconnection attempts, +- 2 times/sec]

Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: XXX
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://YYY:8080/
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to YYY:62768
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

and the wrapper log says:

==========================================
2014-07-16 21:01:59 - Stopping jenkinsslave-e__JenkinsHome
2014-07-16 21:01:59 - ProcessKill 3664
2014-07-16 21:01:59 - Send SIGINT 3664
2014-07-16 21:01:59 - SIGINT to3664 successful
2014-07-16 21:01:59 - Finished jenkinsslave-e__JenkinsHome
2014-07-16 21:02:01 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:02:01 - Started 3264
2014-07-16 21:02:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:02:06 - Started 1364

[many more of these reconnection attempts]

2014-07-16 21:07:04 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:04 - Started 3448
2014-07-16 21:07:05 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:05 - Started 1384
2014-07-16 21:07:06 - Starting C:\Program Files (x86)\Java\jre7\bin\java.exe -Xrs -jar "E:\JenkinsHome\slave.jar" -jnlpUrl http://YYY:8080/computer/XXX/slave-agent.jnlp -secret 69619b5aede9e043c5697ef11f4a51e679140506c282aefdd7f728b8a7ac7d2f
2014-07-16 21:07:06 - Started 3140
[nothing after that, so this is the reconnection attempt that worked, I think.]
==========================================

Finally in the server log:

==========================================
TCP slave agent connection handler #1153 with /IPofXXX:52399 is aborted: Unrecognized name: XXX

Jul 16, 2014 9:07:10 PM INFO hudson.TcpSlaveAgentListener$ConnectionHandler run

Accepted connection #1154 from /IPofXXX:52406

Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: XXX is already connected to this master. Rejecting this connection.

Jul 16, 2014 9:07:10 PM WARNING jenkins.slaves.JnlpSlaveHandshake error

TCP slave agent connection handler #1154 with /IPofXXX:52406 is aborted: Unrecognized name: XXX

Jul 16, 2014 9:07:10 PM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

Communication problem
java.io.IOException: Operation timed out
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)

Jul 16, 2014 9:07:11 PM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

NioChannelHub keys=1 gen=1952690: Computer.threadPoolForRemoting 1 for + XXX terminated
java.io.IOException: Failed to abort
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
Caused by: java.io.IOException: Operation timed out
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
... 7 more
==========================================

Note that the clocks on both machines are within a few seconds of each other, so it seems that the order of events is:

slave detects a failed recv, which closes the channel
next communication attempt detects the closed channel
slave tries to reconnect, +- 2 times/sec
server refuses this because it thinks it is still connected
after +-5 minutes, the server pings the slave (some sort of keepalive?), which finds out the connection is already dead
at that point, the server accepts the next connection attempt from the slave

If the cause of the first failed recv can't be found or prevented, it seems the communication between client and server needs some sort of 'reconnection' mechanism (so the client can reconnect right away, instead of having to abort the build and wait until the server times out the connection). Then, the client could reconnect and retry what it was doing, instead of aborting the build on a failed connection?

relates to

JENKINS-14332 Repeated channel/timeout errors from Jenkins slave

Closed

Arnt Witteveen added a comment - 2014-07-17 09:49

JENKINS-19513 may also be related.

Arnt Witteveen added a comment - 2014-07-17 09:49 JENKINS-19513 may also be related.

Kohsuke Kawaguchi added a comment - 2014-09-05 22:18

Thanks for the detailed log analysis. I agree with your order of events list, although there are some details that are still suspicious. For example, if the client has decided to shutdown the connection and that properly arrived at the server, we should see something other than "java.io.IOException: Operation timed out" in the read. And we don't set any read timeout, so I'm puzzled as to why we see this exception during the read.

In any case, according to that theory, the first event that caused this is the following:

Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)

This appears to be an error unique to Windows, and I found this stackoverflow conversation helpful to understand what it means, as well as this KB article.

If I understand this error correctly, it means that Jenkins slave has given WinSock some data to send, and WinSock has sent some packets over the network, but it's not getting the data through to the other side, and so it has given up and declared the connection lost. That happens much later after the write is called, so the next read that came in gets hit by this problem.

The links I mentioned above do not really discuss what that "data transmission time-out" is about. It could be anything from WinSock not getting TCK ACK from the other side, TCP flow control, or self-imposed timeout unique to WinSock.

I'd have normally suspected the network issue, but the succssive reconnection attempts seem to exclude this possibility.

Another possibility could be that this is related to issues like ~~JENKINS-24050~~, where an unrelated problem kills the NIO selector thread. When that happens, all the sockets stopped getting serviced, so data that the client sent will not be picked up by the application on the server side. TCP ACK should still come in this case, since kernel is getting the packets all right, but perhaps in this situation WinSock might decide to time out, perhaps due to a self-imposed timeout. This would need some experiments.

I've just fixed ~~JENKINS-24050~~, so if possible I'd love to have you try this fix, which should make it into 1.580. Another useful test is to disable the use of NIO on the server side for managing JNLP slaves with the system property '-Djenkins.slaves.NioChannelSelector.disabled=true' on the master. If the problem goes away with it, that'd be an useful input.

For anyone else seeing problems, please check relevant logs on both the server and the slave and share it with us, so that we can resolve this problem efficiently.

Kohsuke Kawaguchi added a comment - 2014-09-05 22:18 Thanks for the detailed log analysis. I agree with your order of events list, although there are some details that are still suspicious. For example, if the client has decided to shutdown the connection and that properly arrived at the server, we should see something other than "java.io.IOException: Operation timed out" in the read. And we don't set any read timeout, so I'm puzzled as to why we see this exception during the read. In any case, according to that theory, the first event that caused this is the following: Jul 16, 2014 9:01:57 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Software caused connection abort: recv failed at java.net.SocketInputStream.socketRead0(Native Method) This appears to be an error unique to Windows, and I found this stackoverflow conversation helpful to understand what it means, as well as this KB article . If I understand this error correctly, it means that Jenkins slave has given WinSock some data to send, and WinSock has sent some packets over the network, but it's not getting the data through to the other side, and so it has given up and declared the connection lost. That happens much later after the write is called, so the next read that came in gets hit by this problem. The links I mentioned above do not really discuss what that "data transmission time-out" is about. It could be anything from WinSock not getting TCK ACK from the other side, TCP flow control, or self-imposed timeout unique to WinSock. I'd have normally suspected the network issue, but the succssive reconnection attempts seem to exclude this possibility. Another possibility could be that this is related to issues like JENKINS-24050 , where an unrelated problem kills the NIO selector thread. When that happens, all the sockets stopped getting serviced, so data that the client sent will not be picked up by the application on the server side. TCP ACK should still come in this case, since kernel is getting the packets all right, but perhaps in this situation WinSock might decide to time out, perhaps due to a self-imposed timeout. This would need some experiments. I've just fixed JENKINS-24050 , so if possible I'd love to have you try this fix, which should make it into 1.580. Another useful test is to disable the use of NIO on the server side for managing JNLP slaves with the system property '-Djenkins.slaves.NioChannelSelector.disabled=true' on the master. If the problem goes away with it, that'd be an useful input. For anyone else seeing problems, please check relevant logs on both the server and the slave and share it with us, so that we can resolve this problem efficiently.

Arnt Witteveen added a comment - 2014-09-09 06:44

You are correct that this was a network issue: after some time I noticed this always happened around 9PM and mentioned this to a network admin. He said that something similar had happened once, and we let a ping run for a whole day. It seems that there were some network interruptions (of up to a minute IIRC) at the specific time this happened. The machine got switched to another port on the router (or another router), and the issue has not reoccurred since.

Sorry for not mentioning this earlier, I forgot about having logged this.

Arnt Witteveen added a comment - 2014-09-09 06:44 You are correct that this was a network issue: after some time I noticed this always happened around 9PM and mentioned this to a network admin. He said that something similar had happened once, and we let a ping run for a whole day. It seems that there were some network interruptions (of up to a minute IIRC) at the specific time this happened. The machine got switched to another port on the router (or another router), and the issue has not reoccurred since. Sorry for not mentioning this earlier, I forgot about having logged this.

Richard Mortimer added a comment - 2014-09-09 08:47

I have been seeing similar problems to this. In my situation one of my slaves is a Windows 7 desktop computer that is set to suspend after a period of inactivity. I've had this running for a few years without problem and the slave has always reconnected successfully when the desktop is woken up.

However since the NIO selector thread has been introduced the slaves have ended up being offline for prolonged periods of time and require a manual restart of the Jenkins slave service to bring them back online (see ~~JENKINS-24272~~). One other symptom that I have observed during testing of this is the server rejecting connections similar to that reported here namely:

Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

Having investigated I'm seeing that the slave has decided that the connection is broken and has tore down its TCP connection to the master but the master never sees the TCP RST for this so it continues to retry the TCP packets for 10 or so minutes until the TCP connection times out and then the master tears down the connection.

The really strange this is that this behaviour seems to be different between NIO and threaded methods on the master. With threaded the TCP RST is sent by W7 and kills the connection before the slave reconnects but in NIO no RST is ever sent. That happens down at the network packet level so it shouldn't really be affected by NIO but I wonder if it a slight different configuration in the network stack (non-blocking vs blocking) that causes the different behaviour.

I also suspect that this may be related to the change to re-start JNLP slaves using a new JVM rather than just restarting in the same process. It may be that the Windows TCP stack just immediately forgets the old TCP socket when the old JVM process exits and hence ignores retries rather than RSTing the connection.

I'm planning on waiting for the current crop of JNLP slave fixes to get integrated and will then experiment a bit further to see if I can work out what is stopping the NIO selector getting notified that the slave has closed the TCP connection.

Richard Mortimer added a comment - 2014-09-09 08:47 I have been seeing similar problems to this. In my situation one of my slaves is a Windows 7 desktop computer that is set to suspend after a period of inactivity. I've had this running for a few years without problem and the slave has always reconnected successfully when the desktop is woken up. However since the NIO selector thread has been introduced the slaves have ended up being offline for prolonged periods of time and require a manual restart of the Jenkins slave service to bring them back online (see JENKINS-24272 ). One other symptom that I have observed during testing of this is the server rejecting connections similar to that reported here namely: Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection. java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection. at hudson.remoting.Engine.onConnectionRejected(Engine.java:304) at hudson.remoting.Engine.run(Engine.java:276) Having investigated I'm seeing that the slave has decided that the connection is broken and has tore down its TCP connection to the master but the master never sees the TCP RST for this so it continues to retry the TCP packets for 10 or so minutes until the TCP connection times out and then the master tears down the connection. The really strange this is that this behaviour seems to be different between NIO and threaded methods on the master. With threaded the TCP RST is sent by W7 and kills the connection before the slave reconnects but in NIO no RST is ever sent. That happens down at the network packet level so it shouldn't really be affected by NIO but I wonder if it a slight different configuration in the network stack (non-blocking vs blocking) that causes the different behaviour. I also suspect that this may be related to the change to re-start JNLP slaves using a new JVM rather than just restarting in the same process. It may be that the Windows TCP stack just immediately forgets the old TCP socket when the old JVM process exits and hence ignores retries rather than RSTing the connection. I'm planning on waiting for the current crop of JNLP slave fixes to get integrated and will then experiment a bit further to see if I can work out what is stopping the NIO selector getting notified that the slave has closed the TCP connection.

Arnt Witteveen added a comment - 2014-09-12 07:58

Thinking about this some more, I still think it would be nice that a temporary network outage of half a minute or so :

does not abort a build (that may have been halfway through a build process of several hours). I wouldn't mind seeing, in the log on the server, a message saying [network was out, log may be missing a part here] in the middle of the log, but why stop the build completely right away? Even in cases where the slave needs the server, there's little point in aborting the build, because the slave is not going to get new work from the server if it can't connect to it anyway?
does not cause the client to be offline for that long. If the server receives a new connection attempt of a slave node, and there already is a connection of the same slave node, who not just disconnect the old connection and continue with the new one?

Arnt Witteveen added a comment - 2014-09-12 07:58 Thinking about this some more, I still think it would be nice that a temporary network outage of half a minute or so : does not abort a build (that may have been halfway through a build process of several hours). I wouldn't mind seeing, in the log on the server, a message saying [network was out, log may be missing a part here] in the middle of the log, but why stop the build completely right away? Even in cases where the slave needs the server, there's little point in aborting the build, because the slave is not going to get new work from the server if it can't connect to it anyway? does not cause the client to be offline for that long. If the server receives a new connection attempt of a slave node, and there already is a connection of the same slave node, who not just disconnect the old connection and continue with the new one?

Richard Mortimer added a comment - 2014-09-15 11:26

@Arnt Witteveen I agree it would be nice if it could withstand a network outage of a minute or so but there are a few technical challenges to overcome which make this non-trivial.

In theory things should survive for a brief network outage and indeed they do on a linux-to-linux master/slave arrangement but windows seems to be much more aggressive in closing down TCP connections and hence the windows side of things has gone before the network connection has been restored.

I believe that making the low level protocol cope with a break of the TCP connection is non-trivial.

Richard Mortimer added a comment - 2014-09-15 11:26 @Arnt Witteveen I agree it would be nice if it could withstand a network outage of a minute or so but there are a few technical challenges to overcome which make this non-trivial. In theory things should survive for a brief network outage and indeed they do on a linux-to-linux master/slave arrangement but windows seems to be much more aggressive in closing down TCP connections and hence the windows side of things has gone before the network connection has been restored. I believe that making the low level protocol cope with a break of the TCP connection is non-trivial.

Daniel Beck added a comment - 2014-09-15 12:11

temporary network outage of half a minute

What kind of real network has regular outages like that? I'd try to discuss SLA with network admins

A few workarounds to this issue:

Jenkins Enterprise by Cloudbees (Jenkins + commercial plugins and support) offers a "long running builds plugin" that can survive disconnects, master restarts, etc:
http://release-notes.cloudbees.com/product/CloudBees+Long-Running+Build+Plugin

Something similar will be made available for free as part of the workflow plugin:
https://github.com/jenkinsci/workflow-plugin

Daniel Beck added a comment - 2014-09-15 12:11 temporary network outage of half a minute What kind of real network has regular outages like that? I'd try to discuss SLA with network admins A few workarounds to this issue: Jenkins Enterprise by Cloudbees (Jenkins + commercial plugins and support) offers a "long running builds plugin" that can survive disconnects, master restarts, etc: http://release-notes.cloudbees.com/product/CloudBees+Long-Running+Build+Plugin Something similar will be made available for free as part of the workflow plugin: https://github.com/jenkinsci/workflow-plugin

Michael M added a comment - 2014-11-17 13:56

Just wanted to follow up. We see this issue a few times a week, and are confident that there isn't technically a network outage at all as other nodes and jobs stay connected through this error. It most frequently happens for us at the end of a job, once as far as I'm concerned everything is complete. Can attach more logs of the error if that is helpful.

Running Jenkins 1.583 on Windows 7.

Thanks!

Michael M added a comment - 2014-11-17 13:56 Just wanted to follow up. We see this issue a few times a week, and are confident that there isn't technically a network outage at all as other nodes and jobs stay connected through this error. It most frequently happens for us at the end of a job, once as far as I'm concerned everything is complete. Can attach more logs of the error if that is helpful. Running Jenkins 1.583 on Windows 7. Thanks!

Arnt Witteveen added a comment - 2015-06-02 13:27

Meanwhile, we have found another cause of these disconnects, which may help in reproducing it (if you happen to have this setup available) : this happens when the slave is on a VMWare ESXi virtual machine, and you have multiple hosts (physical machines running VMWare ESX) available, and you have set up VMotion with DRS: at the moment the current slave VM host gets overloaded (e.g. because it starts a build!) the load is rebalanced over the hosts, and the slave VM may move to another host. This movement should be transparent but seems to cause this issue (I'm not sure why, it might be a timeout, it might be the fact that the physical network card that receives the traffic changes, or something else still).

Arnt Witteveen added a comment - 2015-06-02 13:27 Meanwhile, we have found another cause of these disconnects, which may help in reproducing it (if you happen to have this setup available) : this happens when the slave is on a VMWare ESXi virtual machine, and you have multiple hosts (physical machines running VMWare ESX) available, and you have set up VMotion with DRS: at the moment the current slave VM host gets overloaded (e.g. because it starts a build!) the load is rebalanced over the hosts, and the slave VM may move to another host. This movement should be transparent but seems to cause this issue (I'm not sure why, it might be a timeout, it might be the fact that the physical network card that receives the traffic changes, or something else still).

Daniel Serodio added a comment - 2015-06-12 13:32

Thanks legolas, this is exactly our setup. As a workaround, I've asked our ops team do disable VMotion for these VMs

Daniel Serodio added a comment - 2015-06-12 13:32 Thanks legolas , this is exactly our setup. As a workaround, I've asked our ops team do disable VMotion for these VMs

Stephen Connolly added a comment - 2015-06-24 08:30

By any chance are you provisioning and de-provisioning other unrelated slaves? I ask as a slave disconnect, if incorrectly timed and unlucky enough can cause NioChannelHub to livelock (~~JENKINS-25218~~) trying to process the channel termination for the disconnected slave until the ping thread kills that channel and breaks the livelock.

Disabling the ping threads on the slave and the master should either provide an alternative workaround or confirm that this issue is ~~JENKINS-25218~~

Stephen Connolly added a comment - 2015-06-24 08:30 By any chance are you provisioning and de-provisioning other unrelated slaves? I ask as a slave disconnect, if incorrectly timed and unlucky enough can cause NioChannelHub to livelock ( JENKINS-25218 ) trying to process the channel termination for the disconnected slave until the ping thread kills that channel and breaks the livelock. Disabling the ping threads on the slave and the master should either provide an alternative workaround or confirm that this issue is JENKINS-25218

Arnt Witteveen added a comment - 2015-06-24 09:34

At the time this first occurred, in my case we had only 1 slave, so that means that at least for me that was probably not the cause if I understand correctly?

Arnt Witteveen added a comment - 2015-06-24 09:34 At the time this first occurred, in my case we had only 1 slave, so that means that at least for me that was probably not the cause if I understand correctly?

Stephen Connolly added a comment - 2015-06-24 09:54

legolas correct if you have one and only one slave and you are not using one of the cloud providers and you are not using a retention strategy that disconnects when idle then this issue is not ~~JENKINS-25218~~.

Stephen Connolly added a comment - 2015-06-24 09:54 legolas correct if you have one and only one slave and you are not using one of the cloud providers and you are not using a retention strategy that disconnects when idle then this issue is not JENKINS-25218 .

Ed Mangual added a comment - 2015-08-05 21:32

I have the same issues with frequent disconnects with Jenkins Master/Slaves VMs hosted on ESX with VMotion enabled. This is common in IT organizations nowadays. What are we doing to get a permanent solution to this issue?

dserodio Can you tell us if disabling VMotion helped?

In the meantime I will have my ops team disable vmotion as well.

Ed Mangual added a comment - 2015-08-05 21:32 I have the same issues with frequent disconnects with Jenkins Master/Slaves VMs hosted on ESX with VMotion enabled. This is common in IT organizations nowadays. What are we doing to get a permanent solution to this issue? dserodio Can you tell us if disabling VMotion helped? In the meantime I will have my ops team disable vmotion as well.

Daniel Serodio added a comment - 2015-08-05 22:09

emangual unfortunately, I found out that VMotion was already disabled. I did find out that our corporate firewall closes connections if the connection is idle, but from what I understand of Jenkins remoting I think the "ping thread" should prevent the connection from idling. Still investigating.

Daniel Serodio added a comment - 2015-08-05 22:09 emangual unfortunately, I found out that VMotion was already disabled. I did find out that our corporate firewall closes connections if the connection is idle, but from what I understand of Jenkins remoting I think the "ping thread" should prevent the connection from idling. Still investigating.

Jay Kah added a comment - 2015-11-24 18:38

We are having the same issue on Jenkins ver. 1.634 on Ubuntu Trusty.

Jay Kah added a comment - 2015-11-24 18:38 We are having the same issue on Jenkins ver. 1.634 on Ubuntu Trusty.

Florian Doersch added a comment - 2015-11-26 06:47

we're having the same issue with 1.638 and "real" (not virtualized) hosts

Florian Doersch added a comment - 2015-11-26 06:47 we're having the same issue with 1.638 and "real" (not virtualized) hosts

Moshe Zvi added a comment - 2015-12-28 14:47

+1 Florian

Moshe Zvi added a comment - 2015-12-28 14:47 +1 Florian

Dario Griffo added a comment - 2016-01-18 13:35

Same issue for me. Below some log from the slave.
I can get small builds to end fine (less than 2 minutes), but we are running automation tests running in this box (fixed job) and takes ~30 minutes.
This started last week, without any changes in the boxes, no updates, no infrastructure changes.
The build has been working fine for 2 months.
Master: Windows 2012 64-bit R2 Datacenter
Slave: Windows 7 64-bit Enterprise

Slave log (all look the same)
INFO: Trying protocol: JNLP2-connect
Jan 18, 2016 12:49:34 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected
Jan 18, 2016 12:59:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: Software caused connection abort: recv failed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Jan 18, 2016 12:59:54 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Jan 18, 2016 1:00:04 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@c2400d
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main createEngine
INFO: Setting up slave: vostro-470-002
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among http://168.0.0.228:7542/
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to ci.teamhours.com:50081
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Trying protocol: JNLP2-connect
Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connected

Dario Griffo added a comment - 2016-01-18 13:35 Same issue for me. Below some log from the slave. I can get small builds to end fine (less than 2 minutes), but we are running automation tests running in this box (fixed job) and takes ~30 minutes. This started last week, without any changes in the boxes, no updates, no infrastructure changes. The build has been working fine for 2 months. Master: Windows 2012 64-bit R2 Datacenter Slave: Windows 7 64-bit Enterprise Slave log (all look the same) INFO: Trying protocol: JNLP2-connect Jan 18, 2016 12:49:34 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected Jan 18, 2016 12:59:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel channel java.net.SocketException: Software caused connection abort: recv failed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:82) at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72) at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Jan 18, 2016 12:59:54 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Terminated Jan 18, 2016 1:00:04 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFO: Restarting slave via jenkins.slaves.restarter.WinswSlaveRestarter@c2400d Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main createEngine INFO: Setting up slave: vostro-470-002 Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener <init> INFO: Jenkins agent is running in headless mode. Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Locating server among http://168.0.0.228:7542/ Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Handshaking Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connecting to ci.teamhours.com:50081 Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Trying protocol: JNLP2-connect Jan 18, 2016 1:00:06 PM hudson.remoting.jnlp.Main$CuiListener status INFO: Connected

Dario Griffo added a comment - 2016-01-29 11:47

flow86 legolas
I have been running again builds for 4 hours without any disconnection.
My "fix": stop using the slave windows service and run this as a scheduled task running a bat file.
First I downloaded the slave.jar and slave-agent.jnlp into an empty folder (C:\Jenkins).
Created the following .bat in that folder

cd C:\Jenkins
java -jar slave.jar -jnlpUrl http:/MY_JENKINS_URL/computer/MY_SLAVE_NAME/slave-agent.jnlp -secret MYSECRET

That last line is the one in the agent page
Create a scheduled task to run at startup with same credentials as the uninstall windows slave service.
So far, no disconnections, when I used to have them every 4~5 minutes.

Dario Griffo added a comment - 2016-01-29 11:47 flow86 legolas I have been running again builds for 4 hours without any disconnection. My "fix": stop using the slave windows service and run this as a scheduled task running a bat file. First I downloaded the slave.jar and slave-agent.jnlp into an empty folder (C:\Jenkins). Created the following .bat in that folder cd C:\Jenkins java -jar slave.jar -jnlpUrl http:/MY_JENKINS_URL/computer/MY_SLAVE_NAME/slave-agent.jnlp -secret MYSECRET That last line is the one in the agent page Create a scheduled task to run at startup with same credentials as the uninstall windows slave service. So far, no disconnections, when I used to have them every 4~5 minutes.

James Femia added a comment - 2016-03-15 07:28 - edited

We are having the same issue with slaves running on Win10 in VMWare, already running slave.jar as a scheduled task on login so the suggested workaround does not help us

James Femia added a comment - 2016-03-15 07:28 - edited We are having the same issue with slaves running on Win10 in VMWare, already running slave.jar as a scheduled task on login so the suggested workaround does not help us

Kevin R. added a comment - 2017-01-11 22:50

bump

Kevin R. added a comment - 2017-01-11 22:50 bump

Thorsten Meinl added a comment - 2017-03-23 08:18

Since we switched from freestyle jobs to pipelines (a very decision if you as me), about every second jobs fails with this error which makes Windows slaves pretty much unusable. It has never been a problem before, therefore my guess is that is has something to do with pipelines. Here's a strack trace, which always looks the same:

java.io.IOException: remote file operation failed: C:\Users\jenkins\slave\workspace\g.knime.product.full_master-6YACVMZPMCFFZKR6NNQZEQAJGOVHWVVS2MSXJGRRFPLNAY2WNRWQ at hudson.remoting.Channel@2eed4a92:Channel to /xxx.xxx.xxx.xxx hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.FilePath.act(FilePath.java:992)
	at hudson.FilePath.act(FilePath.java:974)
	at hudson.FilePath.mkdirs(FilePath.java:1157)
	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:77)
	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:65)
	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49)
	at hudson.security.ACL.impersonate(ACL.java:221)
	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
	at hudson.remoting.Channel.send(Channel.java:604)
	at hudson.remoting.Request.call(Request.java:130)
	at hudson.remoting.Channel.call(Channel.java:821)
	at hudson.FilePath.act(FilePath.java:985)
	... 12 more
Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	... 5 more
Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
	... 6 more

I'm happy to debug this further if someone tells me what to do.

Thorsten Meinl added a comment - 2017-03-23 08:18 Since we switched from freestyle jobs to pipelines (a very decision if you as me), about every second jobs fails with this error which makes Windows slaves pretty much unusable. It has never been a problem before, therefore my guess is that is has something to do with pipelines. Here's a strack trace, which always looks the same: java.io.IOException: remote file operation failed: C:\Users\jenkins\slave\workspace\g.knime.product.full_master-6YACVMZPMCFFZKR6NNQZEQAJGOVHWVVS2MSXJGRRFPLNAY2WNRWQ at hudson.remoting.Channel@2eed4a92:Channel to /xxx.xxx.xxx.xxx hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:992) at hudson.FilePath.act(FilePath.java:974) at hudson.FilePath.mkdirs(FilePath.java:1157) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:77) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:65) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49) at hudson.security.ACL.impersonate(ACL.java:221) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:604) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:821) at hudson.FilePath.act(FilePath.java:985) ... 12 more Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) ... 5 more Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more I'm happy to debug this further if someone tells me what to do.

Daniel Sobral added a comment - 2017-03-23 17:34

My maven builds were plagued by this, so it's not pipeline-related. I don't see it much with pipelines, though I wrote my pipelines with retry from the start, because I expected that very problem to persist.

Daniel Sobral added a comment - 2017-03-23 17:34 My maven builds were plagued by this, so it's not pipeline-related. I don't see it much with pipelines, though I wrote my pipelines with retry from the start, because I expected that very problem to persist.

Thorsten Meinl added a comment - 2017-03-24 09:32

I have a vague feeling that this problem occurs when there is more than one job running on the slave. If I have time, I will try to restrict the Windows slaves to one slot and see if this makes the problem go away. Maybe some of you can try this, too.

Thorsten Meinl added a comment - 2017-03-24 09:32 I have a vague feeling that this problem occurs when there is more than one job running on the slave. If I have time, I will try to restrict the Windows slaves to one slot and see if this makes the problem go away. Maybe some of you can try this, too.

Mihai Stoichitescu added a comment - 2017-03-24 10:15

sithmein we encounter this error on machines that only have 1 executor setup.

Maybe this happens if there are flyweight executors running on the node?

Mihai Stoichitescu added a comment - 2017-03-24 10:15 sithmein we encounter this error on machines that only have 1 executor setup. Maybe this happens if there are flyweight executors running on the node?

Stuart Smith added a comment - 2017-04-19 07:43

We have frequent disconnects on two of our slave machines, both are located in geographically the same location, and no other machines in any other location suffer from this, so we are suspecting that it is maybe something firewall related in that location. The infra guys have initially suggested that the firewall is killing idle connections, and have increased the timeout, but we are still seeing frequent disconnections. Is the JNLP connection really idle between slave and master if no builds are running? Our call stack is as follows (Jenkins Master 2.46.1):

JNLP agent connected from xx Slave.jar version: 3.4.1 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated [8mha:////4EVWKVufSpoBsjG/AK97kvCQst6o1LLM9fjogkB0XVcIAAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[0mjava.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

Stuart Smith added a comment - 2017-04-19 07:43 We have frequent disconnects on two of our slave machines, both are located in geographically the same location, and no other machines in any other location suffer from this, so we are suspecting that it is maybe something firewall related in that location. The infra guys have initially suggested that the firewall is killing idle connections, and have increased the timeout, but we are still seeing frequent disconnections. Is the JNLP connection really idle between slave and master if no builds are running? Our call stack is as follows (Jenkins Master 2.46.1): JNLP agent connected from xx Slave.jar version: 3.4.1 This is a Windows agent Agent successfully connected and online ERROR: Connection terminated [8mha:////4EVWKVufSpoBsjG/AK97kvCQst6o1LLM9fjogkB0XVcIAAAAWB+LCAAAAAAAAP9b85aBtbiIQSmjNKU4P08vOT+vOD8nVc8DzHWtSE4tKMnMz/PLL0ldFVf2c+b/lb5MDAwVRQxSaBqcITRIIQMEMIIUFgAAckCEiWAAAAA=[0mjava.nio.channels.ClosedChannelException at org.jenkinsci.remoting.protocol.NetworkLayer.onRecvClosed(NetworkLayer.java:154) at org.jenkinsci.remoting.protocol.impl.NIONetworkLayer.ready(NIONetworkLayer.java:179) at org.jenkinsci.remoting.protocol.IOHub$OnReady.run(IOHub.java:721) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source)

Stephen Connolly added a comment - 2017-04-19 12:19

If no builds are running then the connection can be idle though given you are running JNLP4 then the TLS transport layer should be trickling some data periodically... OTOH perhaps that periodic trickle is actually detecting a network issue

Stephen Connolly added a comment - 2017-04-19 12:19 If no builds are running then the connection can be idle though given you are running JNLP4 then the TLS transport layer should be trickling some data periodically... OTOH perhaps that periodic trickle is actually detecting a network issue

Oleg Nenashev added a comment - 2017-05-05 06:02

stuartjsmith "RecvClosed()" means that the Tcp session is interrupted by the operating system or by any router between Jenkins master and agent. My first recommendation would be to increase Tcp retransmission timeout in the operating system. If it does not help, then you need to identify which node in your network interrupts the session. It can be done by analyzing Tcp dumps

Oleg Nenashev added a comment - 2017-05-05 06:02 stuartjsmith "RecvClosed()" means that the Tcp session is interrupted by the operating system or by any router between Jenkins master and agent. My first recommendation would be to increase Tcp retransmission timeout in the operating system. If it does not help, then you need to identify which node in your network interrupts the session. It can be done by analyzing Tcp dumps

Daniel Serodio added a comment - 2017-05-05 21:03

Doesn't the "ping thread" prevent the connection from becoming idle?

Daniel Serodio added a comment - 2017-05-05 21:03 Doesn't the " ping thread " prevent the connection from becoming idle?

Heiko Nardmann added a comment - 2017-11-24 09:33

I also get this problem. Just one slave is affected (Windows 10) and this slave is running two agents based on separate folders. One agent is intended to be used to provide a build slave, the other one is used for providing a smoke test slave (which does not need a high performance).

Heiko Nardmann added a comment - 2017-11-24 09:33 I also get this problem. Just one slave is affected (Windows 10) and this slave is running two agents based on separate folders. One agent is intended to be used to provide a build slave, the other one is used for providing a smoke test slave (which does not need a high performance).

Assignee:: Unassigned

Reporter:: Arnt Witteveen

Votes:: 41 Vote for this issue

Watchers:: 52 Start watching this issue

Created:: 2014-07-17 08:59

Updated:: 2017-11-24 09:33

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Arnt Witteveen added a comment - 2014-07-17 09:49

Expand comment: Arnt Witteveen added a comment - 2014-07-17 09:49

Collapse comment: Kohsuke Kawaguchi added a comment - 2014-09-05 22:18

Expand comment: Kohsuke Kawaguchi added a comment - 2014-09-05 22:18

Collapse comment: Arnt Witteveen added a comment - 2014-09-09 06:44

Expand comment: Arnt Witteveen added a comment - 2014-09-09 06:44

Collapse comment: Richard Mortimer added a comment - 2014-09-09 08:47

Expand comment: Richard Mortimer added a comment - 2014-09-09 08:47

Collapse comment: Arnt Witteveen added a comment - 2014-09-12 07:58

Expand comment: Arnt Witteveen added a comment - 2014-09-12 07:58

Collapse comment: Richard Mortimer added a comment - 2014-09-15 11:26

Expand comment: Richard Mortimer added a comment - 2014-09-15 11:26

Collapse comment: Daniel Beck added a comment - 2014-09-15 12:11

Expand comment: Daniel Beck added a comment - 2014-09-15 12:11

Collapse comment: Michael M added a comment - 2014-11-17 13:56

Expand comment: Michael M added a comment - 2014-11-17 13:56

Collapse comment: Arnt Witteveen added a comment - 2015-06-02 13:27

Expand comment: Arnt Witteveen added a comment - 2015-06-02 13:27

Collapse comment: Daniel Serodio added a comment - 2015-06-12 13:32

Expand comment: Daniel Serodio added a comment - 2015-06-12 13:32

Collapse comment: Stephen Connolly added a comment - 2015-06-24 08:30

Expand comment: Stephen Connolly added a comment - 2015-06-24 08:30

Collapse comment: Arnt Witteveen added a comment - 2015-06-24 09:34

Expand comment: Arnt Witteveen added a comment - 2015-06-24 09:34

Collapse comment: Stephen Connolly added a comment - 2015-06-24 09:54

Expand comment: Stephen Connolly added a comment - 2015-06-24 09:54

Collapse comment: Ed Mangual added a comment - 2015-08-05 21:32

Expand comment: Ed Mangual added a comment - 2015-08-05 21:32

Collapse comment: Daniel Serodio added a comment - 2015-08-05 22:09

Expand comment: Daniel Serodio added a comment - 2015-08-05 22:09

Collapse comment: Jay Kah added a comment - 2015-11-24 18:38

Expand comment: Jay Kah added a comment - 2015-11-24 18:38

Collapse comment: Florian Doersch added a comment - 2015-11-26 06:47

Expand comment: Florian Doersch added a comment - 2015-11-26 06:47

Collapse comment: Moshe Zvi added a comment - 2015-12-28 14:47

Expand comment: Moshe Zvi added a comment - 2015-12-28 14:47

Collapse comment: Dario Griffo added a comment - 2016-01-18 13:35

Expand comment: Dario Griffo added a comment - 2016-01-18 13:35

Collapse comment: Dario Griffo added a comment - 2016-01-29 11:47

Expand comment: Dario Griffo added a comment - 2016-01-29 11:47

Collapse comment: James Femia added a comment - 2016-03-15 07:28, Edited by James Femia - 2016-03-15 07:28

Expand comment: James Femia added a comment - 2016-03-15 07:28, Edited by James Femia - 2016-03-15 07:28

Collapse comment: Kevin R. added a comment - 2017-01-11 22:50

Expand comment: Kevin R. added a comment - 2017-01-11 22:50

Collapse comment: Thorsten Meinl added a comment - 2017-03-23 08:18

Expand comment: Thorsten Meinl added a comment - 2017-03-23 08:18

Collapse comment: Daniel Sobral added a comment - 2017-03-23 17:34

Expand comment: Daniel Sobral added a comment - 2017-03-23 17:34

Collapse comment: Thorsten Meinl added a comment - 2017-03-24 09:32

Expand comment: Thorsten Meinl added a comment - 2017-03-24 09:32

Collapse comment: Mihai Stoichitescu added a comment - 2017-03-24 10:15

Expand comment: Mihai Stoichitescu added a comment - 2017-03-24 10:15

Collapse comment: Stuart Smith added a comment - 2017-04-19 07:43

Expand comment: Stuart Smith added a comment - 2017-04-19 07:43

Collapse comment: Stephen Connolly added a comment - 2017-04-19 12:19

Expand comment: Stephen Connolly added a comment - 2017-04-19 12:19

Collapse comment: Oleg Nenashev added a comment - 2017-05-05 06:02

Expand comment: Oleg Nenashev added a comment - 2017-05-05 06:02

Collapse comment: Daniel Serodio added a comment - 2017-05-05 21:03

Expand comment: Daniel Serodio added a comment - 2017-05-05 21:03

Collapse comment: Heiko Nardmann added a comment - 2017-11-24 09:33

Expand comment: Heiko Nardmann added a comment - 2017-11-24 09:33

People

Dates