Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Won't Fix
-
None
-
Slave Windows Server 2012 R
Master Windows Server 2012 R
Jenkins version 1.611
Description
Our Windows slaves are JNLP connected using scheduled tasks (like this: https://wiki.jenkins-ci.org/display/JENKINS/Launch+Java+Web+Start+slave+agent+via+Windows+Scheduler).
When a job starts working with a slave it suddenly fails because the connection was lost. It says something on slave side killed the connection. Few minutes afterwards the slave is back on-line. I am sure it is not a network issue as I tried pinging the network from system start-up till the problem occurs. I could not find find what kills the connection and disable TCP chimney (advice from google) did not work.
The exception is below.
I think proper behaviour would be not to fail the job but attempt to reconnect. Any hints of additional configuration that will help me resolve this issue is appreciated.
Building remotely on vmbam32.eur.ad.sag (R2 Server 2012 Enterprise Windows) in workspace c:\jenkins\workspace\optimize-install
FATAL: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
hudson.remoting.RequestAbortedException: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
at hudson.remoting.Request.abort(Request.java:296)
at hudson.remoting.Channel.terminate(Channel.java:815)
at hudson.remoting.Channel$2.terminate(Channel.java:492)
at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
at ......remote call to vmbam32.eur.ad.sag(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1361)
at hudson.remoting.Request.call(Request.java:171)
at hudson.remoting.Channel.call(Channel.java:752)
at hudson.FilePath.act(FilePath.java:980)
at hudson.FilePath.act(FilePath.java:969)
at hudson.FilePath.mkdirs(FilePath.java:1152)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1269)
at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
at hudson.model.Run.execute(Run.java:1744)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:374)
Caused by: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(Unknown Source)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
at sun.nio.ch.IOUtil.read(Unknown Source)
at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
... 6 more
We're also seeing much the same thing - the Jenkins server is saying that the (JNLP) slave dropped the connection.
We've been looking into this and we've found that (at least in our system) the problem is coming from the Slave's Windows OS itself. On the slave, we are seeing the following error logged:
java.io.IOException: An established connection was aborted by the software in your host machine.
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:55)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:235)
at sun.nio.ch.IOUtil.read(IOUtil.java:209)
at sun.nio.ch.SockerChannelImpl.read(SocketChannelImpl.java:409)
at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:77)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:121)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:115)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:86)
at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)
Some searching later, we identified that "An established connection was aborted by the software in your host machine" is the Windows socket error WSACONNABORTED (10053).
Much investigation later, we've found that various Windows services running on the Slave's OS are logging (in the Windows event log) that they're restarting at the exact same time, and (more interestingly!) we've also seen that the DHCP lease was renewed at the exact time that the slave reported the connection had died.
So I think that something, somewhere deep within Windows, is making Windows believe that it has lost the network layer. Problem is, I don't (yet) know what's doing it - all I see is a lot of symptoms of it doing it, not the root cause.
"ipconfig /release && ipconfig /renew" will cause this (even if you immediately get the same IP address back), as will unplugging/replugging your Cat5 cable or disconnecting/reconnecting your WiFi connection, or power-saving on your NIC, or reconfiguring your NIC and thus forcing a reload of the driver, or...
We've yet to find the root cause in our setup, but investigations are ongoing.