Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-28155

Job fails with [An existing connection was forcibly closed by the remote host]

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Won't Fix
    • None
    • Slave Windows Server 2012 R
      Master Windows Server 2012 R
      Jenkins version 1.611

    Description

      Our Windows slaves are JNLP connected using scheduled tasks (like this: https://wiki.jenkins-ci.org/display/JENKINS/Launch+Java+Web+Start+slave+agent+via+Windows+Scheduler).

      When a job starts working with a slave it suddenly fails because the connection was lost. It says something on slave side killed the connection. Few minutes afterwards the slave is back on-line. I am sure it is not a network issue as I tried pinging the network from system start-up till the problem occurs. I could not find find what kills the connection and disable TCP chimney (advice from google) did not work.

      The exception is below.

      I think proper behaviour would be not to fail the job but attempt to reconnect. Any hints of additional configuration that will help me resolve this issue is appreciated.

      Building remotely on vmbam32.eur.ad.sag (R2 Server 2012 Enterprise Windows) in workspace c:\jenkins\workspace\optimize-install
      FATAL: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
      hudson.remoting.RequestAbortedException: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
      at hudson.remoting.Request.abort(Request.java:296)
      at hudson.remoting.Channel.terminate(Channel.java:815)
      at hudson.remoting.Channel$2.terminate(Channel.java:492)
      at hudson.remoting.AbstractByteArrayCommandTransport$1.terminate(AbstractByteArrayCommandTransport.java:72)
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
      at java.util.concurrent.FutureTask.run(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      at ......remote call to vmbam32.eur.ad.sag(Native Method)
      at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1361)
      at hudson.remoting.Request.call(Request.java:171)
      at hudson.remoting.Channel.call(Channel.java:752)
      at hudson.FilePath.act(FilePath.java:980)
      at hudson.FilePath.act(FilePath.java:969)
      at hudson.FilePath.mkdirs(FilePath.java:1152)
      at hudson.model.AbstractProject.checkout(AbstractProject.java:1269)
      at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:610)
      at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
      at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:532)
      at hudson.model.Run.execute(Run.java:1744)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
      at hudson.model.ResourceController.execute(ResourceController.java:98)
      at hudson.model.Executor.run(Executor.java:374)
      Caused by: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@b7add1[name=vmbam32.eur.ad.sag]
      at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:628)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
      at java.util.concurrent.FutureTask.run(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.lang.Thread.run(Unknown Source)
      Caused by: java.io.IOException: An existing connection was forcibly closed by the remote host
      at sun.nio.ch.SocketDispatcher.read0(Native Method)
      at sun.nio.ch.SocketDispatcher.read(Unknown Source)
      at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
      at sun.nio.ch.IOUtil.read(Unknown Source)
      at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
      at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
      at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
      at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
      ... 6 more

      Attachments

        Activity

          pjdarton pjdarton added a comment - - edited

          We're also seeing much the same thing - the Jenkins server is saying that the (JNLP) slave dropped the connection.

          We've been looking into this and we've found that (at least in our system) the problem is coming from the Slave's Windows OS itself.  On the slave, we are seeing the following error logged:

          java.io.IOException: An established connection was aborted by the software in your host machine.
              at sun.nio.ch.SocketDispatcher.read0(Native Method)
              at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:55)
              at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:235)
              at sun.nio.ch.IOUtil.read(IOUtil.java:209)
              at sun.nio.ch.SockerChannelImpl.read(SocketChannelImpl.java:409)
              at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:77)
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:121)
              at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:115)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
              at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:86)
              at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)
              at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)
              at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)
              at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
              at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59)

          Some searching later, we identified that "An established connection was aborted by the software in your host machine" is the Windows socket error WSACONNABORTED (10053).

          Much investigation later, we've found that various Windows services running on the Slave's OS are logging (in the Windows event log) that they're restarting at the exact same time, and (more interestingly!) we've also seen that the DHCP lease was renewed at the exact time that the slave reported the connection had died.

          • Note: Windows handles TCP connections differently to other operating systems.  If Windows decides that the physical network layer has gone down (however briefly), it actively kills (with a WSACONNABORTED) all TCP connections that were being routed over that network, thus turning a transient outage (that normal TCP retransmissions would handle so that the user doesn't even see the problem) into an application-level outage (as the TCP connection closes, forcing the application to deal with it, usually by reporting that the connection has failed and it's "game over").  This is why a brief network outage that should cause no operational impact will result in a flurry of service restarts as they all try to handle the connection losses.  Windows' (mis?)handling of this scenario has been like this for so long that I doubt Microsoft would be willing to change it now.

          So I think that something, somewhere deep within Windows, is making Windows believe that it has lost the network layer.  Problem is, I don't (yet) know what's doing it - all I see is a lot of symptoms of it doing it, not the root cause.

          "ipconfig /release && ipconfig /renew" will cause this (even if you immediately get the same IP address back), as will unplugging/replugging your Cat5 cable or disconnecting/reconnecting your WiFi connection, or power-saving on your NIC, or reconfiguring your NIC and thus forcing a reload of the driver, or...

          We've yet to find the root cause in our setup, but investigations are ongoing.

          pjdarton pjdarton added a comment - - edited We're also seeing much the same thing - the Jenkins server is saying that the (JNLP) slave dropped the connection. We've been looking into this and we've found that (at least in our system) the problem is coming from the Slave's Windows OS itself.  On the slave, we are seeing the following error logged: java.io.IOException: An established connection was aborted by the software in your host machine.     at sun.nio.ch.SocketDispatcher.read0(Native Method)     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:55)     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:235)     at sun.nio.ch.IOUtil.read(IOUtil.java:209)     at sun.nio.ch.SockerChannelImpl.read(SocketChannelImpl.java:409)     at hudson.remoting.SocketChannelStream$1.read(SocketChannelStream.java:35)     at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:77)     at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:121)     at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:115)     at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)     at java.io.BufferedInputStream.read(BufferedInputStream.java:265)     at hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:86)     at hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:72)     at hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103)     at hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39)     at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)     at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:59) Some searching later, we identified that "An established connection was aborted by the software in your host machine" is the Windows socket error WSACONNABORTED (10053). Much investigation later, we've found that various Windows services running on the Slave's OS are logging (in the Windows event log) that they're restarting at the exact same time, and (more interestingly!) we've also seen that the DHCP lease was renewed at the exact time that the slave reported the connection had died. Note: Windows handles TCP connections differently to other operating systems .  If Windows decides that the physical network layer has gone down (however briefly), it actively kills (with a WSACONNABORTED) all TCP connections that were being routed over that network, thus turning a transient outage (that normal TCP retransmissions would handle so that the user doesn't even see the problem) into an application-level outage (as the TCP connection closes, forcing the application to deal with it, usually by reporting that the connection has failed and it's "game over").  This is why a brief network outage that should cause no operational impact will result in a flurry of service restarts as they all try to handle the connection losses.  Windows' (mis?)handling of this scenario has been like this for so long that I doubt Microsoft would be willing to change it now. So I think that something, somewhere deep within Windows, is making Windows believe that it has lost the network layer.  Problem is, I don't (yet) know what's doing it - all I see is a lot of symptoms of it doing it, not the root cause. "ipconfig /release && ipconfig /renew" will cause this (even if you immediately get the same IP address back), as will unplugging/replugging your Cat5 cable or disconnecting/reconnecting your WiFi connection, or power-saving on your NIC, or reconfiguring your NIC and thus forcing a reload of the driver, or... We've yet to find the root cause in our setup, but investigations are ongoing.
          oleg_nenashev Oleg Nenashev added a comment -

          Still seems like a generic infrastructure issue symptom, which may be caused by Remoting defects in some cases. Investigation from pjdarton is definitely correct, but I'd guess the most of reporters/voters here see quite another issue. Maybe it should be reported as a separate ticket since it may be worked around somehow.

          I do not anticipate to have any time for Remoting maintenance anytime soon, so I am going to unassign this issue.

          oleg_nenashev Oleg Nenashev added a comment - Still seems like a generic infrastructure issue symptom, which may be caused by Remoting defects in some cases. Investigation from pjdarton is definitely correct, but I'd guess the most of reporters/voters here see quite another issue. Maybe it should be reported as a separate ticket since it may be worked around somehow. I do not anticipate to have any time for Remoting maintenance anytime soon, so I am going to unassign this issue.
          pjdarton pjdarton added a comment -

          FYI we got to the bottom of why our Windows slaves were disconnecting - it would appear that the Windows DHCP Client is incompatible with the Windows Time Service.

          Our slaves were VMs created within OpenStack, and what we were seeing was a failure to renew the DHCP lease correctly. When OpenStack detects that the guest OS has failed to renew the DHCP lease on time, it (briefly) drops the network link in order to prompt a lease renewal. However this causes Windows to panic and kill all TCP connections (due to the way Windows mishandles network layers).
          It seems that the DHCP client is not calculating the renewal time in a manner that's independent of the system's idea of "real time", and so it all goes wrong when the date/time gets changed (by the Windows Time Service), triggering OpenStack to bounce the physical layer, which Windows cascades in to an application-layer network outage killing the TCP connection that the slave relies on.

          We "fixed" this by forcing our slaves to:

          1. run "w32tm /resync" until they'd got the time synchronized,
          2. turn off the Windows Time Service entirely,
          3. ipconfig /release /renew to update the DHCP lease time
          4. start the Jenkins JNLP slave process

          This ensured that Windows would not update its clock while the slave's TCP connection was live, meaning that we weren't affected by the DHCP client's inability to keep the network alive after clock changes.
          Since doing that we've not had any further problems of this nature (and we're quite pleased with that!)

          Note: I've also seen Windows 10 report unpredictable (and incorrect) DHCP lease renewal times on other (non-OpenStack) machines - it lies.

          pjdarton pjdarton added a comment - FYI we got to the bottom of why our Windows slaves were disconnecting - it would appear that the Windows DHCP Client is incompatible with the Windows Time Service. Our slaves were VMs created within OpenStack, and what we were seeing was a failure to renew the DHCP lease correctly. When OpenStack detects that the guest OS has failed to renew the DHCP lease on time, it (briefly) drops the network link in order to prompt a lease renewal. However this causes Windows to panic and kill all TCP connections (due to the way Windows mishandles network layers). It seems that the DHCP client is not calculating the renewal time in a manner that's independent of the system's idea of "real time", and so it all goes wrong when the date/time gets changed (by the Windows Time Service), triggering OpenStack to bounce the physical layer, which Windows cascades in to an application-layer network outage killing the TCP connection that the slave relies on. We "fixed" this by forcing our slaves to: run "w32tm /resync" until they'd got the time synchronized, turn off the Windows Time Service entirely, ipconfig /release /renew to update the DHCP lease time start the Jenkins JNLP slave process This ensured that Windows would not update its clock while the slave's TCP connection was live, meaning that we weren't affected by the DHCP client's inability to keep the network alive after clock changes. Since doing that we've not had any further problems of this nature (and we're quite pleased with that!) Note: I've also seen Windows 10 report unpredictable (and incorrect) DHCP lease renewal times on other (non-OpenStack) machines - it lies.
          franky4ro Dan Albu added a comment -

          After i did mass refactoring on my INFRA i do not see this type of errors anymore, except when the network in the lab goes high wire.

          franky4ro Dan Albu added a comment - After i did mass refactoring on my INFRA i do not see this type of errors anymore, except when the network in the lab goes high wire.
          markewaite Mark Waite added a comment -

          Won't be fixed. See JENKINS-67604 for the details of the deprecation of agents started by WMI calls using DCOM.

          markewaite Mark Waite added a comment - Won't be fixed. See JENKINS-67604 for the details of the deprecation of agents started by WMI calls using DCOM.

          People

            Unassigned Unassigned
            vassilena Vassilena Treneva
            Votes:
            27 Vote for this issue
            Watchers:
            31 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: