• Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • core

      This issue is related to JENKINS-6817.

      I am running Jenkins slaves inside virtual machines. Sometimes these machines are overloaded and I get the following exception:

      FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      	at hudson.remoting.Request.call(Request.java:174)
      	at hudson.remoting.Channel.call(Channel.java:713)
      	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
      	at $Proxy38.join(Unknown Source)
      	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:925)
      	at hudson.Launcher$ProcStarter.join(Launcher.java:360)
      	at hudson.tasks.Maven.perform(Maven.java:327)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:804)
      	at hudson.model.Build$BuildExecution.build(Build.java:199)
      	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:586)
      	at hudson.model.Run.execute(Run.java:1593)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:247)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.Request.abort(Request.java:299)
      	at hudson.remoting.Channel.terminate(Channel.java:773)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      Caused by: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
      	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
      	at java.io.ObjectInputStream.readObject0(Unknown Source)
      	at java.io.ObjectInputStream.readObject(Unknown Source)
      	at hudson.remoting.Command.readFrom(Command.java:92)
      	at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:72)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      Is it possible to make the channel timeout configurable? I'd like to increase the value from, say 5 seconds, to 30 seconds or a minute.

          [JENKINS-18781] Configurable channel timeout for slaves

          We also see this behavior periodicially in our system. Unfortunately for us we lose a lot of time and it is very disruptive of our release process when it occurs since the Windows slave nodes which show this problem are used to execute long-running tests. It would not be such a problem if it was just a short running module build job that can be readily retried. This seems like it would be really straightforward to add configurable values for this behavior and it would increase the value we get from Jenkins quite a lot. Please consider addressing this issue .
          Thanks,
          John

          John McCullough added a comment - We also see this behavior periodicially in our system. Unfortunately for us we lose a lot of time and it is very disruptive of our release process when it occurs since the Windows slave nodes which show this problem are used to execute long-running tests. It would not be such a problem if it was just a short running module build job that can be readily retried. This seems like it would be really straightforward to add configurable values for this behavior and it would increase the value we get from Jenkins quite a lot. Please consider addressing this issue . Thanks, John

          Rick Liu added a comment - - edited

          Ubuntu 14.04 server 64-bit
          oracle-java7: 1.7.0_80
          Jenkins: 1.651.1 LTS

          The build sometimes randomly failed with this kind of error.

          This time happened in the post-build actions:
          FATAL: channel is already closed
          hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:578)
          at hudson.remoting.Request.call(Request.java:130)
          at hudson.remoting.Channel.call(Channel.java:780)
          at hudson.Launcher$RemoteLauncher.kill(Launcher.java:953)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:540)
          at hudson.model.Run.execute(Run.java:1738)
          at hudson.matrix.MatrixBuild.run(MatrixBuild.java:313)
          at hudson.model.ResourceController.execute(ResourceController.java:98)
          at hudson.model.Executor.run(Executor.java:410)
          Caused by: java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
          Caused by: java.io.EOFException
          at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325)
          at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
          at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
          at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299)
          at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          Rick Liu added a comment - - edited Ubuntu 14.04 server 64-bit oracle-java7: 1.7.0_80 Jenkins: 1.651.1 LTS The build sometimes randomly failed with this kind of error. This time happened in the post-build actions: FATAL: channel is already closed hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:578) at hudson.remoting.Request.call(Request.java:130) at hudson.remoting.Channel.call(Channel.java:780) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:953) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:540) at hudson.model.Run.execute(Run.java:1738) at hudson.matrix.MatrixBuild.run(MatrixBuild.java:313) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410) Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          Oleg Nenashev added a comment -

          totoroliu this issue has been solved in remoting-2.62 (JENKINS-22853)
          There was also a fix of SocketTimeoutException in remoting-2.62 (JENKINS-22722), which makes remoting tolerant against SocketTimeout exceptions.

          So the remoting layer should be more stable now

          Oleg Nenashev added a comment - totoroliu this issue has been solved in remoting-2.62 ( JENKINS-22853 ) There was also a fix of SocketTimeoutException in remoting-2.62 ( JENKINS-22722 ), which makes remoting tolerant against SocketTimeout exceptions. So the remoting layer should be more stable now

          oleg_nenashev: The reference to JENKINS-22853 seems to be unrelated. Did you type the wrong number by any chance?
          Also, JENKINS-22722 states it was fixed in remoting-2.60 (although we still have pretty bad problems with broken connections)

          All: Are you running Jenkins on VMs? We noticed that VMware moving VMs between hosts can cause a brief packet loss which can cause Jenkins to loose connection.

          Stefan Möbius added a comment - oleg_nenashev : The reference to JENKINS-22853 seems to be unrelated. Did you type the wrong number by any chance? Also, JENKINS-22722 states it was fixed in remoting-2.60 (although we still have pretty bad problems with broken connections) All: Are you running Jenkins on VMs? We noticed that VMware moving VMs between hosts can cause a brief packet loss which can cause Jenkins to loose connection.

          Elliott Jones added a comment -

          We have slave disconnect issues and are running on VMware (both master and slave). From the recent available data, the 'Tasks & Events' history does NOT show a 'Migrate virtual machine' entry at the time of disconnect (for either master or the slave or involved).

          We'll continue to monitor, though we've not had any disconnects since our upgrade to Jenkins 2.7.2 and we used to get 1 or 2 a week.

          Elliott Jones added a comment - We have slave disconnect issues and are running on VMware (both master and slave). From the recent available data, the 'Tasks & Events' history does NOT show a 'Migrate virtual machine' entry at the time of disconnect (for either master or the slave or involved). We'll continue to monitor, though we've not had any disconnects since our upgrade to Jenkins 2.7.2 and we used to get 1 or 2 a week.

          Maciej Kusz added a comment -

          We;ve got similar problem when our master was on VMware. After migration to Hyper-V from Microsoft problem has been solved. I think that this is some problem with VMware configuration or it's network switch virtualization.

          Maciej Kusz added a comment - We;ve got similar problem when our master was on VMware. After migration to Hyper-V from Microsoft problem has been solved. I think that this is some problem with VMware configuration or it's network switch virtualization.

          Markus Niklasson added a comment - - edited

          Hi,

          We have recently also encountered disconnection issues. Slave is a Windows 7 (x64) PC with enough of RAM and CPU to run heavy applications. The Jenkins master is a Enterprise Redhat 7 (3.10.0-327.18.2.el7.x86_64) running Jenkins 2.23 also with enough memory and so on to run Jenkins. Both running Java 8 update 102. The slave are connected through JNLP. Network can be a bit unstable at times.

          The following intermittent error occurs very frequently during builds:

          Agent went offline during the build
          ERROR: Connection was broken: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@69c08f2a[name=Buildserver]
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          at java.util.concurrent.FutureTask.run(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at java.lang.Thread.run(Unknown Source)
          Caused by: java.io.IOException: Connection timed out
          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
          at sun.nio.ch.SocketDispatcher.read(Unknown Source)
          at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
          at sun.nio.ch.IOUtil.read(Unknown Source)
          at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
          at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
          at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
          ... 6 more

          I have unticked "Response Time" from "Preventive Node Monitoring" and Slaves has -Dhudson.slaves.ChannelPinger.pingInterval=1 set.

          Any other workaround available?

          Markus Niklasson added a comment - - edited Hi, We have recently also encountered disconnection issues. Slave is a Windows 7 (x64) PC with enough of RAM and CPU to run heavy applications. The Jenkins master is a Enterprise Redhat 7 (3.10.0-327.18.2.el7.x86_64) running Jenkins 2.23 also with enough memory and so on to run Jenkins. Both running Java 8 update 102. The slave are connected through JNLP. Network can be a bit unstable at times. The following intermittent error occurs very frequently during builds: Agent went offline during the build ERROR: Connection was broken: java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@69c08f2a [name=Buildserver] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(Unknown Source) at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.read(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561) ... 6 more I have unticked "Response Time" from "Preventive Node Monitoring" and Slaves has -Dhudson.slaves.ChannelPinger.pingInterval=1 set. Any other workaround available?

          Joe George added a comment - - edited

          According to the documentation (https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread) -Dhudson.slaves.ChannelPinger.pingInterval=1 should be set on Master. You should also try setting -Dhudson.remoting.Launcher.pingIntervalSec=-1 on the Slave.

          I haven't experience any issues since disabling pinging this way. Next is to start testing different timeout values.

          Joe George added a comment - - edited According to the documentation ( https://wiki.jenkins-ci.org/display/JENKINS/Ping+Thread ) -Dhudson.slaves.ChannelPinger.pingInterval=1 should be set on Master . You should also try setting -Dhudson.remoting.Launcher.pingIntervalSec=-1 on the Slave. I haven't experience any issues since disabling pinging this way. Next is to start testing different timeout values.

          Thanks for the tip!

          By disabling the ping completely it made it more stable. However, I still experience intermittent connectivity problems. During an execution, the slave computer went offline for a couple of seconds and then reconnects to Jenkins Master as seen in the system log:

          Accepted connection #7 from /10.31.43.49:52692

          Sep 21, 2016 8:14:21 AM INFO jenkins.slaves.DefaultJnlpSlaveReceiver handle

          Disconnecting Buildserver as we are reconnected from the current peer

          Sep 21, 2016 8:29:49 AM WARNING org.jenkinsci.remoting.nio.NioChannelHub run

          Communication problem
          java.io.IOException: Connection timed out
          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
          at sun.nio.ch.SocketDispatcher.read(Unknown Source)
          at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
          at sun.nio.ch.IOUtil.read(Unknown Source)
          at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
          at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
          at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          at java.util.concurrent.FutureTask.run(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at java.lang.Thread.run(Unknown Source)

          Sep 21, 2016 8:29:49 AM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed

          NioChannelHub keys=3 gen=842933: Computer.threadPoolForRemoting 2 for Buildserver terminated
          java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@28b1969e[name=Buildserver]
          at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          at java.util.concurrent.FutureTask.run(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          at java.lang.Thread.run(Unknown Source)
          Caused by: java.io.IOException: Connection timed out
          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
          at sun.nio.ch.SocketDispatcher.read(Unknown Source)
          at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
          at sun.nio.ch.IOUtil.read(Unknown Source)
          at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
          at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137)
          at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310)
          at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561)
          ... 6 more

          Any ideas how I can prevent the Master from disconnecting the slave (use the reconnected session instead)?

          Markus Niklasson added a comment - Thanks for the tip! By disabling the ping completely it made it more stable. However, I still experience intermittent connectivity problems. During an execution, the slave computer went offline for a couple of seconds and then reconnects to Jenkins Master as seen in the system log: — Accepted connection #7 from /10.31.43.49:52692 Sep 21, 2016 8:14:21 AM INFO jenkins.slaves.DefaultJnlpSlaveReceiver handle Disconnecting Buildserver as we are reconnected from the current peer Sep 21, 2016 8:29:49 AM WARNING org.jenkinsci.remoting.nio.NioChannelHub run Communication problem java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(Unknown Source) at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.read(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Sep 21, 2016 8:29:49 AM WARNING jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed NioChannelHub keys=3 gen=842933: Computer.threadPoolForRemoting 2 for Buildserver terminated java.io.IOException: Connection aborted: org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport@28b1969e [name=Buildserver] at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:208) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:629) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(Unknown Source) at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) at sun.nio.ch.IOUtil.read(Unknown Source) at sun.nio.ch.SocketChannelImpl.read(Unknown Source) at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:137) at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:310) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:561) ... 6 more Any ideas how I can prevent the Master from disconnecting the slave (use the reconnected session instead)?

          Oleg Nenashev added a comment -

          JENKINS-44785 likely addresses this issue in general. There is a pull request to remoting: https://github.com/jenkinsci/remoting/pull/174 , but I have never finished it due to the review feedback.

           

          I will remove the assignee from the ticket for now, see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context

          Oleg Nenashev added a comment - JENKINS-44785 likely addresses this issue in general. There is a pull request to remoting: https://github.com/jenkinsci/remoting/pull/174 , but I have never finished it due to the review feedback.   I will remove the assignee from the ticket for now, see https://groups.google.com/d/msg/jenkinsci-dev/uc6NsMoCFQI/AIO4WG1UCwAJ for the context

            Unassigned Unassigned
            cowwoc cowwoc
            Votes:
            58 Vote for this issue
            Watchers:
            69 Start watching this issue

              Created:
              Updated: