• Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • core

      This issue is related to JENKINS-6817.

      I am running Jenkins slaves inside virtual machines. Sometimes these machines are overloaded and I get the following exception:

      FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
      	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
      	at hudson.remoting.Request.call(Request.java:174)
      	at hudson.remoting.Channel.call(Channel.java:713)
      	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
      	at $Proxy38.join(Unknown Source)
      	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:925)
      	at hudson.Launcher$ProcStarter.join(Launcher.java:360)
      	at hudson.tasks.Maven.perform(Maven.java:327)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:804)
      	at hudson.model.Build$BuildExecution.build(Build.java:199)
      	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:586)
      	at hudson.model.Run.execute(Run.java:1593)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:247)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.Request.abort(Request.java:299)
      	at hudson.remoting.Channel.terminate(Channel.java:773)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
      Caused by: java.io.IOException: Unexpected termination of the channel
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
      Caused by: java.io.EOFException
      	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
      	at java.io.ObjectInputStream.readObject0(Unknown Source)
      	at java.io.ObjectInputStream.readObject(Unknown Source)
      	at hudson.remoting.Command.readFrom(Command.java:92)
      	at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:72)
      	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
      

      Is it possible to make the channel timeout configurable? I'd like to increase the value from, say 5 seconds, to 30 seconds or a minute.

          [JENKINS-18781] Configurable channel timeout for slaves

          cowwoc created issue -
          cowwoc made changes -
          Link New: This issue is related to JENKINS-6817 [ JENKINS-6817 ]
          Ulli Hafner made changes -
          Link New: This issue is related to JENKINS-18879 [ JENKINS-18879 ]

          I wonder if those new timeouts of the slaves might be related to this change https://github.com/jenkinsci/remoting/commit/28830e37b94387d0c6f9927ad897f4010e6c1bda
          Maybe kohsuke knows and can add some logging in case of connection timeouts. Currently everything happens silently and there is no clue why the connections die and which timeout is actually responsible.

          Ramin Baradari added a comment - I wonder if those new timeouts of the slaves might be related to this change https://github.com/jenkinsci/remoting/commit/28830e37b94387d0c6f9927ad897f4010e6c1bda Maybe kohsuke knows and can add some logging in case of connection timeouts. Currently everything happens silently and there is no clue why the connections die and which timeout is actually responsible.

          cowwoc added a comment -

          I agree, we need more logging. On a side-note, a five second timeout is very low in my case. It is quite likely that overloaded VMs will fail to respond for longer (especially if swapping to disk occurs).

          cowwoc added a comment - I agree, we need more logging. On a side-note, a five second timeout is very low in my case. It is quite likely that overloaded VMs will fail to respond for longer (especially if swapping to disk occurs).

          Henri Gomez added a comment -

          +1, I get more and more often these failures.

          It would be great to have slave connection and read timeout configurable by node, so we could set it independently.
          For example, remote slaves via WAN could requires more time to be connected and respond

          Henri Gomez added a comment - +1, I get more and more often these failures. It would be great to have slave connection and read timeout configurable by node, so we could set it independently. For example, remote slaves via WAN could requires more time to be connected and respond

          Tony Greway added a comment -

          I've recently upgraded from 0.27 SSH plugin to 1.4 and I see this exception being thrown almost everyday on our nightly builds. We are using a multi-configuration project to essentially install software on our distributed cluster ever night and I see our builds fail randomly on one of the nodes with high regularity. All of our nodes are VMs that typically have no traffic at the time of the build and I have them configured to be offline until the build is requested. Is there a way to revert back to .27 until this is fixed - I've tried without success?

          Tony Greway added a comment - I've recently upgraded from 0.27 SSH plugin to 1.4 and I see this exception being thrown almost everyday on our nightly builds. We are using a multi-configuration project to essentially install software on our distributed cluster ever night and I see our builds fail randomly on one of the nodes with high regularity. All of our nodes are VMs that typically have no traffic at the time of the build and I have them configured to be offline until the build is requested. Is there a way to revert back to .27 until this is fixed - I've tried without success?

          Hi guys,

          Are there any plans to fix that soon? It is crucial to have it in order to have stable environment.

          Thanks

          Stoil Valchkov added a comment - Hi guys, Are there any plans to fix that soon? It is crucial to have it in order to have stable environment. Thanks

          Ian Norton added a comment -

          I hate to make a "me too" post but this is getting rather annoying here. My windows (VM) slaves tend to fail one in every 3 builds because of this, and as we have a matrix job for this it usually means 2/3 builds fail.

          Ian Norton added a comment - I hate to make a "me too" post but this is getting rather annoying here. My windows (VM) slaves tend to fail one in every 3 builds because of this, and as we have a matrix job for this it usually means 2/3 builds fail.

          Sajajd Rehman added a comment -

          PING any fixers there?

          Sajajd Rehman added a comment - PING any fixers there?

            Unassigned Unassigned
            cowwoc cowwoc
            Votes:
            58 Vote for this issue
            Watchers:
            69 Start watching this issue

              Created:
              Updated: