• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • remoting
    • None

      The slave goes offline during the job execution and throws the error as mentioned below

      Slave went offline during the build
      01:20:15 ERROR: Connection was broken: java.io.EOFException
      01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      01:20:15 at java.lang.Thread.run(Thread.java:724)
      01:20:15

          [JENKINS-31050] Slave goes offline during the build

          I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.

          Sujith Dinakar added a comment - I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.

          Also I see this issue on multiple slaves, currently its blocking us.

          Sujith Dinakar added a comment - Also I see this issue on multiple slaves, currently its blocking us.

          Does anyone even look at these defects? May I have an update please?

          Sujith Dinakar added a comment - Does anyone even look at these defects? May I have an update please?

          Kevin Navero added a comment -

          I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.

          Kevin Navero added a comment - I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.

          charles s added a comment -

          Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.

          charles s added a comment - Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.

          Fernando Abad added a comment -

          I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.

          Fernando Abad added a comment - I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.

          Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this

          Roberto Flores added a comment - Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this

          Fernando Abad added a comment -

          We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces.

          I think this error is displayed when java process on the slave machine is clodes suddenly.

          Fernando Abad added a comment - We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces. I think this error is displayed when java process on the slave machine is clodes suddenly.

          Todd B added a comment - - edited

          I have been seeing this too on Windows based VMs. The Node VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node. This is really bad when it happen mid job and logging the message Slave goes offline during the build.

          Todd B added a comment - - edited I have been seeing this too on Windows based VMs. The Node VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node. This is really bad when it happen mid job and logging the message Slave goes offline during the build.

          I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails .
          The stack trace is just slightly different:

          Agent went offline during the build
          ERROR: Connection was broken: java.io.EOFException
          at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)

          Ricardo Moreira added a comment - I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails . The stack trace is just slightly different: Agent went offline during the build ERROR: Connection was broken: java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

          I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51

          WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated
          java.io.EOFException
          at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
          SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
          hudson.remoting.RequestAbortedException: java.io.EOFException
          at hudson.remoting.Request.abort(Request.java:303)
          at hudson.remoting.Channel.terminate(Channel.java:847)
          at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
          at ......remote call to VM06-OASTEST(Native Method)
          at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
          at hudson.remoting.Request.call(Request.java:172)
          at hudson.remoting.Channel.call(Channel.java:780)
          at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508)
          at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127)
          at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69)
          at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60)
          at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32)
          at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182)
          Caused by: java.io.EOFException
          at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)

          varun shrivastava added a comment - I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51 WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code. hudson.remoting.RequestAbortedException: java.io.EOFException at hudson.remoting.Request.abort(Request.java:303) at hudson.remoting.Channel.terminate(Channel.java:847) at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) at ......remote call to VM06-OASTEST(Native Method) at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416) at hudson.remoting.Request.call(Request.java:172) at hudson.remoting.Channel.call(Channel.java:780) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182) Caused by: java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

          Oleg Nenashev added a comment -

          Fixed the component

          Oleg Nenashev added a comment - Fixed the component

          oleg_nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?

          Hariharan Ragothaman added a comment - oleg_nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?

          Oleg Nenashev added a comment -

          Which remoting version do you use on nodes and the master?

          Oleg Nenashev added a comment - Which remoting version do you use on nodes and the master?

          Oleg Nenashev added a comment -

          I am pretty sure changes in 3.3 for JENKINS-25218 will somehow influence the behavior (and maybe fixed it).
          Created JENKINS-40491 for the diagnostic improvements.

          Oleg Nenashev added a comment - I am pretty sure changes in 3.3 for JENKINS-25218 will somehow influence the behavior (and maybe fixed it). Created JENKINS-40491 for the diagnostic improvements.

          Oleg Nenashev added a comment -

          So I have created https://github.com/jenkinsci/remoting/pull/138 with additional diagnostics

          Oleg Nenashev added a comment - So I have created https://github.com/jenkinsci/remoting/pull/138 with additional diagnostics

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
          src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
          http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca
          Log:
          JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.

          When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log.
          This change should improve diagnostics of issues like JENKINS-31050

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca Log: JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination. When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log. This change should improve diagnostics of issues like JENKINS-31050

          Code changed in jenkins
          User: Oleg Nenashev
          Path:
          src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
          src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
          http://jenkins-ci.org/commit/remoting/e500853bc8b50c12761ad63739fd27fd40183b3c
          Log:
          Merge pull request #138 from oleg-nenashev/bug/JENKINS-31050

          JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination

          Compare: https://github.com/jenkinsci/remoting/compare/cdd5bce5725d...e500853bc8b5

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/e500853bc8b50c12761ad63739fd27fd40183b3c Log: Merge pull request #138 from oleg-nenashev/bug/ JENKINS-31050 JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination Compare: https://github.com/jenkinsci/remoting/compare/cdd5bce5725d...e500853bc8b5

          Oleg Nenashev added a comment -

          Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs

          Oleg Nenashev added a comment - Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs

          Hi oleg_nenashev

          I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod.
          After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error

          Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
          	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
          	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
          	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
          	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
          	... 6 more
          

          Let me know if I need to provide more details.

          Raghu Pallikonda added a comment - Hi oleg_nenashev I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod. After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more Let me know if I need to provide more details.

          Orgad Shaneh added a comment -

          Looks similar to JENKINS-25858. There are 2 solutions that were proposed there:

          1. Upgrade the kernel to >=3.16.1
          2. Execute on the slave as root ethtool -K eth0 sg off

          This worked for us.

          Orgad Shaneh added a comment - Looks similar to JENKINS-25858 . There are 2 solutions that were proposed there : Upgrade the kernel to >=3.16.1 Execute on the slave as root ethtool -K eth0 sg off This worked for us.

          Hi orgads
          Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the

          ethtool -K eth0 sg off
          

          The command failed with

          Cannot set device feature settings: Operation not permitted
          

          The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect).

          My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too).

          Could you let me know if I triage the issue any further.

          Thanks,
          Raghu

          Raghu Pallikonda added a comment - Hi orgads Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the ethtool -K eth0 sg off The command failed with Cannot set device feature settings: Operation not permitted The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect). My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too). Could you let me know if I triage the issue any further. Thanks, Raghu

          Orgad Shaneh added a comment -

          Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

          Orgad Shaneh added a comment - Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

          orgads Hmm, I am leveraging the Jenkins kubernetes plugin (https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup.

          Thank you for the quick clarification.

          Raghu Pallikonda added a comment - orgads Hmm, I am leveraging the Jenkins kubernetes plugin ( https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin ), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup. Thank you for the quick clarification.

          In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.

          The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.

          During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

          Luke Richardson added a comment - In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded. The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master ( http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html ). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240. During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          shraddha Magar added a comment - - edited

          I am aslo facing the same issue of agent went offline during build.

          I am using jenkins v2.105 and jre 1.8

          I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

          Thanks in advance.

          shraddha Magar added a comment - - edited I am aslo facing the same issue of agent went offline during build. I am using jenkins v2.105 and jre 1.8 I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know. Thanks in advance.

          Prudhvi Godithi added a comment - - edited

          Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know.
          Thank you 
          Slave Verion:

          remoting-3.20.jar

          Error:

          hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.

           

          Should I upgrade the remoting to latest version?

          Prudhvi Godithi added a comment - - edited Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know. Thank you  Slave Verion: remoting-3.20.jar Error: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.   Should I upgrade the remoting to latest version?

          The th3mis added a comment -

          Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.

          TLDR:  Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.

           

          PID   PGID  SID   TPGID COMMAND
          13691 13691 49864 13691 java -jar agent.jar
          13818 13691 49864 13691  \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh
          13820 13691 49864 13691    \_ kill(0, SIGTERM)

           

          **I propose some agent demonization for except such bug  (call setsid() in thread pool?)

          Description:

          For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

          The th3mis added a comment - Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent. TLDR:   Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.   PID PGID SID TPGID COMMAND 13691 13691 49864 13691 java -jar agent.jar 13818 13691 49864 13691 \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh 13820 13691 49864 13691 \_ kill(0, SIGTERM)   **I propose some agent demonization for except such bug   (call setsid() in thread pool?) Description: For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

            Unassigned Unassigned
            nutcracker66 Sujith Dinakar
            Votes:
            24 Vote for this issue
            Watchers:
            31 Start watching this issue

              Created:
              Updated: