• Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • remoting
    • None

      The slave goes offline during the job execution and throws the error as mentioned below

      Slave went offline during the build
      01:20:15 ERROR: Connection was broken: java.io.EOFException
      01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
      01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
      01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
      01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
      01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
      01:20:15 at java.lang.Thread.run(Thread.java:724)
      01:20:15

          [JENKINS-31050] Slave goes offline during the build

          Hi oleg_nenashev

          I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod.
          After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error

          Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
          	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
          	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          	at java.lang.Thread.run(Thread.java:745)
          Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
          	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
          	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
          	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
          	... 6 more
          

          Let me know if I need to provide more details.

          Raghu Pallikonda added a comment - Hi oleg_nenashev I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod. After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more Let me know if I need to provide more details.

          Orgad Shaneh added a comment -

          Looks similar to JENKINS-25858. There are 2 solutions that were proposed there:

          1. Upgrade the kernel to >=3.16.1
          2. Execute on the slave as root ethtool -K eth0 sg off

          This worked for us.

          Orgad Shaneh added a comment - Looks similar to JENKINS-25858 . There are 2 solutions that were proposed there : Upgrade the kernel to >=3.16.1 Execute on the slave as root ethtool -K eth0 sg off This worked for us.

          Hi orgads
          Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the

          ethtool -K eth0 sg off
          

          The command failed with

          Cannot set device feature settings: Operation not permitted
          

          The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect).

          My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too).

          Could you let me know if I triage the issue any further.

          Thanks,
          Raghu

          Raghu Pallikonda added a comment - Hi orgads Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the ethtool -K eth0 sg off The command failed with Cannot set device feature settings: Operation not permitted The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect). My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too). Could you let me know if I triage the issue any further. Thanks, Raghu

          Orgad Shaneh added a comment -

          Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

          Orgad Shaneh added a comment - Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

          orgads Hmm, I am leveraging the Jenkins kubernetes plugin (https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup.

          Thank you for the quick clarification.

          Raghu Pallikonda added a comment - orgads Hmm, I am leveraging the Jenkins kubernetes plugin ( https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin ), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup. Thank you for the quick clarification.

          In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.

          The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.

          During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

          Luke Richardson added a comment - In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded. The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master ( http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html ). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240. During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          shraddha Magar added a comment - - edited

          I am aslo facing the same issue of agent went offline during build.

          I am using jenkins v2.105 and jre 1.8

          I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

          Thanks in advance.

          shraddha Magar added a comment - - edited I am aslo facing the same issue of agent went offline during build. I am using jenkins v2.105 and jre 1.8 I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know. Thanks in advance.

          Prudhvi Godithi added a comment - - edited

          Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know.
          Thank you 
          Slave Verion:

          remoting-3.20.jar

          Error:

          hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.

           

          Should I upgrade the remoting to latest version?

          Prudhvi Godithi added a comment - - edited Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know. Thank you  Slave Verion: remoting-3.20.jar Error: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.   Should I upgrade the remoting to latest version?

          The th3mis added a comment -

          Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.

          TLDR:  Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.

           

          PID   PGID  SID   TPGID COMMAND
          13691 13691 49864 13691 java -jar agent.jar
          13818 13691 49864 13691  \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh
          13820 13691 49864 13691    \_ kill(0, SIGTERM)

           

          **I propose some agent demonization for except such bug  (call setsid() in thread pool?)

          Description:

          For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

          The th3mis added a comment - Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent. TLDR:   Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0,  signal = SIGTERM) will crash Jenkins agent too.   PID PGID SID TPGID COMMAND 13691 13691 49864 13691 java -jar agent.jar 13818 13691 49864 13691 \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh 13820 13691 49864 13691 \_ kill(0, SIGTERM)   **I propose some agent demonization for except such bug   (call setsid() in thread pool?) Description: For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

            Unassigned Unassigned
            nutcracker66 Sujith Dinakar
            Votes:
            24 Vote for this issue
            Watchers:
            31 Start watching this issue

              Created:
              Updated: