-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
The slave goes offline during the job execution and throws the error as mentioned below
Slave went offline during the build
01:20:15 ERROR: Connection was broken: java.io.EOFException
01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
01:20:15 at java.lang.Thread.run(Thread.java:724)
01:20:15
- is duplicated by
-
JENKINS-36944 Agent goes offline during the build
-
- Resolved
-
- is related to
-
JENKINS-40491 Preliminary FifoBuffer termination can cause outage of all JNLP1/2 agents
-
- Resolved
-
- relates to
-
JENKINS-25858 java.io.IOException: Unexpected termination of the channel
-
- Resolved
-
-
JENKINS-23419 FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF
-
- Closed
-
- links to
[JENKINS-31050] Slave goes offline during the build
Also I see this issue on multiple slaves, currently its blocking us.
Does anyone even look at these defects? May I have an update please?
I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.
Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.
I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.
Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this
We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces.
I think this error is displayed when java process on the slave machine is clodes suddenly.
I have been seeing this too on Windows based VMs. The Node VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node. This is really bad when it happen mid job and logging the message Slave goes offline during the build.
I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails .
The stack trace is just slightly different:
Agent went offline during the build
ERROR: Connection was broken: java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51
WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated
java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:303)
at hudson.remoting.Channel.terminate(Channel.java:847)
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at ......remote call to VM06-OASTEST(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
at hudson.remoting.Request.call(Request.java:172)
at hudson.remoting.Channel.call(Channel.java:780)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182)
Caused by: java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
oleg_nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?
I am pretty sure changes in 3.3 for JENKINS-25218 will somehow influence the behavior (and maybe fixed it).
Created JENKINS-40491 for the diagnostic improvements.
So I have created https://github.com/jenkinsci/remoting/pull/138 with additional diagnostics
Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca
Log:
JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination.
When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log.
This change should improve diagnostics of issues like JENKINS-31050
Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/e500853bc8b50c12761ad63739fd27fd40183b3c
Log:
Merge pull request #138 from oleg-nenashev/bug/JENKINS-31050
JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination
Compare: https://github.com/jenkinsci/remoting/compare/cdd5bce5725d...e500853bc8b5
Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs
I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod.
After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error
Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more
Let me know if I need to provide more details.
Looks similar to JENKINS-25858. There are 2 solutions that were proposed there:
- Upgrade the kernel to >=3.16.1
- Execute on the slave as root ethtool -K eth0 sg off
This worked for us.
Hi orgads
Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the
ethtool -K eth0 sg off
The command failed with
Cannot set device feature settings: Operation not permitted
The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect).
My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too).
Could you let me know if I triage the issue any further.
Thanks,
Raghu
Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.
orgads Hmm, I am leveraging the Jenkins kubernetes plugin (https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup.
Thank you for the quick clarification.
In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.
The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.
During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.
Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.
I am aslo facing the same issue of agent went offline during build.
I am using jenkins v2.105 and jre 1.8
I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.
Thanks in advance.
Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know.
Thank you
Slave Verion:
remoting-3.20.jar
Error:
hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.
Should I upgrade the remoting to latest version?
Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.
TLDR: Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0, signal = SIGTERM) will crash Jenkins agent too.
PID PGID SID TPGID COMMAND 13691 13691 49864 13691 java -jar agent.jar 13818 13691 49864 13691 \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh 13820 13691 49864 13691 \_ kill(0, SIGTERM)
**I propose some agent demonization for except such bug (call setsid() in thread pool?)
Description:
For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.
I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.