[JENKINS-31050] Slave goes offline during the build

Type: Bug
Resolution: Unresolved
Priority: Blocker
Component/s: remoting
Labels:
None

Similar Issues:
Powered by SuggestiMate

Show
Epic Link:
Remoting - Agent Stability improvements

The slave goes offline during the job execution and throws the error as mentioned below

Slave went offline during the build
01:20:15 ERROR: Connection was broken: java.io.EOFException
01:20:15 at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:613)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
01:20:15 at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
01:20:15 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
01:20:15 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
01:20:15 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
01:20:15 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
01:20:15 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
01:20:15 at java.lang.Thread.run(Thread.java:724)
01:20:15

is duplicated by

JENKINS-36944 Agent goes offline during the build

Resolved

is related to

JENKINS-40491 Preliminary FifoBuffer termination can cause outage of all JNLP1/2 agents

Resolved

relates to

JENKINS-25858 java.io.IOException: Unexpected termination of the channel

Resolved

JENKINS-23419 FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF

Closed

links to

CloudBees Internal OSS-1739

Sujith Dinakar added a comment - 2015-10-20 09:42

I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.

Sujith Dinakar added a comment - 2015-10-20 09:42 I have found few jira issues realted to this but I do not see a fix or a workaround for the same. Please let me know if you require more information on this.

Sujith Dinakar added a comment - 2015-10-20 09:53

Also I see this issue on multiple slaves, currently its blocking us.

Sujith Dinakar added a comment - 2015-10-20 09:53 Also I see this issue on multiple slaves, currently its blocking us.

Sujith Dinakar added a comment - 2015-10-21 15:33

Does anyone even look at these defects? May I have an update please?

Sujith Dinakar added a comment - 2015-10-21 15:33 Does anyone even look at these defects? May I have an update please?

Kevin Navero added a comment - 2015-11-05 04:44

I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.

Kevin Navero added a comment - 2015-11-05 04:44 I'm getting the same problem on Jenkins 1.625.1 LTS. The configuration I have set up is that I have a slave node running Windows 7 natively which is running a Windows Server 2003 virtual machine. The slave-agent client is running on the virtual machine. The Windows Server 2003 VM is running java version 1.7.0_80. Let me know if I can supply more information.

charles s added a comment - 2015-11-16 05:21

Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.

charles s added a comment - 2015-11-16 05:21 Same issue here for a couple of months. Our jenkins script triggers Java ProcessBuilder and redirect IO then the issue appears.

Fernando Abad added a comment - 2016-03-08 12:32

I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.

Fernando Abad added a comment - 2016-03-08 12:32 I had the same issue, it was because other process (automatic testing) was killing the Java process in the slave machine.

Roberto Flores added a comment - 2016-03-15 14:33

Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this

Roberto Flores added a comment - 2016-03-15 14:33 Hi Fernando, it seems I'm having this same problem. How did you get around it? Would really appreciate your help on this

Fernando Abad added a comment - 2016-03-16 13:48

We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces.

I think this error is displayed when java process on the slave machine is clodes suddenly.

Fernando Abad added a comment - 2016-03-16 13:48 We use jenkins slave to run UFT automatic test. (500 Test cases) a few of them had a "TSKILL java" in the code. Review that no one of process that you are runing in the slave machine is not killing java proces. I think this error is displayed when java process on the slave machine is clodes suddenly.

Todd B added a comment - 2016-04-21 13:23 - edited

I have been seeing this too on Windows based VMs. The Node VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node. This is really bad when it happen mid job and logging the message Slave goes offline during the build.

Todd B added a comment - 2016-04-21 13:23 - edited I have been seeing this too on Windows based VMs. The Node VM is not being reset so it must just the Jenkins service that is crashing and restarting. I am seeing this as much as twice a day since some of the jobs run at the start of node. This is really bad when it happen mid job and logging the message Slave goes offline during the build.

Ricardo Moreira added a comment - 2016-07-14 10:43

I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails .
The stack trace is just slightly different:

Agent went offline during the build
ERROR: Connection was broken: java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Ricardo Moreira added a comment - 2016-07-14 10:43 I'm getting this on a Ubuntu machine. It takes about 1 minute from the time the job enters in a Behat step and the time the job fails . The stack trace is just slightly different: Agent went offline during the build ERROR: Connection was broken: java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

varun shrivastava added a comment - 2016-09-05 11:45

I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51

WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated
java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:303)
at hudson.remoting.Channel.terminate(Channel.java:847)
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at ......remote call to VM06-OASTEST(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416)
at hudson.remoting.Request.call(Request.java:172)
at hudson.remoting.Channel.call(Channel.java:780)
at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508)
at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127)
at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69)
at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60)
at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32)
at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182)
Caused by: java.io.EOFException
at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

varun shrivastava added a comment - 2016-09-05 11:45 I have the same problem my version of jenkins is 2.7.2-1.1 and jdk 1.8.0_51 WARNING: Computer.threadPoolForRemoting 10973 for VM06-OASTEST terminated java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) SEVERE: A thread (TCP agent connection handler #12285 with /10.254.1.94:62697/223645) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code. hudson.remoting.RequestAbortedException: java.io.EOFException at hudson.remoting.Request.abort(Request.java:303) at hudson.remoting.Channel.terminate(Channel.java:847) at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) at ......remote call to VM06-OASTEST(Native Method) at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1416) at hudson.remoting.Request.call(Request.java:172) at hudson.remoting.Channel.call(Channel.java:780) at hudson.slaves.SlaveComputer.setChannel(SlaveComputer.java:508) at jenkins.slaves.JnlpSlaveAgentProtocol$Handler.jnlpConnect(JnlpSlaveAgentProtocol.java:127) at jenkins.slaves.DefaultJnlpSlaveReceiver.handle(DefaultJnlpSlaveReceiver.java:69) at jenkins.slaves.JnlpSlaveAgentProtocol2$Handler2.run(JnlpSlaveAgentProtocol2.java:60) at jenkins.slaves.JnlpSlaveAgentProtocol2.handle(JnlpSlaveAgentProtocol2.java:32) at hudson.TcpSlaveAgentListener$ConnectionHandler.run(TcpSlaveAgentListener.java:182) Caused by: java.io.EOFException at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:614) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Oleg Nenashev added a comment - 2016-10-01 20:16

Fixed the component

Oleg Nenashev added a comment - 2016-10-01 20:16 Fixed the component

Hariharan Ragothaman added a comment - 2016-12-15 22:44

oleg_nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?

Hariharan Ragothaman added a comment - 2016-12-15 22:44 oleg_nenashev Still having this issue on Ubuntu nodes, been following this story. Is there something else to be done on the user's end?

Oleg Nenashev added a comment - 2016-12-15 23:25

Which remoting version do you use on nodes and the master?

Oleg Nenashev added a comment - 2016-12-15 23:25 Which remoting version do you use on nodes and the master?

Oleg Nenashev added a comment - 2016-12-15 23:48

I am pretty sure changes in 3.3 for ~~JENKINS-25218~~ will somehow influence the behavior (and maybe fixed it).
Created ~~JENKINS-40491~~ for the diagnostic improvements.

Oleg Nenashev added a comment - 2016-12-15 23:48 I am pretty sure changes in 3.3 for JENKINS-25218 will somehow influence the behavior (and maybe fixed it). Created JENKINS-40491 for the diagnostic improvements.

Oleg Nenashev added a comment - 2016-12-15 23:56

So I have created https://github.com/jenkinsci/remoting/pull/138 with additional diagnostics

Oleg Nenashev added a comment - 2016-12-15 23:56 So I have created https://github.com/jenkinsci/remoting/pull/138 with additional diagnostics

SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca
Log:
~~JENKINS-40491~~ - Improve diagnostincs of the preliminary FifoBuffer termination.

When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log.
This change should improve diagnostics of issues like JENKINS-31050

SCM/JIRA link daemon added a comment - 2016-12-16 22:41 Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/2f81d4c9604dfe490b8474b0c44c1ef90f4cbeca Log: JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination. When NioChannelHub suffers from the preliminary buffer closure, it will print a SEVERE log to the Agent log. This change should improve diagnostics of issues like JENKINS-31050

SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Code changed in jenkins
User: Oleg Nenashev
Path:
src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java
src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java
http://jenkins-ci.org/commit/remoting/e500853bc8b50c12761ad63739fd27fd40183b3c
Log:
Merge pull request #138 from oleg-nenashev/bug/JENKINS-31050

~~JENKINS-40491~~ - Improve diagnostincs of the preliminary FifoBuffer termination

Compare: https://github.com/jenkinsci/remoting/compare/cdd5bce5725d...e500853bc8b5

SCM/JIRA link daemon added a comment - 2016-12-16 22:41 Code changed in jenkins User: Oleg Nenashev Path: src/main/java/org/jenkinsci/remoting/nio/FifoBuffer.java src/main/java/org/jenkinsci/remoting/nio/NioChannelHub.java http://jenkins-ci.org/commit/remoting/e500853bc8b50c12761ad63739fd27fd40183b3c Log: Merge pull request #138 from oleg-nenashev/bug/ JENKINS-31050 JENKINS-40491 - Improve diagnostincs of the preliminary FifoBuffer termination Compare: https://github.com/jenkinsci/remoting/compare/cdd5bce5725d...e500853bc8b5

Oleg Nenashev added a comment - 2016-12-27 13:27

Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs

Oleg Nenashev added a comment - 2016-12-27 13:27 Jenkins 2.37 offers a better diagnostics of such case. Would appreciate if somebody reproduces the behavior on this version and provides new logs

Raghu Pallikonda added a comment - 2017-01-22 08:41

Hi oleg_nenashev

I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod.
After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error

Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed
	at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested
	at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426)
	at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332)
	at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565)
	... 6 more

Let me know if I need to provide more details.

Raghu Pallikonda added a comment - 2017-01-22 08:41 Hi oleg_nenashev I was getting the 'Agent offline during the build' error when I was using Jenkins v2.19.1 for the Jenkins Master and Jenkins-slave v2.62 for the slave pod. After reading up on your fix, upgraded the Jenkins to v 2.37 and the slave to jenkins-slave 3.4 (remoting 3.4). Now I am getting the below error Caused by: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed at org.jenkinsci.remoting.nio.NioChannelHub$3.run(NioChannelHub.java:617) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:112) at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) Caused by: org.jenkinsci.remoting.nio.FifoBuffer$CloseCause: Buffer close has been requested at org.jenkinsci.remoting.nio.FifoBuffer.close(FifoBuffer.java:426) at org.jenkinsci.remoting.nio.NioChannelHub$MonoNioTransport.closeR(NioChannelHub.java:332) at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:565) ... 6 more Let me know if I need to provide more details.

Orgad Shaneh added a comment - 2017-01-22 12:50

Looks similar to ~~JENKINS-25858~~. There are 2 solutions that were proposed there:

Upgrade the kernel to >=3.16.1
Execute on the slave as root ethtool -K eth0 sg off

This worked for us.

Orgad Shaneh added a comment - 2017-01-22 12:50 Looks similar to JENKINS-25858 . There are 2 solutions that were proposed there : Upgrade the kernel to >=3.16.1 Execute on the slave as root ethtool -K eth0 sg off This worked for us.

Raghu Pallikonda added a comment - 2017-01-22 20:14

Hi orgads
Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the

ethtool -K eth0 sg off

The command failed with

Cannot set device feature settings: Operation not permitted

The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect).

My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too).

Could you let me know if I triage the issue any further.

Thanks,
Raghu

Raghu Pallikonda added a comment - 2017-01-22 20:14 Hi orgads Thank you for providing the solutions that have solved the issue for some users who faced similar issues. My slaves are docker containers, and when I tried the ethtool -K eth0 sg off The command failed with Cannot set device feature settings: Operation not permitted The above command requires that my docker containers run in privileged mode, and this is not acceptable (from security aspect). My slave docker is derived from Ubuntu 16.10 (Linux kernel 4.8). Based on the above solutions, a kernel version higher than 3.16.1 should also fix the issue, but that doesn't seem to work (unless someone has got it to work with that too). Could you let me know if I triage the issue any further. Thanks, Raghu

Orgad Shaneh added a comment - 2017-01-22 20:18

Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

Orgad Shaneh added a comment - 2017-01-22 20:18 Actually I got it wrong. Our slaves are AWS machines. We just checked "Connect by SSH Process" in System configuration, and it solved the issue.

Raghu Pallikonda added a comment - 2017-01-22 21:22

orgads Hmm, I am leveraging the Jenkins kubernetes plugin (https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup.

Thank you for the quick clarification.

Raghu Pallikonda added a comment - 2017-01-22 21:22 orgads Hmm, I am leveraging the Jenkins kubernetes plugin ( https://wiki.jenkins-ci.org/display/JENKINS/Kubernetes+Plugin ), It only launches the JNLP slave workers under the hood. "SSH process" is not available in my setup. Thank you for the quick clarification.

Luke Richardson added a comment - 2017-02-01 16:15

In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded.

The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master (http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240.

During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

Luke Richardson added a comment - 2017-02-01 16:15 In our configuration on AWS I found that the connection to slaves was being terminated around 1 minute for the particular pipeline stage that was running. The stage was a long running git checkout that intermittently succeeded. The solution for me was to increase the ELB idle timeout property on the load balancer in between the slave and master ( http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html ). By default this property is set to 60 seconds, whereas the Jenkins default for 'hudson.remoting.Launcher.pingTimeoutSec' is 240. During the 1 minute period where the slave was executing the long-running git checkout it must have been transferring less than 1 byte of data and therefore the ELB was dropping the TCP connection.

Oleg Nenashev added a comment - 2018-03-14 02:32

Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

Oleg Nenashev added a comment - 2018-03-14 02:32 Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

shraddha Magar added a comment - 2018-08-21 06:11 - edited

I am aslo facing the same issue of agent went offline during build.

I am using jenkins v2.105 and jre 1.8

I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know.

Thanks in advance.

shraddha Magar added a comment - 2018-08-21 06:11 - edited I am aslo facing the same issue of agent went offline during build. I am using jenkins v2.105 and jre 1.8 I am using Linux as master and IBM AIX and windows server 2K12 as slaves. we are executing nightly builds on slaves but sometimes due to agent goes offline that build won't get complete, so anybody has any workarround for this issue then please let me know. Thanks in advance.

Prudhvi Godithi added a comment - 2019-01-22 16:53 - edited

Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know.
Thank you
Slave Verion:

remoting-3.20.jar

Error:

hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed.

Should I upgrade the remoting to latest version?

Prudhvi Godithi added a comment - 2019-01-22 16:53 - edited Hey I am having the same issue with Kubernetes plugin, where slaves try to connect to master with jnlp at particular port, we have even increased the ELB connection Timeout still facing the same issue where slaves go offline in between the builds and works fine when again rebuild the job, this is causing us huge impact for pipeline builds, our issue is very close to what rpallikonda has mentioned above, any solution for this, please let me know. Thank you Slave Verion: remoting-3.20.jar Error: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected EOF while receiving the data from the channel. FIFO buffer has been already closed. Should I upgrade the remoting to latest version?

The th3mis added a comment - 2019-08-07 10:00

Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent.

TLDR: Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0, signal = SIGTERM) will crash Jenkins agent too.

PID   PGID  SID   TPGID COMMAND
13691 13691 49864 13691 java -jar agent.jar
13818 13691 49864 13691  \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh
13820 13691 49864 13691    \_ kill(0, SIGTERM)

**I propose some agent demonization for except such bug (call setsid() in thread pool?)

Description:

For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

The th3mis added a comment - 2019-08-07 10:00 Hello everyone, I faced with same problem when slave goes offline during the build using SSH or JNLP agent. TLDR: Process hierarchy of Jenkins agent and build shell with same PGID, so kill(pid = 0, signal = SIGTERM) will crash Jenkins agent too. PID PGID SID TPGID COMMAND 13691 13691 49864 13691 java -jar agent.jar 13818 13691 49864 13691 \_ /bin/sh -xe /tmp/jenkins4748921288996267614.sh 13820 13691 49864 13691 \_ kill(0, SIGTERM) **I propose some agent demonization for except such bug (call setsid() in thread pool?) Description: For our example we builds many project using make, so it build and abort build many times,GNU make has pid = 0 in internal structure, so when we click abort build on Jenkins it send SIGTERM to child processes -> make send SIGTERM to child and sometimes GNU make (fixed after ) calls `kill(0, SIGTERM)` which means on Linux agent that all the process group will be terminated included Jenkins agent -> so we get died agent during the build.

Assignee:: Unassigned

Reporter:: Sujith Dinakar

Votes:: 24 Vote for this issue

Watchers:: 31 Start watching this issue

Created:: 2015-10-20 09:40

Updated:: 2019-08-07 10:00

Jenkins

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Sujith Dinakar added a comment - 2015-10-20 09:42

Expand comment: Sujith Dinakar added a comment - 2015-10-20 09:42

Collapse comment: Sujith Dinakar added a comment - 2015-10-20 09:53

Expand comment: Sujith Dinakar added a comment - 2015-10-20 09:53

Collapse comment: Sujith Dinakar added a comment - 2015-10-21 15:33

Expand comment: Sujith Dinakar added a comment - 2015-10-21 15:33

Collapse comment: Kevin Navero added a comment - 2015-11-05 04:44

Expand comment: Kevin Navero added a comment - 2015-11-05 04:44

Collapse comment: charles s added a comment - 2015-11-16 05:21

Expand comment: charles s added a comment - 2015-11-16 05:21

Collapse comment: Fernando Abad added a comment - 2016-03-08 12:32

Expand comment: Fernando Abad added a comment - 2016-03-08 12:32

Collapse comment: Roberto Flores added a comment - 2016-03-15 14:33

Expand comment: Roberto Flores added a comment - 2016-03-15 14:33

Collapse comment: Fernando Abad added a comment - 2016-03-16 13:48

Expand comment: Fernando Abad added a comment - 2016-03-16 13:48

Collapse comment: Todd B added a comment - 2016-04-21 13:23, Edited by Todd B - 2016-04-21 13:25

Expand comment: Todd B added a comment - 2016-04-21 13:23, Edited by Todd B - 2016-04-21 13:25

Collapse comment: Ricardo Moreira added a comment - 2016-07-14 10:43

Expand comment: Ricardo Moreira added a comment - 2016-07-14 10:43

Collapse comment: varun shrivastava added a comment - 2016-09-05 11:45

Expand comment: varun shrivastava added a comment - 2016-09-05 11:45

Collapse comment: Oleg Nenashev added a comment - 2016-10-01 20:16

Expand comment: Oleg Nenashev added a comment - 2016-10-01 20:16

Collapse comment: Hariharan Ragothaman added a comment - 2016-12-15 22:44

Expand comment: Hariharan Ragothaman added a comment - 2016-12-15 22:44

Collapse comment: Oleg Nenashev added a comment - 2016-12-15 23:25

Expand comment: Oleg Nenashev added a comment - 2016-12-15 23:25

Collapse comment: Oleg Nenashev added a comment - 2016-12-15 23:48

Expand comment: Oleg Nenashev added a comment - 2016-12-15 23:48

Collapse comment: Oleg Nenashev added a comment - 2016-12-15 23:56

Expand comment: Oleg Nenashev added a comment - 2016-12-15 23:56

Collapse comment: SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Expand comment: SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Collapse comment: SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Expand comment: SCM/JIRA link daemon added a comment - 2016-12-16 22:41

Collapse comment: Oleg Nenashev added a comment - 2016-12-27 13:27

Expand comment: Oleg Nenashev added a comment - 2016-12-27 13:27

Collapse comment: Raghu Pallikonda added a comment - 2017-01-22 08:41

Expand comment: Raghu Pallikonda added a comment - 2017-01-22 08:41

Collapse comment: Orgad Shaneh added a comment - 2017-01-22 12:50

Expand comment: Orgad Shaneh added a comment - 2017-01-22 12:50

Collapse comment: Raghu Pallikonda added a comment - 2017-01-22 20:14

Expand comment: Raghu Pallikonda added a comment - 2017-01-22 20:14

Collapse comment: Orgad Shaneh added a comment - 2017-01-22 20:18

Expand comment: Orgad Shaneh added a comment - 2017-01-22 20:18

Collapse comment: Raghu Pallikonda added a comment - 2017-01-22 21:22

Expand comment: Raghu Pallikonda added a comment - 2017-01-22 21:22

Collapse comment: Luke Richardson added a comment - 2017-02-01 16:15

Expand comment: Luke Richardson added a comment - 2017-02-01 16:15

Collapse comment: Oleg Nenashev added a comment - 2018-03-14 02:32

Expand comment: Oleg Nenashev added a comment - 2018-03-14 02:32

Collapse comment: shraddha Magar added a comment - 2018-08-21 06:11, Edited by shraddha Magar - 2018-08-21 06:21

Expand comment: shraddha Magar added a comment - 2018-08-21 06:11, Edited by shraddha Magar - 2018-08-21 06:21

Collapse comment: Prudhvi Godithi added a comment - 2019-01-22 16:53, Edited by Prudhvi Godithi - 2019-01-22 17:04

Expand comment: Prudhvi Godithi added a comment - 2019-01-22 16:53, Edited by Prudhvi Godithi - 2019-01-22 17:04

Collapse comment: The th3mis added a comment - 2019-08-07 10:00

Expand comment: The th3mis added a comment - 2019-08-07 10:00

People

Dates