[JENKINS-18781] Configurable channel timeout for slaves

Type: Improvement
Resolution: Unresolved
Priority: Major
Component/s: core
Labels:
- remoting

Similar Issues:
Powered by SuggestiMate

Show

This issue is related to ~~JENKINS-6817~~.

I am running Jenkins slaves inside virtual machines. Sometimes these machines are overloaded and I get the following exception:

FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
	at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
	at hudson.remoting.Request.call(Request.java:174)
	at hudson.remoting.Channel.call(Channel.java:713)
	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
	at $Proxy38.join(Unknown Source)
	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:925)
	at hudson.Launcher$ProcStarter.join(Launcher.java:360)
	at hudson.tasks.Maven.perform(Maven.java:327)
	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:804)
	at hudson.model.Build$BuildExecution.build(Build.java:199)
	at hudson.model.Build$BuildExecution.doRun(Build.java:160)
	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:586)
	at hudson.model.Run.execute(Run.java:1593)
	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:247)
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.Request.abort(Request.java:299)
	at hudson.remoting.Channel.terminate(Channel.java:773)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
Caused by: java.io.IOException: Unexpected termination of the channel
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
	at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
	at java.io.ObjectInputStream.readObject0(Unknown Source)
	at java.io.ObjectInputStream.readObject(Unknown Source)
	at hudson.remoting.Command.readFrom(Command.java:92)
	at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:72)
	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

Is it possible to make the channel timeout configurable? I'd like to increase the value from, say 5 seconds, to 30 seconds or a minute.

depends on

JENKINS-44785 Add Built-in Request timeout support in Remoting

Open

is duplicated by

JENKINS-22754 Configurable node timeout

Closed

is related to

JENKINS-22853 SEVERE: Trying to unexport an object that's already unexported

Resolved

JENKINS-22722 Master doesn't show connected slave

Resolved

JENKINS-18879 Collecting finbugs analysis results randomly fails with exception

Resolved

JENKINS-6817 FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel

Closed

(1 is related to)

cowwoc created issue - 2013-07-16 21:30

cowwoc made changes - 2013-07-16 21:31

Link

New: This issue is related to ~~JENKINS-6817~~ [ ~~JENKINS-6817~~ ]

Ulli Hafner made changes - 2013-08-09 21:34

Link

New: This issue is related to ~~JENKINS-18879~~ [ ~~JENKINS-18879~~ ]

Ramin Baradari added a comment - 2013-09-06 13:11

I wonder if those new timeouts of the slaves might be related to this change https://github.com/jenkinsci/remoting/commit/28830e37b94387d0c6f9927ad897f4010e6c1bda
Maybe kohsuke knows and can add some logging in case of connection timeouts. Currently everything happens silently and there is no clue why the connections die and which timeout is actually responsible.

Ramin Baradari added a comment - 2013-09-06 13:11 I wonder if those new timeouts of the slaves might be related to this change https://github.com/jenkinsci/remoting/commit/28830e37b94387d0c6f9927ad897f4010e6c1bda Maybe kohsuke knows and can add some logging in case of connection timeouts. Currently everything happens silently and there is no clue why the connections die and which timeout is actually responsible.

cowwoc added a comment - 2013-09-06 22:12

I agree, we need more logging. On a side-note, a five second timeout is very low in my case. It is quite likely that overloaded VMs will fail to respond for longer (especially if swapping to disk occurs).

cowwoc added a comment - 2013-09-06 22:12 I agree, we need more logging. On a side-note, a five second timeout is very low in my case. It is quite likely that overloaded VMs will fail to respond for longer (especially if swapping to disk occurs).

Henri Gomez added a comment - 2013-10-01 10:15

+1, I get more and more often these failures.

It would be great to have slave connection and read timeout configurable by node, so we could set it independently.
For example, remote slaves via WAN could requires more time to be connected and respond

Henri Gomez added a comment - 2013-10-01 10:15 +1, I get more and more often these failures. It would be great to have slave connection and read timeout configurable by node, so we could set it independently. For example, remote slaves via WAN could requires more time to be connected and respond

Tony Greway added a comment - 2013-10-19 19:47

I've recently upgraded from 0.27 SSH plugin to 1.4 and I see this exception being thrown almost everyday on our nightly builds. We are using a multi-configuration project to essentially install software on our distributed cluster ever night and I see our builds fail randomly on one of the nodes with high regularity. All of our nodes are VMs that typically have no traffic at the time of the build and I have them configured to be offline until the build is requested. Is there a way to revert back to .27 until this is fixed - I've tried without success?

Tony Greway added a comment - 2013-10-19 19:47 I've recently upgraded from 0.27 SSH plugin to 1.4 and I see this exception being thrown almost everyday on our nightly builds. We are using a multi-configuration project to essentially install software on our distributed cluster ever night and I see our builds fail randomly on one of the nodes with high regularity. All of our nodes are VMs that typically have no traffic at the time of the build and I have them configured to be offline until the build is requested. Is there a way to revert back to .27 until this is fixed - I've tried without success?

Stoil Valchkov added a comment - 2014-04-22 10:16

Hi guys,

Are there any plans to fix that soon? It is crucial to have it in order to have stable environment.

Thanks

Stoil Valchkov added a comment - 2014-04-22 10:16 Hi guys, Are there any plans to fix that soon? It is crucial to have it in order to have stable environment. Thanks

Ian Norton added a comment - 2014-05-02 10:22

I hate to make a "me too" post but this is getting rather annoying here. My windows (VM) slaves tend to fail one in every 3 builds because of this, and as we have a matrix job for this it usually means 2/3 builds fail.

Ian Norton added a comment - 2014-05-02 10:22 I hate to make a "me too" post but this is getting rather annoying here. My windows (VM) slaves tend to fail one in every 3 builds because of this, and as we have a matrix job for this it usually means 2/3 builds fail.

Sajajd Rehman added a comment - 2014-05-13 13:10

PING any fixers there?

Sajajd Rehman added a comment - 2014-05-13 13:10 PING any fixers there?

Assignee:: Unassigned

Reporter:: cowwoc

Votes:: 58 Vote for this issue

Watchers:: 69 Start watching this issue

Created:: 2013-07-16 21:30

Updated:: 2019-01-02 10:37

Details

Description

Attachments

Issue Links

Activity

Collapse comment: Ramin Baradari added a comment - 2013-09-06 13:11

Expand comment: Ramin Baradari added a comment - 2013-09-06 13:11

Collapse comment: cowwoc added a comment - 2013-09-06 22:12

Expand comment: cowwoc added a comment - 2013-09-06 22:12

Collapse comment: Henri Gomez added a comment - 2013-10-01 10:15

Expand comment: Henri Gomez added a comment - 2013-10-01 10:15

Collapse comment: Tony Greway added a comment - 2013-10-19 19:47

Expand comment: Tony Greway added a comment - 2013-10-19 19:47

Collapse comment: Stoil Valchkov added a comment - 2014-04-22 10:16

Expand comment: Stoil Valchkov added a comment - 2014-04-22 10:16

Collapse comment: Ian Norton added a comment - 2014-05-02 10:22

Expand comment: Ian Norton added a comment - 2014-05-02 10:22

Collapse comment: Sajajd Rehman added a comment - 2014-05-13 13:10

Expand comment: Sajajd Rehman added a comment - 2014-05-13 13:10

People

Dates