[JENKINS-19513] Jenkins should reconnect to a node after ChannelClosedException

Type: Improvement
Resolution: Unresolved
Priority: Major
Component/s: core, remoting
Labels:
- remoting
Environment:
The slave node is using Webstart to connect to the master.

Similar Issues:
Powered by SuggestiMate

Show

I tried running a job in Jenkins and got the following exception:

[EnvInject] - Loading node environment variables.
[EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed
FATAL: null
java.lang.NullPointerException
	at hudson.tasks.MailSender.createFailureMail(MailSender.java:279)
	at hudson.tasks.MailSender.getMail(MailSender.java:154)
	at hudson.tasks.MailSender.execute(MailSender.java:100)
	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.cleanUp(MavenModuleSetBuild.java:1025)
	at hudson.model.Run.execute(Run.java:1648)
	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506)
	at hudson.model.ResourceController.execute(ResourceController.java:88)
	at hudson.model.Executor.run(Executor.java:247)

Hitting "build now" multiple times results in the same exception over and over again. I am expecting Jenkins to automatically reconnect with the node after this exception occurs but it does not.

It looks like EnvInject plugin runs into this exception and passes the failure on to Jenkins which then runs into a NPE. Ideally this bug should be solved so that regardless how the plugins behave, Jenkins will be smart enough to reconnect (i.e. faulty plugins shouldn't block a reconnect).

cowwoc added a comment - 2013-09-09 15:44

On a related note, if you read the above log you will notice that Jenkins doesn't say which node is disconnected which makes it very difficult to know which node needs to be reconnected. I ended up restarting nodes one by one until I got the right one. Very annoying

Please correct this bug as well.

cowwoc added a comment - 2013-09-09 15:44 On a related note, if you read the above log you will notice that Jenkins doesn't say which node is disconnected which makes it very difficult to know which node needs to be reconnected. I ended up restarting nodes one by one until I got the right one. Very annoying Please correct this bug as well.

David Cramer added a comment - 2013-10-21 23:36 - edited

This error has become pretty problematic for us. We're consistently hitting it on Ubuntu 10.04 slaves.

It also seems to happen around the same location in the console log stream, which could suggest its duration or stream length based (duration seems wrong as provision times tend to vary):

est_yahoo_request_token_delete_token_by_user_id (metaserver.tests.model_tests.user_modelFATAL: hudson.remoting.ChannelClosedException: channel is already closed
hudson.remoting.RemotingSystemException: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:183)
at sun.proxy.$Proxy49.stop(Unknown Source)
at com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.tearDown(SSHAgentBuildWrapper.java:255)
at hudson.model.Build$BuildExecution.doRun(Build.java:171)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:562)
at hudson.model.Run.execute(Run.java:1665)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:246)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:516)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:714)
at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
... 8 more
Caused by: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2595)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1315)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
at hudson.remoting.Command.readFrom(Command.java:92)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

As you can see its aborting mid-way through this test.

Of note these jobs have generally been running ~40-50 minutes when they fail.

David Cramer added a comment - 2013-10-21 23:36 - edited This error has become pretty problematic for us. We're consistently hitting it on Ubuntu 10.04 slaves. It also seems to happen around the same location in the console log stream, which could suggest its duration or stream length based (duration seems wrong as provision times tend to vary): est_yahoo_request_token_delete_token_by_user_id (metaserver.tests.model_tests.user_modelFATAL: hudson.remoting.ChannelClosedException: channel is already closed hudson.remoting.RemotingSystemException: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:183) at sun.proxy.$Proxy49.stop(Unknown Source) at com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.tearDown(SSHAgentBuildWrapper.java:255) at hudson.model.Build$BuildExecution.doRun(Build.java:171) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:562) at hudson.model.Run.execute(Run.java:1665) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:246) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:516) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:714) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167) ... 8 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2595) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1315) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) As you can see its aborting mid-way through this test. Of note these jobs have generally been running ~40-50 minutes when they fail.

Hannes Kogler added a comment - 2013-10-27 19:27 - edited

same problem here.

We have updated to Jenkins version 1.531 recently.
Since that upgrade we get those random ChannelClosedException, that results in any following StreamExecpetions with fat error loggings like this one:

...'8' '.' '1' ' ' 'K' 'B' '/' 's' 'e' 'c' ')' 0x0d 0x0a
at hudson.remoting.FlightRecorderInputStream.analyzeCrash(FlightRecorderInputStream.java:71)
at hudson.remoting.ClassicCommandTransport.diagnoseStreamCorruption(ClassicCommandTransport.java:94)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:78)
at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
Caused by: java.io.StreamCorruptedException: invalid type code: 14
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at hudson.remoting.Command.readFrom(Command.java:92)
at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
... 1 more

Looks like the node went offline during the build. Check the slave log for the details.
FATAL: channel is already closed
hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:516)
at hudson.remoting.Request.call(Request.java:129)
at hudson.remoting.Channel.call(Channel.java:714)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:887)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:590)
at hudson.model.Run.execute(Run.java:1604)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:246)
Caused by: hudson.remoting.DiagnosedStreamCorruptionException
Read back: 0x14

But we certainly didn't change our Slave configuration or anything like that. Neither they should act more unstable than before suddenly?!

Hannes Kogler added a comment - 2013-10-27 19:27 - edited same problem here. We have updated to Jenkins version 1.531 recently. Since that upgrade we get those random ChannelClosedException, that results in any following StreamExecpetions with fat error loggings like this one: ...'8' '.' '1' ' ' 'K' 'B' '/' 's' 'e' 'c' ')' 0x0d 0x0a at hudson.remoting.FlightRecorderInputStream.analyzeCrash(FlightRecorderInputStream.java:71) at hudson.remoting.ClassicCommandTransport.diagnoseStreamCorruption(ClassicCommandTransport.java:94) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:78) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Caused by: java.io.StreamCorruptedException: invalid type code: 14 at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) ... 1 more Looks like the node went offline during the build. Check the slave log for the details. FATAL: channel is already closed hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:516) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:714) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:887) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:590) at hudson.model.Run.execute(Run.java:1604) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:246) Caused by: hudson.remoting.DiagnosedStreamCorruptionException Read back: 0x14 But we certainly didn't change our Slave configuration or anything like that. Neither they should act more unstable than before suddenly?!

Ryang Woo Park added a comment - 2014-02-28 06:08

same problem

Jenkins ver. 1.543
EnvInject 1.88

13:57:23 Started by an SCM change
13:57:23 [EnvInject] - Loading node environment variables.
13:57:23 [EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed
13:57:23 ERROR: Publisher hudson.tasks.Mailer aborted due to exception
13:57:23 hudson.remoting.ChannelClosedException: channel is already closed
13:57:23 	at hudson.remoting.Channel.send(Channel.java:524)
13:57:23 	at hudson.remoting.Request.call(Request.java:129)
13:57:23 	at hudson.remoting.Channel.call(Channel.java:722)
13:57:23 	at hudson.EnvVars.getRemote(EnvVars.java:396)
13:57:23 	at hudson.model.Computer.getEnvironment(Computer.java:908)
13:57:23 	at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
13:57:23 	at hudson.model.Run.getEnvironment(Run.java:2191)
13:57:23 	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914)
13:57:23 	at hudson.tasks.Mailer.perform(Mailer.java:114)
13:57:23 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:757)
13:57:23 	at hudson.model.Build$BuildExecution.post2(Build.java:183)
13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:706)
13:57:23 	at hudson.model.Run.execute(Run.java:1703)
13:57:23 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
13:57:23 	at hudson.model.ResourceController.execute(ResourceController.java:88)
13:57:23 	at hudson.model.Executor.run(Executor.java:231)
13:57:23 Caused by: java.io.IOException
13:57:23 	at hudson.remoting.Channel.close(Channel.java:1003)
13:57:23 	at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
13:57:23 	at hudson.remoting.PingThread.ping(PingThread.java:120)
13:57:23 	at hudson.remoting.PingThread.run(PingThread.java:81)
13:57:23 Caused by: java.util.concurrent.TimeoutException: Ping started on 1393551418747 hasn't completed at 1393551658747
13:57:23 	... 2 more
13:57:23 Email was triggered for: Failure
13:57:23 Sending email for trigger: Failure
13:57:27 ERROR: Error: No workspace found!
13:57:27 Error retrieving environment vars: channel is already closed
13:57:31 Sending email to: cass.park@chipsnmedia.com aiden.lee@chipsnmedia.com cnmqc@chipsnmedia.com
13:57:32 Loading slave statistic
13:57:32 Loading slave statisticCODA980_refsw_commit_linuxjob/CODA980_refsw_commit_linux/132/
13:57:32 Slave statistic loaded
13:57:32 [EnvInject] - [ERROR] - SEVERE ERROR occurs: channel is already closed
13:57:32 Finished: FAILURE

Ryang Woo Park added a comment - 2014-02-28 06:08 same problem Jenkins ver. 1.543 EnvInject 1.88 13:57:23 Started by an SCM change 13:57:23 [EnvInject] - Loading node environment variables. 13:57:23 [EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed 13:57:23 ERROR: Publisher hudson.tasks.Mailer aborted due to exception 13:57:23 hudson.remoting.ChannelClosedException: channel is already closed 13:57:23 at hudson.remoting.Channel.send(Channel.java:524) 13:57:23 at hudson.remoting.Request.call(Request.java:129) 13:57:23 at hudson.remoting.Channel.call(Channel.java:722) 13:57:23 at hudson.EnvVars.getRemote(EnvVars.java:396) 13:57:23 at hudson.model.Computer.getEnvironment(Computer.java:908) 13:57:23 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29) 13:57:23 at hudson.model.Run.getEnvironment(Run.java:2191) 13:57:23 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914) 13:57:23 at hudson.tasks.Mailer.perform(Mailer.java:114) 13:57:23 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:757) 13:57:23 at hudson.model.Build$BuildExecution.post2(Build.java:183) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:706) 13:57:23 at hudson.model.Run.execute(Run.java:1703) 13:57:23 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) 13:57:23 at hudson.model.ResourceController.execute(ResourceController.java:88) 13:57:23 at hudson.model.Executor.run(Executor.java:231) 13:57:23 Caused by: java.io.IOException 13:57:23 at hudson.remoting.Channel.close(Channel.java:1003) 13:57:23 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110) 13:57:23 at hudson.remoting.PingThread.ping(PingThread.java:120) 13:57:23 at hudson.remoting.PingThread.run(PingThread.java:81) 13:57:23 Caused by: java.util.concurrent.TimeoutException: Ping started on 1393551418747 hasn't completed at 1393551658747 13:57:23 ... 2 more 13:57:23 Email was triggered for : Failure 13:57:23 Sending email for trigger: Failure 13:57:27 ERROR: Error: No workspace found! 13:57:27 Error retrieving environment vars: channel is already closed 13:57:31 Sending email to: cass.park@chipsnmedia.com aiden.lee@chipsnmedia.com cnmqc@chipsnmedia.com 13:57:32 Loading slave statistic 13:57:32 Loading slave statisticCODA980_refsw_commit_linuxjob/CODA980_refsw_commit_linux/132/ 13:57:32 Slave statistic loaded 13:57:32 [EnvInject] - [ERROR] - SEVERE ERROR occurs: channel is already closed 13:57:32 Finished: FAILURE

Jonathan Langevin added a comment - 2015-01-05 10:12

Same problem for me, using EC2 slaves plugin.

Jonathan Langevin added a comment - 2015-01-05 10:12 Same problem for me, using EC2 slaves plugin.

Mic Le added a comment - 2015-04-02 13:22

I am having the same issue when using VMWare Dynamic Slaves.
Does anyone have a solution or workaround ?

Mic Le added a comment - 2015-04-02 13:22 I am having the same issue when using VMWare Dynamic Slaves. Does anyone have a solution or workaround ?

Jacobo Jimenez added a comment - 2015-07-30 09:08

Same problem here using real "SUSE Linux Enterprise Server 11.3 (x86_64)" slaves.

Jacobo Jimenez added a comment - 2015-07-30 09:08 Same problem here using real "SUSE Linux Enterprise Server 11.3 (x86_64)" slaves.

Katrin Nilsson added a comment - 2016-01-07 12:03

Are there any updates on this?

Katrin Nilsson added a comment - 2016-01-07 12:03 Are there any updates on this?

Chase Adams added a comment - 2016-08-17 15:52 - edited

We're experiencing the same problem (centos) and it impacts our internal users. Is there any update on this? This seems like a significant problem for a lot of users (lots of google stack overflow + jenkins issues) and it's 3 years old.

Chase Adams added a comment - 2016-08-17 15:52 - edited We're experiencing the same problem (centos) and it impacts our internal users. Is there any update on this? This seems like a significant problem for a lot of users (lots of google stack overflow + jenkins issues) and it's 3 years old.

Oleg Nenashev added a comment - 2016-08-19 20:43

I had a prototype of the new protocol implementation, which partially solves this problem. But actually this change requires some architecture redesign, e.g. switching to fault-tolerant message buses

Oleg Nenashev added a comment - 2016-08-19 20:43 I had a prototype of the new protocol implementation, which partially solves this problem. But actually this change requires some architecture redesign, e.g. switching to fault-tolerant message buses

Assignee:: Unassigned

Reporter:: cowwoc

Votes:: 29 Vote for this issue

Watchers:: 28 Start watching this issue

Created:: 2013-09-09 15:38

Updated:: 2016-08-19 20:43

Details

Description

Attachments

Activity

Collapse comment: cowwoc added a comment - 2013-09-09 15:44

Expand comment: cowwoc added a comment - 2013-09-09 15:44

Collapse comment: David Cramer added a comment - 2013-10-21 23:36, Edited by David Cramer - 2013-10-21 23:39

Expand comment: David Cramer added a comment - 2013-10-21 23:36, Edited by David Cramer - 2013-10-21 23:39

Collapse comment: Hannes Kogler added a comment - 2013-10-27 19:27, Edited by Hannes Kogler - 2013-10-27 19:27

Expand comment: Hannes Kogler added a comment - 2013-10-27 19:27, Edited by Hannes Kogler - 2013-10-27 19:27

Collapse comment: Ryang Woo Park added a comment - 2014-02-28 06:08

Expand comment: Ryang Woo Park added a comment - 2014-02-28 06:08

Collapse comment: Jonathan Langevin added a comment - 2015-01-05 10:12

Expand comment: Jonathan Langevin added a comment - 2015-01-05 10:12

Collapse comment: Mic Le added a comment - 2015-04-02 13:22

Expand comment: Mic Le added a comment - 2015-04-02 13:22

Collapse comment: Jacobo Jimenez added a comment - 2015-07-30 09:08

Expand comment: Jacobo Jimenez added a comment - 2015-07-30 09:08

Collapse comment: Katrin Nilsson added a comment - 2016-01-07 12:03

Expand comment: Katrin Nilsson added a comment - 2016-01-07 12:03

Collapse comment: Chase Adams added a comment - 2016-08-17 15:52, Edited by Chase Adams - 2016-08-17 16:15

Expand comment: Chase Adams added a comment - 2016-08-17 15:52, Edited by Chase Adams - 2016-08-17 16:15

Collapse comment: Oleg Nenashev added a comment - 2016-08-19 20:43

Expand comment: Oleg Nenashev added a comment - 2016-08-19 20:43

People

Dates