Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-19513

Jenkins should reconnect to a node after ChannelClosedException

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • core, remoting
    • The slave node is using Webstart to connect to the master.

      I tried running a job in Jenkins and got the following exception:

      [EnvInject] - Loading node environment variables.
      [EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed
      FATAL: null
      java.lang.NullPointerException
      	at hudson.tasks.MailSender.createFailureMail(MailSender.java:279)
      	at hudson.tasks.MailSender.getMail(MailSender.java:154)
      	at hudson.tasks.MailSender.execute(MailSender.java:100)
      	at hudson.maven.MavenModuleSetBuild$MavenModuleSetBuildExecution.cleanUp(MavenModuleSetBuild.java:1025)
      	at hudson.model.Run.execute(Run.java:1648)
      	at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:247)
      

      Hitting "build now" multiple times results in the same exception over and over again. I am expecting Jenkins to automatically reconnect with the node after this exception occurs but it does not.

      It looks like EnvInject plugin runs into this exception and passes the failure on to Jenkins which then runs into a NPE. Ideally this bug should be solved so that regardless how the plugins behave, Jenkins will be smart enough to reconnect (i.e. faulty plugins shouldn't block a reconnect).

          [JENKINS-19513] Jenkins should reconnect to a node after ChannelClosedException

          cowwoc added a comment -

          On a related note, if you read the above log you will notice that Jenkins doesn't say which node is disconnected which makes it very difficult to know which node needs to be reconnected. I ended up restarting nodes one by one until I got the right one. Very annoying

          Please correct this bug as well.

          cowwoc added a comment - On a related note, if you read the above log you will notice that Jenkins doesn't say which node is disconnected which makes it very difficult to know which node needs to be reconnected. I ended up restarting nodes one by one until I got the right one. Very annoying Please correct this bug as well.

          David Cramer added a comment - - edited

          This error has become pretty problematic for us. We're consistently hitting it on Ubuntu 10.04 slaves.

          It also seems to happen around the same location in the console log stream, which could suggest its duration or stream length based (duration seems wrong as provision times tend to vary):

          est_yahoo_request_token_delete_token_by_user_id (metaserver.tests.model_tests.user_modelFATAL: hudson.remoting.ChannelClosedException: channel is already closed
          hudson.remoting.RemotingSystemException: hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:183)
          at sun.proxy.$Proxy49.stop(Unknown Source)
          at com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.tearDown(SSHAgentBuildWrapper.java:255)
          at hudson.model.Build$BuildExecution.doRun(Build.java:171)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:562)
          at hudson.model.Run.execute(Run.java:1665)
          at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:246)
          Caused by: hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:516)
          at hudson.remoting.Request.call(Request.java:129)
          at hudson.remoting.Channel.call(Channel.java:714)
          at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167)
          ... 8 more
          Caused by: java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
          Caused by: java.io.EOFException
          at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2595)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1315)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
          at hudson.remoting.Command.readFrom(Command.java:92)
          at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)

          As you can see its aborting mid-way through this test.

          Of note these jobs have generally been running ~40-50 minutes when they fail.

          David Cramer added a comment - - edited This error has become pretty problematic for us. We're consistently hitting it on Ubuntu 10.04 slaves. It also seems to happen around the same location in the console log stream, which could suggest its duration or stream length based (duration seems wrong as provision times tend to vary): est_yahoo_request_token_delete_token_by_user_id (metaserver.tests.model_tests.user_modelFATAL: hudson.remoting.ChannelClosedException: channel is already closed hudson.remoting.RemotingSystemException: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:183) at sun.proxy.$Proxy49.stop(Unknown Source) at com.cloudbees.jenkins.plugins.sshagent.SSHAgentBuildWrapper$SSHAgentEnvironment.tearDown(SSHAgentBuildWrapper.java:255) at hudson.model.Build$BuildExecution.doRun(Build.java:171) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:562) at hudson.model.Run.execute(Run.java:1665) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:246) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:516) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:714) at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:167) ... 8 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2595) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1315) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) As you can see its aborting mid-way through this test. Of note these jobs have generally been running ~40-50 minutes when they fail.

          Hannes Kogler added a comment - - edited

          same problem here.

          We have updated to Jenkins version 1.531 recently.
          Since that upgrade we get those random ChannelClosedException, that results in any following StreamExecpetions with fat error loggings like this one:

          ...'8' '.' '1' ' ' 'K' 'B' '/' 's' 'e' 'c' ')' 0x0d 0x0a
          at hudson.remoting.FlightRecorderInputStream.analyzeCrash(FlightRecorderInputStream.java:71)
          at hudson.remoting.ClassicCommandTransport.diagnoseStreamCorruption(ClassicCommandTransport.java:94)
          at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:78)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)
          Caused by: java.io.StreamCorruptedException: invalid type code: 14
          at java.io.ObjectInputStream.readObject0(Unknown Source)
          at java.io.ObjectInputStream.readObject(Unknown Source)
          at hudson.remoting.Command.readFrom(Command.java:92)
          at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
          ... 1 more

          Looks like the node went offline during the build. Check the slave log for the details.
          FATAL: channel is already closed
          hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:516)
          at hudson.remoting.Request.call(Request.java:129)
          at hudson.remoting.Channel.call(Channel.java:714)
          at hudson.Launcher$RemoteLauncher.kill(Launcher.java:887)
          at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:590)
          at hudson.model.Run.execute(Run.java:1604)
          at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:246)
          Caused by: hudson.remoting.DiagnosedStreamCorruptionException
          Read back: 0x14

          But we certainly didn't change our Slave configuration or anything like that. Neither they should act more unstable than before suddenly?!

          Hannes Kogler added a comment - - edited same problem here. We have updated to Jenkins version 1.531 recently. Since that upgrade we get those random ChannelClosedException, that results in any following StreamExecpetions with fat error loggings like this one: ...'8' '.' '1' ' ' 'K' 'B' '/' 's' 'e' 'c' ')' 0x0d 0x0a at hudson.remoting.FlightRecorderInputStream.analyzeCrash(FlightRecorderInputStream.java:71) at hudson.remoting.ClassicCommandTransport.diagnoseStreamCorruption(ClassicCommandTransport.java:94) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:78) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) Caused by: java.io.StreamCorruptedException: invalid type code: 14 at java.io.ObjectInputStream.readObject0(Unknown Source) at java.io.ObjectInputStream.readObject(Unknown Source) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) ... 1 more Looks like the node went offline during the build. Check the slave log for the details. FATAL: channel is already closed hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:516) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:714) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:887) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:590) at hudson.model.Run.execute(Run.java:1604) at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:506) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:246) Caused by: hudson.remoting.DiagnosedStreamCorruptionException Read back: 0x14 But we certainly didn't change our Slave configuration or anything like that. Neither they should act more unstable than before suddenly?!

          same problem

          Jenkins ver. 1.543
          EnvInject 1.88

          13:57:23 Started by an SCM change
          13:57:23 [EnvInject] - Loading node environment variables.
          13:57:23 [EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed
          13:57:23 ERROR: Publisher hudson.tasks.Mailer aborted due to exception
          13:57:23 hudson.remoting.ChannelClosedException: channel is already closed
          13:57:23 	at hudson.remoting.Channel.send(Channel.java:524)
          13:57:23 	at hudson.remoting.Request.call(Request.java:129)
          13:57:23 	at hudson.remoting.Channel.call(Channel.java:722)
          13:57:23 	at hudson.EnvVars.getRemote(EnvVars.java:396)
          13:57:23 	at hudson.model.Computer.getEnvironment(Computer.java:908)
          13:57:23 	at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
          13:57:23 	at hudson.model.Run.getEnvironment(Run.java:2191)
          13:57:23 	at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914)
          13:57:23 	at hudson.tasks.Mailer.perform(Mailer.java:114)
          13:57:23 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785)
          13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:757)
          13:57:23 	at hudson.model.Build$BuildExecution.post2(Build.java:183)
          13:57:23 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:706)
          13:57:23 	at hudson.model.Run.execute(Run.java:1703)
          13:57:23 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
          13:57:23 	at hudson.model.ResourceController.execute(ResourceController.java:88)
          13:57:23 	at hudson.model.Executor.run(Executor.java:231)
          13:57:23 Caused by: java.io.IOException
          13:57:23 	at hudson.remoting.Channel.close(Channel.java:1003)
          13:57:23 	at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110)
          13:57:23 	at hudson.remoting.PingThread.ping(PingThread.java:120)
          13:57:23 	at hudson.remoting.PingThread.run(PingThread.java:81)
          13:57:23 Caused by: java.util.concurrent.TimeoutException: Ping started on 1393551418747 hasn't completed at 1393551658747
          13:57:23 	... 2 more
          13:57:23 Email was triggered for: Failure
          13:57:23 Sending email for trigger: Failure
          13:57:27 ERROR: Error: No workspace found!
          13:57:27 Error retrieving environment vars: channel is already closed
          13:57:31 Sending email to: cass.park@chipsnmedia.com aiden.lee@chipsnmedia.com cnmqc@chipsnmedia.com
          13:57:32 Loading slave statistic
          13:57:32 Loading slave statisticCODA980_refsw_commit_linuxjob/CODA980_refsw_commit_linux/132/
          13:57:32 Slave statistic loaded
          13:57:32 [EnvInject] - [ERROR] - SEVERE ERROR occurs: channel is already closed
          13:57:32 Finished: FAILURE
          

          Ryang Woo Park added a comment - same problem Jenkins ver. 1.543 EnvInject 1.88 13:57:23 Started by an SCM change 13:57:23 [EnvInject] - Loading node environment variables. 13:57:23 [EnvInject] - [ERROR] - SEVERE ERROR occurs: hudson.remoting.ChannelClosedException: channel is already closed 13:57:23 ERROR: Publisher hudson.tasks.Mailer aborted due to exception 13:57:23 hudson.remoting.ChannelClosedException: channel is already closed 13:57:23 at hudson.remoting.Channel.send(Channel.java:524) 13:57:23 at hudson.remoting.Request.call(Request.java:129) 13:57:23 at hudson.remoting.Channel.call(Channel.java:722) 13:57:23 at hudson.EnvVars.getRemote(EnvVars.java:396) 13:57:23 at hudson.model.Computer.getEnvironment(Computer.java:908) 13:57:23 at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29) 13:57:23 at hudson.model.Run.getEnvironment(Run.java:2191) 13:57:23 at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914) 13:57:23 at hudson.tasks.Mailer.perform(Mailer.java:114) 13:57:23 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:785) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:757) 13:57:23 at hudson.model.Build$BuildExecution.post2(Build.java:183) 13:57:23 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:706) 13:57:23 at hudson.model.Run.execute(Run.java:1703) 13:57:23 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) 13:57:23 at hudson.model.ResourceController.execute(ResourceController.java:88) 13:57:23 at hudson.model.Executor.run(Executor.java:231) 13:57:23 Caused by: java.io.IOException 13:57:23 at hudson.remoting.Channel.close(Channel.java:1003) 13:57:23 at hudson.slaves.ChannelPinger$1.onDead(ChannelPinger.java:110) 13:57:23 at hudson.remoting.PingThread.ping(PingThread.java:120) 13:57:23 at hudson.remoting.PingThread.run(PingThread.java:81) 13:57:23 Caused by: java.util.concurrent.TimeoutException: Ping started on 1393551418747 hasn't completed at 1393551658747 13:57:23 ... 2 more 13:57:23 Email was triggered for : Failure 13:57:23 Sending email for trigger: Failure 13:57:27 ERROR: Error: No workspace found! 13:57:27 Error retrieving environment vars: channel is already closed 13:57:31 Sending email to: cass.park@chipsnmedia.com aiden.lee@chipsnmedia.com cnmqc@chipsnmedia.com 13:57:32 Loading slave statistic 13:57:32 Loading slave statisticCODA980_refsw_commit_linuxjob/CODA980_refsw_commit_linux/132/ 13:57:32 Slave statistic loaded 13:57:32 [EnvInject] - [ERROR] - SEVERE ERROR occurs: channel is already closed 13:57:32 Finished: FAILURE

          Same problem for me, using EC2 slaves plugin.

          Jonathan Langevin added a comment - Same problem for me, using EC2 slaves plugin.

          Mic Le added a comment -

          I am having the same issue when using VMWare Dynamic Slaves.
          Does anyone have a solution or workaround ?

          Mic Le added a comment - I am having the same issue when using VMWare Dynamic Slaves. Does anyone have a solution or workaround ?

          Same problem here using real "SUSE Linux Enterprise Server 11.3 (x86_64)" slaves.

          Jacobo Jimenez added a comment - Same problem here using real "SUSE Linux Enterprise Server 11.3 (x86_64)" slaves.

          Are there any updates on this?

          Katrin Nilsson added a comment - Are there any updates on this?

          Chase Adams added a comment - - edited

          We're experiencing the same problem (centos) and it impacts our internal users. Is there any update on this? This seems like a significant problem for a lot of users (lots of google stack overflow + jenkins issues) and it's 3 years old.

          Chase Adams added a comment - - edited We're experiencing the same problem (centos) and it impacts our internal users. Is there any update on this? This seems like a significant problem for a lot of users (lots of google stack overflow + jenkins issues) and it's 3 years old.

          Oleg Nenashev added a comment -

          I had a prototype of the new protocol implementation, which partially solves this problem. But actually this change requires some architecture redesign, e.g. switching to fault-tolerant message buses

          Oleg Nenashev added a comment - I had a prototype of the new protocol implementation, which partially solves this problem. But actually this change requires some architecture redesign, e.g. switching to fault-tolerant message buses

            Unassigned Unassigned
            cowwoc cowwoc
            Votes:
            29 Vote for this issue
            Watchers:
            28 Start watching this issue

              Created:
              Updated: