Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-3412

For long running jobs (>2 hours) job failing with hudson.util.IOException2: Failed to join the process

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Major Major
    • core
    • None
    • Platform: PC, OS: Linux

      We have a sort of special CI environment where after projects build we execute
      them remotely and use hudson to monitor their progress. The remote execution of
      these programs take a while and at certain points no output is sent back to the
      master for long periods of time. During these long intervals where no output is
      sent back (just over 2 hours) I am occasionally seeing the job fail with the
      following:

      FATAL: command execution failed
      hudson.util.IOException2: Failed to join the process
      at hudson.Proc$RemoteProc.join(Proc.java:269)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:84)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
      at hudson.model.Build$RunnerImpl.build(Build.java:195)
      at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
      at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
      at hudson.model.Run.run(Run.java:895)
      at hudson.model.Build.run(Build.java:112)
      at hudson.model.ResourceController.execute(ResourceController.java:93)
      at hudson.model.Executor.run(Executor.java:119)
      Caused by: java.util.concurrent.ExecutionException:
      hudson.remoting.RequestAbortedException: java.io.EOFException
      at hudson.remoting.Request$1.get(Request.java:188)
      at hudson.remoting.Request$1.get(Request.java:157)
      at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
      at hudson.Proc$RemoteProc.join(Proc.java:261)
      ... 9 more
      Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
      at hudson.remoting.Request.abort(Request.java:223)
      at hudson.remoting.Channel.terminate(Channel.java:528)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:684)
      Caused by: java.io.EOFException
      at
      java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:665)
      FATAL: Unable to delete script file /tmp/hudson24564.sh
      hudson.util.IOException2: remote file operation failed
      at hudson.FilePath.act(FilePath.java:544)
      at hudson.FilePath.delete(FilePath.java:741)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:94)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
      at hudson.model.Build$RunnerImpl.build(Build.java:195)
      at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
      at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
      at hudson.model.Run.run(Run.java:895)
      at hudson.model.Build.run(Build.java:112)
      at hudson.model.ResourceController.execute(ResourceController.java:93)
      at hudson.model.Executor.run(Executor.java:119)
      Caused by: java.io.IOException: already closed
      at hudson.remoting.Channel.send(Channel.java:342)
      at hudson.remoting.Request.call(Request.java:104)
      at hudson.remoting.Channel.call(Channel.java:481)
      at hudson.FilePath.act(FilePath.java:541)
      ... 10 more
      FATAL: already closed
      java.io.IOException: already closed
      at hudson.remoting.Channel.send(Channel.java:342)
      at hudson.remoting.Request.call(Request.java:104)
      at hudson.remoting.Channel.call(Channel.java:481)
      at hudson.Launcher$RemoteLauncher.kill(Launcher.java:466)
      at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:277)
      at hudson.model.Run.run(Run.java:895)
      at hudson.model.Build.run(Build.java:112)
      at hudson.model.ResourceController.execute(ResourceController.java:93)
      at hudson.model.Executor.run(Executor.java:119)

      However, this is not predictable or reproducible which makes me think it
      corresponds to an external event such as GC, or even an network or OS event (eg
      TCP Error or Socket timeout). Anyway I thought I would put it up here and see if
      anyone else is getting this too.

      I am using Hudson ver. 1.293, The master and slave are both RHEL 4

      An interesting development occurred when I upgraded recently and then set
      hudson.util.ProcessTreeKiller.disable=true. The jobs were still failing but the
      underlying process was eventually completing its job successfully (copying a
      large MySQL DB if you must know). This is the reason I reported this. This hints
      at a bug in hudson's remoting code.

      --Chad

          [JENKINS-3412] For long running jobs (>2 hours) job failing with hudson.util.IOException2: Failed to join the process

          njancesk added a comment -

          I have this same issue running Hudson ver. 1.355 on slave on Solaris 10 Sparc machine using Java 1.5.0 with jobs that take 4+ hours.

          I don't have this issue with similiar jobs on Solaris 10 x86, but the job finishes before 4 hours.

          njancesk added a comment - I have this same issue running Hudson ver. 1.355 on slave on Solaris 10 Sparc machine using Java 1.5.0 with jobs that take 4+ hours. I don't have this issue with similiar jobs on Solaris 10 x86, but the job finishes before 4 hours.

          Jim McCaskey added a comment -

          FWIW: This seems to be happing with a Windows 2003 Slave as well usining Hudson 1.362. I have seen it before, this is just the first time I tried to track down a solution. Here is what the error looks like on this version of Hudson.

          FATAL: command execution failed
          hudson.util.IOException2: Failed to join the process
          at hudson.Proc$RemoteProc.join(Proc.java:312)
          at hudson.Launcher$ProcStarter.join(Launcher.java:280)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
          at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
          at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
          at hudson.model.Build$RunnerImpl.build(Build.java:174)
          at hudson.model.Build$RunnerImpl.doRun(Build.java:138)
          at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416)
          at hudson.model.Run.run(Run.java:1253)
          at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:124)
          Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.Request$1.get(Request.java:218)
          at hudson.remoting.Request$1.get(Request.java:172)
          at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
          at hudson.Proc$RemoteProc.join(Proc.java:304)
          ... 12 more
          Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.Request.abort(Request.java:257)
          at hudson.remoting.Channel.terminate(Channel.java:602)
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:893)
          Caused by: java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:875)
          Caused by: java.io.EOFException
          at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:869)
          FATAL: Unable to delete script file C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat
          hudson.util.IOException2: remote file operation failed: C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat at hudson.remoting.Channel@1a8aa2c:cmhslave02-win32
          at hudson.FilePath.act(FilePath.java:749)
          at hudson.FilePath.act(FilePath.java:735)
          at hudson.FilePath.delete(FilePath.java:990)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:93)
          at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
          at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
          at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
          at hudson.model.Build$RunnerImpl.build(Build.java:174)
          at hudson.model.Build$RunnerImpl.doRun(Build.java:138)
          at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416)
          at hudson.model.Run.run(Run.java:1253)
          at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:124)
          Caused by: hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:412)
          at hudson.remoting.Request.call(Request.java:105)
          at hudson.remoting.Channel.call(Channel.java:555)
          at hudson.FilePath.act(FilePath.java:742)
          ... 13 more
          FATAL: channel is already closed
          hudson.remoting.ChannelClosedException: channel is already closed
          at hudson.remoting.Channel.send(Channel.java:412)
          at hudson.remoting.Request.call(Request.java:105)
          at hudson.remoting.Channel.call(Channel.java:555)
          at hudson.Launcher$RemoteLauncher.kill(Launcher.java:744)
          at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:421)
          at hudson.model.Run.run(Run.java:1253)
          at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
          at hudson.model.ResourceController.execute(ResourceController.java:88)
          at hudson.model.Executor.run(Executor.java:124)

          Jim McCaskey added a comment - FWIW: This seems to be happing with a Windows 2003 Slave as well usining Hudson 1.362. I have seen it before, this is just the first time I tried to track down a solution. Here is what the error looks like on this version of Hudson. FATAL: command execution failed hudson.util.IOException2: Failed to join the process at hudson.Proc$RemoteProc.join(Proc.java:312) at hudson.Launcher$ProcStarter.join(Launcher.java:280) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601) at hudson.model.Build$RunnerImpl.build(Build.java:174) at hudson.model.Build$RunnerImpl.doRun(Build.java:138) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416) at hudson.model.Run.run(Run.java:1253) at hudson.matrix.MatrixRun.run(MatrixRun.java:130) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:124) Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.Request$1.get(Request.java:218) at hudson.remoting.Request$1.get(Request.java:172) at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55) at hudson.Proc$RemoteProc.join(Proc.java:304) ... 12 more Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.Request.abort(Request.java:257) at hudson.remoting.Channel.terminate(Channel.java:602) at hudson.remoting.Channel$ReaderThread.run(Channel.java:893) Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.Channel$ReaderThread.run(Channel.java:875) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:869) FATAL: Unable to delete script file C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat hudson.util.IOException2: remote file operation failed: C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat at hudson.remoting.Channel@1a8aa2c:cmhslave02-win32 at hudson.FilePath.act(FilePath.java:749) at hudson.FilePath.act(FilePath.java:735) at hudson.FilePath.delete(FilePath.java:990) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:93) at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19) at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601) at hudson.model.Build$RunnerImpl.build(Build.java:174) at hudson.model.Build$RunnerImpl.doRun(Build.java:138) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416) at hudson.model.Run.run(Run.java:1253) at hudson.matrix.MatrixRun.run(MatrixRun.java:130) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:124) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:412) at hudson.remoting.Request.call(Request.java:105) at hudson.remoting.Channel.call(Channel.java:555) at hudson.FilePath.act(FilePath.java:742) ... 13 more FATAL: channel is already closed hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:412) at hudson.remoting.Request.call(Request.java:105) at hudson.remoting.Channel.call(Channel.java:555) at hudson.Launcher$RemoteLauncher.kill(Launcher.java:744) at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:421) at hudson.model.Run.run(Run.java:1253) at hudson.matrix.MatrixRun.run(MatrixRun.java:130) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:124)

          Shrinkhla21 added a comment -

          The best solution being that there is no ping requests made till the time the slave is running a build.
          Or the least requirement would be to somehow increase the timeout on the ping event.

          Shrinkhla21 added a comment - The best solution being that there is no ping requests made till the time the slave is running a build. Or the least requirement would be to somehow increase the timeout on the ping event.

          Tzuchien added a comment -

          I am also experiencing exactly the same problem (callstack). Hudson 1.363 on RedHat/Tomcat. Windows XP slaves. I have a matrix job, and each configuration job takes about 1.5 hours. Not all jobs fail, in my last build, 1 out of 25 configuration failed because of this problem.

          Tzuchien added a comment - I am also experiencing exactly the same problem (callstack). Hudson 1.363 on RedHat/Tomcat. Windows XP slaves. I have a matrix job, and each configuration job takes about 1.5 hours. Not all jobs fail, in my last build, 1 out of 25 configuration failed because of this problem.

          Code changed in hudson
          User: : kohsuke
          Path:
          trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java
          trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java
          http://jenkins-ci.org/commit/33537
          Log:
          [JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed.

          SCM/JIRA link daemon added a comment - Code changed in hudson User: : kohsuke Path: trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java http://jenkins-ci.org/commit/33537 Log: [JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed.

          dogfood added a comment -

          Integrated in hudson_main_trunk #156
          [JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed.

          kohsuke :
          Files :

          • /trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java
          • /trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java

          dogfood added a comment - Integrated in hudson_main_trunk #156 [JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed. kohsuke : Files : /trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java /trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java

          kalpanab added a comment -

          I did integrate the above fix in our Hudson 1.362 version but I am not seeing the root cause of why Hudson slave connection reset.

          kalpanab added a comment - I did integrate the above fix in our Hudson 1.362 version but I am not seeing the root cause of why Hudson slave connection reset.

          Can you please report the stack trace?

          Kohsuke Kawaguchi added a comment - Can you please report the stack trace?

          I'm marking this as a duplicate of JENKINS-5073.

          Both issues are caused by a lost master/slave communication channel. When it happens while your build is waiting for a forked process to complete, you see this error in the build console.

          Kohsuke Kawaguchi added a comment - I'm marking this as a duplicate of JENKINS-5073 . Both issues are caused by a lost master/slave communication channel. When it happens while your build is waiting for a forked process to complete, you see this error in the build console.

          Marking as a duplicate.

          Kohsuke Kawaguchi added a comment - Marking as a duplicate.

            Unassigned Unassigned
            chad_lyon chad_lyon
            Votes:
            24 Vote for this issue
            Watchers:
            27 Start watching this issue

              Created:
              Updated:
              Resolved: