-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Platform: PC, OS: Linux
-
Powered by SuggestiMate
We have a sort of special CI environment where after projects build we execute
them remotely and use hudson to monitor their progress. The remote execution of
these programs take a while and at certain points no output is sent back to the
master for long periods of time. During these long intervals where no output is
sent back (just over 2 hours) I am occasionally seeing the job fail with the
following:
FATAL: command execution failed
hudson.util.IOException2: Failed to join the process
at hudson.Proc$RemoteProc.join(Proc.java:269)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:84)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.model.Build$RunnerImpl.build(Build.java:195)
at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
at hudson.model.Run.run(Run.java:895)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:119)
Caused by: java.util.concurrent.ExecutionException:
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request$1.get(Request.java:188)
at hudson.remoting.Request$1.get(Request.java:157)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
at hudson.Proc$RemoteProc.join(Proc.java:261)
... 9 more
Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:223)
at hudson.remoting.Channel.terminate(Channel.java:528)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:684)
Caused by: java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:665)
FATAL: Unable to delete script file /tmp/hudson24564.sh
hudson.util.IOException2: remote file operation failed
at hudson.FilePath.act(FilePath.java:544)
at hudson.FilePath.delete(FilePath.java:741)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:94)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.model.Build$RunnerImpl.build(Build.java:195)
at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
at hudson.model.Run.run(Run.java:895)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:119)
Caused by: java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:342)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:481)
at hudson.FilePath.act(FilePath.java:541)
... 10 more
FATAL: already closed
java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:342)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:481)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:466)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:277)
at hudson.model.Run.run(Run.java:895)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:119)
However, this is not predictable or reproducible which makes me think it
corresponds to an external event such as GC, or even an network or OS event (eg
TCP Error or Socket timeout). Anyway I thought I would put it up here and see if
anyone else is getting this too.
I am using Hudson ver. 1.293, The master and slave are both RHEL 4
An interesting development occurred when I upgraded recently and then set
hudson.util.ProcessTreeKiller.disable=true. The jobs were still failing but the
underlying process was eventually completing its job successfully (copying a
large MySQL DB if you must know). This is the reason I reported this. This hints
at a bug in hudson's remoting code.
--Chad
- duplicates
-
JENKINS-5073 hudson.util.IOException2: Failed to join the process - on a Windows slave
-
- Resolved
-
[JENKINS-3412] For long running jobs (>2 hours) job failing with hudson.util.IOException2: Failed to join the process
I apologize if I was vague. The job is just a shell execute but it must run on a
particular environment. Thus, it is tied to a slave and that slave is started
via ssh command from master.
The shell script starts by copying tables from a remote data store using mysql
client. One of those tables is very large and takes just over two hours to copy.
While it is copying there is obviously TCP activity between the slave and the
remote data store but the slave doesn't send any logging info back to the master
for the entire two hour+ period. Since upgrading hudson to 1.293 from 1.278. The
connection seems to be getting dropped at some point during this two hour period.
Before I turned off ProcessTreeKiller the underlying mysql transfer was
terminiating with the hudson job. However, now the command started by the
hudson job is completing on the slave but the slave reports failure.
It seems I'm hitting the same problem with just 5 seconds of sleep time (the job
is executing a shell script that in turn calls ant):
check-resources-library:
[echo] Javascript Library 1_2 available = true
[echo] The file is checked at:
/export/home/j2eetest/hudson/workspace/JSF-core/glassfishv3/glassfish/domains/domain1/applications/guessNumber/resources/js/1_2/validator.js
[echo] Image Library 1_2 available = true
[echo] the file is checked at:
/export/home/j2eetest/hudson/workspace/JSF-core/glassfishv3/glassfish/domains/domain1/applications/guessNumber/resources/images/1_2/wave.med.gif
[echo] Sleeping for 5 seconds...
FATAL: command execution failed
hudson.util.IOException2: Failed to join the process
at hudson.Proc$RemoteProc.join(Proc.java:297)
at hudson.Launcher$ProcStarter.join(Launcher.java:274)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:84)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.model.Build$RunnerImpl.build(Build.java:195)
at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
at hudson.model.Run.run(Run.java:928)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:118)
Caused by: java.util.concurrent.ExecutionException:
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request$1.get(Request.java:188)
at hudson.remoting.Request$1.get(Request.java:157)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
at hudson.Proc$RemoteProc.join(Proc.java:289)
... 10 more
Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:223)
at hudson.remoting.Channel.terminate(Channel.java:558)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:776)
Caused by: java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:757)
FATAL: Unable to delete script file /tmp/hudson8537360715477296990.sh
hudson.util.IOException2: remote file operation failed
at hudson.FilePath.act(FilePath.java:645)
at hudson.FilePath.act(FilePath.java:633)
at hudson.FilePath.delete(FilePath.java:863)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:94)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.model.Build$RunnerImpl.build(Build.java:195)
at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:272)
at hudson.model.Run.run(Run.java:928)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:118)
Caused by: java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:372)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:511)
at hudson.FilePath.act(FilePath.java:640)
... 11 more
FATAL: already closed
java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:372)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:511)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:730)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:277)
at hudson.model.Run.run(Run.java:928)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:118)
This job executes fine on solaris but fails on linux RH5.
I am having the same issue on CentOS 5.
....F...............FATAL: rake execution failed
hudson.util.IOException2: Failed to join the process
at hudson.Proc$RemoteProc.join(Proc.java:297)
at hudson.plugins.rake.Rake.perform(Rake.java:101)
at
hudson.tasks.BuildStepCompatibilityLayer.perform(BuildStepCompatibilityLayer.java:56)
at hudson.model.Build$RunnerImpl.build(Build.java:195)
at hudson.model.Build$RunnerImpl.doRun(Build.java:151)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:271)
at hudson.model.Run.run(Run.java:938)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:118)
Caused by: java.util.concurrent.ExecutionException:
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request$1.get(Request.java:188)
at hudson.remoting.Request$1.get(Request.java:157)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
at hudson.Proc$RemoteProc.join(Proc.java:289)
... 9 more
Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:223)
at hudson.remoting.Channel.terminate(Channel.java:558)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:776)
Caused by: java.io.EOFException
at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2570)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1314)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:757)
FATAL: already closed
java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:372)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:511)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:730)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:276)
at hudson.model.Run.run(Run.java:938)
at hudson.model.Build.run(Build.java:112)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:118)
When this happens, the slave log might show some record of why the communication
with the slave JVM failed. Can you please check them?
I hit this error as well on the slave. The slave log seemed to be basically empty, but the Hudson's main log had this which seemed
to correspond with the slave and time:
16/07/2009 3:32:31 PM hudson.node_monitors.AbstractNodeMonitorDescriptor$Record run
WARNING: Failed to monitor Worker 4 for Free Temp Space
hudson.util.IOException2: remote file operation failed
at hudson.FilePath.act(FilePath.java:548)
at hudson.node_monitors.TemporarySpaceMonitor$1.getFreeSpace(TemporarySpaceMonitor.java:71)
at hudson.node_monitors.DiskSpaceMonitorDescriptor.monitor(DiskSpaceMonitorDescriptor.java:80)
at hudson.node_monitors.DiskSpaceMonitorDescriptor.monitor(DiskSpaceMonitorDescriptor.java:43)
at hudson.node_monitors.AbstractNodeMonitorDescriptor$Record.run(AbstractNodeMonitorDescriptor.java:161)
Caused by: java.io.IOException: Unable to serialize 229391015936
at hudson.remoting.UserRequest.serialize(UserRequest.java:134)
at hudson.remoting.UserRequest.perform(UserRequest.java:100)
at hudson.remoting.UserRequest.perform(UserRequest.java:46)
at hudson.remoting.Request$2.run(Request.java:236)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at hudson.remoting.Engine$1$1.run(Engine.java:54)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.NotSerializableException: hudson.node_monitors.DiskSpaceMonitorDescriptor$DiskSpace
at java.io.ObjectOutputStream.writeObject0(Unknown Source)
at java.io.ObjectOutputStream.writeObject(Unknown Source)
at hudson.remoting.UserRequest._serialize(UserRequest.java:123)
at hudson.remoting.UserRequest.serialize(UserRequest.java:132)
... 10 more
This stacktrace is reported here:
https://hudson.dev.java.net/issues/show_bug.cgi?id=3381, which has been fixed already, however in 1.296, where-as we are running
1.295, so we are updating now. Hopefully that will fix this issue reported here.
I am on Hudson 1.319 and am seeing a similar problem. This is not only for long
jobs anymore...this happens after 6 minutes for me. I am running Hudson on a
Fedora Core 6 Linux box, but am doing the builds on a Red Hat Enterprise Linux
server 5.1 slave. It happens intermittently without pattern. I leave it
running all weekend doing a build every 2 hours. During the weekend of about 30
- 40 builds, it fails 1 time with the following while in the middle of
compilation (then works fine on the next run):
FATAL: command execution failed
hudson.util.IOException2: Failed to join the process
at hudson.Proc$RemoteProc.join(Proc.java:297)
at hudson.Launcher$ProcStarter.join(Launcher.java:275)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:471)
at hudson.model.Build$RunnerImpl.build(Build.java:157)
at hudson.model.Build$RunnerImpl.doRun(Build.java:113)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:345)
at hudson.model.Run.run(Run.java:1090)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:122)
Caused by: java.util.concurrent.ExecutionException:
hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request$1.get(Request.java:188)
at hudson.remoting.Request$1.get(Request.java:157)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
at hudson.Proc$RemoteProc.join(Proc.java:289)
... 12 more
Caused by: hudson.remoting.RequestAbortedException: java.io.EOFException
at hudson.remoting.Request.abort(Request.java:223)
at hudson.remoting.Channel.terminate(Channel.java:561)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:819)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:800)
FATAL: Unable to delete script file /tmp/hudson5532835365757807889.sh
hudson.util.IOException2: remote file operation failed
at hudson.FilePath.act(FilePath.java:672)
at hudson.FilePath.act(FilePath.java:660)
at hudson.FilePath.delete(FilePath.java:904)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:93)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:471)
at hudson.model.Build$RunnerImpl.build(Build.java:157)
at hudson.model.Build$RunnerImpl.doRun(Build.java:113)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:345)
at hudson.model.Run.run(Run.java:1090)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:122)
Caused by: java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:375)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:514)
at hudson.FilePath.act(FilePath.java:667)
... 13 more
FATAL: already closed
java.io.IOException: already closed
at hudson.remoting.Channel.send(Channel.java:375)
at hudson.remoting.Request.call(Request.java:104)
at hudson.remoting.Channel.call(Channel.java:514)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:732)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:350)
at hudson.model.Run.run(Run.java:1090)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:93)
at hudson.model.Executor.run(Executor.java:122)
Been having this problem ever since upgrading from 1.320 to 1.327. See:
http://www.nabble.com/Failed-to-join-process-v1.327-1.328-to25866005.html
Now using hudson 1.329. Master is on fedora core 5. Slaves are on all kinds of
platforms: rhel, windows, hpux, macosx, solaris etc etc.
Using Hudson 1.339 on Windows XP with 5 Windows XP slaves, I'm seeing this error happening constantly with longer test runs.
I'm pretty sure this happens because nothing is coming to the output for a long time from our test system, therefore I would suggest if it is possible to add the possibility to adjust the timeout time manually either to project or node configuration in Hudson?
If someone knows if this can be done with Java command line options I would appreciate that too.
I got some new information when I ran one of the slaves in headless mode instead of Java Web Start, and just before this issue was reported in the console output, I saw the following in the command prompt window of the slave:
20.1.2010 14:14:44 hudson.remoting.Engine$2 onDead
INFO: Ping failed. Terminating the socket.
20.1.2010 14:14:44 hudson.remoting.Channel$ReaderThread run
SEVERE: I/O error in channel channel
java.net.SocketException: socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at java.io.ObjectInputStream$PeekInputStream.peek(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.peek(Unknown Source)
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:852)
20.1.2010 14:14:44 hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
So I think the new ping mechanism of Hudson thinks that the connection is broken and that produces the exception. I've been pinging the machines all day long with Windows XP's ping utility and sometimes, rarely I see that ping times out when I have the timeout value set at 1 second. I wonder if there is a way to manually adjust the timeout value in Hudson?
I don't know the exact timeout value which should work in my environment but it seems that it should be at least bigger 1 second.
I'm getting the very same error (as described in the initial issue-description) with Hudson
1.345 occasionally (maybe once every 40 builds).
Our server is running on Solaris 9 and I just saw the error happening on a slave running Solaris 10.
Hi,
we are getting the same result (with long running jobs).
We have a mixture of Linux RHEE installations (running version 4 and 5).
hudson version is latest
For Hudson, 'latest' changes every week. Could you specify the version number explicitly?
Hi,
we still have this problem, and not only for long running jobs.
Currently we are using Hudson 1.355 running on RHEE.
I have this same issue running Hudson ver. 1.355 on slave on Solaris 10 Sparc machine using Java 1.5.0 with jobs that take 4+ hours.
I don't have this issue with similiar jobs on Solaris 10 x86, but the job finishes before 4 hours.
FWIW: This seems to be happing with a Windows 2003 Slave as well usining Hudson 1.362. I have seen it before, this is just the first time I tried to track down a solution. Here is what the error looks like on this version of Hudson.
FATAL: command execution failed
hudson.util.IOException2: Failed to join the process
at hudson.Proc$RemoteProc.join(Proc.java:312)
at hudson.Launcher$ProcStarter.join(Launcher.java:280)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:83)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
at hudson.model.Build$RunnerImpl.build(Build.java:174)
at hudson.model.Build$RunnerImpl.doRun(Build.java:138)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416)
at hudson.model.Run.run(Run.java:1253)
at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:124)
Caused by: java.util.concurrent.ExecutionException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.Request$1.get(Request.java:218)
at hudson.remoting.Request$1.get(Request.java:172)
at hudson.remoting.FutureAdapter.get(FutureAdapter.java:55)
at hudson.Proc$RemoteProc.join(Proc.java:304)
... 12 more
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.Request.abort(Request.java:257)
at hudson.remoting.Channel.terminate(Channel.java:602)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:893)
Caused by: java.io.IOException: Unexpected termination of the channel
at hudson.remoting.Channel$ReaderThread.run(Channel.java:875)
Caused by: java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
at hudson.remoting.Channel$ReaderThread.run(Channel.java:869)
FATAL: Unable to delete script file C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat
hudson.util.IOException2: remote file operation failed: C:\DOCUME~1\conman\LOCALS~1\Temp\hudson7729064622458259363.bat at hudson.remoting.Channel@1a8aa2c:cmhslave02-win32
at hudson.FilePath.act(FilePath.java:749)
at hudson.FilePath.act(FilePath.java:735)
at hudson.FilePath.delete(FilePath.java:990)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:93)
at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:601)
at hudson.model.Build$RunnerImpl.build(Build.java:174)
at hudson.model.Build$RunnerImpl.doRun(Build.java:138)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:416)
at hudson.model.Run.run(Run.java:1253)
at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:124)
Caused by: hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:412)
at hudson.remoting.Request.call(Request.java:105)
at hudson.remoting.Channel.call(Channel.java:555)
at hudson.FilePath.act(FilePath.java:742)
... 13 more
FATAL: channel is already closed
hudson.remoting.ChannelClosedException: channel is already closed
at hudson.remoting.Channel.send(Channel.java:412)
at hudson.remoting.Request.call(Request.java:105)
at hudson.remoting.Channel.call(Channel.java:555)
at hudson.Launcher$RemoteLauncher.kill(Launcher.java:744)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:421)
at hudson.model.Run.run(Run.java:1253)
at hudson.matrix.MatrixRun.run(MatrixRun.java:130)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:124)
The best solution being that there is no ping requests made till the time the slave is running a build.
Or the least requirement would be to somehow increase the timeout on the ping event.
I am also experiencing exactly the same problem (callstack). Hudson 1.363 on RedHat/Tomcat. Windows XP slaves. I have a matrix job, and each configuration job takes about 1.5 hours. Not all jobs fail, in my last build, 1 out of 25 configuration failed because of this problem.
Code changed in hudson
User: : kohsuke
Path:
trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java
trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java
http://jenkins-ci.org/commit/33537
Log:
[JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed.
Integrated in hudson_main_trunk #156
[JENKINS-5073 JENKINS-3412] improved the error diagnostics on ChannelClosedException by having it report who/how the connection was closed.
kohsuke :
Files :
- /trunk/hudson/main/remoting/src/main/java/hudson/remoting/ChannelClosedException.java
- /trunk/hudson/main/remoting/src/main/java/hudson/remoting/Channel.java
I did integrate the above fix in our Hudson 1.362 version but I am not seeing the root cause of why Hudson slave connection reset.
I'm marking this as a duplicate of JENKINS-5073.
Both issues are caused by a lost master/slave communication channel. When it happens while your build is waiting for a forked process to complete, you see this error in the build console.
If I understand you correctly, Hudson starts a shell (on a slave) and runs your
and starts a process on yet another machine?
script, which in turn run ssh
The exception indicates that the link between the master and the slave are
terminated unexpectedly. How does your master and slave talk to each other?
Finally, I didn't follow your reasoning about ProcessTreeKiller and why that
hints a bug in the remoting code.