Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-12235

FATAL, Unable to delete script file, IOException2, remote file operation failed, unexpected termination of channel

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • core, remoting
    • None

      Below is the stacktrace.

      It happened when I ran two jobs on a master. After running a while, both jobs crashed with this exception.
      I think this might be caused by a small flip-flop connectivity of the network, but I didn't noticed any disconnection.
      Another cause may be the huge load of jenkins:

      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      25942 hudson 15 0 6902m 5.8g 5720 S 0.3 74.3 401:22.30 java

      Does the jenkins runs its own garbage collector at some specified time?
      We have to restart every few days because it's getting slower and slower until hangs out.

      FATAL: Unable to delete script file /tmp/hudson8303731085225956739.sh
      hudson.util.IOException2: remote file operation failed: /tmp/hudson8303731085225956739.sh at hudson.remoting.Channel@30e472f4:build@autom-1
      at hudson.FilePath.act(FilePath.java:781)
      at hudson.FilePath.act(FilePath.java:767)
      at hudson.FilePath.delete(FilePath.java:1022)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:92)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
      at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:695)
      at hudson.model.Build$RunnerImpl.build(Build.java:178)
      at hudson.model.Build$RunnerImpl.doRun(Build.java:139)
      at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:461)
      at hudson.model.Run.run(Run.java:1404)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      at hudson.model.ResourceController.execute(ResourceController.java:88)
      at hudson.model.Executor.run(Executor.java:230)
      Caused by: hudson.remoting.ChannelClosedException: channel is already closed
      at hudson.remoting.Channel.send(Channel.java:499)
      at hudson.remoting.Request.call(Request.java:110)
      at hudson.remoting.Channel.call(Channel.java:681)
      at hudson.FilePath.act(FilePath.java:774)
      ... 13 more
      Caused by: java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:1115)
      Caused by: java.io.EOFException
      at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:1109)
      FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.Request.call(Request.java:149)
      at hudson.remoting.Channel.call(Channel.java:681)
      at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:158)
      at $Proxy29.join(Unknown Source)
      at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:859)
      at hudson.Launcher$ProcStarter.join(Launcher.java:345)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:82)
      at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:58)
      at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      at hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:695)
      at hudson.model.Build$RunnerImpl.build(Build.java:178)
      at hudson.model.Build$RunnerImpl.doRun(Build.java:139)
      at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:461)
      at hudson.model.Run.run(Run.java:1404)
      at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      at hudson.model.ResourceController.execute(ResourceController.java:88)
      at hudson.model.Executor.run(Executor.java:230)
      Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.Request.abort(Request.java:273)
      at hudson.remoting.Channel.terminate(Channel.java:732)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:1139)
      Caused by: java.io.IOException: Unexpected termination of the channel
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:1115)
      Caused by: java.io.EOFException
      at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2554)
      at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
      at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
      at hudson.remoting.Channel$ReaderThread.run(Channel.java:1109)

          [JENKINS-12235] FATAL, Unable to delete script file, IOException2, remote file operation failed, unexpected termination of channel

          Zhijun Xu added a comment -

          @Rozendorn, I think you should set TCPKeepAlive to yes, try it

          Zhijun Xu added a comment - @Rozendorn, I think you should set TCPKeepAlive to yes, try it

          Guy Rozendorn added a comment -

          forever_xt, TCPKeepAlive yes is the default, which doesn't work either

          Guy Rozendorn added a comment - forever_xt , TCPKeepAlive yes is the default, which doesn't work either

          sanga added a comment -

          As an update to this, we have a bug in a script of ours which redeployed our build slaves. But even after this (also with both TCP and SSH keep alives enabled) we're occasionally seeing this bug. One possible explanation is that it may be related to the load on the Jenkins master. That's something that's been mentioned earlier in this case as a possible cause and something that we noticed too - Updating from 1.489 to 1.509.2 caused significantly increased load on our jenkins master. So we've given the master some more resources and tweaked jvm opts a bit to see if that improves things at all.

          @Zhijun: out of curiosity, how is the load on your jenkins master? Is it at all swapping?

          sanga added a comment - As an update to this, we have a bug in a script of ours which redeployed our build slaves. But even after this (also with both TCP and SSH keep alives enabled) we're occasionally seeing this bug. One possible explanation is that it may be related to the load on the Jenkins master. That's something that's been mentioned earlier in this case as a possible cause and something that we noticed too - Updating from 1.489 to 1.509.2 caused significantly increased load on our jenkins master. So we've given the master some more resources and tweaked jvm opts a bit to see if that improves things at all. @Zhijun: out of curiosity, how is the load on your jenkins master? Is it at all swapping?

          Guy Rozendorn added a comment -

          Looking at our master while Jenkins is alive (no job is running), Jenkin's java process takes 100% of one of the CPUs

          Tasks:  84 total,   1 running,  83 sleeping,   0 stopped,   0 zombie
          %Cpu(s): 51.4 us,  0.0 sy,  0.0 ni, 48.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
          KiB Mem:   8178392 total,  7231500 used,   946892 free,   381796 buffers
          KiB Swap:  8386556 total,    20888 used,  8365668 free,  5538224 cached
          
            PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
          12077 jenkins   20   0 5229m 873m 7428 S  99.7 10.9  11621:35 java
          

          does anyone here know how/has references on how to debug this?

          Guy Rozendorn added a comment - Looking at our master while Jenkins is alive (no job is running), Jenkin's java process takes 100% of one of the CPUs Tasks: 84 total, 1 running, 83 sleeping, 0 stopped, 0 zombie %Cpu(s): 51.4 us, 0.0 sy, 0.0 ni, 48.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8178392 total, 7231500 used, 946892 free, 381796 buffers KiB Swap: 8386556 total, 20888 used, 8365668 free, 5538224 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 12077 jenkins 20 0 5229m 873m 7428 S 99.7 10.9 11621:35 java does anyone here know how/has references on how to debug this?

          Guy Rozendorn added a comment -

          after:

          • applying the ssh settings mentioned above on all our slaves
          • adding more RAM and CPUs to the master
          • spanning our nightly runs over a longer timeframe so we'll run the minimum number of jobs concurrently as we can

          we're not seeing this issue. however, when we start running jobs in parallel (globally in jenkins, not on slaves (each has only 1 executor), we're seeing this issue

          Guy Rozendorn added a comment - after: applying the ssh settings mentioned above on all our slaves adding more RAM and CPUs to the master spanning our nightly runs over a longer timeframe so we'll run the minimum number of jobs concurrently as we can we're not seeing this issue. however, when we start running jobs in parallel (globally in jenkins, not on slaves (each has only 1 executor), we're seeing this issue

          Marc Seeger added a comment -

          I just witnessed it live on a slave today.
          Some findings:

          1. Once the slave started failing, following (different) jobs failed too. (Tested 3 jobs, all of them failed with the same error)
          2. Just disconnecting and reconnecting the slave made it work again

          Marc Seeger added a comment - I just witnessed it live on a slave today. Some findings: 1. Once the slave started failing, following (different) jobs failed too. (Tested 3 jobs, all of them failed with the same error) 2. Just disconnecting and reconnecting the slave made it work again

          Guy Rozendorn added a comment -

          We had some issues in our lab, which forced us to re-install all of our slaves (84 and counting).
          We are still experiencing this issue

          It seems that after this happens, the slave remains connected to Jenkins. However, I can't tell what happens if you try to run another job on it, because we revert the slave VM from snapshot after every run (whether it is successful or not)

          Guy Rozendorn added a comment - We had some issues in our lab, which forced us to re-install all of our slaves (84 and counting). We are still experiencing this issue It seems that after this happens, the slave remains connected to Jenkins. However, I can't tell what happens if you try to run another job on it, because we revert the slave VM from snapshot after every run (whether it is successful or not)

          Danny Staple added a comment -

          Ok - I've found something on this today. If you have very "chatty" jobs on the slaves which output a lot of console data, try to log/redirect it to a file - they aren't necessarily the root cause, but make it more prone.

          If a job is running, but quiet, you can unplug a slave network cable for a few seconds, put it back in and things will pretty much continue as before. However- a slave running a chatty job will die with an io error almost immediately.

          If you can redirect to file, you may see a big reduction in these.

          Danny Staple added a comment - Ok - I've found something on this today. If you have very "chatty" jobs on the slaves which output a lot of console data, try to log/redirect it to a file - they aren't necessarily the root cause, but make it more prone. If a job is running, but quiet, you can unplug a slave network cable for a few seconds, put it back in and things will pretty much continue as before. However- a slave running a chatty job will die with an io error almost immediately. If you can redirect to file, you may see a big reduction in these.

          Guy Rozendorn added a comment -

          After update all our jobs to yield output every 10 seconds this occurs less frequent, but it still happens few times a week.

          Guy Rozendorn added a comment - After update all our jobs to yield output every 10 seconds this occurs less frequent, but it still happens few times a week.

          Jesse Glick added a comment -

          Essentially a duplicate of JENKINS-1948.

          Jesse Glick added a comment - Essentially a duplicate of JENKINS-1948 .

            Unassigned Unassigned
            dumghen Ghenadie Dumitru
            Votes:
            38 Vote for this issue
            Watchers:
            47 Start watching this issue

              Created:
              Updated:
              Resolved: