Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45036

Pipeline job hangs with remote file operation failed / channel is already closed after master restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Incomplete
    • Icon: Major Major
    • Jenkins LTS 2.46.3
      All plugins latest at time of report (not sure which to list, sorry)

      In part, I'm reporting this because I don't know where to begin.

      I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

      15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      

      The Linux agent in question is launched by SSH on Debian Jessie.
      The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

      I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

      In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

      13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
      13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
      hudson.remoting.Channel.terminate(Channel.java:896),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
      org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
      org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
      org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
      hudson.remoting.Channel.close(Channel.java:1295),
      hudson.remoting.Channel.close(Channel.java:1263),
      hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
      hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
      hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
      jenkins.model.Jenkins.access$2000(Jenkins.java:307),
      jenkins.model.Jenkins$22.run(Jenkins.java:3340),
      hudson.model.Queue._withLock(Queue.java:1334),
      hudson.model.Queue.withLock(Queue.java:1211),
      jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
      jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
      hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
      jenkins.model.Jenkins$26.run(Jenkins.java:4196),
      ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
      hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
      hudson.remoting.Request.call(Request.java:172),
      hudson.remoting.Channel.call(Channel.java:829),
      hudson.FilePath.act(FilePath.java:985),
      hudson.FilePath.act(FilePath.java:974),
      hudson.FilePath.archive(FilePath.java:456),
      org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
      hudson.security.ACL.impersonate(ACL.java:260),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
      java.util.concurrent.FutureTask.run(FutureTask.java:262),
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
      java.lang.Thread.run(Thread.java:745)]
      

      But normally it just seems to fail immediately on resume.

      After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

      Most of the job is running one batch script / shell script or another, and it's almost always returning from one of these where the failure occurs.

      I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

      I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.

            Unassigned Unassigned
            philmcardlecg Phil McArdle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: