Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45036

Pipeline job hangs with remote file operation failed / channel is already closed after master restart

    • Icon: Bug Bug
    • Resolution: Incomplete
    • Icon: Major Major
    • Jenkins LTS 2.46.3
      All plugins latest at time of report (not sure which to list, sorry)

      In part, I'm reporting this because I don't know where to begin.

      I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

      15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
      

      The Linux agent in question is launched by SSH on Debian Jessie.
      The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

      I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

      In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

      13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
      13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
      hudson.remoting.Channel.terminate(Channel.java:896),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
      org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
      org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
      org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
      org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
      org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
      org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
      hudson.remoting.Channel.close(Channel.java:1295),
      hudson.remoting.Channel.close(Channel.java:1263),
      hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
      hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
      hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
      jenkins.model.Jenkins.access$2000(Jenkins.java:307),
      jenkins.model.Jenkins$22.run(Jenkins.java:3340),
      hudson.model.Queue._withLock(Queue.java:1334),
      hudson.model.Queue.withLock(Queue.java:1211),
      jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
      jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
      hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
      jenkins.model.Jenkins$26.run(Jenkins.java:4196),
      ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
      hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
      hudson.remoting.Request.call(Request.java:172),
      hudson.remoting.Channel.call(Channel.java:829),
      hudson.FilePath.act(FilePath.java:985),
      hudson.FilePath.act(FilePath.java:974),
      hudson.FilePath.archive(FilePath.java:456),
      org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
      org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
      hudson.security.ACL.impersonate(ACL.java:260),
      org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
      java.util.concurrent.FutureTask.run(FutureTask.java:262),
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
      java.lang.Thread.run(Thread.java:745)]
      

      But normally it just seems to fail immediately on resume.

      After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

      Most of the job is running one batch script / shell script or another, and it's almost always returning from one of these where the failure occurs.

      I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

      I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.

          [JENKINS-45036] Pipeline job hangs with remote file operation failed / channel is already closed after master restart

          Phil McArdle created issue -
          Phil McArdle made changes -
          Description Original: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87), jenkins.model.Jenkins.access$2000(Jenkins.java:307), jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211), jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334), jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73), jenkins.model.Jenkins$26.run(Jenkins.java:4196), ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          New: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          Phil McArdle made changes -
          Summary Original: Pipeline job fails with remote file operation failed / channel is already closed after master restart New: Pipeline job hangs with remote file operation failed / channel is already closed after master restart
          Phil McArdle made changes -
          Description Original: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          New: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          After this, all the parallel branches hang, and have to be killed with the two-stage prompt in the console output.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          Phil McArdle made changes -
          Description Original: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          After this, all the parallel branches hang, and have to be killed with the two-stage prompt in the console output.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          New: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          Phil McArdle made changes -
          Description Original: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.
          New: In part, I'm reporting this because I don't know where to begin.

          I've found this while working with an existing somewhat large pipeline script, in which I've only recently tried to see if I can restart during the pipeline run. Having worked around one issue (which was more obviously my fault), I'm now hitting the following when restarting and resuming, tested at various points during the script:

          {noformat}
          15:00:02 [<ParallelStage1>] Cannot contact <LinuxNode>: java.io.IOException: remote file operation failed: <Workspace>/<ParallelStage1> at hudson.remoting.Channel@36509a01:<LinuxNode>: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage2>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage2> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          15:00:02 [<ParallelStage3>] Cannot contact <WindowsNode>: java.io.IOException: remote file operation failed: <Workspace>\<ParallelStage3> at hudson.remoting.Channel@5c2c5123:JNLP4-connect connection from 192.168.0.251/192.168.0.251:53989: hudson.remoting.ChannelClosedException: channel is already closed
          {noformat}

          The Linux agent in question is launched by SSH on Debian Jessie.
          The Windows agent is Windows Server 2012 R2 running the agent through JNLP.

          I've tried restarting the instance (using the safe restart from the UI) at various points now, and on resume it will fail with this almost always immediately.

          In one instance I've managed to catch the exception while running the stash step seemingly post-resume:

          {noformat}
          13:15:59 [<ParallelStage1>] Caught exception: java.nio.channels.ClosedChannelException
          13:15:59 [<ParallelStage1>] Stacktrace: [hudson.remoting.Request.abort(Request.java:307),
          hudson.remoting.Channel.terminate(Channel.java:896),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:208),
          org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:222),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:832),
          org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:287),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:181),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:283),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:503),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:248),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:200),
          org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:213),
          org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:800),
          org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:173),
          org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:311),
          hudson.remoting.Channel.close(Channel.java:1295),
          hudson.remoting.Channel.close(Channel.java:1263),
          hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:704),
          hudson.slaves.SlaveComputer.kill(SlaveComputer.java:675),
          hudson.model.AbstractCIBase.killComputer(AbstractCIBase.java:87),
          jenkins.model.Jenkins.access$2000(Jenkins.java:307),
          jenkins.model.Jenkins$22.run(Jenkins.java:3340),
          hudson.model.Queue._withLock(Queue.java:1334),
          hudson.model.Queue.withLock(Queue.java:1211),
          jenkins.model.Jenkins._cleanUpDisconnectComputers(Jenkins.java:3334),
          jenkins.model.Jenkins.cleanUp(Jenkins.java:3210),
          hudson.lifecycle.UnixLifecycle.restart(UnixLifecycle.java:73),
          jenkins.model.Jenkins$26.run(Jenkins.java:4196),
          ......remote call to JNLP4-connect connection from 192.168.0.251/192.168.0.251:63146(Native Method),
          hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545),
          hudson.remoting.Request.call(Request.java:172),
          hudson.remoting.Channel.call(Channel.java:829),
          hudson.FilePath.act(FilePath.java:985),
          hudson.FilePath.act(FilePath.java:974),
          hudson.FilePath.archive(FilePath.java:456),
          org.jenkinsci.plugins.workflow.flow.StashManager.stash(StashManager.java:107),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:112),
          org.jenkinsci.plugins.workflow.support.steps.stash.StashStep$Execution.run(StashStep.java:100),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:49),
          hudson.security.ACL.impersonate(ACL.java:260),
          org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:46),
          java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471),
          java.util.concurrent.FutureTask.run(FutureTask.java:262),
          java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145),
          java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615),
          java.lang.Thread.run(Thread.java:745)]
          {noformat}

          But normally it just seems to fail immediately on resume.

          After this, all the parallel branches hang, and have to be killed with the two-stage attempt to cancel job, then click the prompt in the console output.

          Most of the job is running one batch script / shell script or another, and it's almost always returning from one of these where the failure occurs.

          I've been trying to build a test script from scratch trying to mimic many of the functions of the script that's failing in order to find a repro to report here, but I haven't gotten close to causing it to fail yet.

          I am also using a shared library with a mixture of CPS and NonCPS code across shared functions and classes, but I've got no serialisation warnings normally on pipeline execution or in the Jenkins master log and no other errors apart from those shown above when the job fails, so I'm not sure what to look at.

          Jesse Glick added a comment -

          Some sort of problem with your agent connection. Unless it is reproducible from scratch, there is not much else to say. Unfortunately Remoting offers poor diagnostic capabilities currently.

          Jesse Glick added a comment - Some sort of problem with your agent connection. Unless it is reproducible from scratch, there is not much else to say. Unfortunately Remoting offers poor diagnostic capabilities currently.
          Jesse Glick made changes -
          Component/s New: workflow-durable-task-step-plugin [ 21715 ]
          Component/s Original: pipeline [ 21692 ]
          Labels New: remoting

          Phil McArdle added a comment -

          For the record, I haven't been able to reproduce this, so if you don't need the bug for any other reason it can be closed.

          Phil McArdle added a comment - For the record, I haven't been able to reproduce this, so if you don't need the bug for any other reason it can be closed.

          Jesse Glick added a comment -

          Unfortunately there is probably nothing to be done here until diagnostics have been improved.

          Jesse Glick added a comment - Unfortunately there is probably nothing to be done here until diagnostics have been improved.
          Jesse Glick made changes -
          Resolution New: Incomplete [ 4 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Unassigned Unassigned
            philmcardlecg Phil McArdle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: