Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49707

Auto retry for elastic agents after channel closure

      While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

      Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
      

      There's a spinning arrow below it.

      I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

      I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

      Things seem stuck. Please advise.

        1. threadDump.txt
          98 kB
        2. jenkins.log
          984 kB
        3. support_2018-07-04_07.35.22.zip
          956 kB
        4. JavaMelodyGrubHeapDump_4_07_18.pdf
          220 kB
        5. NetworkAndMachineStats.png
          NetworkAndMachineStats.png
          224 kB
        6. Thread dump [Jenkins].html
          219 kB
        7. grubSystemInformation.html
          67 kB
        8. slaveLogInMaster.grub.zip
          8 kB
        9. JavaMelodyNodeGrubThreads_4_07_18.pdf
          9 kB
        10. MonitoringJavaelodyOnNodes.html
          44 kB
        11. grub.remoting.logs.zip
          3 kB
        12. jobConsoleOutput.txt
          12 kB
        13. jobConsoleOutput.txt
          12 kB
        14. jenkins_support_2018-06-29_01.14.18.zip
          1.26 MB
        15. jenkins_agents_Thread_dump.html
          172 kB
        16. jenkins_Agent_devbuild9_System_Information.html
          66 kB
        17. jenkins_agent_devbuild9_remoting_logs.zip
          4 kB
        18. image-2018-02-22-17-28-03-053.png
          image-2018-02-22-17-28-03-053.png
          30 kB
        19. image-2018-02-22-17-27-31-541.png
          image-2018-02-22-17-27-31-541.png
          56 kB

          [JENKINS-49707] Auto retry for elastic agents after channel closure

          Oleg Nenashev added a comment -

          Not sure what is the request here. You get a system message from Remoting, it's not related to Pipeline or Jobs at all in general. If you want to implement retry features or document the suggestions, IMHO it is on the Pipeline side

          Oleg Nenashev added a comment - Not sure what is the request here. You get a system message from Remoting, it's not related to Pipeline or Jobs at all in general. If you want to implement retry features or document the suggestions, IMHO it is on the Pipeline side

          Jon B added a comment -

          oleg_nenashev I'm not sure how to handle this situation. The problem I need to overcome is that my pipeline hangs with the error message I have screenshotted above. I would much prefer that it errors out and fails. Unfortunately, the pipeline keeps running indefinately.

          Can this instead be configured to throw a catchable exception?

          Jon B added a comment - oleg_nenashev I'm not sure how to handle this situation. The problem I need to overcome is that my pipeline hangs with the error message I have screenshotted above. I would much prefer that it errors out and fails. Unfortunately, the pipeline keeps running indefinately. Can this instead be configured to throw a catchable exception?

          Jon B added a comment -

          oleg_nenashev Should this be redesignated a remoting bug? I'm not sure how to unblock my pipelines that are hanging from this issue.

          Jon B added a comment - oleg_nenashev Should this be redesignated a remoting bug? I'm not sure how to unblock my pipelines that are hanging from this issue.

          Jon B added a comment -

          i just changed the JIRA "component" field for this to "remoting".

          Jon B added a comment - i just changed the JIRA "component" field for this to "remoting".

          Oleg Nenashev added a comment -

          Please provide the following info:

          You can find some pointers here: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=51

          Oleg Nenashev added a comment - Please provide the following info: Support bundle for the timeframe of the outage: https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin Agent threaddump dump for the timeframe of the outage Agent filesystem log for the time of the outage You can find some pointers here: https://speakerdeck.com/onenashev/day-of-jenkins-2017-dealing-with-agent-connectivity-issues?slide=51

          I would like to increase the Priority of this issue to "Major" since this issue is affecting a lot of users.

          Shital Savekar added a comment - I would like to increase the Priority of this issue to " Major " since this issue is affecting a lot of users.

          We have also been greatly effected by this issue. A resolution would be very nice

          Alex Slaughter added a comment - We have also been greatly effected by this issue. A resolution would be very nice

          We are receiving this message sporadically in cloud nodes in Azure managed by Jenkins.

          Eduardo Lezcano added a comment - We are receiving this message sporadically in cloud nodes in Azure managed by Jenkins.

          Federico Naum added a comment -

          Hi, 

          We are losing at a team to TeamCity mostly for this remoting issue

          Here are the logs requested, the disconnection happened at 10:21 am (agent devbuild9)

          jenkins_agent_devbuild9_remoting_logs.zip

          jenkins_agents_Thread_dump.html

          jenkins_Agent_devbuild9_System_Information.html

          jenkins_support_2018-06-29_01.14.18.zip

          I will appreciate any pointer, to where I can start looking for more information. Let me know if you need more logs. 

          This is happening several times daily so I can provide more logs if needed

           

          Federico Naum added a comment - Hi,  We are losing at a team to TeamCity mostly for this remoting issue Here are the logs requested, the disconnection happened at 10:21 am (agent devbuild9) jenkins_agent_devbuild9_remoting_logs.zip jenkins_agents_Thread_dump.html jenkins_Agent_devbuild9_System_Information.html jenkins_support_2018-06-29_01.14.18.zip I will appreciate any pointer, to where I can start looking for more information. Let me know if you need more logs.   This is happening several times daily so I can provide more logs if needed  

          Oleg Nenashev added a comment -

          At least we have some data for diagnostics now

          Oleg Nenashev added a comment - At least we have some data for diagnostics now

          Federico Naum added a comment - - edited

          This is a fresher issue, with fewer things going on, this time the agent that got disconnected is called grub

          Job console output shows (jobConsoleOutput.txt) show at 17:27:54
           
           

          hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on grub failed. The channel is closing down or has closed down  
            at hudson.remoting.Channel.call(Channel.java:948) 
            at hudson.FilePath.act(FilePath.java:1089) 
            at hudson.FilePath.act(FilePath.java:1078)  
             .....  
          17:27:55 ERROR: Issue with creating launcher for agent grub. The agent has not been fully initialized yet

           
           
           jenkins master log at that time (jenkins.log) shows the following lines:
           

          Jul 04, 2018 5:27:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run
          SEVERE: I/O error in channel grub
          java.io.IOException: Unexpected termination of the channel
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          Caused by: java.io.EOFException
          at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2328)
          at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
          at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
                  ... [trimmed stacktrace]
          
          Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError
          WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
          java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
          at hudson.model.Slave.reportLauncherCreateError(Slave.java:524)
          at hudson.model.Slave.createLauncher(Slave.java:496)
                  ... [trimmed stacktrace]
           
          Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError
          WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries
          java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet
          at hudson.model.Slave.reportLauncherCreateError(Slave.java:524)
          at hudson.model.Slave.createLauncher(Slave.java:496)
                  ... [trimmed stacktrace]
           
          Jul 04, 2018 5:27:55 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol
          INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path?
          Jul 04, 2018 5:27:55 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish
          INFO: rndtest_vortexLibrary/master #289 completed: ABORTED
            

           
          The agent remoting log  that shows the error is the file created at 5:08 pm  (remoting.log.2 inside grub.remoting.logs.zip)
           
           

          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          Caused by: java.io.EOFException
          at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675)
          at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150)
          at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859)
          at java.io.ObjectInputStream.<init>(ObjectInputStream.java:355)
          at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
          at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
          at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          
           

           


           
          but it does not have a timestamp in the message. it would be handy to have one. because I can not work out if the agent or Jenkins master initiated the disconnection.
           
          I've also included 

          • The full support log (support_2018-07-04_07.35.22.zip)
          • The logs under ${JENKINS_HOME}/logs/slaves/grub (slaveLogInMaster.grub.zip)
          • Agent system Information that I grub just minutes after seeing the disconnection.
                - System Information (grubSystemInformation.html)
                - Heap Dump (JavaMelodyGrubHeapDump_4_07_18.pdf)
                - threads (JavaMelodyNodeGrubThreads_4_07_18.pdf)
                - (MonitoringJavaelodyOnNodes.html)
             A screenshot (*NetworkAndMachineStats.png) of the stats of the master (jenkinssecure1) and the agent (grub)  showing the network activity, memory and CPU history. Hardly anything going on both machines. 
             
             
             
             
             
             
             
             
             
             

          Federico Naum added a comment - - edited This is a fresher issue, with fewer things going on, this time the agent that got disconnected is called grub Job console output shows ( jobConsoleOutput.txt ) show at  17:27:54     hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on grub failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1089) at hudson.FilePath.act(FilePath.java:1078)   ..... 17:27:55 ERROR: Issue with creating launcher for agent grub. The agent has not been fully initialized yet      jenkins master log at that time ( jenkins.log ) shows the following lines:   Jul 04, 2018 5:27:54 PM hudson.remoting.SynchronousCommandTransport$ReaderThread run SEVERE: I/O error in channel grub java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2328) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)         ... [trimmed stacktrace] Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet at hudson.model.Slave.reportLauncherCreateError(Slave.java:524) at hudson.model.Slave.createLauncher(Slave.java:496)         ... [trimmed stacktrace]   Jul 04, 2018 5:27:55 PM hudson.model.Slave reportLauncherCreateError WARNING: Issue with creating launcher for agent grub. The agent has not been fully initialized yetProbably there is a race condition with Agent reconnection or disconnection, check other log entries java.lang.IllegalStateException: No remoting channel to the agent OR it has not been fully initialized yet at hudson.model.Slave.reportLauncherCreateError(Slave.java:524) at hudson.model.Slave.createLauncher(Slave.java:496)         ... [trimmed stacktrace]   Jul 04, 2018 5:27:55 PM com.squareup.okhttp.internal.Platform$JdkWithJettyBootPlatform getSelectedProtocol INFO: ALPN callback dropped: SPDY and HTTP/2 are disabled. Is alpn-boot on the boot class path? Jul 04, 2018 5:27:55 PM org.jenkinsci.plugins.workflow.job.WorkflowRun finish INFO: rndtest_vortexLibrary/master #289 completed: ABORTED      The agent remoting log  that shows the error is the file created at 5:08 pm   ( remoting.log.2  inside  grub.remoting.logs.zip )     at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2675) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3150) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:859) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:355) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)       but it does not have a timestamp in the message. it would be handy to have one. because I can not work out if the agent or Jenkins master initiated the disconnection.   I've also included  The full support log ( support_2018-07-04_07.35.22.zip ) The logs under ${JENKINS_HOME}/logs/slaves/grub ( slaveLogInMaster.grub.zip ) Agent system Information that I grub just minutes after seeing the disconnection.     - System Information ( grubSystemInformation.html )     - Heap Dump ( JavaMelodyGrubHeapDump_4_07_18.pdf )     - threads ( JavaMelodyNodeGrubThreads_4_07_18.pdf )     - ( MonitoringJavaelodyOnNodes.html )  A screenshot (*NetworkAndMachineStats.png)  of the stats of the master (jenkinssecure1) and the agent (grub)  showing the network activity, memory and CPU history. Hardly anything going on both machines.                     

          Federico Naum added a comment - Here are the files: jobConsoleOutput.txt grub.remoting.logs.zip JavaMelodyGrubHeapDump_4_07_18.pdf JavaMelodyNodeGrubThreads_4_07_18.pdf MonitoringJavaelodyOnNodes.html grubSystemInformation.html Thread dump [Jenkins].html support_2018-07-04_07.35.22.zip slaveLogInMaster.grub.zip jenkins.log

          Hi oleg_nenashev,

          Do you have any update on this?

          We have seen similar issues: The Jenkins Pipeline hangs when the node becomes unreachable at some point in time.

          It would be great to see this fixed. This issue sometimes blocks many jobs in the queue of our CI.

          In this case it was an intermittent networking issue:

          19:35:55 [Sat Jul 14 17:35:54 2018] Waiting for impl_1 to finish...
          19:37:32 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4860 Killed                  "$RDI_PROG" "$@"
          19:37:32 Makefile:423: recipe for target '../../work/projects/dev1/dev1.sdk/dev1.hdf' failed
          19:37:32 make: *** [../../work/projects/dev1/dev1.sdk/dev1.hdf] Error 137
          19:37:32 make: *** Waiting for unfinished jobs....
          19:38:06 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4859 Killed                  "$RDI_PROG" "$@"
          19:38:06 Makefile:423: recipe for target '../../work/projects/dev0/dev0.sdk/dev0.hdf' failed
          19:38:06 make: *** [../../work/projects/dev0/dev0.sdk/dev0.hdf] Error 137
          19:39:59 Cannot contact ubuntu-16-04-amd64-2: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/build/fpga-projects/build at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
          19:40:30 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194:  4861 Killed                  "$RDI_PROG" "$@"
          19:40:30 Makefile:423: recipe for target '../../work/projects/dev2/dev2.sdk/dev2.hdf' failed
          19:40:30 make: *** [../../work/projects/dev2/dev2.sdk/dev2.hdf] Error 137
          

          finally, we aborted the build:

          Aborted by me
          09:41:29 Sending interrupt signal to process
          09:41:39 After 10s process did not stop
          

          Please note that in the post steps, we see the errors occur but the build no longer hangs here:

          Error when executing always post condition:
          java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/packages at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
          	at hudson.FilePath.act(FilePath.java:1043)
          	at hudson.FilePath.act(FilePath.java:1025)
          	at hudson.FilePath.mkdirs(FilePath.java:1213)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50)
          	at hudson.security.ACL.impersonate(ACL.java:290)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
          	at hudson.remoting.Channel.call(Channel.java:948)
          	at hudson.FilePath.act(FilePath.java:1036)
          	... 12 more
          Caused by: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          Caused by: java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          
          [Pipeline] cleanWs
          Error when executing cleanup post condition:
          java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
          	at hudson.FilePath.act(FilePath.java:1043)
          	at hudson.FilePath.act(FilePath.java:1025)
          	at hudson.FilePath.mkdirs(FilePath.java:1213)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79)
          	at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50)
          	at hudson.security.ACL.impersonate(ACL.java:290)
          	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47)
          	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          	at java.lang.Thread.run(Thread.java:748)
          Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down
          	at hudson.remoting.Channel.call(Channel.java:948)
          	at hudson.FilePath.act(FilePath.java:1036)
          	... 12 more
          Caused by: java.io.IOException: Unexpected termination of the channel
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
          Caused by: java.io.EOFException
          	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679)
          	at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154)
          	at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862)
          	at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358)
          	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48)
          	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
          	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
          

          I hope this somewhat helps.

          With best regards,
          Tom.

          Tom Ghyselinck added a comment - Hi oleg_nenashev , Do you have any update on this? We have seen similar issues: The Jenkins Pipeline hangs when the node becomes unreachable at some point in time. It would be great to see this fixed. This issue sometimes blocks many jobs in the queue of our CI. In this case it was an intermittent networking issue: 19:35:55 [Sat Jul 14 17:35:54 2018] Waiting for impl_1 to finish... 19:37:32 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4860 Killed "$RDI_PROG" "$@" 19:37:32 Makefile:423: recipe for target '../../work/projects/dev1/dev1.sdk/dev1.hdf' failed 19:37:32 make: *** [../../work/projects/dev1/dev1.sdk/dev1.hdf] Error 137 19:37:32 make: *** Waiting for unfinished jobs.... 19:38:06 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4859 Killed "$RDI_PROG" "$@" 19:38:06 Makefile:423: recipe for target '../../work/projects/dev0/dev0.sdk/dev0.hdf' failed 19:38:06 make: *** [../../work/projects/dev0/dev0.sdk/dev0.hdf] Error 137 19:39:59 Cannot contact ubuntu-16-04-amd64-2: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/build/fpga-projects/build at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down 19:40:30 /opt/Xilinx/Vivado/2018.1/bin/loader: line 194: 4861 Killed "$RDI_PROG" "$@" 19:40:30 Makefile:423: recipe for target '../../work/projects/dev2/dev2.sdk/dev2.hdf' failed 19:40:30 make: *** [../../work/projects/dev2/dev2.sdk/dev2.hdf] Error 137 finally, we aborted the build: Aborted by me 09:41:29 Sending interrupt signal to process 09:41:39 After 10s process did not stop Please note that in the post steps, we see the errors occur but the build no longer hangs here: Error when executing always post condition: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA/packages at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.FilePath.act(FilePath.java:1043) at hudson.FilePath.act(FilePath.java:1025) at hudson.FilePath.mkdirs(FilePath.java:1213) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50) at hudson.security.ACL.impersonate(ACL.java:290) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1036) ... 12 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) [Pipeline] cleanWs Error when executing cleanup post condition: java.io.IOException: remote file operation failed: /var/jenkins/ubuntu-16-04-amd64/workspace/nts-fpga_branches_1.x-bwreq-YHXXBDM77DWWMJ4IUYUNRNT2YKWDIASW4VY4YNK2ULEAULWQGGJA at hudson.remoting.Channel@cf0f3fa:ubuntu-16-04-amd64-2: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.FilePath.act(FilePath.java:1043) at hudson.FilePath.act(FilePath.java:1025) at hudson.FilePath.mkdirs(FilePath.java:1213) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:79) at org.jenkinsci.plugins.workflow.steps.CoreStep$Execution.run(CoreStep.java:67) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1$1.call(SynchronousNonBlockingStepExecution.java:50) at hudson.security.ACL.impersonate(ACL.java:290) at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution$1.run(SynchronousNonBlockingStepExecution.java:47) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ubuntu-16-04-amd64-2 failed. The channel is closing down or has closed down at hudson.remoting.Channel.call(Channel.java:948) at hudson.FilePath.act(FilePath.java:1036) ... 12 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2679) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3154) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:862) at java.io.ObjectInputStream.<init>(ObjectInputStream.java:358) at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:48) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) I hope this somewhat helps. With best regards, Tom.

          Oleg Nenashev added a comment -

          tom_ghyselinck nope, I don't. I have requested info which is needed to diagnose the issue, but I have never reviewed it. I will unlikely have time for that in short-term, busy with other stuff in the community. jthompson is the current Remoting default assignee, so I will assign the issue to him.

           

          Oleg Nenashev added a comment - tom_ghyselinck nope, I don't. I have requested info which is needed to diagnose the issue, but I have never reviewed it. I will unlikely have time for that in short-term, busy with other stuff in the community. jthompson is the current Remoting default assignee, so I will assign the issue to him.  

          Hi oleg_nenashev,

          Thanks!

          P.S. I set the Assignee to "Automatic" and it assigned you, it probably needs a change in the component configuration to set it to jthompson by default?

          With best regards,
          Tom.

          Tom Ghyselinck added a comment - Hi oleg_nenashev , Thanks! P.S. I set the Assignee to " Automatic " and it assigned you, it probably needs a change in the component configuration to set it to jthompson by default? With best regards, Tom.

          Oleg Nenashev added a comment -

          No, I am just a default assignee of the "_unsorted" component which was first in the component list. "remoting" component is configured properly, and I have just removed "_unsorted" for now since we have got the diagnostics info

          Oleg Nenashev added a comment - No, I am just a default assignee of the "_unsorted" component which was first in the component list. "remoting" component is configured properly, and I have just removed "_unsorted" for now since we have got the diagnostics info

          Federico Naum added a comment -

          Has someone with experiencing this issue had a look at this new plugin https://plugins.jenkins.io/remoting-kafka

          oleg_nenashev I can see you are very involved with it.

          Looks promising, is lacking some documentation, but I'll play with to see if I can get it working, and report back to see If that solve my connection issues.

          Federico Naum added a comment - Has someone with experiencing this issue had a look at this new plugin https://plugins.jenkins.io/remoting-kafka oleg_nenashev I can see you are very involved with it. Looks promising, is lacking some documentation, but I'll play with to see if I can get it working, and report back to see If that solve my connection issues.

          Jon B added a comment -

          The repro case here is pretty simple:

          1) Create a parallel job (even a job that just does a sleep)

          2) Terminate the executor's host while its running

          It hangs with this error every time.

           

          Jon B added a comment - The repro case here is pretty simple: 1) Create a parallel job (even a job that just does a sleep) 2) Terminate the executor's host while its running It hangs with this error every time.  

          Jon B added a comment - - edited

          I don't mean to be dramatic but this is literally the biggest problem in all of Jenkins as far as I can tell. If we lose an ec2 host while an executor is doing parallel work, we badly need for the parallel item to restart on another healthy executor. When it just plain hangs, we can't do that and the user experience of hanging is not acceptable.

          I would recommend elevating the urgency here to the highest level to get this triaged.

          Jon B added a comment - - edited I don't mean to be dramatic but this is literally the biggest problem in all of Jenkins as far as I can tell. If we lose an ec2 host while an executor is doing parallel work, we badly need for the parallel item to restart on another healthy executor. When it just plain hangs, we can't do that and the user experience of hanging is not acceptable. I would recommend elevating the urgency here to the highest level to get this triaged.

          Jon B added a comment - - edited

          Repro code:

          def jobs = [:]
          jobs["Do Work"] = getWork()
          parallel jobs
          println "Parallel run completed."
          
          def getWork() {
            return {
              node('general') {
                sh """|#!/bin/bash
                      |set -ex
                      |echo "going to sleep..."
                      |sleep 300
                      |echo "yay I made it to the end."
                      |""".stripMargin()
              }
            }
          }
           

          To repro, run this pipeline and once the control flow hits the sleep, terminate the executor's host and it will hang with something like this:

          [Do Work] Cannot contact ip-172-31-237-68.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-237-68.us-west-2.compute.internal failed. The channel is closing down or has closed down 

          It hangs with this error every time I try it.

          Jon B added a comment - - edited Repro code: def jobs = [:] jobs[ "Do Work" ] = getWork() parallel jobs println "Parallel run completed." def getWork() { return { node( 'general' ) { sh """|#!/bin/bash |set -ex |echo "going to sleep..." |sleep 300 |echo "yay I made it to the end." |""".stripMargin() } } } To repro, run this pipeline and once the control flow hits the sleep, terminate the executor's host and it will hang with something like this: [Do Work] Cannot contact ip-172-31-237-68.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on ip-172-31-237-68.us-west-2.compute.internal failed. The channel is closing down or has closed down It hangs with this error every time I try it.

          I concur this is a pretty serious issue, I've tried a number or workarounds  like timeouts to restart the job but once it hangs its stuck.

          Michael McCallum added a comment - I concur this is a pretty serious issue, I've tried a number or workarounds  like timeouts to restart the job but once it hangs its stuck.

          Michael Greco added a comment - - edited

          I've been noticing this for MONTHS. And in case people don't realize the master branch of the docker-plugin wasn't building today 9/17/18 :

          https://ci.jenkins.io/job/Plugins/job/docker-plugin/ 

          Anyways this weekend I loaded docker-plugin build 1.1.5 and today on every build I was getting "The channel is closing down or has closed down" as my jobs would still appear to be running even though obviously the container was gone.

           

          I would up downgrading to an older build I have :  

          1.1.5-SNAPSHOT (private-554bbf8a-win2012-6d34b0$)

          in which the problem seems to happen less. I went so far as to rebuild some of my "build containers" as they are created "FROM jenkinsci/slave" and I noticed that has had an update sometime in August.

           

          Again It made no difference using the "released 1.1.5" version of docker-plugin (every thing wound up in the state of "The channel is closing down or has closed down") and that's when I noticed the master branch isn't building either ... so I just went back to my earlier build.

          Michael Greco added a comment - - edited I've been noticing this for MONTHS. And in case people don't realize the master branch of the docker-plugin wasn't building today 9/17/18 : https://ci.jenkins.io/job/Plugins/job/docker-plugin/   Anyways this weekend I loaded docker-plugin build 1.1.5 and today on every build I was getting "The channel is closing down or has closed down" as my jobs would still appear to be running even though obviously the container was gone.   I would up downgrading to an older build I have :   1.1.5-SNAPSHOT (private-554bbf8a-win2012-6d34b0$) in which the problem seems to happen less. I went so far as to rebuild some of my "build containers" as they are created "FROM jenkinsci/slave" and I noticed that has had an update sometime in August.   Again It made no difference using the "released 1.1.5" version of docker-plugin (every thing wound up in the state of "The channel is closing down or has closed down") and that's when I noticed the master branch isn't building either ... so I just went back to my earlier build.

          Jon B added a comment -

          If left unfixed for much longer, our organization is going to be forced to use another technology for CICD since this is causing widespread pain and confidence lost in this technology among our hundreds of developers who are using Jenkins at our company.

          Jon B added a comment - If left unfixed for much longer, our organization is going to be forced to use another technology for CICD since this is causing widespread pain and confidence lost in this technology among our hundreds of developers who are using Jenkins at our company.

          Michael Greco added a comment - - edited

          Maybe try the LTS? ... uggg I try to be a "start-up" kind of guy ... it sounds like there's maybe some integration tests that need to be part of the project ... If you got access to spinning up another VM maybe launch the LTS version and try it out. I know I keep the jenkins data in a docker volume so moving around between these versions to try stuff out on different docker hosts for exactly these situations is helpful. I'm running 2.140 but maybe the plugin works better with the LTS ? (ok I'm reaching out side the box cause if the plugin has a bug then...)

          Michael Greco added a comment - - edited Maybe try the LTS? ... uggg I try to be a "start-up" kind of guy ... it sounds like there's maybe some integration tests that need to be part of the project ... If you got access to spinning up another VM maybe launch the LTS version and try it out. I know I keep the jenkins data in a docker volume so moving around between these versions to try stuff out on different docker hosts for exactly these situations is helpful. I'm running 2.140 but maybe the plugin works better with the LTS ? (ok I'm reaching out side the box cause if the plugin has a bug then...)

          Jon B added a comment - - edited

          I just met with jglick who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node dies, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname.

          He suggested setting a timeout block which would retry the test run if the work does not complete within a given period.

          We both agreed this seems to therefore be a new feature request.

          The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was terminated while the work was being performed.

          Jon B added a comment - - edited I just met with jglick who told me that in my case, the underlying mechanism that triggers when I call for "node('general')" selects one of my spot instances. At that moment, an internal ec2 hostname is selected. If, while the work is being performed, that particular node dies, Jenkins intentionally waits for another machine at that hostname to wake up before it will continue. It is for this reason why it appears to hang forever - because my AWS spot/autoscaling does not launch another machine with the same internal hostname. He suggested setting a timeout block which would retry the test run if the work does not complete within a given period. We both agreed this seems to therefore be a new feature request. The new feature would allow Jenkins to re-dispatch the closure of work to any other node that matches the given label if the original executor's host was terminated while the work was being performed.

          Michael McCallum added a comment - - edited

          I tried using a timeout block but it never triggers, does anyone have an example of that working?

          Thats 2.141 running jenkins on k8s with k8s agents. With the latest plugins as of a few days ago.

          Michael McCallum added a comment - - edited I tried using a timeout block but it never triggers, does anyone have an example of that working? Thats 2.141 running jenkins on k8s with k8s agents. With the latest plugins as of a few days ago.

          Michael Greco added a comment - - edited

          In my case the original node doesn't die ... I'm not using AWS autoscaling ...

          Michael Greco added a comment - - edited In my case the original node doesn't die ... I'm not using AWS autoscaling ...

          Federico Naum added a comment -

          Agree with piratejohnny regarding that this is a critical issue. We have one of the teams switching to TeamCity  . In the time being, I'm trying to attack the problem using the new kafka agent plugin. In my tests, it seems quite stable, and I'm not encountering the frequent channel disconnection when running parallel jobs, so I would be deploying that to production this week. 

          I agree as well that the retry in a new node that satisfies the labels can be a different issue, but I would also say that should be top priority.

           

          PS: We are also not using AWS

          Federico Naum added a comment - Agree with piratejohnny regarding that this is a critical issue. We have one of the teams switching to TeamCity  . In the time being, I'm trying to attack the problem using the new kafka agent plugin. In my tests, it seems quite stable, and I'm not encountering the frequent channel disconnection when running parallel jobs, so I would be deploying that to production this week.  I agree as well that the retry in a new node that satisfies the labels can be a different issue, but I would also say that should be top priority.   PS: We are also not using AWS

          Michael Greco added a comment - - edited

          This is all fine and well and not to complain but why is the connection going away ? I'll blame myself 1st (that's experience) and say I'm sure I didn't read something ... or maybe missed something that was said in this report ?

           

          It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" but I'm not seeing my AWS instance die as Jon B is. Can someone enlighten me please ?

           

          Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going down in the 1st place ? The "closed channel message" does seem happen during longer running requests.

          Michael Greco added a comment - - edited This is all fine and well and not to complain but why is the connection going away ? I'll blame myself 1st (that's experience) and say I'm sure I didn't read something ... or maybe missed something that was said in this report ?   It just feels like the issue of this report got changed from "connection closed" to "Auto retry for elastic agents after channel closure" but I'm not seeing my AWS instance die as Jon B is. Can someone enlighten me please ?   Or maybe this is really just some issue where the docker plugin isn't able to reach the container anymore ... and that the bug is in the retry logic ? ... why is the channel prematurely going down in the 1st place ? The "closed channel message" does seem happen during longer running requests.

          Jon B added a comment -

          mgreco2k I think the issue leading me to this error message is a different set of circumstances. I'm not using the docker plugin for example. You might want to open a new ticket.

          My case is 100% based on how jenkins is meant to work - its trying to wait for the node that disconnected to come back up. However, in the case of cloud elastic computing, the worker will never come back up and that's why I see the hang. It is for this reason the title was adjusted and also how the ticket is filed.

          Jon B added a comment - mgreco2k I think the issue leading me to this error message is a different set of circumstances. I'm not using the docker plugin for example. You might want to open a new ticket. My case is 100% based on how jenkins is meant to work - its trying to wait for the node that disconnected to come back up. However, in the case of cloud elastic computing, the worker will never come back up and that's why I see the hang. It is for this reason the title was adjusted and also how the ticket is filed.

          Michael Greco added a comment -

          Uggg .. ok ... thanks.

          Mike

          Michael Greco added a comment - Uggg .. ok ... thanks. Mike

          fnaum How does kafka plugin behave in the event of a node shutdown?

          Viacheslav Dubrovskyi added a comment - fnaum How does kafka plugin behave in the event of a node shutdown?

          Federico Naum added a comment -

          There is an issue where Jenkins master does not reflect a kafka agent disconnection (I have logged this issue https://issues.jenkins-ci.org/browse/JENKINS-54001

          • If I reboot an agent and then trigger a build asking for that agent, Jenkins keeps waiting.. and when the agent comes back online it runs the job to completion *.
          • If the agent does not come online, it will eventually time out at some point, fail the build and mark the agent as offline.
          • If I reboot an agent or stop the remoting process while it is running a job in that agent, Jenkins keeps waiting till the agent or the process to get back online, after printing this line:
            Cannot contact AGENTNAME: java.lang.InterruptedException 
            • When it gets back online, then it does fail with 
              wrapper script does not seem to be touching the log file in /var/kafka/jenkins/workspace/demo@tmp/durable-ec4fef48
              (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300)

           

          Even this situation is not ideal. The kafka agents are much more reliable and I do not have the ChannelClosedException when running parallel builds. So for me is more stable even the recovery of an agent shutdown is node ideal.

           

          Note: Does test are with kafka plugin-1.1.1 (and 1.1.3 is out, so I will re-do this test once I upgrade to that latest version

           

          • I wrote this daemon for my centOS-7 setup, so the agent reconnects when it is rebooted or the process dies for some reason:

           

          [Unit]
          Description=Jenkins kafka agent
          After=network.target
          
          [Service]
          Type=simple
          Restart=always
          RestartSec=1
          User=buildboy
          Environment=PATH=/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/X11:/sbin:/usr/local/sbin
          
          ExecStart=/usr/bin/java -jar /var/kafka/remoting-kafka-agent.jar -name AGENTNAME -master http://myjenkinsinstance:8081/ -
          secret 611c91c8013e27b8b00e36d66e421a1743604230862f4d290a87b9426a2b3b1f -kafkaURL kafka:9092 -noauth
          
          [Install]
          WantedBy=multi-user.target
          

           

           

           

           

          Federico Naum added a comment - There is an issue where Jenkins master does not reflect a kafka agent disconnection (I have logged this issue https://issues.jenkins-ci.org/browse/JENKINS-54001 )  If I reboot an agent and then trigger a build asking for that agent, Jenkins keeps waiting.. and when the agent comes back online it runs the job to completion *. If the agent does not come online, it will eventually time out at some point, fail the build and mark the agent as offline. If I reboot an agent or stop the remoting process while it is running a job in that agent, Jenkins keeps waiting till the agent or the process to get back online, after printing this line: Cannot contact AGENTNAME: java.lang.InterruptedException  When it gets back online, then it does fail with  wrapper script does not seem to be touching the log file in / var /kafka/jenkins/workspace/demo@tmp/durable-ec4fef48 (JENKINS-48300: if on a laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=300) I thought I logged this ticket https://issues.jenkins-ci.org/browse/JENKINS-53397 for this issue, but it a bit different.   Even this situation is not ideal. The kafka agents are much more reliable and I do not have the ChannelClosedException when running parallel builds. So for me is more stable even the recovery of an agent shutdown is node ideal.   Note: Does test are with kafka plugin-1.1.1 (and 1.1.3 is out, so I will re-do this test once I upgrade to that latest version   I wrote this daemon for my centOS-7 setup, so the agent reconnects when it is rebooted or the process dies for some reason:   [Unit] Description=Jenkins kafka agent After=network.target [Service] Type=simple Restart=always RestartSec=1 User=buildboy Environment=PATH=/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/bin/X11:/sbin:/usr/local/sbin ExecStart=/usr/bin/java -jar / var /kafka/remoting-kafka-agent.jar -name AGENTNAME -master http: //myjenkinsinstance:8081/ - secret 611c91c8013e27b8b00e36d66e421a1743604230862f4d290a87b9426a2b3b1f -kafkaURL kafka:9092 -noauth [Install] WantedBy=multi-user.target        

          fnaum thank you for information. It's not solve the main problem. I checked ssh and swarm agents and this problem doesn't depend from agent's type connection.

          The explanation of piratejohnny that if recreate node with the same hostname and label helped to me. I use GCE for nodes and custom script for add or remove nodes. So I can easy add logic for detect removed nodes and re-add it.
          It's a pity that none of the cloud plugins can't do this.

          Viacheslav Dubrovskyi added a comment - fnaum  thank you for information. It's not solve the main problem. I checked ssh and swarm agents and this problem doesn't depend from agent's type connection. The explanation of piratejohnny  that if recreate node with the same hostname and label helped to me. I use GCE for nodes and custom script for add or remove nodes. So I can easy add logic for detect removed nodes and re-add it. It's a pity that none of the cloud plugins can't do this.

          Amir Barkal added a comment - - edited

          The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build.
          There's an infinite loop that's easy to reproduce:

          1. Start Jenkins slave with remoting jnlp jar:

          java -jar agent.jar -jnlpUrl "http://jenkins:8080/computer/agent1/slave-agent.jnlp" -secret 123
          

          2. Run the following pipeline:

          node('agent1') {
              sh('sleep 100000000')   
          }
          

          3. Kill the agent (Ctrl+C)

          4. Jenkins output in job console log:

          Started by user admin
          Replayed #23
          Running as admin
          Running in Durability level: MAX_SURVIVABILITY
          [Pipeline] node
          Running on agent1-805fa9fd in /workspace/Pipeline-1
          [Pipeline] {
          [Pipeline] sh
          [Pipeline-1] Running shell script
          + sleep 100000000
          Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down
          

          threadDump.txt

          System info:
          Jenkins ver. 2.138
          Durable Task: 1.25

          What I would like is a way to configure a maximum timeout for the Jenkins master to wait for the agent to respond, and then just abort the build. It's absolutely unacceptable that builds will hang due to dead agents.

          Amir Barkal added a comment - - edited The problem is Jenkins not aborting / cancelling / stopping / whatever the build when the agent is terminated in the middle of a build. There's an infinite loop that's easy to reproduce: 1. Start Jenkins slave with remoting jnlp jar: java -jar agent.jar -jnlpUrl "http: //jenkins:8080/computer/agent1/slave-agent.jnlp" -secret 123 2. Run the following pipeline: node( 'agent1' ) { sh( 'sleep 100000000' ) } 3. Kill the agent (Ctrl+C) 4. Jenkins output in job console log: Started by user admin Replayed #23 Running as admin Running in Durability level: MAX_SURVIVABILITY [Pipeline] node Running on agent1-805fa9fd in /workspace/Pipeline-1 [Pipeline] { [Pipeline] sh [Pipeline-1] Running shell script + sleep 100000000 Cannot contact agent-805fa9fd: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 2163fbb04240.jenkins/172.20.0.3:43902 failed. The channel is closing down or has closed down threadDump.txt System info: Jenkins ver. 2.138 Durable Task: 1.25 What I would like is a way to configure a maximum timeout for the Jenkins master to wait for the agent to respond, and then just abort the build. It's absolutely unacceptable that builds will hang due to dead agents.

          Like Amir Barkal, I would like a pipeline step to fail quickly if the Jenkins master loses its connection to the agent for the node running the step. The log mentions that hudson.remoting.ChannelClosedException was thrown. If I can catch that exception in my pipeline script, I can retry the appropriate steps.

          Jonathan Rogers added a comment - Like Amir Barkal, I would like a pipeline step to fail quickly if the Jenkins master loses its connection to the agent for the node running the step. The log mentions that hudson.remoting.ChannelClosedException was thrown. If I can catch that exception in my pipeline script, I can retry the appropriate steps.

          FWIW, we use the EC2 Fleet Plugin and regularly experience this issue.

          It would be great if agents had an attribute to indicate whether they are durable/long lived or dynamic/transient. That way the channel closure could be handled appropriately for each scenario. At the very least, having a global config to control whether or not agent disconnection was fatal to a build or not would allow pipeline authors to handle the disconnection explicitly, without resorting to putting timeouts in place.

          Joshua Spiewak added a comment - FWIW, we use the EC2 Fleet Plugin and regularly experience this issue. It would be great if agents had an attribute to indicate whether they are durable/long lived or dynamic/transient. That way the channel closure could be handled appropriately for each scenario. At the very least, having a global config to control whether or not agent disconnection was fatal to a build or not would allow pipeline authors to handle the disconnection explicitly, without resorting to putting timeouts in place.

          Troni Dale Atillo added a comment - - edited

          I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this 

          hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
           

          I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too.

          The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.

          ws (...){ 
          //other scripts need to be executed after the disconnection 
          }

           

          Troni Dale Atillo added a comment - - edited I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this  hudson.remoting.ChannelClosedException: Channel "unknown" : .... The channel is closing down or has closed down I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too. The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works. ws (...){ //other scripts need to be executed after the disconnection }  

          Jesse Glick added a comment - - edited

          There are actually several subcases mixed together here.

          1. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block.
          2. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.)
          3. If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.)
          4. If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3.

          #4 was addressed by JENKINS-36013. I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed.

          #2 is possible but a little trickier, since some sort of timeout value needs to be defined.

          #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

          Jesse Glick added a comment - - edited There are actually several subcases mixed together here. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.) If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.) If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3. #4 was addressed by JENKINS-36013 . I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed. #2 is possible but a little trickier, since some sort of timeout value needs to be defined. #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

          Artem Stasiuk added a comment -

          For the first one can we use:

          smt like

          @Override
          public void taskCompleted(Executor executor, Queue.Task task, long durationMS) {
              super.taskCompleted(executor, task, durationMS);
              if (isOffline() && getOfflineCause() != null) {
                  System.out.println("Opa, try to resubmit");
                  Queue.getInstance().schedule(task, 10);
              }
          }
          

          Artem Stasiuk added a comment - For the first one can we use: smt like @Override public void taskCompleted(Executor executor, Queue.Task task, long durationMS) { super .taskCompleted(executor, task, durationMS); if (isOffline() && getOfflineCause() != null ) { System .out.println( "Opa, try to resubmit" ); Queue.getInstance().schedule(task, 10); } }

          This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ?

          I upgraded to 1.17.1 and I always encounter it.

          My job is blocked for more than one hour on this error :

           

          Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down 
          

          The slave pod has been evicted by k8s :

           

          $ kubectl -n tools describe pods openjdk8-slave-5vff7
          ....
          Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container
          Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0.
          Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://openjdk:Need to kill Pod
          Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://jnlp:Need to kill Pod
          

           

           

           

          Olivier Boudet added a comment - This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ? I upgraded to 1.17.1 and I always encounter it. My job is blocked for more than one hour on this error :   Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down  The slave pod has been evicted by k8s :   $ kubectl -n tools describe pods openjdk8-slave-5vff7 .... Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0. Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //openjdk:Need to kill Pod Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //jnlp:Need to kill Pod      

          Jesse Glick added a comment -

          orgoz subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed. That would be a new RFE.

          Jesse Glick added a comment - orgoz subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed . That would be a new RFE.

          Jon B added a comment -

          jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

          Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
          Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
          The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
          pipeline

          { agent \{ label 'universal' }

          ...
          This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
          org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
          This error was the result of the following code:
          post {
          always {
          sh """|#!/bin/bash

          set -x
          docker ps -a -q xargs --no-run-if-empty docker rm -f true
          """.stripMargin()
          ...
          Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

          Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

          Keep up the great work you are all doing. This is great.

          Jon B added a comment - jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on  https://issues.jenkins-ci.org/browse/JENKINS-36013  appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed: Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is: Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):  pipeline { agent \{ label 'universal' } ... This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing This error was the result of the following code: post { always { sh """|#!/bin/bash set -x docker ps -a -q xargs --no-run-if-empty docker rm -f true """.stripMargin() ... Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013 , this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea. Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability. Keep up the great work you are all doing. This is great.

          Jesse Glick added a comment -

          The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

          if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

          Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

          while (true) {
            try {
              node('spotty') {
                sh '…'
              }
              break
            } catch (x) {
              if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
                  x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
                continue
              } else {
                throw x
              }
            }
          }
          

          Jesse Glick added a comment - The MissingContextVariableException is tracked by JENKINS-58900 . That is just a bad error message, though; the point is that the node is gone. if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like while ( true ) { try { node( 'spotty' ) { sh '…' } break } catch (x) { if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException && x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) { continue } else { throw x } } }

          We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

          Andrey Babushkin added a comment - We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

          Jesse Glick added a comment -

          oxygenxo that should have already been fixed—see linked PRs.

          Jesse Glick added a comment - oxygenxo that should have already been fixed—see linked PRs.

          Jesse Glick added a comment -

          Jesse Glick added a comment - A very limited variant of this concept (likely not compatible with Pipeline) is implemented in the EC2 Fleet plugin: https://github.com/jenkinsci/ec2-fleet-plugin/blob/2d4ed2bd0b05b1b3778ec7508923e21db0f9eb7b/src/main/java/com/amazon/jenkins/ec2fleet/EC2FleetAutoResubmitComputerLauncher.java#L87-L108

            jglick Jesse Glick
            piratejohnny Jon B
            Votes:
            37 Vote for this issue
            Watchers:
            54 Start watching this issue

              Created:
              Updated:
              Resolved: