Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49707

Auto retry for elastic agents after channel closure

      While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

      Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
      

      There's a spinning arrow below it.

      I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

      I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

      Things seem stuck. Please advise.

        1. grub.remoting.logs.zip
          3 kB
        2. grubSystemInformation.html
          67 kB
        3. image-2018-02-22-17-27-31-541.png
          image-2018-02-22-17-27-31-541.png
          56 kB
        4. image-2018-02-22-17-28-03-053.png
          image-2018-02-22-17-28-03-053.png
          30 kB
        5. JavaMelodyGrubHeapDump_4_07_18.pdf
          220 kB
        6. JavaMelodyNodeGrubThreads_4_07_18.pdf
          9 kB
        7. jenkins_agent_devbuild9_remoting_logs.zip
          4 kB
        8. jenkins_Agent_devbuild9_System_Information.html
          66 kB
        9. jenkins_agents_Thread_dump.html
          172 kB
        10. jenkins_support_2018-06-29_01.14.18.zip
          1.26 MB
        11. jenkins.log
          984 kB
        12. jobConsoleOutput.txt
          12 kB
        13. jobConsoleOutput.txt
          12 kB
        14. MonitoringJavaelodyOnNodes.html
          44 kB
        15. NetworkAndMachineStats.png
          NetworkAndMachineStats.png
          224 kB
        16. slaveLogInMaster.grub.zip
          8 kB
        17. support_2018-07-04_07.35.22.zip
          956 kB
        18. threadDump.txt
          98 kB
        19. Thread dump [Jenkins].html
          219 kB

          [JENKINS-49707] Auto retry for elastic agents after channel closure

          Troni Dale Atillo added a comment - - edited

          I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this 

          hudson.remoting.ChannelClosedException: Channel "unknown": .... The channel is closing down or has closed down
           

          I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too.

          The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works.

          ws (...){ 
          //other scripts need to be executed after the disconnection 
          }

           

          Troni Dale Atillo added a comment - - edited I have this problem too. Our script will have to trigger a reboot of the slave machine and we added sleep just wait for the slave to come back. Once the slave comes back in the middle of the executing node, then our pipeline continue the execution, we got this  hudson.remoting.ChannelClosedException: Channel "unknown" : .... The channel is closing down or has closed down I noticed that when the agent was disconnected, the workspace that we are using before the disconnection seems locked when it comes back. Any operation you will do that requires execution in the said workspace seems cause this error. It seems it cannot use that workspace anymore. My script was run in parallel too. The workaround that I tried was to run the next execution or next line of script into a different wokspace and it works. ws (...){ //other scripts need to be executed after the disconnection }  

          Jesse Glick added a comment - - edited

          There are actually several subcases mixed together here.

          1. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block.
          2. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.)
          3. If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.)
          4. If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3.

          #4 was addressed by JENKINS-36013. I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed.

          #2 is possible but a little trickier, since some sort of timeout value needs to be defined.

          #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

          Jesse Glick added a comment - - edited There are actually several subcases mixed together here. The originally reported RFE: if something like a spot instance is terminated, we would like to retry the whole node block. If an agent gets disconnected but continues to be registered in Jenkins, we would like to eventually abort the build. (Not immediately, since sometimes there is just a transient Remoting channel outage or agent JVM crash or whatever; if the agent successfully reconnects, we want to continue processing output from the durable task, which should not have been affected by the outage.) If an agent goes offline and is removed from the Jenkins configuration, we may as well immediately abort the build, since it is unlikely it would be reattached under the same name with the same processes still running. (Though this can happen when using the Swarm plugin.) If an agent is removed from the Jenkins configuration and Jenkins is restarted, we may as well abort the build, as in #3. #4 was addressed by JENKINS-36013 . I filed workflow-durable-task-step #104 for #3. For this to be effective, cloud provider plugins need to actually remove dead agents automatically (at some point); it will take some work to see if this is so, and if not, whether that can be safely changed. #2 is possible but a little trickier, since some sort of timeout value needs to be defined. #1 would be a rather different implementation and would certainly need to be opt-in (somehow TBD).

          Artem Stasiuk added a comment -

          For the first one can we use:

          smt like

          @Override
          public void taskCompleted(Executor executor, Queue.Task task, long durationMS) {
              super.taskCompleted(executor, task, durationMS);
              if (isOffline() && getOfflineCause() != null) {
                  System.out.println("Opa, try to resubmit");
                  Queue.getInstance().schedule(task, 10);
              }
          }
          

          Artem Stasiuk added a comment - For the first one can we use: smt like @Override public void taskCompleted(Executor executor, Queue.Task task, long durationMS) { super .taskCompleted(executor, task, durationMS); if (isOffline() && getOfflineCause() != null ) { System .out.println( "Opa, try to resubmit" ); Queue.getInstance().schedule(task, 10); } }

          This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ?

          I upgraded to 1.17.1 and I always encounter it.

          My job is blocked for more than one hour on this error :

           

          Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down 
          

          The slave pod has been evicted by k8s :

           

          $ kubectl -n tools describe pods openjdk8-slave-5vff7
          ....
          Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container
          Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0.
          Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://openjdk:Need to kill Pod
          Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker://jnlp:Need to kill Pod
          

           

           

           

          Olivier Boudet added a comment - This issue appears in the release note of kubernetes plugin 1.17.0, so I assume it should be fixed ? I upgraded to 1.17.1 and I always encounter it. My job is blocked for more than one hour on this error :   Cannot contact openjdk8-slave-5vff7: hudson.remoting.ChannelClosedException: Channel "unknown" : Remote call on JNLP4-connect connection from 10.8.4.28/10.8.4.28:35920 failed. The channel is closing down or has closed down  The slave pod has been evicted by k8s :   $ kubectl -n tools describe pods openjdk8-slave-5vff7 .... Normal Started 57m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Started container Warning Evicted 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 The node was low on resource: memory. Container jnlp was using 4943792Ki, which exceeds its request of 0. Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //openjdk:Need to kill Pod Normal Killing 53m kubelet, gke-cluster-1-pool-0-da2236b1-vdd3 Killing container with id docker: //jnlp:Need to kill Pod      

          Jesse Glick added a comment -

          orgoz subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed. That would be a new RFE.

          Jesse Glick added a comment - orgoz subcase #3 as above should be addressed in recent releases: if an agent pod is deleted then the corresponding build should abort in a few minutes. There is not currently any logic which would do the same after a PodPhase: Failed . That would be a new RFE.

          Jon B added a comment -

          jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

          Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
          Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
          The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
          pipeline

          { agent \{ label 'universal' }

          ...
          This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
          org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
          This error was the result of the following code:
          post {
          always {
          sh """|#!/bin/bash

          set -x
          docker ps -a -q xargs --no-run-if-empty docker rm -f true
          """.stripMargin()
          ...
          Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

          Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

          Keep up the great work you are all doing. This is great.

          Jon B added a comment - jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on  https://issues.jenkins-ci.org/browse/JENKINS-36013  appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed: Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is: Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):  pipeline { agent \{ label 'universal' } ... This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing This error was the result of the following code: post { always { sh """|#!/bin/bash set -x docker ps -a -q xargs --no-run-if-empty docker rm -f true """.stripMargin() ... Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013 , this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea. Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability. Keep up the great work you are all doing. This is great.

          Jesse Glick added a comment -

          The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

          if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

          Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

          while (true) {
            try {
              node('spotty') {
                sh '…'
              }
              break
            } catch (x) {
              if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
                  x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
                continue
              } else {
                throw x
              }
            }
          }
          

          Jesse Glick added a comment - The MissingContextVariableException is tracked by JENKINS-58900 . That is just a bad error message, though; the point is that the node is gone. if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like while ( true ) { try { node( 'spotty' ) { sh '…' } break } catch (x) { if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException && x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) { continue } else { throw x } } }

          We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

          Andrey Babushkin added a comment - We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

          Jesse Glick added a comment -

          oxygenxo that should have already been fixed—see linked PRs.

          Jesse Glick added a comment - oxygenxo that should have already been fixed—see linked PRs.

          Jesse Glick added a comment -

          Jesse Glick added a comment - A very limited variant of this concept (likely not compatible with Pipeline) is implemented in the EC2 Fleet plugin: https://github.com/jenkinsci/ec2-fleet-plugin/blob/2d4ed2bd0b05b1b3778ec7508923e21db0f9eb7b/src/main/java/com/amazon/jenkins/ec2fleet/EC2FleetAutoResubmitComputerLauncher.java#L87-L108

            jglick Jesse Glick
            piratejohnny Jon B
            Votes:
            37 Vote for this issue
            Watchers:
            54 Start watching this issue

              Created:
              Updated:
              Resolved: