Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-49707

Auto retry for elastic agents after channel closure

    XMLWordPrintable

Details

    Description

      While my pipeline was running, the node that was executing logic terminated. I see this at the bottom of my console output:

      Cannot contact ip-172-31-242-8.us-west-2.compute.internal: java.io.IOException: remote file operation failed: /ebs/jenkins/workspace/common-pipelines-nodeploy at hudson.remoting.Channel@48503f20:ip-172-31-242-8.us-west-2.compute.internal: hudson.remoting.ChannelClosedException: Channel "unknown": Remote call on ip-172-31-242-8.us-west-2.compute.internal failed. The channel is closing down or has closed down
      

      There's a spinning arrow below it.

      I have a cron script that uses the Jenkins master CLI to remove nodes which have stopped responding. When I examine this node's page in my Jenkins website, it looks like the node is still running that job and i see an orange label that says "Feb 22, 2018 5:16:02 PM Node is being removed".

      I'm wondering what would be a better way to say "If the channel closes down, retry the work on another node with the same label?

      Things seem stuck. Please advise.

      Attachments

        1. grub.remoting.logs.zip
          3 kB
        2. grubSystemInformation.html
          67 kB
        3. image-2018-02-22-17-27-31-541.png
          image-2018-02-22-17-27-31-541.png
          56 kB
        4. image-2018-02-22-17-28-03-053.png
          image-2018-02-22-17-28-03-053.png
          30 kB
        5. JavaMelodyGrubHeapDump_4_07_18.pdf
          220 kB
        6. JavaMelodyNodeGrubThreads_4_07_18.pdf
          9 kB
        7. jenkins_agent_devbuild9_remoting_logs.zip
          4 kB
        8. jenkins_Agent_devbuild9_System_Information.html
          66 kB
        9. jenkins_agents_Thread_dump.html
          172 kB
        10. jenkins_support_2018-06-29_01.14.18.zip
          1.26 MB
        11. jenkins.log
          984 kB
        12. jobConsoleOutput.txt
          12 kB
        13. jobConsoleOutput.txt
          12 kB
        14. MonitoringJavaelodyOnNodes.html
          44 kB
        15. NetworkAndMachineStats.png
          NetworkAndMachineStats.png
          224 kB
        16. slaveLogInMaster.grub.zip
          8 kB
        17. support_2018-07-04_07.35.22.zip
          956 kB
        18. threadDump.txt
          98 kB
        19. Thread dump [Jenkins].html
          219 kB

        Issue Links

          Activity

            piratejohnny Jon B added a comment -

            jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on https://issues.jenkins-ci.org/browse/JENKINS-36013 appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed:

            Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is:
            Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body
            The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers): 
            pipeline

            { agent \{ label 'universal' }

            ...
            This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is:
            org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
            This error was the result of the following code:
            post {
            always {
            sh """|#!/bin/bash

            set -x
            docker ps -a -q xargs --no-run-if-empty docker rm -f true
            """.stripMargin()
            ...
            Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013, this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea.

            Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability.

            Keep up the great work you are all doing. This is great.

            piratejohnny Jon B added a comment - jglick Just wanted to thank you and everybody else who's been working on jenkins and to confirm that the work over on  https://issues.jenkins-ci.org/browse/JENKINS-36013  appears to have handled this case in a much better way. I consider the current behavior to be a major step in the right direction for Jenkins. Here's what I noticed: Last night, our Jenkins worker pool did its normal scheduled nightly scale down and one of the pipelines got disrupted. The message I see in my affected pipeline's console log is: Agent ip-172-31-235-152.us-west-2.compute.internal was deleted; cancelling node body The above mentioned hostname is the one that Jenkins selected at at the top of my declarative pipeline as a result of my call for a 'universal' machine (universal is how we label all of our workers):  pipeline { agent \{ label 'universal' } ... This particular declarative pipeline tries to "sh" to the console at the end inside a post{} section and clean up after itself, but since the node was lost, the next error that also appears in the Jenkins console log is: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing This error was the result of the following code: post { always { sh """|#!/bin/bash set -x docker ps -a -q xargs --no-run-if-empty docker rm -f true """.stripMargin() ... Let me just point out that the recent Jenkins advancements are fantastic. Before JENKINS-36013 , this pipeline would have just been stuck with no error messages. I'm so happy with this progress you have no idea. Now if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node, we would have the best of all worlds. At my company, the fact that a node is deleted during a scaledown is a confusing irrelevant problem for one of my developers to grapple with. The job of my developers (the folks writing Jenkins pipelines) is to write idempotent pipeline steps and my job is make sure all of the developer's steps trigger and the pipeline concludes with a high amount of durability. Keep up the great work you are all doing. This is great.
            jglick Jesse Glick added a comment -

            The MissingContextVariableException is tracked by JENKINS-58900. That is just a bad error message, though; the point is that the node is gone.

            if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node

            Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like

            while (true) {
              try {
                node('spotty') {
                  sh '…'
                }
                break
              } catch (x) {
                if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException &&
                    x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) {
                  continue
                } else {
                  throw x
                }
              }
            }
            
            jglick Jesse Glick added a comment - The MissingContextVariableException is tracked by JENKINS-58900 . That is just a bad error message, though; the point is that the node is gone. if there's any way to get this to actually retry the step it was on such that the pipeline can actually tolerate losing the node Well that is the primary subject of this RFE, my “subcase #1” above. Pending a supported feature, you might be able to hack something up in a trusted Scripted library like while ( true ) { try { node( 'spotty' ) { sh '…' } break } catch (x) { if (x instanceof org.jenkinsci.plugins.workflow.steps.FlowInterruptedException && x.causes*.getClass().contains(org.jenkinsci.plugins.workflow.support.steps.ExecutorStepExecution.RemovedNodeCause)) { continue } else { throw x } } }

            We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?

            oxygenxo Andrey Babushkin added a comment - We use kubernetes plugin with our bare-metal kubernetes cluster and the problem is that pipeline can run indefinitely  if agent inside pod were killed/underlying node was restarted. Is there any option to tweak such behavior, e.g. some timeout settings (except explicit timeout step)?
            jglick Jesse Glick added a comment -

            oxygenxo that should have already been fixed—see linked PRs.

            jglick Jesse Glick added a comment - oxygenxo that should have already been fixed—see linked PRs.
            jglick Jesse Glick added a comment - A very limited variant of this concept (likely not compatible with Pipeline) is implemented in the EC2 Fleet plugin: https://github.com/jenkinsci/ec2-fleet-plugin/blob/2d4ed2bd0b05b1b3778ec7508923e21db0f9eb7b/src/main/java/com/amazon/jenkins/ec2fleet/EC2FleetAutoResubmitComputerLauncher.java#L87-L108

            People

              jglick Jesse Glick
              piratejohnny Jon B
              Votes:
              37 Vote for this issue
              Watchers:
              54 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: