Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-36013

Automatically abort ExecutorPickle rehydration from an ephemeral node

    XMLWordPrintable

Details

    • Pipeline - July/August

    Description

      ExecutorPickle.rehydrate ought to be able to detect that it has been spinning in circles because the agent node it was supposed to run on is not in the Jenkins node list, and automatically abort, causing the build to fail with a comprehensible message rather than just hanging indefinitely. (As opposed to being registered but offline, which is normal enough for a JNLP agent etc.—in such cases we just want to wait for the agent to come back online.)

      This would provide a better experience for the case of a build which was running on an EphemeralNode (such as from a Cloud without durable-task integration) when Jenkins was restarted. An agent using an inappropriate RetentionStrategy is trickier since it might still be defined after a restart, but will soon be terminated. Similarly, there may be cases where the agent is actually going to be redefined (with the same name) when it is attached after the restart—not sure about the Swarm plugin, but CloudBees DEV@cloud OPEs work this way. To prevent the build from being killed too aggressively, the cleanup should be delayed until some time has elapsed since rehydration began (or, ideally, since Jenkins completed initialization)—say, five minutes.

      Attachments

        Issue Links

          Activity

            svanoort Sam Van Oort added a comment - - edited

            basil I am sorry to hear that this caused a regression for you – it appears to be an unanticipated case where EphemeralNodes generated by the Swarm Plugin aren't really following the contract of that interface, since they can reconnect and be recreated but will not do so immediately. 

             I have a fix supplied here – https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/48 – pending review by jglick.  This will apply a 5 minute timeout, restoring my original implementation strategy. 

            svanoort Sam Van Oort added a comment - - edited basil I am sorry to hear that this caused a regression for you – it appears to be an unanticipated case where EphemeralNodes generated by the Swarm Plugin aren't really following the contract of that interface, since they can reconnect and be recreated but will not do so immediately.   I have a fix supplied here – https://github.com/jenkinsci/workflow-durable-task-step-plugin/pull/48 – pending review by jglick .  This will apply a 5 minute timeout, restoring my original implementation strategy. 
            jglick Jesse Glick added a comment -

            basil sounds like a bug in the Swarm plugin to me. The implementation of JENKINS-34593 ought to have removed the EphemeralNode marker if I understand it correctly. The description of -deleteExistingClients does not seem to talk about retaining agents across restarts but it seems that you have discovered that as a use case—probably not a tested one. (As an aside, it seems the plugin contains no tests which actually run Jenkins, much less tests of Pipeline interoperability or of restart behavior.)

            jglick Jesse Glick added a comment - basil sounds like a bug in the Swarm plugin to me. The implementation of  JENKINS-34593 ought to have removed the EphemeralNode marker if I understand it correctly. The description of -deleteExistingClients does not seem to talk about retaining agents across restarts but it seems that you have discovered that as a use case—probably not a tested one. (As an aside, it seems the plugin contains no tests which actually run Jenkins, much less tests of Pipeline interoperability or of restart behavior.)
            basil Basil Crow added a comment -

            svanoort and jglick, I wanted to say thank you for releasing version 2.15 which restores this functionality. I tested it, and my pipeline jobs that use the swarm plugin once again survive Jenkins restarts. Thanks!

            basil Basil Crow added a comment - svanoort and jglick , I wanted to say thank you for releasing version 2.15 which restores this functionality. I tested it, and my pipeline jobs that use the swarm plugin once again survive Jenkins restarts. Thanks!
            michaelneale Michael Neale added a comment -

            hi five svanoort (but make sure he has washed his hands, he just finished coming second in a chilli eating competition)!

            michaelneale Michael Neale added a comment - hi five svanoort (but make sure he has washed his hands, he just finished coming second in a chilli eating competition)!
            svanoort Sam Van Oort added a comment -

            basil Thanks!  I'm glad you're finding it works well for you now

            svanoort Sam Van Oort added a comment - basil Thanks!  I'm glad you're finding it works well for you now

            People

              svanoort Sam Van Oort
              jglick Jesse Glick
              Votes:
              6 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: