Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43781

Quickly detecting and restarting a job if the job's slave disconnects

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Minor Minor
    • core, remoting
    • None
    • Jenkins 2.46.1

      During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

      I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

      https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291

      What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected quickly?

      Desired Behavior:

      Jenkins detects the channel is disconnected within 30 seconds. It proceeds to restart the job via another healthy node.

          [JENKINS-43781] Quickly detecting and restarting a job if the job's slave disconnects

          Oleg Nenashev added a comment -

          There is no plan to implement job failover within the Jenkins core. There is a Naginator plugin for it: https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin

          Regarding the node failure detection time, it really depends on operation being executed. If the node is being disconnected after 7 minutes, it likely means that...
          1) Remoting channel does not notice disconnect or does not propagate it to all pending calls
          2) The channel is being disconnected by the PingThread timeout

          Anyway, I need Jenkins System, agent and build logs with failure timestamp to analyze the root cause.

          Oleg Nenashev added a comment - There is no plan to implement job failover within the Jenkins core. There is a Naginator plugin for it: https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin Regarding the node failure detection time, it really depends on operation being executed. If the node is being disconnected after 7 minutes, it likely means that... 1) Remoting channel does not notice disconnect or does not propagate it to all pending calls 2) The channel is being disconnected by the PingThread timeout Anyway, I need Jenkins System, agent and build logs with failure timestamp to analyze the root cause.

          Jesse Glick added a comment -

          JENKINS-49707 would address this for Pipeline builds. Not necessarily in 30s; depends on the agent type.

          Jesse Glick added a comment - JENKINS-49707 would address this for Pipeline builds. Not necessarily in 30s; depends on the agent type.

            Unassigned Unassigned
            piratejohnny Jon B
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: