Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43781

Quickly detecting and restarting a job if the job's slave disconnects

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Minor Minor
    • core, remoting
    • None
    • Jenkins 2.46.1

      During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

      I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

      https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291

      What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected quickly?

      Desired Behavior:

      Jenkins detects the channel is disconnected within 30 seconds. It proceeds to restart the job via another healthy node.

          [JENKINS-43781] Quickly detecting and restarting a job if the job's slave disconnects

          Jon B created issue -
          Jon B made changes -
          Component/s New: remoting [ 15489 ]
          Jon B made changes -
          Description Original: During the day, I'd like to run lots of Jenkins slaves. During the evening, I'd like to autoscale down the number of slaves I'm using. AWS autoscaling can easily allow me to kill off a certain number of slaves and so I use that.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, this job has a 10 minute timeout set, and even after the timeout is reached, it waits another 7 minutes before Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected?
          New: During the day, I run lots of Jenkins slaves. During the evening, I'd like to autoscale down the number of slaves I'm using. AWS autoscaling can easily allow me to kill off a certain number of slaves and so I use that.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, this job has a 10 minute timeout set, and even after the timeout is reached, it waits another 7 minutes before Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected?
          Jon B made changes -
          Description Original: During the day, I run lots of Jenkins slaves. During the evening, I'd like to autoscale down the number of slaves I'm using. AWS autoscaling can easily allow me to kill off a certain number of slaves and so I use that.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, this job has a 10 minute timeout set, and even after the timeout is reached, it waits another 7 minutes before Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected?
          New: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a channel disconnect.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected?
          Jon B made changes -
          Description Original: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a channel disconnect.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected?
          New: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a channel disconnect.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?
          Jon B made changes -
          Description Original: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a channel disconnect.

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?
          New: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?
          Jon B made changes -
          Description Original: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?
          New: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?

           

          *Desired Behavior:*

          Jenkins detects the channel is hung within 30 seconds. It proceeds to restart the job via another healthy node.
          Jon B made changes -
          Description Original: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?

           

          *Desired Behavior:*

          Jenkins detects the channel is hung within 30 seconds. It proceeds to restart the job via another healthy node.
          New: During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".

          I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.

          [https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291]

          What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected _quickly_?

          *Desired Behavior:*

          Jenkins detects the channel is disconnected within 30 seconds. It proceeds to restart the job via another healthy node.

          Oleg Nenashev added a comment -

          There is no plan to implement job failover within the Jenkins core. There is a Naginator plugin for it: https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin

          Regarding the node failure detection time, it really depends on operation being executed. If the node is being disconnected after 7 minutes, it likely means that...
          1) Remoting channel does not notice disconnect or does not propagate it to all pending calls
          2) The channel is being disconnected by the PingThread timeout

          Anyway, I need Jenkins System, agent and build logs with failure timestamp to analyze the root cause.

          Oleg Nenashev added a comment - There is no plan to implement job failover within the Jenkins core. There is a Naginator plugin for it: https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin Regarding the node failure detection time, it really depends on operation being executed. If the node is being disconnected after 7 minutes, it likely means that... 1) Remoting channel does not notice disconnect or does not propagate it to all pending calls 2) The channel is being disconnected by the PingThread timeout Anyway, I need Jenkins System, agent and build logs with failure timestamp to analyze the root cause.

          Jesse Glick added a comment -

          JENKINS-49707 would address this for Pipeline builds. Not necessarily in 30s; depends on the agent type.

          Jesse Glick added a comment - JENKINS-49707 would address this for Pipeline builds. Not necessarily in 30s; depends on the agent type.
          Jesse Glick made changes -
          Link New: This issue duplicates JENKINS-49707 [ JENKINS-49707 ]
          Jesse Glick made changes -
          Resolution New: Duplicate [ 3 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

            Unassigned Unassigned
            piratejohnny Jon B
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: