During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".
I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.
https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291
What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected quickly?
Desired Behavior:
Jenkins detects the channel is disconnected within 30 seconds. It proceeds to restart the job via another healthy node.
- duplicates
-
JENKINS-49707 Auto retry for elastic agents after channel closure
- Resolved