During the day, I run lots of Jenkins slaves. During the evening, I use AWS to autoscale down the number of slaves I'm using. AWS simply terminates the instances. Jenkins probably would call this a "channel disconnect".
I noticed that any jobs which are running when the slave is killed off hang for a really long time. For example, the link below shows a job which had a 10 minute timeout set. I kill the job off at the 24 second mark, but the job hangs up until the 10 minute mark where Jenkins timeout plugin detects a timeout.. but then it spends the next 7 minutes hanging until Jenkins realizes the channel is disconnected.
https://gist.github.com/blockjon/6358b4124935fa4e72ba8a7d5bd12291
What's a better way to have jobs be stopped and/or restarted if the slave they are running on is disconnected quickly?
Desired Behavior:
Jenkins detects the channel is disconnected within 30 seconds. It proceeds to restart the job via another healthy node.
- duplicates
-
JENKINS-49707 Auto retry for elastic agents after channel closure
-
- Resolved
-
There is no plan to implement job failover within the Jenkins core. There is a Naginator plugin for it: https://wiki.jenkins-ci.org/display/JENKINS/Naginator+Plugin
Regarding the node failure detection time, it really depends on operation being executed. If the node is being disconnected after 7 minutes, it likely means that...
1) Remoting channel does not notice disconnect or does not propagate it to all pending calls
2) The channel is being disconnected by the PingThread timeout
Anyway, I need Jenkins System, agent and build logs with failure timestamp to analyze the root cause.