-
Improvement
-
Resolution: Unresolved
-
Minor
-
None
-
Jenkins LTS 2.479.3
swarm-plugin 3.49
Rarely when there are communications hiccups between Jenkins controller and machines running a Swarm agent, I find them in a situation that the server does not have the agent listed among its "computers", but the agent's log says it is "Connected". I can only theorize at the moment how their views of the world begin to differ, and why agent's current regular ping does not fail the broken connection, but apparently it does not always suffice.
Here I propose that the ping operation be extended or supplemented by logic where the agent lists known workers (or just itself) from the controller, posts some request with a cookie (job-like? ideally something that does not require an executor and wait in the queue) that should be served by its name (as a node label?) and awaits that such request comes in soon, or restarts the connection or even itself. Ideally it should not disrupt jobs running on the disconnected agent while e.g. the Jenkins controller is restarting and pipelines can allegedly march on.
So this probe verifies that the agent is/remains known to the job scheduling engine, is listed among known workers by its expected name, etc.