Status: Open (View Workflow)
Environment:Reference working system:
Jenkins ver. 2.176.3, Self-Organizing Swarm Plug-in Modules ver. 3.17
Jenkins ver. 2.150.2, Self-Organizing Swarm Plug-in Modules ver. 3.15
Swarm client ver. 3.17, running with Oracle Java 8.
The swarm client connection hangs when the HTTP handshake with a Jenkins Master hangs.
In our situation, the Jenkins Master was responsive via UI and there was no issue connecting a conventional Agent via the web launcher. For some reason, however, the connection with a Swarm client would simply hang. The Master didn't log any attempt for connection, hence I'm only able to provide a log from the client.
Restarting the Master did solve the issue (that's why I'm reporting the bug as "minor" at first), but my main concern is that, as the Swarm client was designed for auto-discovery, there is a chance that clients would gradually sink into a broken Master and hang indefinitely, leaving the remaining instances in the cluster unattended.
We didn't have any issues connecting the same Swarm clients to other Masters in the same infrastructure. Hence, network issue was ruled out.
We tried with Swarm client v3.17 and 3.14, to no avail.
The Swarm client failed to connect from both Windows and Linux (CentOS) nodes.
Sadly, I had to replace company name, machine name and stuff like that... sorry about it.
For log collection I have disable the SSL verification in the node.
swarm-healthy-log.txt is the full log of the swarm client connecting to a Master from our infra without issues (for reference).
swarm-issue-log1.txt and swarm-issue-log2.txt are the full logs connecting to the troublesome Master. Notice that the handshake failed at different points. Sometimes the first handshake would succeed, but we never succeeded in the second one.
I understand that the Master could have been in a corrupt state somehow. As said, restarting it brought things back to normal.
However, we should expect the Swarm client to be resilient against any issues with the Master, more precisely because of the auto-discovery feature. This client has more autonomy, and if it can't connect to a Master, simply move on to the next one.
The suggested fix (maybe I'm being naive on this) would be a timeout for all the HTTP requests.
For example here and here: