Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: swarm-plugin
Labels:
- swarm
Environment:

Hide
Reference working system:
Jenkins ver. 2.176.3, Self-Organizing Swarm Plug-in Modules ver. 3.17

Broken system:
Jenkins ver. 2.150.2, Self-Organizing Swarm Plug-in Modules ver. 3.15

Swarm client ver. 3.17, running with Oracle Java 8.

Show
Reference working system: Jenkins ver. 2.176.3, Self-Organizing Swarm Plug-in Modules ver. 3.17 Broken system: Jenkins ver. 2.150.2, Self-Organizing Swarm Plug-in Modules ver. 3.15 Swarm client ver. 3.17, running with Oracle Java 8.

The swarm client connection hangs when the HTTP handshake with a Jenkins Master hangs.

In our situation, the Jenkins Master was responsive via UI and there was no issue connecting a conventional Agent via the web launcher. For some reason, however, the connection with a Swarm client would simply hang. The Master didn't log any attempt for connection, hence I'm only able to provide a log from the client.

Restarting the Master did solve the issue (that's why I'm reporting the bug as "minor" at first), but my main concern is that, as the Swarm client was designed for auto-discovery, there is a chance that clients would gradually sink into a broken Master and hang indefinitely, leaving the remaining instances in the cluster unattended.

Some attempts...

We didn't have any issues connecting the same Swarm clients to other Masters in the same infrastructure. Hence, network issue was ruled out.

We tried with Swarm client v3.17 and 3.14, to no avail.

The Swarm client failed to connect from both Windows and Linux (CentOS) nodes.

About the logs...

Sadly, I had to replace company name, machine name and stuff like that... sorry about it.

For log collection I have disable the SSL verification in the node.

swarm-healthy-log.txt is the full log of the swarm client connecting to a Master from our infra without issues (for reference).

swarm-issue-log1.txt and swarm-issue-log2.txt are the full logs connecting to the troublesome Master. Notice that the handshake failed at different points. Sometimes the first handshake would succeed, but we never succeeded in the second one.

Expectation...

I understand that the Master could have been in a corrupt state somehow. As said, restarting it brought things back to normal.

However, we should expect the Swarm client to be resilient against any issues with the Master, more precisely because of the auto-discovery feature. This client has more autonomy, and if it can't connect to a Master, simply move on to the next one.

The suggested fix (maybe I'm being naive on this) would be a timeout for all the HTTP requests.

For example here and here:

https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L478

https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L392

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

swarm-healthy-log.txt
155 kB
2019-10-17 07:08
swarm-issue-log1.txt
16 kB
2019-10-17 07:08
swarm-issue-log2.txt
4 kB
2019-10-17 07:08

Assignee:: Unassigned
Reporter:: Rafael Rezende
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: 2019-10-17 07:23
Updated:: 2020-07-10 15:16

Details

Description

Attachments

Attachments

Activity

People

Dates