Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-59817

Swarm client hangs indefinitely while waiting for HTTP handshake to complete

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • swarm-plugin

      The swarm client connection hangs when the HTTP handshake with a Jenkins Master hangs.

      In our situation, the Jenkins Master was responsive via UI and there was no issue connecting a conventional Agent via the web launcher. For some reason, however, the connection with a Swarm client would simply hang. The Master didn't log any attempt for connection, hence I'm only able to provide a log from the client.

      Restarting the Master did solve the issue (that's why I'm reporting the bug as "minor" at first), but my main concern is that, as the Swarm client was designed for auto-discovery, there is a chance that clients would gradually sink into a broken Master and hang indefinitely, leaving the remaining instances in the cluster unattended.

      Some attempts...

      We didn't have any issues connecting the same Swarm clients to other Masters in the same infrastructure. Hence, network issue was ruled out.

      We tried with Swarm client v3.17 and 3.14, to no avail.

      The Swarm client failed to connect from both Windows and Linux (CentOS) nodes.

      About the logs...

      Sadly, I had to replace company name, machine name and stuff like that... sorry about it.

      For log collection I have disable the SSL verification in the node.

      swarm-healthy-log.txt is the full log of the swarm client connecting to a Master from our infra without issues (for reference).

      swarm-issue-log1.txt and swarm-issue-log2.txt are the full logs connecting to the troublesome Master. Notice that the handshake failed at different points. Sometimes the first handshake would succeed, but we never succeeded in the second one.

       

      Expectation...

      I understand that the Master could have been in a corrupt state somehow. As said, restarting it brought things back to normal.

      However, we should expect the Swarm client to be resilient against any issues with the Master, more precisely because of the auto-discovery feature. This client has more autonomy, and if it can't connect to a Master, simply move on to the next one.

      The suggested fix (maybe I'm being naive on this) would be a timeout for all the HTTP requests.

      For example here and here:

      https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L478

      https://github.com/jenkinsci/swarm-plugin/blob/bfbd2c79eea470847335fb6a0ef9ce19d425429b/client/src/main/java/hudson/plugins/swarm/SwarmClient.java#L392

       

       

          [JENKINS-59817] Swarm client hangs indefinitely while waiting for HTTP handshake to complete

          Basil Crow added a comment -

          I like the suggestion for an HTTP timeout. If this problem occurs again, could you get a thread dump from the Jenkins master and the Swarm client agent at the time of the hang? I would like to see the stack trace on each side of the connection. This should help pinpoint the cause of the hang.

          Basil Crow added a comment - I like the suggestion for an HTTP timeout. If this problem occurs again, could you get a thread dump from the Jenkins master and the Swarm client agent at the time of the hang? I would like to see the stack trace on each side of the connection. This should help pinpoint the cause of the hang.

          Basil Crow added a comment -

          Circling back on this somewhat old issue. rafaelrezend, have you seen any hangs due to a stuck HTTP handshake since then? I am still interested in seeing the stack trace of such a hung Swarm client.

          Basil Crow added a comment - Circling back on this somewhat old issue. rafaelrezend , have you seen any hangs due to a stuck HTTP handshake since then? I am still interested in seeing the stack trace of such a hung Swarm client.

          I haven't seen any occurrence of that issue since months. It could be because it just didn't happen anymore or because the Jenkins admins are used to restart their instances in the first sign of failure. That restores the whole thing...

          Also, clients are slowly migrating away from the swarm client because the use case that required it no longer exists. So, currently the chances of witnessing the issue again are very slim  Besides, on the good side, people seemed to start taking care of their Jenkins Masters. There have been fewer anomalies that could cause the issue with the client.

          basil, I'm sorry to say that I won't be able to get the stack trace after all

          Rafael Rezende added a comment - I haven't seen any occurrence of that issue since months. It could be because it just didn't happen anymore or because the Jenkins admins are used to restart their instances in the first sign of failure. That restores the whole thing... Also, clients are slowly migrating away from the swarm client because the use case that required it no longer exists. So, currently the chances of witnessing the issue again are very slim  Besides, on the good side, people seemed to start taking care of their Jenkins Masters. There have been fewer anomalies that could cause the issue with the client. basil , I'm sorry to say that I won't be able to get the stack trace after all

            Unassigned Unassigned
            rafaelrezend Rafael Rezende
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: