Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-67082

(Swarm) Agents fail to reconnect to controller after reboot

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • remoting, (1)
      swarm-plugin
    • None
    • Linux (Controller), Windows (Agents)
      Controller Version: 2.263.1
      Swarm Version: 3.24
      Remoting Version: 4.5

      We do daily maintenance on our Windows Agents, which includes a reboot. This works fine most of the time. The machines reboot and the Swarm Agent (which runs as a Windows service) just reconnects to the controller and is ready to run builds again.

      However, after some time (maybe days or a couple of weeks), agents can't connect anymore until the controller is restarted.

      In the agent log I see messages like the following:

      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Nov 08, 2021 12:56:29 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible
      WARNING: Connection refused: connect
      Nov 08, 2021 12:56:29 AM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: https://<jenkins-controller>/ provided port:35725 is not reachable

      On the controller, OTOH, the agent shows up as being offline and subsequent connection attempts result in

      SEVERE: An error occurred
      hudson.plugins.swarm.RetryException: Failed to create a Swarm agent on Jenkins. Response code: 409
      Agent "myAgent" already exists.

      If the agent is removed from the controller, the same happens again. The only way to resolve the situation is to restart the controller.

      I wonder whether this might be related to JENKINS-57831.

          [JENKINS-67082] (Swarm) Agents fail to reconnect to controller after reboot

          No, I don't.

          Dirk Heinrichs added a comment - No, I don't.

          ethorsa added a comment - - edited

          Using deleteExistingClients a potentially existing agent is removed and the reconnected one is created. Another approach is disableClientsUniqueId, where each connected agent gets an unique ID assigned.

          Docs: https://github.com/jenkinsci/swarm-plugin#available-options 

          ethorsa added a comment - - edited Using deleteExistingClients a potentially existing agent is removed and the reconnected one is created. Another approach is disableClientsUniqueId, where each connected agent gets an unique ID assigned. Docs: https://github.com/jenkinsci/swarm-plugin#available-options  

          Yep, I use the latter.

          Dirk Heinrichs added a comment - Yep, I use the latter.

          Basil Crow added a comment -

          There's a passing test for the use case where the Swarm client should be able to reconnect to the Jenkins controller after a restart.

          Please make sure you are using the -deleteExistingClients option if you aren't already.

          Basil Crow added a comment - There's  a passing test for the use case where the Swarm client should be able to reconnect to the Jenkins controller after a restart. Please make sure you are using the -deleteExistingClients option if you aren't already.

          The problem isn't that the Swarm client can't reconnect in general. It's that the controller stops accepting those reconnects after some time, after which there have been many successful reconnects already. If this happens, I already tried removing the clients from the controller manually, but that didn't help in these cases. So I don't really see how adding that option would help.

          Dirk Heinrichs added a comment - The problem isn't that the Swarm client can't reconnect in general. It's that the controller stops accepting those reconnects after some time, after which there have been many successful reconnects already. If this happens, I already tried removing the clients from the controller manually, but that didn't help in these cases. So I don't really see how adding that option would help.

          Basil Crow added a comment -

          It's that the controller stops accepting those reconnects after some time

          Does the controller print a stack trace in its logs?

          Basil Crow added a comment - It's that the controller stops accepting those reconnects after some time Does the controller print a stack trace in its logs?

          No, it doesn't. The only thing I see is what I've already put into the description above.

          I've meanwhile updated the Swarm plugin and all agents to 3.30 to see whether that helps. Might take some time, though, until the problem shows up again...

          Dirk Heinrichs added a comment - No, it doesn't. The only thing I see is what I've already put into the description above. I've meanwhile updated the Swarm plugin and all agents to 3.30 to see whether that helps. Might take some time, though, until the problem shows up again...

          Basil Crow added a comment -

          I've meanwhile updated the Swarm plugin and all agents to 3.30 to see whether that helps. Might take some time, though, until the problem shows up again...

          That's probably not going to help you very much. To get to the bottom of this you'll need to blast up the logs on the Swarm client side at least:

          https://github.com/jenkinsci/swarm-plugin/blob/master/docs/logging.adoc

          Basil Crow added a comment - I've meanwhile updated the Swarm plugin and all agents to 3.30 to see whether that helps. Might take some time, though, until the problem shows up again... That's probably not going to help you very much. To get to the bottom of this you'll need to blast up the logs on the Swarm client side at least: https://github.com/jenkinsci/swarm-plugin/blob/master/docs/logging.adoc

          Thanks for the hint. Need to see when I can find a maintenance window to reconfigure and restart all agents, though...

          Dirk Heinrichs added a comment - Thanks for the hint. Need to see when I can find a maintenance window to reconfigure and restart all agents, though...

          We've just updated Jenkins to the latest LTS version (incl. all plugins) and thus all swarm agents to 3.32, so I could enable logging on the agents, using level "ALL".

          Dirk Heinrichs added a comment - We've just updated Jenkins to the latest LTS version (incl. all plugins) and thus all swarm agents to 3.32, so I could enable logging on the agents, using level "ALL".

            Unassigned Unassigned
            dhs Dirk Heinrichs
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: