(Swarm) Agents fail to reconnect to controller after reboot

This issue is archived. You can view it, but you can't modify it. Learn more

XMLWordPrintable

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Critical
    • Component/s: remoting, swarm-plugin
    • Environment:
      Linux (Controller), Windows (Agents)
      Controller Version: 2.263.1
      Swarm Version: 3.24
      Remoting Version: 4.5

      We do daily maintenance on our Windows Agents, which includes a reboot. This works fine most of the time. The machines reboot and the Swarm Agent (which runs as a Windows service) just reconnects to the controller and is ready to run builds again.

      However, after some time (maybe days or a couple of weeks), agents can't connect anymore until the controller is restarted.

      In the agent log I see messages like the following:

      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Nov 08, 2021 12:56:29 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver isPortVisible
      WARNING: Connection refused: connect
      Nov 08, 2021 12:56:29 AM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: https://<jenkins-controller>/ provided port:35725 is not reachable

      On the controller, OTOH, the agent shows up as being offline and subsequent connection attempts result in

      SEVERE: An error occurred
      hudson.plugins.swarm.RetryException: Failed to create a Swarm agent on Jenkins. Response code: 409
      Agent "myAgent" already exists.

      If the agent is removed from the controller, the same happens again. The only way to resolve the situation is to restart the controller.

      I wonder whether this might be related to JENKINS-57831.

            Assignee:
            Unassigned
            Reporter:
            Dirk Heinrichs
            Archiver:
            Jenkins Service Account

              Created:
              Updated:
              Archived: