Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-75195

Swarm client "Attempting to reconnect" sometimes indefinitely (not bound by timeout and retry)

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • swarm-plugin
    • None
    • Jenkins LTS 2.479.3
      swarm-plugin 3.49

      Sometimes after a hiccup with internet availability between community members' computers with Swarm agents and the FOSS project's Jenkins controller in the cloud, or during a restart of the said controller or its VM, some agents do not come back online until their services/containers/VMs are restarted by community members (so marking as a blocker - only a manual intervention fixes the run-time issue), e.g.:

      $ tail -F /var/log/jenkins-swarm-nutci.log
      java.util.concurrent.TimeoutException: Ping started at 1737925236886 hasn't completed by 1737925476910
              at hudson.remoting.PingThread.ping(PingThread.java:135)
              at hudson.remoting.PingThread.run(PingThread.java:87)Jan 26, 2025 10:04:36 PM hudson.remoting.Launcher$CuiListener status
      INFO: Terminated
      Jan 26, 2025 10:04:36 PM hudson.plugins.swarm.Client run
      INFO: Retrying in 10 seconds
      Jan 26, 2025 10:04:47 PM hudson.plugins.swarm.Client run
      INFO: Attempting to connect to https://ci.ourproject.org/
      
      $ date
      Mon Jan 27 08:13:48 CET 2025 

      As seen above, the "Attempting to connect" (from client/src/main/java/hudson/plugins/swarm/Client.java::run()) never returned, even after 10 hours into the call.

      The code in client/src/main/java/hudson/plugins/swarm/SwarmClient.java::createSwarmAgent() seems to createHttpClient() and HttpRequest.newBuilder(uri).POST(...) which I suppose may behave in a "generally undefined manner" while communications and/or controller itself are not fully available, and should be constrained by some timeout.

      Agents in that setup use the LabelFileWatcher, an info-log message about which is literally the next line in that run() method, and is not logged - so the agent is blocked inside the createSwarmAgent() code somewhere.

      A healthy startup, e.g. when restarting the same client, looks like this:

      + exec java -jar /home/abuild/jenkins-swarm/swarm-client-3.49.jar -config jenkins-swarm.yml
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.Client logArguments
      INFO: Client invoked with: -config jenkins-swarm.yml
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.Client main
      INFO: Load configuration from jenkins-swarm.yml
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.SwarmClient <init>
      INFO: Loading labels from jenkins-swarm.labels...
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.SwarmClient <init>
      INFO: Labels found in file: nut-builder ...
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.SwarmClient <init>
      INFO: Effective label list: [nut-builder, ...]
      
      ### The healthy attempt to connect...
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.Client run
      INFO: Connecting to Jenkins controller
      
      Jan 27, 2025 8:14:25 AM hudson.plugins.swarm.Client run
      INFO: Attempting to connect to https://ci.ourproject.org/
      
      Jan 27, 2025 8:14:27 AM hudson.plugins.swarm.Client run
      INFO: Setting up LabelFileWatcher
      ### ...happens quickly
      
      Jan 27, 2025 8:14:27 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using ./remoting as a remoting work directory
      Jan 27, 2025 8:14:27 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
      INFO: Both error and output logs will be printed to ./remoting
      Jan 27, 2025 8:14:27 AM hudson.remoting.Launcher createEngine
      INFO: Setting up agent: nutci-freebsd12-amd64
      Jan 27, 2025 8:14:27 AM hudson.remoting.Engine startEngine
      INFO: Using Remoting version: 3283.v92c105e0f819
      Jan 27, 2025 8:14:27 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
      INFO: Using ./remoting as a remoting work directory
      Jan 27, 2025 8:14:27 AM hudson.remoting.Engine startEngine
      INFO: Using custom JAR Cache: FileSystem JAR Cache: path=/home/abuild/jenkins-swarm/../.jarcache, touch=true
      Jan 27, 2025 8:14:27 AM hudson.remoting.Launcher$CuiListener status
      INFO: Locating server among [https://ci.ourproject.org/]
      Jan 27, 2025 8:14:48 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Agent discovery successful
        Agent address: ci.ourproject.org
        Agent port:    22033
        Identity:      38:9c:a5:6b:6e:17:b6:b3:3b:6a:15:a7:52:4d:12:34
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Handshaking
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Connecting to ci.ourproject.org:22033
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Server reports protocol JNLP4-connect-proxy not supported, skipping
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Trying protocol: JNLP4-connect
      Jan 27, 2025 8:14:48 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader run
      INFO: Waiting for ProtocolStack to start.
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Remote identity confirmed: 38:9c:a5:6b:6e:17:b6:b3:3b:6a:15:a7:52:4d:12:34
      Jan 27, 2025 8:14:48 AM hudson.remoting.Launcher$CuiListener status
      INFO: Connected

      This may be or not be related to JENKINS-59817

          [JENKINS-75195] Swarm client "Attempting to reconnect" sometimes indefinitely (not bound by timeout and retry)

          There are no comments yet on this issue.

            Unassigned Unassigned
            jimklimov Jim Klimov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: