Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69446

Swarm plugin does not reconnect after pinger failure or controller restart

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • remoting, swarm-plugin
    • None
    • swarm-agent 3.34
      Jenkins LTS 2.346.3

      I have some swarm agents hosted on a home PC connecting to a Jenkins controller in the internet. Every once in a while something times out, and the client loses connection. It does not always restart (but sometimes it does), in pathological cases staying for days (if I don't notice) "offline" on controller side, and with little revelation from the agent's log, e.g.:

      root@nutci-debian-11-amd64:~# journalctl -lfu swarm-client-nutci
      -- Journal begins at Wed 2022-05-18 03:14:14 UTC. --
      Aug 26 04:09:24 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: INFO: Connected
      Aug 26 04:11:49 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:11:49 AM org.eclipse.jgit.lib.Repository close
      Aug 26 04:11:49 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: WARNING: close() called when useCnt is already zero for Repository[/home/abuild/jenkins-nutci-debian-11-amd64/workspace/nut_nut_PR-1621@2/.git]
      Aug 26 04:11:51 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:11:51 AM org.eclipse.jgit.lib.Repository close
      Aug 26 04:11:51 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: WARNING: close() called when useCnt is already zero for Repository[/home/abuild/jenkins-nutci-debian-11-amd64/workspace/nut_nut_PR-1621@2/.git]
      Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:33:26 AM hudson.slaves.ChannelPinger$1 onDead
      Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.networkupstools.org/216.158.66.104:2203.
      Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: java.util.concurrent.TimeoutException: Ping started at 1661488166919 hasn't completed by 1661488406920
      Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
      Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]:         at hudson.remoting.PingThread.run(PingThread.java:88)^C
      root@nutci-debian-11-amd64:~# date
      Fri Aug 26 12:09:34 UTC 2022
      

      (here "swarm-client-nutci.sh" is my script to wrap the agent startup on different platforms, see https://github.com/networkupstools/jenkins-swarm-nutci if interested... maybe I fail to set some arguments to it, to ensure reconnects?)

      If it helps, settings on that particular agent which last failed were:

      === Debug: jenkins-swarm.yml:
      url: https://ci.networkupstools.org/
      deleteExistingClients: true
      disableClientsUniqueId: true
      jarCache: /home/abuild/.jarcache
      passwordFile: <swarmfarmuser.token>
      username: <swarmfarmuser>
      mode: exclusive
      labelsFile: jenkins-swarm.labels
      pidFile: jenkins-swarm.pid
      executors: 2
      environmentVariables:
        PATH+LOCAL: "/usr/lib/ccache"
      name: "nutci-debian-11-amd64"
      description: "NUT CI swarm worker from nutci-debian-11-amd64 launched Fri Aug 26 04:08:41 UTC 2022" 

      I would expect the swarm agent to try reconnecting (hopefully even resuming the part of larger build it was at, if any), or at least to die so systemd etc. would recycle it. Currently it too often stays in a "process up, link down" state until I notice it and restart the service on agent machine or reboot it.

      The problem seems to be systems-agnostic, I had such agents frozen on Linux, FreeBSD, OpenBSD and OpenIndiana. Probably something does not handle `java.util.concurrent.TimeoutException` as a reason to reconnect, but I did not quickly find the spot in swarm plugin sources.

      UPDATE: This also tends to happen when my Jenkins controller restarts - some agents remain disconnected (but not all usually - guess it depends on "luck" of pinging at a wrong time):

      Mar 14 19:23:58 nutci-cross-mingw swarm-client-nutci.sh[2173040]: INFO: Retrying in 10 seconds
      Mar 14 19:24:08 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14, 2023 7:24:08 PM hudson.plugins.swarm.Client run
      Mar 14 19:24:08 nutci-cross-mingw swarm-client-nutci.sh[2173040]: INFO: Attempting to connect to https://ci.networkupstools.org/
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14, 2023 7:24:09 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: SEVERE: Could not obtain CSRF crumb. Response code: 503
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]:
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]:
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]:     <!DOCTYPE html><html lang="en"><head resURL="/static/816875f6" data-rooturl="" data-resurl="/static/816875f6" data-imagesurl="/static/816875f6/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/816875f6/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/816875f6/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/816875f6/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading">
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]:                             Please wait while Jenkins is getting ready to work
      Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]:                             <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/816875f6/scripts/loading.js" type="text/javascript"></script></body></html> 

      That controller restart was 2 weeks ago, as of this posting - and the agent stayed effectively down (running - so OS had no reason to restart it - but useless). As seen here, the swarm agent tried to reconnect, saw the "getting ready to work" message and never tried to dial back in (perhaps assuming HTTP-5xx as persistently fatal?) This one was with swarm-agent-3.39 on several OSes.

       

            Unassigned Unassigned
            jimklimov Jim Klimov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: