-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
swarm-agent 3.34
Jenkins LTS 2.346.3
I have some swarm agents hosted on a home PC connecting to a Jenkins controller in the internet. Every once in a while something times out, and the client loses connection. It does not always restart (but sometimes it does), in pathological cases staying for days (if I don't notice) "offline" on controller side, and with little revelation from the agent's log, e.g.:
root@nutci-debian-11-amd64:~# journalctl -lfu swarm-client-nutci -- Journal begins at Wed 2022-05-18 03:14:14 UTC. -- Aug 26 04:09:24 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: INFO: Connected Aug 26 04:11:49 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:11:49 AM org.eclipse.jgit.lib.Repository close Aug 26 04:11:49 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: WARNING: close() called when useCnt is already zero for Repository[/home/abuild/jenkins-nutci-debian-11-amd64/workspace/nut_nut_PR-1621@2/.git] Aug 26 04:11:51 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:11:51 AM org.eclipse.jgit.lib.Repository close Aug 26 04:11:51 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: WARNING: close() called when useCnt is already zero for Repository[/home/abuild/jenkins-nutci-debian-11-amd64/workspace/nut_nut_PR-1621@2/.git] Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: Aug 26, 2022 4:33:26 AM hudson.slaves.ChannelPinger$1 onDead Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.networkupstools.org/216.158.66.104:2203. Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: java.util.concurrent.TimeoutException: Ping started at 1661488166919 hasn't completed by 1661488406920 Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: at hudson.remoting.PingThread.ping(PingThread.java:132) Aug 26 04:33:26 nutci-debian-11-amd64 swarm-client-nutci.sh[2733292]: at hudson.remoting.PingThread.run(PingThread.java:88)^C root@nutci-debian-11-amd64:~# date Fri Aug 26 12:09:34 UTC 2022
(here "swarm-client-nutci.sh" is my script to wrap the agent startup on different platforms, see https://github.com/networkupstools/jenkins-swarm-nutci if interested... maybe I fail to set some arguments to it, to ensure reconnects?)
If it helps, settings on that particular agent which last failed were:
=== Debug: jenkins-swarm.yml: url: https://ci.networkupstools.org/ deleteExistingClients: true disableClientsUniqueId: true jarCache: /home/abuild/.jarcache passwordFile: <swarmfarmuser.token> username: <swarmfarmuser> mode: exclusive labelsFile: jenkins-swarm.labels pidFile: jenkins-swarm.pid executors: 2 environmentVariables: PATH+LOCAL: "/usr/lib/ccache" name: "nutci-debian-11-amd64" description: "NUT CI swarm worker from nutci-debian-11-amd64 launched Fri Aug 26 04:08:41 UTC 2022"
I would expect the swarm agent to try reconnecting (hopefully even resuming the part of larger build it was at, if any), or at least to die so systemd etc. would recycle it. Currently it too often stays in a "process up, link down" state until I notice it and restart the service on agent machine or reboot it.
The problem seems to be systems-agnostic, I had such agents frozen on Linux, FreeBSD, OpenBSD and OpenIndiana. Probably something does not handle `java.util.concurrent.TimeoutException` as a reason to reconnect, but I did not quickly find the spot in swarm plugin sources.
UPDATE: This also tends to happen when my Jenkins controller restarts - some agents remain disconnected (but not all usually - guess it depends on "luck" of pinging at a wrong time):
Mar 14 19:23:58 nutci-cross-mingw swarm-client-nutci.sh[2173040]: INFO: Retrying in 10 seconds Mar 14 19:24:08 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14, 2023 7:24:08 PM hudson.plugins.swarm.Client run Mar 14 19:24:08 nutci-cross-mingw swarm-client-nutci.sh[2173040]: INFO: Attempting to connect to https://ci.networkupstools.org/ Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14, 2023 7:24:09 PM hudson.plugins.swarm.SwarmClient getCsrfCrumb Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: SEVERE: Could not obtain CSRF crumb. Response code: 503 Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: <!DOCTYPE html><html lang="en"><head resURL="/static/816875f6" data-rooturl="" data-resurl="/static/816875f6" data-imagesurl="/static/816875f6/images"><title>Starting Jenkins</title><meta name="ROBOTS" content="NOFOLLOW"><meta name="viewport" content="width=device-width, initial-scale=1"><link rel="stylesheet" href="/static/816875f6/jsbundles/simple-page.css" type="text/css"><link rel="stylesheet" href="/static/816875f6/css/loading.css" type="text/css"></head><body><div class="simple-page" role="main"><div class="modal signup"><div class="signupIntroDefault"><div class="logo"><img src="/static/816875f6/images/svgs/logo.svg" alt="Jenkins logo"></div><h1 class="loading"> Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: Please wait while Jenkins is getting ready to work Mar 14 19:24:09 nutci-cross-mingw swarm-client-nutci.sh[2173040]: <span>.</span><span>.</span><span>.</span></h1><p class="restarting">Your browser will reload automatically when Jenkins is ready.</div></div></div><script src="/static/816875f6/scripts/loading.js" type="text/javascript"></script></body></html>
That controller restart was 2 weeks ago, as of this posting - and the agent stayed effectively down (running - so OS had no reason to restart it - but useless). As seen here, the swarm agent tried to reconnect, saw the "getting ready to work" message and never tried to dial back in (perhaps assuming HTTP-5xx as persistently fatal?) This one was with swarm-agent-3.39 on several OSes.
- relates to
-
JENKINS-70501 swarm-plugin stops reconnection if controller is starting
- In Progress