We are running several JNLP slaves on Windows as Windows service using the Winsw wrapper. On some machines, when an agent loses the connection to the master, all running processes are killed and the jobs never complete.
This happens since the agent tries to restart itself when it loses connection. There are two possibilities:
- If the agent runs as a user that is a local admin (sadly the default, since services run as the SYSTEM user by default), winsw restarts the service. Upon restarting the service, both winsw and Windows kill all processes that belong to the service, which includes all processes of currently running jobs.
- If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.
Frankly, I don't see any reason why an agent should restart itself on connection loss. In the case of an agent running as a Windows service, it can never work properly and is thus entirely useless.
A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely.