Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-47657

Agent running as Windows service kills all running jobs on reconnect

    • durable-task 1.38

      We are running several JNLP agents on Windows as Windows service using the Winsw wrapper. On some machines, when an agent loses the connection to the controller, all running processes are killed and the jobs never complete.

      This happens since the agent tries to restart itself when it loses connection. There are two possibilities:

      • If the agent runs as a user that is a local admin (sadly the default, since services run as the SYSTEM user by default), winsw restarts the service. Upon restarting the service, both winsw and Windows kill all processes that belong to the service, which includes all processes of currently running jobs.
      • If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

      Frankly, I don't see any reason why an agent should restart itself on connection loss. In the case of an agent running as a Windows service, it can never work properly and is thus entirely useless.

      A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely.

          [JENKINS-47657] Agent running as Windows service kills all running jobs on reconnect

          Oleg Nenashev added a comment - - edited

          IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway.

          > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

          Are you talking about Pipeline jobs or other Durable Task implementations?

          > A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely.

          Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. I have never been brave enough to run Jenkins agents with a local admin.

          > (sadly the default, since services run as the SYSTEM user by default)

          Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.

          Oleg Nenashev added a comment - - edited IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway. > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running. Are you talking about Pipeline jobs or other Durable Task implementations? > A solution would be to remove jenkins.slaves.restarter.WinswSlaveRestarter entirely. Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. I have never been brave enough to run Jenkins agents with a local admin. > (sadly the default, since services run as the SYSTEM user by default) Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.

          > IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway.

          From what I can tell from the logs, a restart happens on every connection loss:

          Okt 02, 2017 9:26:05 AM hudson.remoting.jnlp.Main$CuiListener status
          INFORMATION: Connected
          Okt 02, 2017 10:04:20 AM hudson.remoting.jnlp.Main$CuiListener status
          INFORMATION: Terminated
          Okt 02, 2017 10:04:35 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
          INFORMATION: Failed to connect to the master. Will retry again
          [...] (previous message repeats until the master is reachable again
          Okt 02, 2017 10:06:01 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect
          INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@3c30a4a0

          > > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running.

          > Are you talking about Pipeline jobs or other Durable Task implementations?

          We only use pipeline jobs.

          > Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research.

          Fair enough, though I still fail to see a case where the restart is useful.

          > > (sadly the default, since services run as the SYSTEM user by default)

          > Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems.

          True, the installer from the GUI always fails. However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.

          Thomas Bächler added a comment - > IIRC Windows service agent restart happens if and only if the JNLP process experiences a severe issue. In such case a Channel will be broken, and all non-durable tasks will likely get aborted anyway. From what I can tell from the logs, a restart happens on every connection loss: Okt 02, 2017 9:26:05 AM hudson.remoting.jnlp.Main$CuiListener status INFORMATION: Connected Okt 02, 2017 10:04:20 AM hudson.remoting.jnlp.Main$CuiListener status INFORMATION: Terminated Okt 02, 2017 10:04:35 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady INFORMATION: Failed to connect to the master. Will retry again [...] (previous message repeats until the master is reachable again Okt 02, 2017 10:06:01 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onReconnect INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@3c30a4a0 > > If the agent runs as an unprivileged user, the agent fails to restart itself and logs a confusing error message. However, it reconnects without issue and jobs keep running. > Are you talking about Pipeline jobs or other Durable Task implementations? We only use pipeline jobs. > Adding a flag for Disabling the restarter is definitely reasonable. Regarding the complete removal, it needs more research. Fair enough, though I still fail to see a case where the restart is useful. > > (sadly the default, since services run as the SYSTEM user by default) > Yes, I would rather rework the current Installer GUI entirely. It should just generate a sample config and then point the user to installation guidelines. I hope nobody runs JNLP files as administrator, so the service installation in Web UI should fail by default Win7+ systems. True, the installer from the GUI always fails. However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.

          Oleg Nenashev added a comment -

          > However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work.

          Actually it's configurable: https://github.com/kohsuke/winsw/blob/master/doc/xmlConfigFile.md#service-account . The problem is that the the option is not provided by default in the Jenkins config. Defining passwords as a plain text is also far from being a good recommendation, but WinSW also supports interactive mode.

          Oleg Nenashev added a comment - > However, running jenkins-slave.exe install as admin (after the GUI install failed) installs the service, but sets its executing user to the SYSTEM user (which is the Windows default). This is very bad practice IMO - but if that default is changed, the WinswSlaveRestarter would never work. Actually it's configurable: https://github.com/kohsuke/winsw/blob/master/doc/xmlConfigFile.md#service-account . The problem is that the the option is not provided by default in the Jenkins config. Defining passwords as a plain text is also far from being a good recommendation, but WinSW also supports interactive mode.

          m t added a comment -

          Has anybody found a workaround for this? We have a windows agent that has to run as an unprivileged user and it's quite annoying that it doesn't restart itself when it disconnects.

          I also don't see a reason for the service to ever restart. It completely breaks the durability of pipeline jobs. JENKINS-27617 may also fix this, but imho not restarting in the first place is a better option.

          This is with Jenkins 2.152 and the agent running on Windows 10 x64.

          m t added a comment - Has anybody found a workaround for this? We have a windows agent that has to run as an unprivileged user and it's quite annoying that it doesn't restart itself when it disconnects. I also don't see a reason for the service to ever restart. It completely breaks the durability of pipeline jobs.  JENKINS-27617 may also fix this, but imho not restarting in the first place is a better option. This is with Jenkins 2.152 and the agent running on Windows 10 x64.

          m t added a comment -

          For anyone else looking for a workaround, it turns out using SSH with windows works quite well. I installed it as described here and connected the node with "Launch agent via SSH".

          https://github.com/PowerShell/Win32-OpenSSH/wiki/Install-Win32-OpenSSH

          m t added a comment - For anyone else looking for a workaround, it turns out using SSH with windows works quite well. I installed it as described here and connected the node with "Launch agent via SSH". https://github.com/PowerShell/Win32-OpenSSH/wiki/Install-Win32-OpenSSH

          One of our Win-10 machines were very seriously affected by this, going offline several times per day.

          As a workaround, I added a batch script to check the status of the service every 10 minutes, and restarting the service if stopped.

          The batch is scheduled by Windows Schedule Tasks service, and set to run as a high priority task whenever the computer starts. Note the script must be run by at least a local administrator.

          I've placed the following code in a file called c:\Jenkins\EnsureJenkinsServiceRunnning.cmd :

          @echo off
          set "ServiceName=jenkinsslave-C__Jenkins"
          for /F "tokens=3 delims=: " %%H in ('sc query "%ServiceName%" ^| findstr "        STATE"') DO (
            if /i "%%H" neq "RUNNING" (
             net start "%ServiceName%"
            )
          )
          

          Flemming Steffensen added a comment - One of our Win-10 machines were very seriously affected by this, going offline several times per day. As a workaround, I added a batch script to check the status of the service every 10 minutes, and restarting the service if stopped. The batch is scheduled by Windows Schedule Tasks service, and set to run as a high priority task whenever the computer starts. Note the script must be run by at least a local administrator. I've placed the following code in a file called c:\Jenkins\EnsureJenkinsServiceRunnning.cmd : @echo off set "ServiceName=jenkinsslave-C__Jenkins" for /F "tokens=3 delims=: " %%H in ( 'sc query "%ServiceName%" ^| findstr " STATE" ' ) DO ( if /i "%%H" neq "RUNNING" ( net start "%ServiceName%" ) )

          Carroll Chiou added a comment -

          See https://issues.jenkins.io/browse/JENKINS-27617 for a possible solution to this issue

          Carroll Chiou added a comment - See https://issues.jenkins.io/browse/JENKINS-27617 for a possible solution to this issue

            Unassigned Unassigned
            procom_bl Thomas Bächler
            Votes:
            3 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: