Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-25582

Installing Jenkins service the slave cannot re-connect after first system restart

    • Icon: Bug Bug
    • Resolution: Won't Fix
    • Icon: Critical Critical
    • windows-slaves-plugin
    • Jenkins master (1.580.1) on Ubuntu 14.04 and Jenkins slave on Windows 7 64bit.

      I tried to use the Windows Service, which can be installed via a JLNP connection, to let the slave nodes automatically connect to the master. The installation seems to have a problem with shutting down the initial connection to master. So after a system restart of the slave it is no longer able to re-connect to the master because the slave is still marked as connected. It looks like that when the initial JLNP connection was killed the slave has not properly disconnected itself. Further restarts do not show this behavior. As of now you are forced to restart Jenkins on the master machine to let the client connect again.

      Steps:
      1. Setup Jenkins and a slave node via Java Web Start
      2. Connect a Windows slave
      3. Select 'File | Install as a service'
      4. Wait for the service to be installed
      5. Restart the slave

      Once the slave has been restarted, the service will try to re-connect to the master, but the connection is not allowed:

      Nov 12, 2014 11:11:05 PM hudson.TcpSlaveAgentListener$ConnectionHandler run
      INFO: Accepted connection #618 from /192.168.123.3:60703
      Nov 12, 2014 11:11:05 PM jenkins.slaves.JnlpSlaveHandshake error
      WARNING: TCP slave agent connection handler #618 with /192.168.123.3:60703 is aborted: dummy-windows is already connected to this master. Rejecting this connection.
      Nov 12, 2014 11:11:05 PM jenkins.slaves.JnlpSlaveHandshake error
      WARNING: TCP slave agent connection handler #618 with /192.168.123.3:60703 is aborted: Unrecognized name: dummy-windows

      I see hundreds of those messages in a really quick sequence.

          [JENKINS-25582] Installing Jenkins service the slave cannot re-connect after first system restart

          Oleg Nenashev added a comment -

          It happens due to TCP Timeout and PingThread on the master.
          Jenkins master consider a slave as connected if...

          • There's no failed requests from master to slave
          • PingThread has not failed with 4-minutes timeout (it's hardcoded now)

          In order to fix the issue, reconfigure the reconnect interval in Slave Service configuration.
          It will decrease the frequency of connection attempts.

          Oleg Nenashev added a comment - It happens due to TCP Timeout and PingThread on the master. Jenkins master consider a slave as connected if... There's no failed requests from master to slave PingThread has not failed with 4-minutes timeout (it's hardcoded now) In order to fix the issue, reconfigure the reconnect interval in Slave Service configuration. It will decrease the frequency of connection attempts.

          Daniel Beck added a comment -

          Oleg: Did you investigate this? Shouldn't a proper slave restart disconnect the previous connection?

          I'd try to look in Task Manager whether after service install + start, two slaves are running somehow.

          Daniel Beck added a comment - Oleg: Did you investigate this? Shouldn't a proper slave restart disconnect the previous connection? I'd try to look in Task Manager whether after service install + start, two slaves are running somehow.

          Oleg Nenashev added a comment -

          danielbeck, I see such behavior on my installations (remoting-2.36). We had to increase the TCP Timeout to minutes due to the extremely unreliable VPN connections between sites, so slaves actually have the 4-minutes timeout. In such case the issue easily appears even if the slave behaves correctly.

          BTW, my previous comment applies to the situation, when both WinSW and remoting behave correctly. There could be an issue, so it definitely makes sense to check runaway processes as Daniel proposed.

          Oleg Nenashev added a comment - danielbeck , I see such behavior on my installations (remoting-2.36). We had to increase the TCP Timeout to minutes due to the extremely unreliable VPN connections between sites, so slaves actually have the 4-minutes timeout. In such case the issue easily appears even if the slave behaves correctly. BTW, my previous comment applies to the situation, when both WinSW and remoting behave correctly. There could be an issue, so it definitely makes sense to check runaway processes as Daniel proposed.

          Oleg Nenashev added a comment -

          Took the issue to my backlog

          Oleg Nenashev added a comment - Took the issue to my backlog

          Mark Waite added a comment -

          Won't be fixed. See JENKINS-67604 for the details of the deprecation of agents started by WMI calls using DCOM.

          Mark Waite added a comment - Won't be fixed. See JENKINS-67604 for the details of the deprecation of agents started by WMI calls using DCOM.

            Unassigned Unassigned
            whimboo Henrik Skupin
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: