Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-43128

JNLP Slave run as windows service fail to reconnect after slave reboot

       

      We have an issue where windows slaves fall off line every time our infrastructure team patches them.  The scenario is simply this.  

      1. The machines get patched with the lastest windows patches.
      2. This triggers a reboot.
      3. The slave service shuts down with a log entry in the jenkins-slave.wrapper log to the effect of:
        2017-03-27 07:50:19 - Shutdown exception
        Message:A system shutdown is in progress. (Exception from HRESULT: 0x8007045B)
        Stacktrace:   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
           at System.Management.ManagementScope.InitializeGuts(Object o)
           at System.Management.ManagementScope.Initialize()
           at System.Management.ManagementObjectSearcher.Initialize()
           at System.Management.ManagementObjectSearcher.Get()
           at winsw.WrapperService.GetChildPids(Int32 pid)
           at winsw.WrapperService.StopProcessAndChildren(Int32 pid)
           at winsw.WrapperService.StopIt()
           at winsw.WrapperService.OnShutdown()
         
      1. (4) The slave restarts and we see this in the jenkins-slave_<date>.err log:
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main createEngine
        INFO: Setting up slave: sv20-jenddb-001
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener <init>
        INFO: Jenkins agent is running in headless mode.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Locating server among [https://jenkins.core.cvent.org/]
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Handshaking
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server reports protocol JNLP3-connect not supported, skipping
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Trying protocol: JNLP2-connect
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Trying protocol: JNLP-connect
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
        INFO: Connecting to jenkins.core.cvent.org:55087
        Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener error
        SEVERE: The server rejected the connection: None of the protocols were accepted
        java.lang.Exception: The server rejected the connection: None of the protocols were accepted
        	at hudson.remoting.Engine.onConnectionRejected(Engine.java:380)
        	at hudson.remoting.Engine.run(Engine.java:352)
         

      We then go in and restart the slave service manually and everything is fine.

      What seems to be happening is that when the slave service shuts down due to a system shutdown request, it fails to notify the master that it is shutting down.  As a result, when it starts back up after the reboot, the master still thinks it is connected and refuses to allow it to connect.  By the time we get in there to manually restart the service, the master realized the slave is off line, so the service restart/reconnection works fine at that point.

      It seems there are two possible solutions here:

      1. The slave should notify the master that it is shutting down so that the master will not still think it is 'online'.
      2. The master, when it receives a connection request for a slave that it thinks is 'online' should verify that the old connection is really still active before refusing to accept the new one.

      Or do both?

      Note we are able to reproduce this simply by rebooting a windows slave.  It always fails to reconnect as described.

          [JENKINS-43128] JNLP Slave run as windows service fail to reconnect after slave reboot

          Kenneth Baltrinic created issue -
          Kenneth Baltrinic made changes -
          Description Original:  

          We have an issue where windows slaves fall off line every time our infrastructure team patches them.  The scenario is simply this.  
           # The machines get patched with the lastest windows patches.
           # This triggers a reboot.
           # The slave service shuts down with a log entry in the jenkins-slave.wrapper log to the effect of:
          {code:java}
          2017-03-27 07:50:19 - Shutdown exception
          Message:A system shutdown is in progress. (Exception from HRESULT: 0x8007045B)
          Stacktrace: at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
             at System.Management.ManagementScope.InitializeGuts(Object o)
             at System.Management.ManagementScope.Initialize()
             at System.Management.ManagementObjectSearcher.Initialize()
             at System.Management.ManagementObjectSearcher.Get()
             at winsw.WrapperService.GetChildPids(Int32 pid)
             at winsw.WrapperService.StopProcessAndChildren(Int32 pid)
             at winsw.WrapperService.StopIt()
             at winsw.WrapperService.OnShutdown()
           {code}

           # (4) The slave restarts and we see this in the jenkins-slave_<date>.err log:
          {code:java}
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: sv20-jenddb-001
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [https://jenkins.core.cvent.org/]
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server reports protocol JNLP3-connect not supported, skipping
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP2-connect
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP-connect
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: None of the protocols were accepted
          java.lang.Exception: The server rejected the connection: None of the protocols were accepted
          at hudson.remoting.Engine.onConnectionRejected(Engine.java:380)
          at hudson.remoting.Engine.run(Engine.java:352)
           {code}

          We then go in and restart the slave service manually and everything is fine.

          What seems to be happening is that when the slave service shuts down due to a system shutdown request, it fails to notify the master that it is shutting down.  As a result, when it starts back up after the reboot, the master still thinks it is connected and refuses to allow it to connect.

          It seems there are two possible solutions here:
           # The slave should notify the master that it is shutting down so that the master will not still think it is 'online'.
           # The master, when it receives a connection request for a slave that it thinks is 'online' should verify that the old connection is really still active before refusing to accept the new one.

          Or do both?

          Note we are able to reproduce this simply by rebooting a windows slave.  It always fails to reconnect as described.
          New:  

          We have an issue where windows slaves fall off line every time our infrastructure team patches them.  The scenario is simply this.  
           # The machines get patched with the lastest windows patches.
           # This triggers a reboot.
           # The slave service shuts down with a log entry in the jenkins-slave.wrapper log to the effect of:
          {code:java}
          2017-03-27 07:50:19 - Shutdown exception
          Message:A system shutdown is in progress. (Exception from HRESULT: 0x8007045B)
          Stacktrace: at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
             at System.Management.ManagementScope.InitializeGuts(Object o)
             at System.Management.ManagementScope.Initialize()
             at System.Management.ManagementObjectSearcher.Initialize()
             at System.Management.ManagementObjectSearcher.Get()
             at winsw.WrapperService.GetChildPids(Int32 pid)
             at winsw.WrapperService.StopProcessAndChildren(Int32 pid)
             at winsw.WrapperService.StopIt()
             at winsw.WrapperService.OnShutdown()
           {code}

           # (4) The slave restarts and we see this in the jenkins-slave_<date>.err log:
          {code:java}
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main createEngine
          INFO: Setting up slave: sv20-jenddb-001
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener <init>
          INFO: Jenkins agent is running in headless mode.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Locating server among [https://jenkins.core.cvent.org/]
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Handshaking
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server reports protocol JNLP3-connect not supported, skipping
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP2-connect
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Trying protocol: JNLP-connect
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Server didn't accept the handshake: sv20-jenddb-001 is already connected to this master. Rejecting this connection.
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener status
          INFO: Connecting to jenkins.core.cvent.org:55087
          Mar 27, 2017 7:52:52 AM hudson.remoting.jnlp.Main$CuiListener error
          SEVERE: The server rejected the connection: None of the protocols were accepted
          java.lang.Exception: The server rejected the connection: None of the protocols were accepted
          at hudson.remoting.Engine.onConnectionRejected(Engine.java:380)
          at hudson.remoting.Engine.run(Engine.java:352)
           {code}

          We then go in and restart the slave service manually and everything is fine.

          What seems to be happening is that when the slave service shuts down due to a system shutdown request, it fails to notify the master that it is shutting down.  As a result, when it starts back up after the reboot, the master still thinks it is connected and refuses to allow it to connect.  By the time we get in there to manually restart the service, the master realized the slave is off line, so the service restart/reconnection works fine at that point.

          It seems there are two possible solutions here:
           # The slave should notify the master that it is shutting down so that the master will not still think it is 'online'.
           # The master, when it receives a connection request for a slave that it thinks is 'online' should verify that the old connection is really still active before refusing to accept the new one.

          Or do both?

          Note we are able to reproduce this simply by rebooting a windows slave.  It always fails to reconnect as described.
          Oleg Nenashev made changes -
          Assignee New: Oleg Nenashev [ oleg_nenashev ]
          Oleg Nenashev made changes -
          Component/s New: windows-slave-installer-module [ 21834 ]

          Oleg Nenashev added a comment -

          It duplicates JENKINS-22692, which has been fixed in Jenkins 2.50. The fix (upgrade to WinSW 2.0.2) is not a subject for backporting IIRC, so it should be available in the next LTS baseline after 2.46.x . If you need it earlier, please vote in the referenced ticket

          Oleg Nenashev added a comment - It duplicates JENKINS-22692 , which has been fixed in Jenkins 2.50. The fix (upgrade to WinSW 2.0.2) is not a subject for backporting IIRC, so it should be available in the next LTS baseline after 2.46.x . If you need it earlier, please vote in the referenced ticket
          Oleg Nenashev made changes -
          Link New: This issue duplicates JENKINS-22692 [ JENKINS-22692 ]
          Oleg Nenashev made changes -
          Resolution New: Duplicate [ 3 ]
          Status Original: Open [ 1 ] New: Resolved [ 5 ]

            oleg_nenashev Oleg Nenashev
            kbaltrinic Kenneth Baltrinic
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: