Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-24272

jnlp slaves fail to reconnect when master is restarted

XMLWordPrintable

      I have noticed that whenever I restart my Jenkins master my jnlp slaves are not reconnecting and require a manual slave restart to bring them back online.

      I've traced this back to the changes to fix JENKINS-19055. Specifically those changes cause the slave JVM to be restarted when the master disconnects. Prior to this change the remoting engine would wait for the server to restart before attempting to reconnect to the master. With the change it immediately tries to connect to the master and get a connection error because the master is restarting. This causes the slave to immediately terminate.

      Jenkins 1.575 gives the following slave log output when restarting the master

      Aug 12, 2014 3:55:15 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      Aug 12, 2014 3:55:15 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$2$1 onDisconnect
      INFO: Restarting slave via jenkins.slaves.restarter.UnixSlaveRestarter@32a9f661
      Aug 12, 2014 3:55:17 PM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: bishop
      Aug 12, 2014 3:55:17 PM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      Aug 12, 2014 3:55:17 PM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [http://jenkins.example/]
      Aug 12, 2014 3:55:18 PM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: http://jenkins.example/tcpSlaveAgentListener/ is invalid: 503 Service Temporarily Unavailable
      java.lang.Exception: http://jenkins.example/tcpSlaveAgentListener/ is invalid: 503 Service Temporarily Unavailable
              at hudson.remoting.Engine.run(Engine.java:213)
      

      Notice the "jenkins.slaves.restarter.JnlpSlaveRestarterInstaller" onDisconnect log message that performs a slave restart.

      Prior to JENKINS-19055 being integrated the slave called waitForServerToBack() repeatedly until the master came back online. For example

      25-Mar-2014 10:52:16 hudson.remoting.jnlp.Main$CuiListener status
      INFO: Terminated
      25-Mar-2014 10:52:26 hudson.remoting.Engine waitForServerToBack
      INFO: Failed to connect to the master. Will retry again
      java.net.ConnectException: Connection refused
              at java.net.PlainSocketImpl.socketConnect(Native Method)
              at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327)
              at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193)
              at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180)
              at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:385)
              at java.net.Socket.connect(Socket.java:546)
              at sun.net.NetworkClient.doConnect(NetworkClient.java:173)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:409)
              at sun.net.www.http.HttpClient.openServer(HttpClient.java:530)
              at sun.net.www.http.HttpClient.<init>(HttpClient.java:240)
              at sun.net.www.http.HttpClient.New(HttpClient.java:321)
              at sun.net.www.http.HttpClient.New(HttpClient.java:338)
              at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:935)
              at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:876)
              at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:801)
              at hudson.remoting.Engine.waitForServerToBack(Engine.java:371)
              at hudson.remoting.Engine.run(Engine.java:278)
      ...
      25-Mar-2014 10:54:11 hudson.remoting.Engine waitForServerToBack
      INFO: Master isn't ready to talk to us. Will retry again: response code=503
      25-Mar-2014 10:54:21 hudson.remoting.Engine waitForServerToBack
      INFO: Master isn't ready to talk to us. Will retry again: response code=503
      25-Mar-2014 10:54:31 hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among [http://jenkins.example/]
      25-Mar-2014 10:54:31 hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins.example:42715
      25-Mar-2014 10:54:31 hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      25-Mar-2014 10:54:32 hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connected
      

      The connection/retry logic is contained in remoting Engine.java
      https://github.com/jenkinsci/remoting/blob/master/src/main/java/hudson/remoting/Engine.java

      When connecting to the slave an error causes the connection to terminate (around line 232)

      if(firstError!=null) {
        events.error(firstError);
        return;
      }
      

      prior to JENKINS-19055 hooking into onDisconnect() a re-connection would not be attempted until waitForServerToBack() had ensured that the master had recovered.

      events.onDisconnect();
      // try to connect back to the server every 10 secs.
      waitForServerToBack();
      

      A quick and dirty fix would likely be to swap the onDisconnect and waitForServerToBack calls around.

            oldelvet Richard Mortimer
            oldelvet Richard Mortimer
            Votes:
            4 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: