Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-2566

Windows slave (service) not able to start after reboot

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: In Progress (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Component/s: other
    • Labels:
      None
    • Environment:
      Platform: All, OS: All
    • Similar Issues:

      Description

      We have configured our windows (2003 server) slave launching as services.
      When we restart the slave (master keeps running) it is not able to connect to
      the master.
      With restarting the master the slave is able to connect again.

      Error trace in slave log:
      Nov 3, 2008 11:30:26 AM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Hudson agent is running in headless mode.
      Nov 3, 2008 11:30:26 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating Server
      Nov 3, 2008 11:30:26 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to scrat:42682
      Nov 3, 2008 11:30:26 AM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      Nov 3, 2008 11:30:26 AM hudson.remoting.jnlp.Main$CuiListener error
      SEVERE: unexpected stream termination
      java.io.EOFException: unexpected stream termination
      at hudson.remoting.Channel.<init>(Channel.java:261)
      at hudson.remoting.Channel.<init>(Channel.java:205)
      at hudson.remoting.Engine.run(Engine.java:83)

        Attachments

          Activity

          Hide
          cbos Cees Bos added a comment -

          found in Tomcat logfile:
          Nov 3, 2008 11:30:27 AM hudson.TcpSlaveAgentListener$ConnectionHandler run
          INFO: Accepted connection #5 from /10.3.30.54:1097
          Nov 3, 2008 11:30:27 AM hudson.TcpSlaveAgentListener$ConnectionHandler error
          WARNING: Connection #5 is aborted: Already connected
          Nov 3, 2008 11:30:48 AM hudson.remoting.Channel$ReaderThread run
          SEVERE: I/O error in channel srv-nl-crd61
          java.net.SocketException: Connection reset
          at java.net.SocketInputStream.read(SocketInputStream.java:168)
          at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          at
          java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2247)
          at
          java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2540)
          at
          java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2
          550)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:637)
          Nov 3, 2008 11:30:48 AM hudson.TcpSlaveAgentListener$ConnectionHandler$1
          onClosed
          WARNING: Connection #3 terminated
          java.net.SocketException: Connection reset
          at java.net.SocketInputStream.read(SocketInputStream.java:168)
          at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          at
          java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2247)
          at
          java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2540)
          at
          java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2
          550)
          at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
          at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
          at hudson.remoting.Channel$ReaderThread.run(Channel.java:637)

          Show
          cbos Cees Bos added a comment - found in Tomcat logfile: Nov 3, 2008 11:30:27 AM hudson.TcpSlaveAgentListener$ConnectionHandler run INFO: Accepted connection #5 from /10.3.30.54:1097 Nov 3, 2008 11:30:27 AM hudson.TcpSlaveAgentListener$ConnectionHandler error WARNING: Connection #5 is aborted: Already connected Nov 3, 2008 11:30:48 AM hudson.remoting.Channel$ReaderThread run SEVERE: I/O error in channel srv-nl-crd61 java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2247) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2540) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2 550) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:637) Nov 3, 2008 11:30:48 AM hudson.TcpSlaveAgentListener$ConnectionHandler$1 onClosed WARNING: Connection #3 terminated java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2247) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2540) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2 550) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:637)
          Hide
          scm_issue_link SCM/JIRA link daemon added a comment -

          Code changed in hudson
          User: : kohsuke
          Path:
          trunk/hudson/main/core/src/main/java/hudson/TcpSlaveAgentListener.java
          http://fisheye4.cenqua.com/changelog/hudson/?cs=12963
          Log:
          JENKINS-2566 improving error message

          Show
          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in hudson User: : kohsuke Path: trunk/hudson/main/core/src/main/java/hudson/TcpSlaveAgentListener.java http://fisheye4.cenqua.com/changelog/hudson/?cs=12963 Log: JENKINS-2566 improving error message
          Hide
          kohsuke Kohsuke Kawaguchi added a comment -

          The error message isn't very friedntly (which I fixed in 1.259) but basically
          it's saying that the master think this slave is already connected.

          Could it be that you are running two slave agents on the same machine, or having
          multiple slaves that are registered under the same name?

          Show
          kohsuke Kohsuke Kawaguchi added a comment - The error message isn't very friedntly (which I fixed in 1.259) but basically it's saying that the master think this slave is already connected. Could it be that you are running two slave agents on the same machine, or having multiple slaves that are registered under the same name?
          Hide
          cbos Cees Bos added a comment -

          I already thought that the slave was not able to connect due fact that the
          master thought it was already connected. That is way I logged it and attached
          both logfile entries.

          At the moment I tried to connect the slave on the Hudson display it was on away.
          That slave was just rebooted. After rebooting it was not able to reconnect.
          So maybe it was not able to disconnect properly at the moment the machine was
          shutting down. I had this 2 times in a row with rebooting.
          We had some performance problems with the slave, so it can be the case that the
          slave was not able to disconnect properly due to these performance issues. But
          still it should be able to reconnect, I think? Now I had to reboot the master
          (and wait first till all jobs were finished) to be able again to reconnect the
          slave.

          Show
          cbos Cees Bos added a comment - I already thought that the slave was not able to connect due fact that the master thought it was already connected. That is way I logged it and attached both logfile entries. At the moment I tried to connect the slave on the Hudson display it was on away. That slave was just rebooted. After rebooting it was not able to reconnect. So maybe it was not able to disconnect properly at the moment the machine was shutting down. I had this 2 times in a row with rebooting. We had some performance problems with the slave, so it can be the case that the slave was not able to disconnect properly due to these performance issues. But still it should be able to reconnect, I think? Now I had to reboot the master (and wait first till all jobs were finished) to be able again to reconnect the slave.
          Hide
          carlspring carlspring added a comment -

          We are also getting the same behavior.
          Our master is running on a Linux machine and one of the slaves is running Windows 2003 R2, 64-bit via JNLP.
          We are having exactly the same problem. It doesn't seem to be a problem with the SCM polling, I checked.

          I am not too sure if the following workaround really works, but at the moment it seems to have tricked it:

          • In the service, click the "Recovery" tab:
            1) First failure: "Restart the service"
            2) Second failure: "Restart the service"
            3) Subsequent failures: "No action"
            4) Reset counter after 0 days
            5) Restart service after three minutes

          So far, it seems to bring it up. I suspect there is a Windows issue that needs to be further investigated and addressed. Fixes without a particular clarity for the cause aren't real fixes and the problems tend to re-appear later on.

          Show
          carlspring carlspring added a comment - We are also getting the same behavior. Our master is running on a Linux machine and one of the slaves is running Windows 2003 R2, 64-bit via JNLP. We are having exactly the same problem. It doesn't seem to be a problem with the SCM polling, I checked. I am not too sure if the following workaround really works, but at the moment it seems to have tricked it: In the service, click the "Recovery" tab: 1) First failure: "Restart the service" 2) Second failure: "Restart the service" 3) Subsequent failures: "No action" 4) Reset counter after 0 days 5) Restart service after three minutes So far, it seems to bring it up. I suspect there is a Windows issue that needs to be further investigated and addressed. Fixes without a particular clarity for the cause aren't real fixes and the problems tend to re-appear later on.
          Hide
          thewallf Jan-Felix Wall added a comment -

          I ran into the same problem today and suspected that the slave service misses its proper dependencies (slave runs on Windows 7 64-bit).
          I manually added "DCOM Server Process Launcher" and "Windows Management Instrumentation" (which it assumedly needs) to the service dependencies and it works so far.

          Show
          thewallf Jan-Felix Wall added a comment - I ran into the same problem today and suspected that the slave service misses its proper dependencies (slave runs on Windows 7 64-bit). I manually added "DCOM Server Process Launcher" and "Windows Management Instrumentation" (which it assumedly needs) to the service dependencies and it works so far.
          Hide
          mindful Andriy Semenchenko added a comment -

          Have this issue on my CI.
          Details:
          Jenkins installed at CentOS release 6.3 (Final)
          Slave was running at Windows 7.

          Issues appeared due to file system full at Jenkins server. After I clear space at jenkins server and restarted jenkins app - slaves connected successfully.

          Show
          mindful Andriy Semenchenko added a comment - Have this issue on my CI. Details: Jenkins installed at CentOS release 6.3 (Final) Slave was running at Windows 7. Issues appeared due to file system full at Jenkins server. After I clear space at jenkins server and restarted jenkins app - slaves connected successfully.

            People

            Assignee:
            Unassigned Unassigned
            Reporter:
            cbos Cees Bos
            Votes:
            2 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: