Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-45290

windows slaves stops being able to connect to master

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • core, remoting

      After a while (how long seems to be random, but between days to ~4 weeks after jenkins is started) windows slaves stop being able to connect to jenkins master.
      After a restart of the jenkins service, all slaves can connect without issues. When this issue occur, all existing connections continue to work.
      There are no issues marking nodes as temporarily offline. But if I reboot a windows slave it can't connect at all until the jenkins master is restarted.
      We tried connecting the slave to another jenkins master (we copied all data from our production jenkins so we got an exact clone) and the slave has no problem connecting
      to the staging jenkins master. We have seen this issue on that server as well.
      The only way for us to resolve this is to restart the jenkins service on the jenkins master. As it starts all jenkins slaves connect automatically and work again.
      This isn't a good solution as we have to abort all jobs in order to restart the service. And for a production server that leads to many (very) unhappy users.

      jenkins-slave.err.log: From launching the java web start until the problem occur. It just hangs there, doing nothing.
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: jenkins_slave
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      jul 04, 2017 1:46:24 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://jenkins_master.company.domain/, http://jenkins_master.company.domain:8080/
      jul 04, 2017 1:46:25 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins_master.company.domain
      Agent port: 49187
      Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins_master.company.domain:49187
      jul 04, 2017 1:46:25 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect

       

      Output from "java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string":
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main createEngine
      INFO: Setting up slave: jenkins_slave
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener <init>
      INFO: Jenkins agent is running in headless mode.
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Locating server among http://jenkins_master.company.domain/
      jul 04, 2017 2:10:18 EM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
      INFO: Remoting server accepts the following protocols: [JNLP4-connect, JNLP-connect, Ping, JNLP2-connect]
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Agent discovery successful
      Agent address: jenkins_master.company.domain
      Agent port: 49187
      Identity: 3d:69:2e:de:e9:84:8b:2b:fd:7b:ad:8c:00:ea:cb:32
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Handshaking
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Connecting to jenkins_master.company.domain:49187
      jul 04, 2017 2:10:18 EM hudson.remoting.jnlp.Main$CuiListener status
      INFO: Trying protocol: JNLP4-connect

      tcpdump from master side over the course of 5 minutes:
      root@jenkins_master:~# date; tcpdump -nnvvv src slave_IP and dst port 49187 -s0 -vv -X -c 1000; date
      Tue Jul 4 13:51:00 CEST 2017
      tcpdump: listening on ens5f0, link-type EN10MB (Ethernet), capture size 262144 bytes
      13:51:26.751986 IP (tos 0x0, ttl 128, id 24431, offset 0, flags [DF], proto TCP (6), length 52)
      slave_IP.58214 > master_IP.49187: Flags [S], cksum 0xec23 (correct), seq 4123494006, win 8192, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0
      0x0000: 4500 0034 5f6f 4000 8006 5831 0a21 14a6 E..4_o@...X1.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8676 0000 0000 .!.<.f.#...v....
      0x0020: 8002 2000 ec23 0000 0204 05b4 0103 0308 .....#..........
      0x0030: 0101 0402 ....
      13:51:26.752189 IP (tos 0x0, ttl 128, id 24432, offset 0, flags [DF], proto TCP (6), length 40)
      slave_IP.58214 > master_IP.49187: Flags [.], cksum 0xa8dd (correct), seq 4123494007, ack 2362185277, win 256, length 0
      0x0000: 4500 0028 5f70 4000 8006 583c 0a21 14a6 E..(_p@...X<.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
      0x0020: 5010 0100 a8dd 0000 0000 0000 0000 P.............
      13:51:26.803856 IP (tos 0x0, ttl 128, id 24433, offset 0, flags [DF], proto TCP (6), length 64)
      slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0xc27a (correct), seq 0:24, ack 1, win 256, length 24
      0x0000: 4500 0040 5f71 4000 8006 5823 0a21 14a6 E..@_q@...X#.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 8677 8ccc 163d .!.<.f.#...w...=
      0x0020: 5018 0100 c27a 0000 0016 5072 6f74 6f63 P....z....Protoc
      0x0030: 6f6c 3a4a 4e4c 5034 2d63 6f6e 6e65 6374 ol:JNLP4-connect
      13:51:26.804301 IP (tos 0x0, ttl 128, id 24434, offset 0, flags [DF], proto TCP (6), length 45)
      slave_IP.58214 > master_IP.49187: Flags [P.], cksum 0x1c72 (correct), seq 24:29, ack 1, win 256, length 5
      0x0000: 4500 002d 5f72 4000 8006 5835 0a21 14a6 E..-_r@...X5.!..
      0x0010: 0a21 1a3c e366 c023 f5c7 868f 8ccc 163d .!.<.f.#.......=
      0x0020: 5018 0100 1c72 0000 0003 4143 4b00 P....r....ACK.
      4 packets captured
      Tue Jul 4 13:55:57 CEST 2017

      So it seems that the slave can try to connect, but the master refuses to respond.

      We have tried asking google for help, but without luck

      We run the slave as a windows service, but we also see the same issue via Java Web Start, and the command line suggested by jenkins
      (java -jar slave.jar -jnlpUrl http://jenkins_master.company.domain/computer/jenkins_slave/slave-agent.jnlp -secret very_secret_hex_string)

      Since we see the same issue when starting the slave as a windows service, java web start and commandline we're happy to support you on the way that is easiest to troubleshoot.

       

      We're happy to provide additional information if need be.

          [JENKINS-45290] windows slaves stops being able to connect to master

          Oleg Nenashev added a comment -

          I need stactraces from both Jenkins Master and the agent for the time when the connection hangs.

          Oleg Nenashev added a comment - I need stactraces from both Jenkins Master and the agent for the time when the connection hangs.

          Robin Ekerhag added a comment -

          I have attached the log files.

          Robin Ekerhag added a comment - I have attached the log files.

          Oleg Nenashev added a comment -

          Have you or your IT applied any security configuration changes on the machine? "An established connection was aborted by the software in your host machine" from antivirus or Firewall. So this is a system issue.

          E.g. see https://social.msdn.microsoft.com/Forums/en-US/212feb8f-cf96-4561-9953-40a21d21ea47/exception-an-established-connection-was-aborted-by-the-software-in-your-host-machine?forum=netfxnetcom

          Oleg Nenashev added a comment - Have you or your IT applied any security configuration changes on the machine? "An established connection was aborted by the software in your host machine" from antivirus or Firewall. So this is a system issue. E.g. see https://social.msdn.microsoft.com/Forums/en-US/212feb8f-cf96-4561-9953-40a21d21ea47/exception-an-established-connection-was-aborted-by-the-software-in-your-host-machine?forum=netfxnetcom

          Robin Ekerhag added a comment -

          That is indeed interesting, but doesn't seem to overlap with the problems we see.

          After the issue occur, none of the windows slaves can connect from being in a disconnected state. We have 50+ windows slaves on this jenkins master.

          If we restart the jenkins master, the problem goes away on all slaves. It seems very improbable to me that this would be a system issue based on that information.

          We can also direct the slaves to a different master and they have no trouble connecting to that one.

          I pasted the TCPdump in the description of this issue, showing all traffic sent over 5 minutes. The master received 4 packets, but sent none.

          Robin Ekerhag added a comment - That is indeed interesting, but doesn't seem to overlap with the problems we see. After the issue occur, none of the windows slaves can connect from being in a disconnected state. We have 50+ windows slaves on this jenkins master. If we restart the jenkins master, the problem goes away on all slaves. It seems very improbable to me that this would be a system issue based on that information. We can also direct the slaves to a different master and they have no trouble connecting to that one. I pasted the TCPdump in the description of this issue, showing all traffic sent over 5 minutes. The master received 4 packets, but sent none.

          Oleg Nenashev added a comment -

          In such case the connection termination probably makes the connection receiver to go into a bogus state. Curious thing is that it happens for both JNLP2 and JNLP4 which are supposed to use different NIO hubs. Would it be possible to get logs before 3:06 PM, probably there is a root cause error there. I would also appreciate a Master's thread dump for the time the agents cannot connect.

           

          I also see another minor issue in REST API, will create a separate defect for it:

          java.lang.Thread.run(Thread.java:748) Caused by: org.kohsuke.stapler.export.NotExportableException: class java.nio.channels.ClosedChannelException doesn't have @ExportedBean so cannot write hudson.slaves.OfflineCause$ChannelTermination.cause at org.kohsuke.stapler.export.Model.<init>(Model.java:79) at org.kohsuke.stapler.export.ModelBuilder.get(ModelBuilder.java:51) at

          Oleg Nenashev added a comment - In such case the connection termination probably makes the connection receiver to go into a bogus state. Curious thing is that it happens for both JNLP2 and JNLP4 which are supposed to use different NIO hubs. Would it be possible to get logs before 3:06 PM, probably there is a root cause error there. I would also appreciate a Master's thread dump for the time the agents cannot connect.   I also see another minor issue in REST API, will create a separate defect for it: java.lang.Thread.run(Thread.java:748) Caused by: org.kohsuke.stapler.export.NotExportableException: class java.nio.channels.ClosedChannelException doesn't have @ExportedBean so cannot write hudson.slaves.OfflineCause$ChannelTermination.cause at org.kohsuke.stapler.export.Model.<init>(Model.java:79) at org.kohsuke.stapler.export.ModelBuilder.get(ModelBuilder.java:51) at

          Robin Ekerhag added a comment -

          We have logs going back to January. I haven't tried, but I expect that the current log (from Jul 2, 2017 06.35 AM) is too large to just attach to this ticket. Uncompressed it is 24GB, compressed it should be between 2-3 GB.

          How should I send this log file to you?

          Can you please explain, or point me to documentation, on how to obtain the Master's thread dump?

          Robin Ekerhag added a comment - We have logs going back to January. I haven't tried, but I expect that the current log (from Jul 2, 2017 06.35 AM) is too large to just attach to this ticket. Uncompressed it is 24GB, compressed it should be between 2-3 GB. How should I send this log file to you? Can you please explain, or point me to documentation, on how to obtain the Master's thread dump?

          Robin Ekerhag added a comment -

          We have been forced to restart jenkins. As I mentioned earlier, this works as a temporary workaround since all windows slaves are able to connect again.

          We will try to set up monitoring for the next time this issue occur so we have less logs to dig through. Are there any logs from a troubleshooting perspective that would be interesting to follow where we can't look up the information afterwards?

          Robin Ekerhag added a comment - We have been forced to restart jenkins. As I mentioned earlier, this works as a temporary workaround since all windows slaves are able to connect again. We will try to set up monitoring for the next time this issue occur so we have less logs to dig through. Are there any logs from a troubleshooting perspective that would be interesting to follow where we can't look up the information afterwards?

          Oleg Nenashev added a comment -

          > Can you please explain, or point me to documentation, on how to obtain the Master's thread dump?

          You can use low-level tools like jstack, but I would rather recommend generating a support bundle using https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

          Oleg Nenashev added a comment - > Can you please explain, or point me to documentation, on how to obtain the Master's thread dump? You can use low-level tools like jstack, but I would rather recommend generating a support bundle using https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

          Robin Ekerhag added a comment -

          I will install the support core plugin on Monday (July 10th).

          Robin Ekerhag added a comment - I will install the support core plugin on Monday (July 10th).

          Robin Ekerhag added a comment -

          Support Core Plugin is now installed

          Robin Ekerhag added a comment - Support Core Plugin is now installed

          Oleg Nenashev added a comment -

          Waiting for the bundles then, thanks!

          Oleg Nenashev added a comment - Waiting for the bundles then, thanks!

          Oleg Nenashev added a comment -

          ping

          Oleg Nenashev added a comment - ping

          Robin Ekerhag added a comment -

          Apologies for the late reply, I have been on vacation.

           

          We are still waiting on the issue to reappear so we can generate the support bundles for you.

          Robin Ekerhag added a comment - Apologies for the late reply, I have been on vacation.   We are still waiting on the issue to reappear so we can generate the support bundles for you.

          Ronen Bar added a comment -

          Any update on this one? we are facing same issue here with Win Slaves.

          Ronen Bar added a comment - Any update on this one? we are facing same issue here with Win Slaves.

          Oleg Nenashev added a comment -

          I cannot investigate the issue without master/agent threaddumps

          ronen_tef if you are ready to provide them, it would help

          Oleg Nenashev added a comment - I cannot investigate the issue without master/agent threaddumps ronen_tef if you are ready to provide them, it would help

          Robin Ekerhag added a comment -

          I have not experienced the issue since I installed the support core plugin.

          I'm waiting for the issue too occur again so I can provide the logs necessary.

          Robin Ekerhag added a comment - I have not experienced the issue since I installed the support core plugin. I'm waiting for the issue too occur again so I can provide the logs necessary.

          Ronen Bar added a comment -

          oleg_nenashev could you please let me know how to take those threaddumps?

          Ronen Bar added a comment - oleg_nenashev could you please let me know how to take those threaddumps?

          Ronen Bar added a comment - - edited

          Okay, I'm attaching 2 outputs of the Master thread dump,

          one is via Jenkins console and one using java jconsole directly from the unix command line.

          As for the Win slave I can't take any as the proccess even not started, the only thing I could provide is an exeption output.

          Ronen Bar added a comment - - edited Okay, I'm attaching 2 outputs of the Master thread dump, one is via Jenkins console and one using java jconsole directly from the unix command line. As for the Win slave I can't take any as the proccess even not started, the only thing I could provide is an exeption output.

          Ronen Bar added a comment -

          Any progress/estimation for a fix?

          Ronen Bar added a comment - Any progress/estimation for a fix?

          Oleg Nenashev added a comment -

          ronen_tef an agent log would be useful just to have a full snapshots. If this is "java.io.IOException: An established connection was aborted by the software in your host machine" again, likely it's a whatever issue in your system.

          I cannot provide any ETA until the bug is triaged at least. But I would not expect the fix in October even if the issue is confirmed (unless it's a quick fix).

          Oleg Nenashev added a comment - ronen_tef an agent log would be useful just to have a full snapshots. If this is "java.io.IOException: An established connection was aborted by the software in your host machine" again, likely it's a whatever issue in your system. I cannot provide any ETA until the bug is triaged at least. But I would not expect the fix in October even if the issue is confirmed (unless it's a quick fix).

          Ronen Bar added a comment -

          oleg_nenashev as already commented, I can't provide any agent log as the agent is not even started therfore a log is not created.

          "...again, likely it's a whatever issue in your system...." in this case, how could it be that the same is not occur with ver. 2.46.1 ? just once upgrading to ver. 2.78.

          The behavier should be consistent I assume.

          Ronen Bar added a comment - oleg_nenashev as already commented, I can't provide any agent log as the agent is not even started therfore a log is not created. "...again, likely it's a whatever issue in your system...." in this case, how could it be that the same is not occur with ver. 2.46.1 ? just once upgrading to ver. 2.78. The behavier should be consistent I assume.

          Oleg Nenashev added a comment -

          ronen_tef Currently I cannot  confirm you have the same issue as ekerhag. If you are sure it's the same, you should be able to provide agent connection stdout/stderr at least (from either slave.jar STDOUT or wrapper logs).

           

          Oleg Nenashev added a comment - ronen_tef Currently I cannot  confirm you have the same issue as ekerhag . If you are sure it's the same, you should be able to provide agent connection stdout/stderr at least (from either slave.jar STDOUT or wrapper logs).  

          Oleg Nenashev added a comment -

          Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Oleg Nenashev added a comment - Unfortunately I have no capacity to work on Remoting in medium term, so I will unassign it and let others to take it. If somebody is interested to submit a pull request, I will be happy to help to get it reviewed and released.

          Steve Kipping added a comment -

          The same issue has started happening for me between a Windows master and Windows slave. Nothing we do with the slave fixes it, but restarting the master service allows the connection to be made and it works again.

          Jenkins master 2.190.1, Java  1.8.0_144

          Steve Kipping added a comment - The same issue has started happening for me between a Windows master and Windows slave. Nothing we do with the slave fixes it, but restarting the master service allows the connection to be made and it works again. Jenkins master 2.190.1, Java  1.8.0_144

            Unassigned Unassigned
            ekerhag Robin Ekerhag
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: