• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • ssh-slaves-plugin
    • None
    • Jenkins 1.529
      OSX 10.8.4 (running as a VMWare Guest in VMWare Workstation 9.0.2 inside a Windows 7 Host)
      also Jenkins 1.645, OSX 10.9, 10.10 (not vm)
      also observed with Windows and Linux slaves.
    • ssh-slaves-1.31.1

      I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

      When the slave hangs, I see:

      This node is being launched. See log for more details
      

      When I click on more details I see an empty log (literally no characters) with a spinning wheel.

      I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

      The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

      1. This hang occurs at least once a day (for over a week now).
      2. There is no known workaround.
      3. There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

      If you can add extra logging, I can try collection more information for you. Where do we get started?

          [JENKINS-19465] Slave hangs while being launched

          Hi hyei,

          I was having this issue today as well. After hours of googling I've found the following:

          https://bugs.java.com/view_bug.do?bug_id=4820090

          It looks like there is a bug in Java that makes SSH Slave plugin hang when it's establishing secure connection to slave. I've implemented suggested workaround and the issue is gone for me.

          In short, Java can hang when it reads random sequences from /dev/random. To workaround this it is suggested to replace /dev/random with /dev/urandom which works more reliable with Java

          sudo rm /dev/random
          sudo ln -s /dev/urandom /dev/random

          So if you have your master running on Linux you can try this workaround.

          Let me know if it helped.

           

          Sergii Ovcharenko added a comment - Hi  hyei , I was having this issue today as well. After hours of googling I've found the following: https://bugs.java.com/view_bug.do?bug_id=4820090 It looks like there is a bug in Java that makes SSH Slave plugin hang when it's establishing secure connection to slave. I've implemented suggested workaround and the issue is gone for me. In short, Java can hang when it reads random sequences from /dev/random. To workaround this it is suggested to replace /dev/random with /dev/urandom which works more reliable with Java sudo rm /dev/random sudo ln -s /dev/urandom /dev/random So if you have your master running on Linux you can try this workaround. Let me know if it helped.  

          I"m still seeing this issue in SSh slaves plugin 1.25.1 with Jenkins 2.89.3.

          Ovidiu-Florin Bogdan added a comment - I"m still seeing this issue in SSh slaves plugin 1.25.1 with Jenkins 2.89.3.

          This issue is still happening on SSH slaves 1.25.1 with Jenkins 2.89.3.

          The curious thing is that I only see it on one of our slaves.

          Ovidiu-Florin Bogdan added a comment - This issue is still happening on SSH slaves 1.25.1 with Jenkins 2.89.3. The curious thing is that I only see it on one of our slaves.

          Oleg Nenashev added a comment -

          ovidiub13 would it be possible to get stacktraces from agent/master?

          Oleg Nenashev added a comment - ovidiub13 would it be possible to get stacktraces from agent/master?

          I'd love to. How do I get them? can you point me to some docs on this?

          I have both Jenkins Master and Slave running in Docker containers.

          Now it works because I've changed the slave IP, triggered a connection that failed, then switched back the IP and it worked.

          For the moment I've used sovcharenko's solution and linked /dev/urandom to /dev/random, but I can change it back if you tell me how to get the stacktraces from a running Jenkins.

          Remember I don't have any errors, no messages in the node connection log. Just the spinning gif thingy.

          Ovidiu-Florin Bogdan added a comment - I'd love to. How do I get them? can you point me to some docs on this? I have both Jenkins Master and Slave running in Docker containers. Now it works because I've changed the slave IP, triggered a connection that failed, then switched back the IP and it worked. For the moment I've used sovcharenko 's solution and linked /dev/urandom to /dev/random, but I can change it back if you tell me how to get the stacktraces from a running Jenkins. Remember I don't have any errors, no messages in the node connection log. Just the spinning gif thingy.

          Oleg Nenashev added a comment -

          Well, generally you need to dump stacktraces during the connection hanging somehow. https://forums.docker.com/t/how-to-dump-heap-from-a-java-program-running-in-container/3217 . Your mileage may vary.

          For master side you can also use https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

          Oleg Nenashev added a comment - Well, generally you need to dump stacktraces during the connection hanging somehow. https://forums.docker.com/t/how-to-dump-heap-from-a-java-program-running-in-container/3217 . Your mileage may vary. For master side you can also use https://wiki.jenkins.io/display/JENKINS/Support+Core+Plugin

          Ovidiu-Florin Bogdan added a comment - - edited

          The Support Core plugin gives empty logs for the slave in discussion.

          The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed.

          I'm having no luck with the nsenter utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky.

          P.S. Symlinking /dev/urandom to /dev/random on the slave has no affect. I realize now that I should've done this on the master.

          /dev/random on master has enough entropy, it works just fine.

          Ovidiu-Florin Bogdan added a comment - - edited The Support Core plugin gives empty logs for the slave in discussion. The slave node get's no connection attempt via ssh from the master. Getting the slave stack trace is not possible since the slave.jar is not being executed. I'm having no luck with the nsenter utility to enter and obtain the master stack trace. I need to restart the container holding master with --privileged to be able to get the stack trace. THis would be rather tricky. P.S. Symlinking /dev/urandom to /dev/random on the slave has no affect. I realize now that I should've done this on the master. /dev/random on master has enough entropy, it works just fine.

          Oleg Nenashev added a comment -

          FYI ifernandezcalvo. I have never been able to diagnose this issue in detail after the last patches, but it seems there are more unfixed run conditions.

          I have no capacity to work on it anytime soon, so I will assign it and let others to take it

          Oleg Nenashev added a comment - FYI ifernandezcalvo . I have never been able to diagnose this issue in detail after the last patches, but it seems there are more unfixed run conditions. I have no capacity to work on it anytime soon, so I will assign it and let others to take it

          Ivan Fernandez Calvo added a comment - - edited

          Overall recommendations:

          Ivan Fernandez Calvo added a comment - - edited Overall recommendations: It is recommended to use JDK nearest and in the same major version of Jenkins instance and Agents It is recommended to tune the TCP stack on of Jenkins instance and Agents On Linux http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html On Windows https://blogs.technet.microsoft.com/nettracer/2010/06/03/things-that-you-may-want-to-know-about-tcp-keepalives/ On Mac https://www.gnugk.org/keepalive.html You should check for hs_err_pid error files in the root fs of the agent http://www.oracle.com/technetwork/java/javase/felog-138657.html#gbwcy Check the logs in the root fs of the agent It is recommended to set the initial heap of the Agent to at least 512M (-Xmx512m -Xms512m), you could start with 512m and lower the value until you find a proper value to your Agents. Disable energy save options that suspend, or hibernate the host

          The default settings on the connection timeout and retries should resolve this issue
          https://issues.jenkins-ci.org/browse/JENKINS-52739

          Ivan Fernandez Calvo added a comment - The default settings on the connection timeout and retries should resolve this issue https://issues.jenkins-ci.org/browse/JENKINS-52739

            ifernandezcalvo Ivan Fernandez Calvo
            cowwoc cowwoc
            Votes:
            12 Vote for this issue
            Watchers:
            23 Start watching this issue

              Created:
              Updated:
              Resolved: