• Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • ssh-slaves-plugin
    • None
    • Jenkins 1.529
      OSX 10.8.4 (running as a VMWare Guest in VMWare Workstation 9.0.2 inside a Windows 7 Host)
      also Jenkins 1.645, OSX 10.9, 10.10 (not vm)
      also observed with Windows and Linux slaves.
    • ssh-slaves-1.31.1

      I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

      When the slave hangs, I see:

      This node is being launched. See log for more details
      

      When I click on more details I see an empty log (literally no characters) with a spinning wheel.

      I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

      The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

      1. This hang occurs at least once a day (for over a week now).
      2. There is no known workaround.
      3. There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

      If you can add extra logging, I can try collection more information for you. Where do we get started?

          [JENKINS-19465] Slave hangs while being launched

          cowwoc created issue -

          cowwoc added a comment - - edited

          According to the node's "load statistics" it was running fine until exactly 9am. Then, for an unknown reason, the node got disconnected. When I got around to looking at Jenkins later that day I found the node in the "This node is being launched" state again... hanging forever.

          I'd like to avoid having to restart the Jenkins server once a day (or potentially multiple times a day) to fix the OSX slave. Any ideas?

          I see an open ssh tunnel from master to the OSX machine but I see no proof that Jenkins is running (according to both "jps" and "ps").

          Is there a way for me to find out why the node got disconnected (a log that spans multiple connections/disconnections) and what it's blocked on trying to reconnect?

          cowwoc added a comment - - edited According to the node's "load statistics" it was running fine until exactly 9am. Then, for an unknown reason, the node got disconnected. When I got around to looking at Jenkins later that day I found the node in the "This node is being launched" state again... hanging forever. I'd like to avoid having to restart the Jenkins server once a day (or potentially multiple times a day) to fix the OSX slave. Any ideas? I see an open ssh tunnel from master to the OSX machine but I see no proof that Jenkins is running (according to both "jps" and "ps"). Is there a way for me to find out why the node got disconnected (a log that spans multiple connections/disconnections) and what it's blocked on trying to reconnect?
          cowwoc made changes -
          Description Original: I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

          When the slave hangs, I see:

          {code}
          This node is being launched. See log for more details
          {code}

          When I click on {{more details} I see an empty log (literally no characters) with a spinny wheel.

          I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

          The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

          # There is no known workaround.
          # The problem occurs randomly.
          # There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

          If you can add extra logging, I can try collection more information for you. Where do we get started?
          New: I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

          When the slave hangs, I see:

          {code}
          This node is being launched. See log for more details
          {code}

          When I click on {{more details}} I see an empty log (literally no characters) with a spinning wheel.

          I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

          The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

          # There is no known workaround.
          # The problem occurs randomly.
          # There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

          If you can add extra logging, I can try collection more information for you. Where do we get started?
          cowwoc made changes -
          Description Original: I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

          When the slave hangs, I see:

          {code}
          This node is being launched. See log for more details
          {code}

          When I click on {{more details}} I see an empty log (literally no characters) with a spinning wheel.

          I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

          The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

          # There is no known workaround.
          # The problem occurs randomly.
          # There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

          If you can add extra logging, I can try collection more information for you. Where do we get started?
          New: I configured an OSX slave to use an SSH connection. I have an identical setup for a Linux slave. The Linux slave never hangs, but the OSX one does randomly every couple of days.

          When the slave hangs, I see:

          {code}
          This node is being launched. See log for more details
          {code}

          When I click on {{more details}} I see an empty log (literally no characters) with a spinning wheel.

          I'd like to disconnect the channel and try again. Unfortunately, there is no "disconnect" button, seemingly because the hang occurs too early in the connection phase.

          The only way I found to fix this problem is restart Jenkins master. I believe this issue is high priority because:

          # This hang occurs at least once a day (for over a week now).
          # There is no known workaround.
          # There is no way to recover except to restart the master node, which means that all running jobs have to be interrupted.

          If you can add extra logging, I can try collection more information for you. Where do we get started?

          Pe Le added a comment -

          Same issue in my case.
          OS X is 10.9.4 running on genuine Mac Mini machine, Master is WIN 7 PC, both running on local network.

          I tried to launch slave headlessly from within the slave itself, as documented at https://wiki.jenkins-ci.org/display/JENKINS/Distributed+builds#Distributedbuilds-LaunchslaveagentviaJavaWebStart:

          $ java -jar slave.jar -jnlpUrl http://<our-server>/computer/slave-name/slave-agent.jnlp

          Which, unfortunately, gives this:

          Waiting 10 seconds before retry
          Failing to obtain http://<our-server>/computer/macos-builder/slave-agent.jnlp
          java.io.IOException: Failed to load http://<our-server>/computer/macos-builder/slave-agent.jnlp: 403 Forbidden
          at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:274)
          at hudson.remoting.Launcher.run(Launcher.java:218)
          at hudson.remoting.Launcher.main(Launcher.java:192)
          Waiting 10 seconds before retry
          ... etc. infinitely

          Pe Le added a comment - Same issue in my case. OS X is 10.9.4 running on genuine Mac Mini machine, Master is WIN 7 PC, both running on local network. I tried to launch slave headlessly from within the slave itself, as documented at https://wiki.jenkins-ci.org/display/JENKINS/Distributed+builds#Distributedbuilds-LaunchslaveagentviaJavaWebStart: $ java -jar slave.jar -jnlpUrl http://<our-server>/computer/slave-name/slave-agent.jnlp Which, unfortunately, gives this: Waiting 10 seconds before retry Failing to obtain http://<our-server>/computer/macos-builder/slave-agent.jnlp java.io.IOException: Failed to load http://<our-server>/computer/macos-builder/slave-agent.jnlp: 403 Forbidden at hudson.remoting.Launcher.parseJnlpArguments(Launcher.java:274) at hudson.remoting.Launcher.run(Launcher.java:218) at hudson.remoting.Launcher.main(Launcher.java:192) Waiting 10 seconds before retry ... etc. infinitely

          Erik Purins added a comment - - edited

          You would possibly also need to pass '-secret HASH' on the command line to authenticate, where HASH is the secret token for that node. See JENKINS-18342 for a groovy console script to get your node's secret, which would otherwise never display.

          Erik Purins added a comment - - edited You would possibly also need to pass '-secret HASH' on the command line to authenticate, where HASH is the secret token for that node. See JENKINS-18342 for a groovy console script to get your node's secret, which would otherwise never display.

          Erik Purins added a comment -

          When I reconnect on the console to try and work around the issue, the node comes back online (executors available in the left column), but the slave node icon remains blinking as if it is not connected correctly.

          Erik Purins added a comment - When I reconnect on the console to try and work around the issue, the node comes back online (executors available in the left column), but the slave node icon remains blinking as if it is not connected correctly.

          Jan Hudec added a comment - - edited

          For me in case of such problem it always worked to mark the slave offline, which gives up the current connection attempt and then try to connect it.

          Except, well, not this time.

          Jan Hudec added a comment - - edited For me in case of such problem it always worked to mark the slave offline, which gives up the current connection attempt and then try to connect it. Except, well, not this time.

          Daniel Beck added a comment -

          Please provide a thread dump during this situation (/threadDump URL).

          Daniel Beck added a comment - Please provide a thread dump during this situation ( /threadDump URL).

          cowwoc added a comment -

          I can no longer reproduce this issue in version 1.570. How about the other reporters? Does this issue still happen for you?

          cowwoc added a comment - I can no longer reproduce this issue in version 1.570. How about the other reporters? Does this issue still happen for you?

            ifernandezcalvo Ivan Fernandez Calvo
            cowwoc cowwoc
            Votes:
            12 Vote for this issue
            Watchers:
            23 Start watching this issue

              Created:
              Updated:
              Resolved: