Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-16078

All types of Windows slaves hangs on start in ~1 of 10 build starts

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • slave-status-plugin
    • None
    • Jenkins 1.492 on Ubunta
      Jnlp slave on win7 x86

      Slave is started in Nimbula cloud - our own plugin is used to start virtual machine. Plugin is not published yet.
      But I really don't think trouble is in it.

      Steps are:
      1. Job is started
      2. Our plugin is used to start VM to run job
      3. VM starts
      4. It connects on start to Jnlp url
      5. Build process starts
      And.. sometimes ~1 of 10 starts (rather rare) build process hangs.
      Console is in such state:
      Started by timer
      Building remotely on Nimbula-slave-1110 in workspace /jenkins-slave/workspace/PRODUCTION - Sanity
      [Throbber non stop here]

      Node system info is in attachment.
      I would be very much obliged if you can help to understand possible reasons!

          [JENKINS-16078] All types of Windows slaves hangs on start in ~1 of 10 build starts

          Serge Tsygankov added a comment - - edited

          Hi

          Still reproducing this issue sometimes. Even more rare than 1/10
          Maybe there is way to turn on additional debug info into log? To see on which step exactly hanging happens.

          The problem is it happens really rare. And I guess it depends a lot from the time conditions.

          Serge Tsygankov added a comment - - edited Hi Still reproducing this issue sometimes. Even more rare than 1/10 Maybe there is way to turn on additional debug info into log? To see on which step exactly hanging happens. The problem is it happens really rare. And I guess it depends a lot from the time conditions.

          Uwe Schindler added a comment - - edited

          Hi the same happens with other windows slave types:

          • Windows Slave started via DCOM (no longer used because horrible to configure, holes in firewall, all bad)
          • Windows Slave started via JNLP (no longer used because slave.jar is not updated automatically)
          • Windows Slave started via SSH (our current solution: a custom script by us that SCPs slave.jar from the WAR file to the slave and then ssh to the machine and start the slave.jar - we have to do this because Jenkins SSH slaves plugin cannot handle the native Windows cmd32 shell; we don't use cygwin, instead the BitviseSSH server).

          We see hangs every few builds, in most cases when the slave already started shortly before. So it seems to be an issue for the first build that starts after the slave started (we use the VirtualBOX VM to wake up a Windows VM from standby, update/start the slave.jar, start the jobs sticked to this VM and after that halt slave.jar and let the VirtualBOX VM go to standby/save state).

          Once it happens again, I will request a jstack trace from a second SSH console to the Windows machine.

          The Windows machine's config is Windows 7 Professional 64bit with the above BitviseSSH config.

          Uwe Schindler added a comment - - edited Hi the same happens with other windows slave types: Windows Slave started via DCOM (no longer used because horrible to configure, holes in firewall, all bad) Windows Slave started via JNLP (no longer used because slave.jar is not updated automatically) Windows Slave started via SSH (our current solution: a custom script by us that SCPs slave.jar from the WAR file to the slave and then ssh to the machine and start the slave.jar - we have to do this because Jenkins SSH slaves plugin cannot handle the native Windows cmd32 shell; we don't use cygwin, instead the BitviseSSH server). We see hangs every few builds, in most cases when the slave already started shortly before. So it seems to be an issue for the first build that starts after the slave started (we use the VirtualBOX VM to wake up a Windows VM from standby, update/start the slave.jar, start the jobs sticked to this VM and after that halt slave.jar and let the VirtualBOX VM go to standby/save state). Once it happens again, I will request a jstack trace from a second SSH console to the Windows machine. The Windows machine's config is Windows 7 Professional 64bit with the above BitviseSSH config.

          lsb want added a comment - - edited

          It is similar with AWS cloud.

          lsb want added a comment - - edited It is similar with AWS cloud.

            Unassigned Unassigned
            baza Serge Tsygankov
            Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: