When using Amazon EC2 plugin, builds on newly created EC2 instances sometimes hang.
      We have two permanent slaves for a certain label (e.g. 'unittest'),
      which is generated from an AMI.
      The same AMI is specified within Amazon EC2 plugin settings.

      We have a job which can be executed concurrently
      When we invoke three builds at one time, two permanent slaves are exhausted
      and the new one is created.

      The problem is that the build on the new slave hangs at the end of it
      where xUnit plugin is aggregating the test result.

      [CHECKSTYLE] Collecting checkstyle analysis files...
      [CHECKSTYLE] Computing warning deltas based on reference build #850
      [FINDBUGS] Collecting findbugs analysis files...
      [FINDBUGS] Computing warning deltas based on reference build #850
      Archiving artifacts
      [xUnit] [INFO] - Starting to record.
      [xUnit] [INFO] - Processing JUnit
      [xUnit] [INFO] - [JUnit] - 581 test report file(s) were found with the pattern '**/testresult/**/*.xml' relative to '/var/lib/jenkins/workspace/400_Precommit_Check_Branch' for the testing framework 'JUnit'.
      

      After aborting the build, the following error is shown.

      ERROR: Publisher org.jenkinsci.plugins.xunit.XUnitPublisher aborted due to exception
      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:146)
      	at hudson.remoting.Channel.call(Channel.java:665)
      	at hudson.FilePath.act(FilePath.java:841)
      	at hudson.FilePath.act(FilePath.java:825)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.performTests(XUnitPublisher.java:170)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.performXUnit(XUnitPublisher.java:115)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.perform(XUnitPublisher.java:92)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:804)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:779)
      	at hudson.model.Build$BuildExecution.post2(Build.java:183)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:726)
      	at hudson.model.Run.execute(Run.java:1541)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:236)
      Email was triggered for: Failure
      Sending email for trigger: Failure
      

      I tried to capture the thread dump, but both master and the target slave had EMPTY thread dump while another slave had its own.
      I'd appreciate if someone give me advice.

      Jenkins ver. 1.491
      xUnit plugin 1.51
      Amazon EC2 plugin 1.17

        1. hs_err_pid15419.log
          92 kB
          Eckhard Völlm
        2. jstack_master.txt
          105 kB
          Antti Rasinen
        3. jstack_slave.out
          13 kB
          Antti Rasinen
        4. lsof_slave.out
          6 kB
          Antti Rasinen

          [JENKINS-15931] Build is hanging in xUnit plugin processing

          Hiroko Tamagawa created issue -
          Hiroko Tamagawa made changes -
          Assignee Original: Francis Upton [ francisu ]
          Hiroko Tamagawa made changes -
          Labels New: ec2 hang slave xunit

          Antti Rasinen added a comment -

          This error happens very frequently at our site. I've attached here a thread dumps from the master and from the slave. The dumps were taken at separate incidents, however.

          Also I've attached lsof -p <pid> output for the slave.

          Antti Rasinen added a comment - This error happens very frequently at our site. I've attached here a thread dumps from the master and from the slave. The dumps were taken at separate incidents, however. Also I've attached lsof -p <pid> output for the slave.
          Antti Rasinen made changes -
          Attachment New: jstack_master.txt [ 23220 ]
          Attachment New: jstack_slave.out [ 23221 ]
          Attachment New: lsof_slave.out [ 23222 ]

          Antti Rasinen added a comment -

          Both server and slave dumps contain threads that are waiting for a channel. Look for xunit in the dumps.

          Antti Rasinen added a comment - Both server and slave dumps contain threads that are waiting for a channel. Look for xunit in the dumps.

          Antti Rasinen added a comment -

          We have constructed a simple test case that brings out the bug. It is 100% reproducible.

          1) Configure a cloud that can run Jenkins slaves. Use just 1 executor.
          2) Configure a job that puts a JUnit file in the workspace. We simply download junit files from a web server.
          3) Add an xUnit processing step for the file.
          4) Make the job build itself after it finishes.

          With this setup, we have noticed that the freeze up happens on the 20th build on the slave, regardless of the number of the jobs. For example, with a two job config we managed to build 10+9 succesfully, with the slave being stuck on the 20th build.´

          We've reproduced this with RHEL 5, RHEL6 and Ubuntu 12.01.

          Antti Rasinen added a comment - We have constructed a simple test case that brings out the bug. It is 100% reproducible. 1) Configure a cloud that can run Jenkins slaves. Use just 1 executor. 2) Configure a job that puts a JUnit file in the workspace. We simply download junit files from a web server. 3) Add an xUnit processing step for the file. 4) Make the job build itself after it finishes. With this setup, we have noticed that the freeze up happens on the 20th build on the slave, regardless of the number of the jobs. For example, with a two job config we managed to build 10+9 succesfully, with the slave being stuck on the 20th build.´ We've reproduced this with RHEL 5, RHEL6 and Ubuntu 12.01.

          Antti Rasinen added a comment -

          We have also found out that the actual limiting number is the number of xUnit steps executed. For example, if the job has two xUnit post-build steps, then the 10th build gets stuck. Same with four xUnit steps per build. In each case, it is the 20th xUnit step that freezes.

          Antti Rasinen added a comment - We have also found out that the actual limiting number is the number of xUnit steps executed. For example, if the job has two xUnit post-build steps, then the 10th build gets stuck. Same with four xUnit steps per build. In each case, it is the 20th xUnit step that freezes.

          Antti Rasinen added a comment -

          Further test cases have resolved that the limiting factor is, in fact, dependent on number of testcase files and the number of runs. The size of the files or the number of testcases within the files do not seem to matter.

          To be precise, when we have 0 or 1 file in the result set, the number of runs follows this equation:

          runs * (1.38 + number of files) = 338.

          When we have 2 or more files, the numbers are somewhat different:

          runs * (1.33 + number of files) = 790.

          Antti Rasinen added a comment - Further test cases have resolved that the limiting factor is, in fact, dependent on number of testcase files and the number of runs. The size of the files or the number of testcases within the files do not seem to matter. To be precise, when we have 0 or 1 file in the result set, the number of runs follows this equation: runs * (1.38 + number of files) = 338. When we have 2 or more files, the numbers are somewhat different: runs * (1.33 + number of files) = 790.

          Francis Upton added a comment -

          Old issue, please refile if broken.

          Francis Upton added a comment - Old issue, please refile if broken.
          Francis Upton made changes -
          Resolution New: Cannot Reproduce [ 5 ]
          Status Original: Open [ 1 ] New: Closed [ 6 ]

            nfalco Nikolas Falco
            tmgw165 Hiroko Tamagawa
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: