When using Amazon EC2 plugin, builds on newly created EC2 instances sometimes hang.
      We have two permanent slaves for a certain label (e.g. 'unittest'),
      which is generated from an AMI.
      The same AMI is specified within Amazon EC2 plugin settings.

      We have a job which can be executed concurrently
      When we invoke three builds at one time, two permanent slaves are exhausted
      and the new one is created.

      The problem is that the build on the new slave hangs at the end of it
      where xUnit plugin is aggregating the test result.

      [CHECKSTYLE] Collecting checkstyle analysis files...
      [CHECKSTYLE] Computing warning deltas based on reference build #850
      [FINDBUGS] Collecting findbugs analysis files...
      [FINDBUGS] Computing warning deltas based on reference build #850
      Archiving artifacts
      [xUnit] [INFO] - Starting to record.
      [xUnit] [INFO] - Processing JUnit
      [xUnit] [INFO] - [JUnit] - 581 test report file(s) were found with the pattern '**/testresult/**/*.xml' relative to '/var/lib/jenkins/workspace/400_Precommit_Check_Branch' for the testing framework 'JUnit'.
      

      After aborting the build, the following error is shown.

      ERROR: Publisher org.jenkinsci.plugins.xunit.XUnitPublisher aborted due to exception
      java.lang.InterruptedException
      	at java.lang.Object.wait(Native Method)
      	at hudson.remoting.Request.call(Request.java:146)
      	at hudson.remoting.Channel.call(Channel.java:665)
      	at hudson.FilePath.act(FilePath.java:841)
      	at hudson.FilePath.act(FilePath.java:825)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.performTests(XUnitPublisher.java:170)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.performXUnit(XUnitPublisher.java:115)
      	at org.jenkinsci.plugins.xunit.XUnitPublisher.perform(XUnitPublisher.java:92)
      	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:804)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:779)
      	at hudson.model.Build$BuildExecution.post2(Build.java:183)
      	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:726)
      	at hudson.model.Run.execute(Run.java:1541)
      	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      	at hudson.model.ResourceController.execute(ResourceController.java:88)
      	at hudson.model.Executor.run(Executor.java:236)
      Email was triggered for: Failure
      Sending email for trigger: Failure
      

      I tried to capture the thread dump, but both master and the target slave had EMPTY thread dump while another slave had its own.
      I'd appreciate if someone give me advice.

      Jenkins ver. 1.491
      xUnit plugin 1.51
      Amazon EC2 plugin 1.17

        1. hs_err_pid15419.log
          92 kB
        2. jstack_master.txt
          105 kB
        3. jstack_slave.out
          13 kB
        4. lsof_slave.out
          6 kB

          [JENKINS-15931] Build is hanging in xUnit plugin processing

          Antti Rasinen added a comment -

          We have also found out that the actual limiting number is the number of xUnit steps executed. For example, if the job has two xUnit post-build steps, then the 10th build gets stuck. Same with four xUnit steps per build. In each case, it is the 20th xUnit step that freezes.

          Antti Rasinen added a comment - We have also found out that the actual limiting number is the number of xUnit steps executed. For example, if the job has two xUnit post-build steps, then the 10th build gets stuck. Same with four xUnit steps per build. In each case, it is the 20th xUnit step that freezes.

          Antti Rasinen added a comment -

          Further test cases have resolved that the limiting factor is, in fact, dependent on number of testcase files and the number of runs. The size of the files or the number of testcases within the files do not seem to matter.

          To be precise, when we have 0 or 1 file in the result set, the number of runs follows this equation:

          runs * (1.38 + number of files) = 338.

          When we have 2 or more files, the numbers are somewhat different:

          runs * (1.33 + number of files) = 790.

          Antti Rasinen added a comment - Further test cases have resolved that the limiting factor is, in fact, dependent on number of testcase files and the number of runs. The size of the files or the number of testcases within the files do not seem to matter. To be precise, when we have 0 or 1 file in the result set, the number of runs follows this equation: runs * (1.38 + number of files) = 338. When we have 2 or more files, the numbers are somewhat different: runs * (1.33 + number of files) = 790.

          Francis Upton added a comment -

          Old issue, please refile if broken.

          Francis Upton added a comment - Old issue, please refile if broken.

          We have exactly this isse here frequently with jobs having a large number of testcases.
          What else information is needed to fix this problem ?

          Hereafter the job that got stuck at 16.01 10:47:23.
          Today morning i pressed the stop button at 8:24, and you see the

          16.01 10:47:22 </ul>
          16.01 10:47:23 [xUnit] [INFO] - Starting to record.
          16.01 10:47:23 [xUnit] [INFO] - Processing BoostTest-1.x (default)
          16.01 10:47:23 [xUnit] [INFO] - [BoostTest-1.x (default)] - 2 test report file(s) were found with the pattern '*.xml' relative to '/home/alcatel/workspace/R8Light' for the testing framework 'BoostTest-1.x (default)'.
          18.01 08:24:25 [xUnit] [WARNING] - Caught exception of unexpected type class java.lang.InterruptedException, rethrowing
          18.01 08:24:25 ERROR: Step ‘Publish xUnit test result report’ aborted due to exception:
          18.01 08:24:25 java.lang.InterruptedException
          18.01 08:24:25 at java.lang.Object.wait(Native Method)
          18.01 08:24:25 at hudson.remoting.Request.call(Request.java:147)
          18.01 08:24:25 at hudson.remoting.Channel.call(Channel.java:780)
          18.01 08:24:25 at hudson.FilePath.act(FilePath.java:979)
          18.01 08:24:25 at hudson.FilePath.act(FilePath.java:968)
          18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitProcessor.performTests(XUnitProcessor.java:138)
          18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitProcessor.performXUnit(XUnitProcessor.java:81)
          18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitPublisher.perform(XUnitPublisher.java:112)
          18.01 08:24:25 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
          18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782)
          18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:723)
          18.01 08:24:25 at hudson.model.Build$BuildExecution.post2(Build.java:185)
          18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:668)
          18.01 08:24:25 at hudson.model.Run.execute(Run.java:1763)
          18.01 08:24:25 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
          18.01 08:24:25 at hudson.model.ResourceController.execute(ResourceController.java:98)
          18.01 08:24:25 at hudson.model.Executor.run(Executor.java:410)
          18.01 08:24:25 [PostBuildScript] - Execution post build scripts.
          18.01 08:24:25 [R8Light] $ /bin/bash -xe /tmp/hudson4109735917483500943.sh

          Eckhard Völlm added a comment - We have exactly this isse here frequently with jobs having a large number of testcases. What else information is needed to fix this problem ? Hereafter the job that got stuck at 16.01 10:47:23. Today morning i pressed the stop button at 8:24, and you see the 16.01 10:47:22 </ul> 16.01 10:47:23 [xUnit] [INFO] - Starting to record. 16.01 10:47:23 [xUnit] [INFO] - Processing BoostTest-1.x (default) 16.01 10:47:23 [xUnit] [INFO] - [BoostTest-1.x (default)] - 2 test report file(s) were found with the pattern '*.xml' relative to '/home/alcatel/workspace/R8Light' for the testing framework 'BoostTest-1.x (default)'. 18.01 08:24:25 [xUnit] [WARNING] - Caught exception of unexpected type class java.lang.InterruptedException, rethrowing 18.01 08:24:25 ERROR: Step ‘Publish xUnit test result report’ aborted due to exception: 18.01 08:24:25 java.lang.InterruptedException 18.01 08:24:25 at java.lang.Object.wait(Native Method) 18.01 08:24:25 at hudson.remoting.Request.call(Request.java:147) 18.01 08:24:25 at hudson.remoting.Channel.call(Channel.java:780) 18.01 08:24:25 at hudson.FilePath.act(FilePath.java:979) 18.01 08:24:25 at hudson.FilePath.act(FilePath.java:968) 18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitProcessor.performTests(XUnitProcessor.java:138) 18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitProcessor.performXUnit(XUnitProcessor.java:81) 18.01 08:24:25 at org.jenkinsci.plugins.xunit.XUnitPublisher.perform(XUnitPublisher.java:112) 18.01 08:24:25 at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:782) 18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:723) 18.01 08:24:25 at hudson.model.Build$BuildExecution.post2(Build.java:185) 18.01 08:24:25 at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:668) 18.01 08:24:25 at hudson.model.Run.execute(Run.java:1763) 18.01 08:24:25 at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) 18.01 08:24:25 at hudson.model.ResourceController.execute(ResourceController.java:98) 18.01 08:24:25 at hudson.model.Executor.run(Executor.java:410) 18.01 08:24:25 [PostBuildScript] - Execution post build scripts. 18.01 08:24:25 [R8Light] $ /bin/bash -xe /tmp/hudson4109735917483500943.sh

          Eckhard Völlm added a comment - - edited

          Hi Gregory, can you help me here ?
          Thanks in advance, Eckhard !

          Eckhard Völlm added a comment - - edited Hi Gregory, can you help me here ? Thanks in advance, Eckhard !

          Here i added the full Java trace from core dump of the hanging Jenkins Jobs.

          Eckhard Völlm added a comment - Here i added the full Java trace from core dump of the hanging Jenkins Jobs.

          Eckhard Völlm added a comment - - edited

          Jenkins version 1.6.43
          xUnit version 1.99

          java version "1.7.0_91"
          OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1)
          OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
          Linux slave 4.2.0-19-lowlatency #23~14.04.1-Ubuntu SMP PREEMPT Thu Nov 12 13:19:01 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

          Here a thread where Java threads also were hanging at "java.util.zip.ZipFile.getEntry":
          https://github.com/aws/aws-sdk-java/issues/238
          same as we observe here. They also found a solution, IMHO:
          "And yes, using this system property gets rid of the thread blocks :+1: Thank you so much for the suggestion!"

          Eckhard Völlm added a comment - - edited Jenkins version 1.6.43 xUnit version 1.99 java version "1.7.0_91" OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.14.04.1) OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) Linux slave 4.2.0-19-lowlatency #23~14.04.1-Ubuntu SMP PREEMPT Thu Nov 12 13:19:01 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Here a thread where Java threads also were hanging at "java.util.zip.ZipFile.getEntry": https://github.com/aws/aws-sdk-java/issues/238 same as we observe here. They also found a solution, IMHO: "And yes, using this system property gets rid of the thread blocks :+1: Thank you so much for the suggestion!"

          Francis Upton added a comment -

          Does not appear to be an ec2 plugin issue.

          Francis Upton added a comment - Does not appear to be an ec2 plugin issue.

          Eckhard Völlm added a comment - - edited

          Yes, we do not have ec2 plugin in use, i can confirm that.
          And we decided to try an upgrade of Java version now, using oracle 1.8.0_66-b17 since last week,
          Since we did that, there was no hanging, but time is too short to say something, not many test were running since snowzilla hit eastcost.

          Propose to switch Subject, to "Build is hanging in xUnit plugin processing"

          Eckhard Völlm added a comment - - edited Yes, we do not have ec2 plugin in use, i can confirm that. And we decided to try an upgrade of Java version now, using oracle 1.8.0_66-b17 since last week, Since we did that, there was no hanging, but time is too short to say something, not many test were running since snowzilla hit eastcost. Propose to switch Subject, to "Build is hanging in xUnit plugin processing"

          Nikolas Falco added a comment -

          Please try with lastest version of Jenkins and the plugin, i could not reproduce this scenario also if detailed with thread dump.

          Nikolas Falco added a comment - Please try with lastest version of Jenkins and the plugin, i could not reproduce this scenario also if detailed with thread dump.

            nfalco Nikolas Falco
            tmgw165 Hiroko Tamagawa
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: