Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-10234

Junit result archiver getting stuck for a long time in concurrent builds

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      When reaching the end of a build with jUnit results, possibly when the job is allowed to run concurrently, we are frequently seeing our system get stuck on "Recording test results".

      Looking at the thread list, I see the following:
      "Executor #9 for master : executing Run_Manual_SOAK #242 : waiting for Check point JUnit result archiving on Run_Manual_SOAK #241
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:502)
      hudson.model.Run$Runner$CheckpointSet.waitForCheckPoint(Run.java:1266)
      hudson.model.Run.waitForCheckpoint(Run.java:1234)
      hudson.model.CheckPoint.block(CheckPoint.java:144)
      hudson.tasks.junit.JUnitResultArchiver.perform(JUnitResultArchiver.java:159)
      hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:19)
      hudson.model.AbstractBuild$AbstractRunner.perform(AbstractBuild.java:663)
      hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:638)
      hudson.model.AbstractBuild$AbstractRunner.performAllBuildSteps(AbstractBuild.java:616)
      hudson.model.Build$RunnerImpl.post2(Build.java:161)
      hudson.model.AbstractBuild$AbstractRunner.post(AbstractBuild.java:585)
      hudson.model.Run.run(Run.java:1399)
      hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
      hudson.model.ResourceController.execute(ResourceController.java:88)
      hudson.model.Executor.run(Executor.java:145)
      Executor #9 for master : executing Run_Manual_SOAK #242 : waiting for Check point JUnit result archiving on Run_Manual_SOAK #241"

      All the stuck jobs are in the same place. They do eventually come unstuck, but can spend a long time (hours and sometimes a day or so) in this state.
      Machine load average is at 0.23 0.22 0.21.

        Attachments

          Issue Links

            Activity

            dannystaple Danny Staple created issue -
            Hide
            dannystaple Danny Staple added a comment -

            We know have understood exactly what happens here - it may not be a bug but a "feature".
            When concurrent jobs are started, it is possible, especially where parameterization affects the run time of a job, for a later build to finish before the earlier build. This means that when it reaches the archive stage, it is doing the junit analysis.
            Some of our tests take 20 minutes, and some 15 hours.

            Junit then tries to sort out regression. The oldest job will hold up the archiving of any newer ones while it waits to find this.
            We use junit as a convenient way to display results - but hadn't anticipated this behaviour.

            Show
            dannystaple Danny Staple added a comment - We know have understood exactly what happens here - it may not be a bug but a "feature". When concurrent jobs are started, it is possible, especially where parameterization affects the run time of a job, for a later build to finish before the earlier build. This means that when it reaches the archive stage, it is doing the junit analysis. Some of our tests take 20 minutes, and some 15 hours. Junit then tries to sort out regression. The oldest job will hold up the archiving of any newer ones while it waits to find this. We use junit as a convenient way to display results - but hadn't anticipated this behaviour.
            Hide
            dannystaple Danny Staple added a comment -

            The answer we are now considering is to find (or make) a way to disable the regression checking behaviour of junit - preferably as an option we can set per job, so that other jobs that are sequential and not concurrent, or that should consistently take the same time, can have it enabled.

            Show
            dannystaple Danny Staple added a comment - The answer we are now considering is to find (or make) a way to disable the regression checking behaviour of junit - preferably as an option we can set per job, so that other jobs that are sequential and not concurrent, or that should consistently take the same time, can have it enabled.
            Hide
            jayee Jonas Eriksson added a comment -

            We're also having this problem of different running times, as we sometimes are skipping a job in a chain of jobs.

            Danny, have you found any workaround for this?

            Temporarily disable the checkPoint waiting on certain jobs would help us. If the JUnitResultArchiver were a plugin it would simplify forking the feature.

            Show
            jayee Jonas Eriksson added a comment - We're also having this problem of different running times, as we sometimes are skipping a job in a chain of jobs. Danny, have you found any workaround for this? Temporarily disable the checkPoint waiting on certain jobs would help us. If the JUnitResultArchiver were a plugin it would simplify forking the feature.
            Hide
            dannystaple Danny Staple added a comment -

            We are in the process of removing things using the concurrent build flag from our setup. It means more duplication and a violation of SPOT (Single point of truth), but concurrent builds have lead us to too many problems - including the far more serious #JENKINS-10615. We are looking into the viability of the templatised job plugin to prevent us duplicating stuff, and have most of our control logic in an SCM run within shell build steps.

            Show
            dannystaple Danny Staple added a comment - We are in the process of removing things using the concurrent build flag from our setup. It means more duplication and a violation of SPOT (Single point of truth), but concurrent builds have lead us to too many problems - including the far more serious # JENKINS-10615 . We are looking into the viability of the templatised job plugin to prevent us duplicating stuff, and have most of our control logic in an SCM run within shell build steps.
            Hide
            jayee Jonas Eriksson added a comment -

            I've been playing around with Jenkins now and found out that even if I don't have JUnit Reports enabled I have another plugin (email ext) that will wait for the job to finish by using the checkPoint.

            I guess running concurrent parameterized builds in Jenkins is not fitting how the model is implemented in the first place.

            Show
            jayee Jonas Eriksson added a comment - I've been playing around with Jenkins now and found out that even if I don't have JUnit Reports enabled I have another plugin (email ext) that will wait for the job to finish by using the checkPoint. I guess running concurrent parameterized builds in Jenkins is not fitting how the model is implemented in the first place.
            Hide
            jayee Jonas Eriksson added a comment -

            If the JUnit Reports is the only task holding you back from finishing a job when running concurrent builds I've found out that the xunit extension configured with Custom Tool is a solution to the checkPoint waiting problem.

            Show
            jayee Jonas Eriksson added a comment - If the JUnit Reports is the only task holding you back from finishing a job when running concurrent builds I've found out that the xunit extension configured with Custom Tool is a solution to the checkPoint waiting problem.
            Hide
            inbar_rose Inbar Rose added a comment -

            same problem here. total blocker. task A starts, then task B starts. task B reaches the 'Recording test results' stage and hangs until task A finishes. after testing with simple timed builds with many plugins/options enabled/disabled concluded that junit is the problem.

            Show
            inbar_rose Inbar Rose added a comment - same problem here. total blocker. task A starts, then task B starts. task B reaches the 'Recording test results' stage and hangs until task A finishes. after testing with simple timed builds with many plugins/options enabled/disabled concluded that junit is the problem.
            inbar_rose Inbar Rose made changes -
            Field Original Value New Value
            Priority Major [ 3 ] Blocker [ 1 ]
            Hide
            kutzi kutzi added a comment -

            IMO this is definitely a feature and not a bug. If you don't like this behaviour, then use e.g. the xunit plugin which doesn't seem to behave in this way.

            Show
            kutzi kutzi added a comment - IMO this is definitely a feature and not a bug. If you don't like this behaviour, then use e.g. the xunit plugin which doesn't seem to behave in this way.
            kutzi kutzi made changes -
            Link This issue is related to JENKINS-9913 [ JENKINS-9913 ]
            jglick Jesse Glick made changes -
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Resolved [ 5 ]
            Hide
            rb2k Marc Seeger added a comment - - edited

            How is this supposed to be a feature? A testrun that doesn't have anything to do with another testrun being blocked?
            This isn't even for parallel runs, this is for completely unrelated jobs.
            This just got closed as a duplicate. Is this in relation to JENKINS-9913?

            Show
            rb2k Marc Seeger added a comment - - edited How is this supposed to be a feature? A testrun that doesn't have anything to do with another testrun being blocked? This isn't even for parallel runs, this is for completely unrelated jobs. This just got closed as a duplicate. Is this in relation to JENKINS-9913 ?
            Hide
            test4ever Andy Chen added a comment -

            I was redirected from another issue to this one. My problem is the result recording takes forever for some of my builds. The job in question is a parametrized concurrent job.

            Show
            test4ever Andy Chen added a comment - I was redirected from another issue to this one. My problem is the result recording takes forever for some of my builds. The job in question is a parametrized concurrent job.
            Hide
            axeda_clint clint axeda added a comment -

            i have hit this issue tonight with Jenkins ver. 1.538. I captured a thread dump if interested?

            Show
            axeda_clint clint axeda added a comment - i have hit this issue tonight with Jenkins ver. 1.538. I captured a thread dump if interested?
            Hide
            mcklaus Klaus Azesberger added a comment -

            @clint axeda:
            can you make a CPU sample using visualVM to check whether you experience high cpu-time consumption in CipherInputStream.fill_buffer() (sort by column "Self Time (CPU)")?

            maybe we share the same root-cause: JENKINS-22297

            Show
            mcklaus Klaus Azesberger added a comment - @clint axeda: can you make a CPU sample using visualVM to check whether you experience high cpu-time consumption in CipherInputStream.fill_buffer() (sort by column "Self Time (CPU)")? maybe we share the same root-cause: JENKINS-22297
            Hide
            jglick Jesse Glick added a comment -

            Since JENKINS-9913 is covering only the reporting of checkpoints, this should be reopened: JUnitResultArchiver.CHECKPOINT still exists, and probably should not.

            Needs to be determined if anything needs to be done to replace it, in case a build with a higher number in fact finishes before one with a lower number, so calculation of test regressions cannot be done accurately when the result is published (in case anyone even cares about build-to-build diffs for a concurrent-capable job). Until the earlier build finishes, will the later build’s test result display show any “regressions” (against the last completed build), or show no regressions ever, or throw exceptions? After the earlier build finishes, will the later’s result display show regressions against the earlier build, or against the last completed build at the time of this build’s completion, or do something else? In other words, are calls to getPreviousResult made on demand whenever a build-to-build diff is requested (great)? Or made once when the build completes (not great but adequate)? Or does something really break? My casual inspection of the code suggests that there is some improper caching (CaseResult.failedSince) but that code generally defends against a prior build having no test result action, meaning that simply deleting CHECKPOINT would cause little harm.

            Show
            jglick Jesse Glick added a comment - Since JENKINS-9913 is covering only the reporting of checkpoints, this should be reopened: JUnitResultArchiver.CHECKPOINT still exists, and probably should not. Needs to be determined if anything needs to be done to replace it, in case a build with a higher number in fact finishes before one with a lower number, so calculation of test regressions cannot be done accurately when the result is published (in case anyone even cares about build-to-build diffs for a concurrent-capable job). Until the earlier build finishes, will the later build’s test result display show any “regressions” (against the last completed build), or show no regressions ever, or throw exceptions? After the earlier build finishes, will the later’s result display show regressions against the earlier build, or against the last completed build at the time of this build’s completion, or do something else? In other words, are calls to getPreviousResult made on demand whenever a build-to-build diff is requested (great)? Or made once when the build completes (not great but adequate)? Or does something really break? My casual inspection of the code suggests that there is some improper caching ( CaseResult.failedSince ) but that code generally defends against a prior build having no test result action, meaning that simply deleting CHECKPOINT would cause little harm.
            jglick Jesse Glick made changes -
            Resolution Duplicate [ 3 ]
            Status Resolved [ 5 ] Reopened [ 4 ]
            jglick Jesse Glick made changes -
            Assignee Jesse Glick [ jglick ]
            jglick Jesse Glick made changes -
            Status Reopened [ 4 ] Open [ 1 ]
            jglick Jesse Glick made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jglick Jesse Glick made changes -
            Labels checkpoint concurrent
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Jesse Glick
            Path:
            changelog.html
            core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java
            http://jenkins-ci.org/commit/jenkins/90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a
            Log:
            [FIXED JENKINS-10234] Removed checkpoint from JUnitResultArchiver.

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Jesse Glick Path: changelog.html core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java http://jenkins-ci.org/commit/jenkins/90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a Log: [FIXED JENKINS-10234] Removed checkpoint from JUnitResultArchiver.
            scm_issue_link SCM/JIRA link daemon made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            Hide
            dogfood dogfood added a comment -

            Integrated in jenkins_main_trunk #3535
            [FIXED JENKINS-10234] Removed checkpoint from JUnitResultArchiver. (Revision 90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a)

            Result = SUCCESS
            Jesse Glick : 90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a
            Files :

            • changelog.html
            • core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java
            Show
            dogfood dogfood added a comment - Integrated in jenkins_main_trunk #3535 [FIXED JENKINS-10234] Removed checkpoint from JUnitResultArchiver. (Revision 90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a) Result = SUCCESS Jesse Glick : 90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a Files : changelog.html core/src/main/java/hudson/tasks/junit/JUnitResultArchiver.java
            basil Basil Crow made changes -
            Link This issue is related to JENKINS-24450 [ JENKINS-24450 ]
            Hide
            whimboo Henrik Skupin added a comment -

            Any change that we could get this backported to the 1.565.x LTS branch? It's one of the most annoying problems for our Jenkins production systems.

            Show
            whimboo Henrik Skupin added a comment - Any change that we could get this backported to the 1.565.x LTS branch? It's one of the most annoying problems for our Jenkins production systems.
            jglick Jesse Glick made changes -
            Labels checkpoint concurrent checkpoint concurrent lts-candidate
            Hide
            jglick Jesse Glick added a comment -

            Probably too late for 1.565.x and probably already in the next LTS, but marking it as a candidate just in case.

            Show
            jglick Jesse Glick added a comment - Probably too late for 1.565.x and probably already in the next LTS, but marking it as a candidate just in case.
            Hide
            whimboo Henrik Skupin added a comment -

            There is version 1.565.3 LTS to be scheduled for Oct 1st. Not sure when fixes are taking into consideration for LTS releases. Anything else beside the keyword we could try to get it in? The next major version bump for LTS will happen end of Oct, where this might be fixed.

            Show
            whimboo Henrik Skupin added a comment - There is version 1.565.3 LTS to be scheduled for Oct 1st. Not sure when fixes are taking into consideration for LTS releases. Anything else beside the keyword we could try to get it in? The next major version bump for LTS will happen end of Oct, where this might be fixed.
            Hide
            danielbeck Daniel Beck added a comment -

            Next LTS will be based on 1.580, and the 1.565.x line is done.

            Show
            danielbeck Daniel Beck added a comment - Next LTS will be based on 1.580, and the 1.565.x line is done.
            danielbeck Daniel Beck made changes -
            Labels checkpoint concurrent lts-candidate checkpoint concurrent
            Hide
            lan_wu Lan Wu added a comment - - edited

            Sorry new to Jenkins jira. We are hitting this bug, but I can't see from this listing which version this was fixed in. On this page, http://jenkins-ci.org/changelog-stable, I couldn't find issue 10234. Does that mean it's not in one of the LTS builds? Thanks!

            Show
            lan_wu Lan Wu added a comment - - edited Sorry new to Jenkins jira. We are hitting this bug, but I can't see from this listing which version this was fixed in. On this page, http://jenkins-ci.org/changelog-stable , I couldn't find issue 10234. Does that mean it's not in one of the LTS builds? Thanks!
            Hide
            danielbeck Daniel Beck added a comment -

            Does that mean it's not in one of the LTS builds?

            No, it just means it has not specifically been fixed/backported for one of the LTS releases. It was fixed for 1.575 and is therefore in 1.580 and and the LTS releases based on that.

            Show
            danielbeck Daniel Beck added a comment - Does that mean it's not in one of the LTS builds? No, it just means it has not specifically been fixed/backported for one of the LTS releases. It was fixed for 1.575 and is therefore in 1.580 and and the LTS releases based on that.
            Hide
            scm_issue_link SCM/JIRA link daemon added a comment -

            Code changed in jenkins
            User: Martin Bektchiev
            Path:
            src/main/java/hudson/plugins/nunit/NUnitPublisher.java
            http://jenkins-ci.org/commit/nunit-plugin/1e89e0267814d280b7051d7d70d5d2939326b182
            Log:
            Do not wait for checkpoint from previous build

            Fixes an issue with result archiver getting stuck for a long time in concurrent builds.
            A similar issue has been fixed in the JUnit Jenkins plugin:

            https://issues.jenkins-ci.org/browse/JENKINS-10234
            https://github.com/jenkinsci/jenkins/commit/90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a

            Show
            scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Martin Bektchiev Path: src/main/java/hudson/plugins/nunit/NUnitPublisher.java http://jenkins-ci.org/commit/nunit-plugin/1e89e0267814d280b7051d7d70d5d2939326b182 Log: Do not wait for checkpoint from previous build Fixes an issue with result archiver getting stuck for a long time in concurrent builds. A similar issue has been fixed in the JUnit Jenkins plugin: https://issues.jenkins-ci.org/browse/JENKINS-10234 https://github.com/jenkinsci/jenkins/commit/90ff9f806fcac1a58f4bd40bfcc4ed5273ff116a
            rtyler R. Tyler Croy made changes -
            Workflow JNJira [ 140422 ] JNJira + In-Review [ 189038 ]
            josias Josias Inacio da Silva Filho made changes -
            Link This issue is related to JENKINS-42727 [ JENKINS-42727 ]

              People

              Assignee:
              jglick Jesse Glick
              Reporter:
              dannystaple Danny Staple
              Votes:
              15 Vote for this issue
              Watchers:
              30 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: