Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-31660

StackOverflowError when maximum number of builds archived

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • junit-plugin
    • None

      We've been seeing a StackOverflowError with Test stability enabled and Discard Old Builds:

      FATAL: null
      java.lang.StackOverflowError
      	at hudson.tasks.junit.TestResultAction.load(TestResultAction.java:197)
      	at hudson.tasks.junit.TestResultAction.getResult(TestResultAction.java:143)
      	at hudson.tasks.junit.TestResultAction.getResult(TestResultAction.java:62)
      	at hudson.tasks.test.AbstractTestResultAction.findCorrespondingResult(AbstractTestResultAction.java:247)
      	at hudson.tasks.test.TestResult.getPreviousResult(TestResult.java:142)
      	at hudson.tasks.junit.SuiteResult.getPreviousResult(SuiteResult.java:283)
      	at hudson.tasks.junit.CaseResult.getPreviousResult(CaseResult.java:446)
      	at hudson.tasks.junit.CaseResult.freeze(CaseResult.java:575)
      	at hudson.tasks.junit.SuiteResult.freeze(SuiteResult.java:325)
      	at hudson.tasks.junit.TestResult.freeze(TestResult.java:627)
      	at hudson.tasks.junit.TestResultAction.load(TestResultAction.java:200)
      	at hudson.tasks.junit.TestResultAction.getResult(TestResultAction.java:143)
              ... repeated ...
      

      Disabling the test stability resolves our issue.

        1. console.log
          89 kB
        2. consoleText
          79 kB
        3. jenkins_stack_trace.txt
          75 kB
        4. jenkins.log
          74 kB
        5. jenkins-full-exception.log
          85 kB
        6. jenkins-SOE-20170426.log
          24 kB

          [JENKINS-31660] StackOverflowError when maximum number of builds archived

          Dave Hunt added a comment -

          I wonder what is happening when you view the test results on this job even with test-stability disabled.

          If I recall correctly, we had a very similar stack trace attempting to view the test results. After disabling test stability on the job though, this has recovered.

          Workaround would be to increase stack size using the -Xss JVM argument.

          We're disabled the plugin for affected jobs, but if it starts happening in other jobs I will experiment with this. Is there a way to get an indication of what I should set this to? Is there a way for the plugin to detect this scenario and make the -Xss suggestion to the user?

          Dave Hunt added a comment - I wonder what is happening when you view the test results on this job even with test-stability disabled. If I recall correctly, we had a very similar stack trace attempting to view the test results. After disabling test stability on the job though, this has recovered. Workaround would be to increase stack size using the -Xss JVM argument. We're disabled the plugin for affected jobs, but if it starts happening in other jobs I will experiment with this. Is there a way to get an indication of what I should set this to? Is there a way for the plugin to detect this scenario and make the -Xss suggestion to the user?

          kutzi added a comment -

          Is there a way for the plugin to detect this scenario and make the -Xss suggestion to the user?

          I don't think that is a feasible approach. Changing the -Xss size should be only a last resort when all other things fail and - if there is really an indefinite loop and not only the large number of jobs is the problem - it wouldn't even solve anything at all.

          It would be great if you could make the jenkins builds folder of the job available, if it's possible. E.g. if it doesn't contain any sensitive data.
          So we could try to reproduce this problem.

          kutzi added a comment - Is there a way for the plugin to detect this scenario and make the -Xss suggestion to the user? I don't think that is a feasible approach. Changing the -Xss size should be only a last resort when all other things fail and - if there is really an indefinite loop and not only the large number of jobs is the problem - it wouldn't even solve anything at all. It would be great if you could make the jenkins builds folder of the job available, if it's possible. E.g. if it doesn't contain any sensitive data. So we could try to reproduce this problem.

          Will Harris added a comment -

          I was about to file a similar bug, but not related to the Test Stability plugin. I had a very similar situation, using Jenkins 1.639 and JUnit plugin 1.9, where on one of my jobs with ~2450 builds Jenkins suddenly started throwing a SOE when trying to read in my (python generated) junit.xml. Unfortunately I can't remember the exact circumstances, if anything had recently been upgraded etc.

          I also had the problem when trying to display the Test Results for prior builds, even those where I know I had previously been able to see the test results.

          As part of the debugging process for the JIRA ticket, I limited the number of old builds to 90 days, which brought my total builds for the job to ~780. I then went back to the very first build remaining, and was able to see the test results. I was still unable to see the test results for the last build that had recorded them however, so I started a manual binary search through my build history to see if I could pinpoint a particular build from which I could no longer see the results. At some point however, I was no longer seeing the problem.

          Considering that the code in the stack trace refers to getPreviousResult I suspect somehow the history of JUnit results was corrupted, and by manually going through the results I somehow put things back in order.

          I've attached an earlier stack trace for reference. Hope this experience report helps in some way.

          Will Harris added a comment - I was about to file a similar bug, but not related to the Test Stability plugin. I had a very similar situation, using Jenkins 1.639 and JUnit plugin 1.9, where on one of my jobs with ~2450 builds Jenkins suddenly started throwing a SOE when trying to read in my (python generated) junit.xml. Unfortunately I can't remember the exact circumstances, if anything had recently been upgraded etc. I also had the problem when trying to display the Test Results for prior builds, even those where I know I had previously been able to see the test results. As part of the debugging process for the JIRA ticket, I limited the number of old builds to 90 days, which brought my total builds for the job to ~780. I then went back to the very first build remaining, and was able to see the test results. I was still unable to see the test results for the last build that had recorded them however, so I started a manual binary search through my build history to see if I could pinpoint a particular build from which I could no longer see the results. At some point however, I was no longer seeing the problem. Considering that the code in the stack trace refers to getPreviousResult I suspect somehow the history of JUnit results was corrupted, and by manually going through the results I somehow put things back in order. I've attached an earlier stack trace for reference. Hope this experience report helps in some way.

          Robert Cody added a comment -

          I don't even have Test stability plugin installed and still getting this error. Console log jenkins.log is attached.
          Jenkins ver. 1.648

          Robert Cody added a comment - I don't even have Test stability plugin installed and still getting this error. Console log jenkins.log is attached. Jenkins ver. 1.648

          Robert Cody added a comment -

          Here is more informative exception log: jenkins-full-exception.log.

          Robert Cody added a comment - Here is more informative exception log: jenkins-full-exception.log .

          Dave Hunt added a comment -

          Just seen this again after an upgrade to Jenkins 2.7.1. The first build failure after the upgrade was reported as expected, however the next build passed but hit this stack overflow. Disabling Test Stability History in the configuration allowed the build to pass without this exception.

          I've attached the full console log including the exception: console.log

          Dave Hunt added a comment - Just seen this again after an upgrade to Jenkins 2.7.1. The first build failure after the upgrade was reported as expected, however the next build passed but hit this stack overflow. Disabling Test Stability History in the configuration allowed the build to pass without this exception. I've attached the full console log including the exception: console.log

          Stefan Thurnherr added a comment - - edited

          Getting same SOE with Jenkins v2.55 and junit-plugin v1.20 and test-stability-plugin not installed. attaching the full stacktrace from the jenkins.log. jenkins-SOE-20170426.log

          The build configuration (controlled by Jenkinsfile build properties) does not discard any old builds, so we have all ca. 1500 builds (oldest is from 2017-02-23) still left inside Jenkins.

          Since it is a multi-branch build pipeline, we have other branches with much shorter build history. And they build without any problems, which further confirms the guess from previous comments that it is related to traversing the build history.

          Update: Configuring the build to discard builds older than 1 months has solved the problem in our case.

          Stefan Thurnherr added a comment - - edited Getting same SOE with Jenkins v2.55 and junit-plugin v1.20 and test-stability-plugin not installed. attaching the full stacktrace from the jenkins.log. jenkins-SOE-20170426.log The build configuration (controlled by Jenkinsfile build properties) does not discard any old builds, so we have all ca. 1500 builds (oldest is from 2017-02-23) still left inside Jenkins. Since it is a multi-branch build pipeline, we have other branches with much shorter build history. And they build without any problems, which further confirms the guess from previous comments that it is related to traversing the build history. Update : Configuring the build to discard builds older than 1 months has solved the problem in our case.

          Sean Flanigan added a comment -

          The stack traces show that AbstractTestResultAction.findCorrespondingResult() indirectly calls itself recursively, and in these cases that recursion has caused a StackOverflowError.

          From stefanthurnherr's description, this sounds like it may have a similar cause to https://issues.jenkins-ci.org/browse/JENKINS-33168?focusedCommentId=285979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-285979 - where StabilityTestDataPublisher.buildUpInitialHistory() iterates through many failing tests, getting their previous results across multiple builds. When memory pressure is high (eg lots of build history), CaseResult.getPreviousResult() can't use its WeakReference cache and has to load results from disk, thus loading each build's results many times instead of once. And now it's apparent that this involves calling AbstractTestResultAction.findCorrespondingResult() recursively for every previous build.

          So there are two related problems here:

          1. When loading results with lots of history, the recursion in findCorrespondingResult() causes a StackOverflowError unless (a) previous results were found in the WeakReference cache or (b) the number of previous results fits in the stack. (willharris's binary search from the earliest build mitigated this by preloading a limited number of results into the WeakReference.)

          2. Test Stability Plugin calls findCorrespondingResult() a lot when building initial history for a failing test. This produces a lot of memory pressure when there is a lot of build history, thus defeating the caching in 1(a) above. (In JENKINS-33168 the number of builds apparently hasn't been high enough to overflow the stack, but the number of test results is too much for the cache, thus killing performance.)

          So increasing stack size should certainly work around the StackOverflowError unless the number of builds gets too high, but if you use Test Stability Plugin you will probably encounter JENKINS-33168 if you have a lot of builds with a lot of tests in them.

          I think AbstractTestResultAction.findCorrespondingResult() in the JUnit plugin (or something else in that recursive call stack) needs to be redesigned to avoid recursion, otherwise a StackOverflowError is unavoidable when there are a lot of previous builds. (Solving JENKINS-33168, on the other hand, will require iterating in such a way that each build's results are only loaded once.)

          Sean Flanigan added a comment - The stack traces show that  AbstractTestResultAction.findCorrespondingResult() indirectly calls itself recursively, and in these cases that recursion has caused a StackOverflowError . From stefanthurnherr 's description, this sounds like it may have a similar cause to https://issues.jenkins-ci.org/browse/JENKINS-33168?focusedCommentId=285979&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-285979  - where StabilityTestDataPublisher.buildUpInitialHistory() iterates through many failing tests, getting their previous results across multiple builds. When memory pressure is high (eg lots of build history), CaseResult.getPreviousResult() can't use its WeakReference cache and has to load results from disk, thus loading each build's results many times instead of once. And now it's apparent that this involves calling  AbstractTestResultAction.findCorrespondingResult() recursively for every previous build. So there are two related problems here: 1. When loading results with lots of history, the recursion in  findCorrespondingResult() causes a StackOverflowError unless (a) previous results were found in the WeakReference cache or (b) the number of previous results fits in the stack. ( willharris 's binary search from the earliest build mitigated this by preloading a limited number of results into the WeakReference.) 2. Test Stability Plugin calls findCorrespondingResult() a lot when building initial history for a failing test. This produces a lot of memory pressure when there is a lot of build history, thus defeating the caching in 1(a) above. (In JENKINS-33168 the number of builds apparently hasn't been high enough to overflow the stack, but the number of test results is too much for the cache, thus killing performance.) So increasing stack size should certainly work around the StackOverflowError unless the number of builds gets too high, but if you use Test Stability Plugin you will probably encounter JENKINS-33168 if you have a lot of builds with a lot of tests in them. I think AbstractTestResultAction.findCorrespondingResult() in the JUnit plugin (or something else in that recursive call stack) needs to be redesigned to avoid recursion, otherwise a StackOverflowError is unavoidable when there are a lot of previous builds. (Solving JENKINS-33168 , on the other hand, will require iterating in such a way that each build's results are only loaded once.)

          This happens often when there are skipped tests because of a bug in the Junit plugin, see https://github.com/jenkinsci/junit-plugin/pull/117
          It's independent on Build Stability plugin.

          Zbynek Konecny added a comment - This happens often when there are skipped tests because of a bug in the Junit plugin, see https://github.com/jenkinsci/junit-plugin/pull/117 It's independent on Build Stability plugin.

          filler mark added a comment -

          A StackOverflowError is simply signals that there is no more memory available. It is to the stack what an OutOfMemoryError is to the heap: it simply signals that there is no more memory available. JVM has a given memory allocated for each stack of each thread, and if an attempt to call a method happens to fill this memory, JVM throws an error. Just like it would do if you were trying to write at index N of an array of length N. No memory corruption can happen. The stack can not write into the heap.

          The common cause for a stackoverflow is a bad recursive call. Typically, this is caused when your recursive functions doesn't have the correct termination condition, so it ends up calling itself forever. Or when the termination condition is fine, it can be caused by requiring too many recursive calls before fulfilling it.

          Here's an example:

          public class Overflow {
          public static final void main(String[] args)

          { main(args); }

          }
          That function calls itself repeatedly with no termination condition. Consequently, the stack fills up because each call has to push a return address on the stack, but the return addresses are never popped off the stack because the function never returns, it just keeps calling itself.

           

          filler mark added a comment - A StackOverflowError is simply signals that there is no more memory available. It is to the stack what an OutOfMemoryError is to the heap: it simply signals that there is no more memory available. JVM has a given memory allocated for each stack of each thread, and if an attempt to call a method happens to fill this memory, JVM throws an error. Just like it would do if you were trying to write at index N of an array of length N. No memory corruption can happen. The stack can not write into the heap. The common cause for a stackoverflow is a bad recursive call. Typically, this is caused when your recursive functions doesn't have the correct termination condition, so it ends up calling itself forever. Or when the termination condition is fine, it can be caused by requiring too many recursive calls before fulfilling it. Here's an example: public class Overflow { public static final void main(String[] args) { main(args); } } That function calls itself repeatedly with no termination condition. Consequently, the stack fills up because each call has to push a return address on the stack, but the return addresses are never popped off the stack because the function never returns, it just keeps calling itself.  

            Unassigned Unassigned
            davehunt Dave Hunt
            Votes:
            5 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: