Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-2978

Checkstyle is taking too long to calculate the difference of warnigs

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • checkstyle-plugin
    • None
    • Platform: All, OS: All

      We have a project that has a lot warnings (approx 100k warnings). Algorithm that
      does differentiation should be optimized a bit, because now it takes 30 minutes,
      this is incredible long.
      Thanks a lot for your efforts.

          [JENKINS-2978] Checkstyle is taking too long to calculate the difference of warnigs

          weigo added a comment -

          The problem is the implementation of AnnotationDifferencer.difference(target, other).

          It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1).

          The implementation should look like this:

          Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java
          ===================================================================
          — plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050)
          +++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie)
          @@ -41,7 +41,7 @@
          private static Set<FileAnnotation> difference(
          final Collection<FileAnnotation> target, final Collection<FileAnnotation> other)

          { Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; }

          This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

          weigo added a comment - The problem is the implementation of AnnotationDifferencer.difference(target, other). It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1). The implementation should look like this: Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java =================================================================== — plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050) +++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie) @@ -41,7 +41,7 @@ private static Set<FileAnnotation> difference( final Collection<FileAnnotation> target, final Collection<FileAnnotation> other) { Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; } This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

          weigo added a comment -

          Patch to resolve this performance issue.

          weigo added a comment - Patch to resolve this performance issue.

          Ulli Hafner added a comment -

          Thanks for the patch. What performance improvement did you get in your case?

          Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o)? How expensive is the object creation?

          Ulli Hafner added a comment - Thanks for the patch. What performance improvement did you get in your case? Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o )? How expensive is the object creation?

          weigo added a comment -

          Hi,

          i've written a little test case that logs the time needed by the
          original and the patched variant. It reads a file of ~84.000 checkstyle
          warnings and computes the difference of sublists whose size increases by
          1000 in each iteration. The first column is the size of the list, second
          is the time needed by the original version and the last column shows the
          time needed by the version using a HashSet as other:

          1000: 91, 12
          2000: 148, 19
          3000: 341, 8
          4000: 634, 5
          5000: 1017, 6
          6000: 1505, 9
          7000: 2131, 11
          8000: 2724, 13
          9000: 3563, 15
          10000: 4562, 17
          11000: 5803, 20
          12000: 7223, 24
          13000: 8851, 25
          14000: 10803, 25
          15000: 13014, 27
          16000: 15565, 29
          17000: 18307, 31
          18000: 21308, 36
          19000: 24706, 35
          20000: 29213, 39
          21000: 32408, 39
          22000: 36629, 40
          23000: 41124, 43
          24000: 46726, 46
          25000: 52839, 47
          26000: 56127, 53
          27000: 61806, 51
          28000: 67967, 52
          29000: 74429, 55
          30000: 96905, 81
          31000: 103298, 59
          32000: 110423, 65
          33000: 120556, 62
          34000: 127095, 65
          35000: 136958, 67
          36000: 149837, 69
          37000: 160435, 72
          38000: 170230, 74
          39000: 180313, 78
          40000: 190803, 81

          All times are in milliseconds.

          I think the cost of constructing a Hashset from the other collection is
          negligible.

          My example shows the worst case since the original and other collections
          are identical. When the other collection contains fewer elements than
          the original the numbers will decrease slightly since there are fewer
          elements to test the contains relation against.

          Using the HashSet version is faster because the contains operation used
          to test the existance of the current element from the original is
          constant O(1) instead of O of an ArrayList (that is used for the
          CheckStyle-Plugin).

          My code is indeed O because it has to iterate over the original
          collection. The original collection is O(n^2) because of the combination
          of the underlying implementation of AbstractCollection.removeAll() and
          the used type of collection for other (ArrayList).

          For each element e in the original set E call contains(e) on other. If
          this returns true remove e from e.

          For the pair (Collection, ArrayList) the complexity is O * O(m) =
          O(n^2) for n = m.

          For the pair (Collection, HashSet) the complexity is O * O(1) = O.

          I hope i made my argument clearer this time.

          Regards

          Dirk

          weigo added a comment - Hi, i've written a little test case that logs the time needed by the original and the patched variant. It reads a file of ~84.000 checkstyle warnings and computes the difference of sublists whose size increases by 1000 in each iteration. The first column is the size of the list, second is the time needed by the original version and the last column shows the time needed by the version using a HashSet as other: 1000: 91, 12 2000: 148, 19 3000: 341, 8 4000: 634, 5 5000: 1017, 6 6000: 1505, 9 7000: 2131, 11 8000: 2724, 13 9000: 3563, 15 10000: 4562, 17 11000: 5803, 20 12000: 7223, 24 13000: 8851, 25 14000: 10803, 25 15000: 13014, 27 16000: 15565, 29 17000: 18307, 31 18000: 21308, 36 19000: 24706, 35 20000: 29213, 39 21000: 32408, 39 22000: 36629, 40 23000: 41124, 43 24000: 46726, 46 25000: 52839, 47 26000: 56127, 53 27000: 61806, 51 28000: 67967, 52 29000: 74429, 55 30000: 96905, 81 31000: 103298, 59 32000: 110423, 65 33000: 120556, 62 34000: 127095, 65 35000: 136958, 67 36000: 149837, 69 37000: 160435, 72 38000: 170230, 74 39000: 180313, 78 40000: 190803, 81 All times are in milliseconds. I think the cost of constructing a Hashset from the other collection is negligible. My example shows the worst case since the original and other collections are identical. When the other collection contains fewer elements than the original the numbers will decrease slightly since there are fewer elements to test the contains relation against. Using the HashSet version is faster because the contains operation used to test the existance of the current element from the original is constant O(1) instead of O of an ArrayList (that is used for the CheckStyle-Plugin). My code is indeed O because it has to iterate over the original collection. The original collection is O(n^2) because of the combination of the underlying implementation of AbstractCollection.removeAll() and the used type of collection for other (ArrayList). For each element e in the original set E call contains(e) on other. If this returns true remove e from e. For the pair (Collection, ArrayList) the complexity is O * O(m) = O(n^2) for n = m. For the pair (Collection, HashSet) the complexity is O * O(1) = O . I hope i made my argument clearer this time. Regards Dirk

          weigo added a comment -

          TestCase to reproduce the numbers in the previous post.

          You will have to supply a file checkstyle-warnings.xml
          in the same package to get to the numbers.

          The file should contain several 1000 warnings to get
          meaningful data.

          regards

          Dirk

          weigo added a comment - TestCase to reproduce the numbers in the previous post. You will have to supply a file checkstyle-warnings.xml in the same package to get to the numbers. The file should contain several 1000 warnings to get meaningful data. regards Dirk

          Ulli Hafner added a comment -

          Ok, thanks for the clarification. Now I see the problem.

          Ulli Hafner added a comment - Ok, thanks for the clarification. Now I see the problem.

          Code changed in hudson
          User: : drulli
          Path:
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/BuildHistory.java
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/BuildResult.java
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/ParserResult.java
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/util/model/AnnotationContainer.java
          trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/util/model/AnnotationProvider.java
          trunk/hudson/plugins/analysis-core/src/main/resources/tabview/warnings.jelly
          http://jenkins-ci.org/commit/28271
          Log:
          JENKINS-2978: Use sets to store warnings and to compute differences between builds.

          SCM/JIRA link daemon added a comment - Code changed in hudson User: : drulli Path: trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/BuildHistory.java trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/BuildResult.java trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/ParserResult.java trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/util/model/AnnotationContainer.java trunk/hudson/plugins/analysis-core/src/main/java/hudson/plugins/analysis/util/model/AnnotationProvider.java trunk/hudson/plugins/analysis-core/src/main/resources/tabview/warnings.jelly http://jenkins-ci.org/commit/28271 Log: JENKINS-2978 : Use sets to store warnings and to compute differences between builds.

          Ulli Hafner added a comment -

          I replaced all lists with ImmutableSets.

          Ulli Hafner added a comment - I replaced all lists with ImmutableSets.

          g6er1 added a comment -

          We just recently installed the checkstyle plugin, and the build became horribly slow (to be precise: the first checkstyled build was 2x slower, but the second was about 10 times slower, even hanging the server), we have ~15000 warnings.

          I'm quite sure this may be the problem, but Hudson 1.349 changelogs didn't mention this fix. Could you please tell me in which version can we expect this?

          g6er1 added a comment - We just recently installed the checkstyle plugin, and the build became horribly slow (to be precise: the first checkstyled build was 2x slower, but the second was about 10 times slower, even hanging the server), we have ~15000 warnings. I'm quite sure this may be the problem, but Hudson 1.349 changelogs didn't mention this fix. Could you please tell me in which version can we expect this?

          Ulli Hafner added a comment -

          This fix is for the second part of your comment. The fix will be part of the next checkstyle release, since plug-ins have an independent release cycle.

          For the first part of your comment: see related issue JENKINS-1960.

          Ulli Hafner added a comment - This fix is for the second part of your comment. The fix will be part of the next checkstyle release, since plug-ins have an independent release cycle. For the first part of your comment: see related issue JENKINS-1960 .

            drulli Ulli Hafner
            developster developster
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: