Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-2978

Checkstyle is taking too long to calculate the difference of warnigs

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • checkstyle-plugin
    • None
    • Platform: All, OS: All

      We have a project that has a lot warnings (approx 100k warnings). Algorithm that
      does differentiation should be optimized a bit, because now it takes 30 minutes,
      this is incredible long.
      Thanks a lot for your efforts.

          [JENKINS-2978] Checkstyle is taking too long to calculate the difference of warnigs

          developster created issue -

          Ulli Hafner added a comment -

          This is related to https://hudson.dev.java.net/issues/show_bug.cgi?id=1827. 100k
          warnings currently create a 35MByte XML file...

          I haven't fount time yet to optimize the plug-in for such a big number of
          warnings...

          Ulli Hafner added a comment - This is related to https://hudson.dev.java.net/issues/show_bug.cgi?id=1827 . 100k warnings currently create a 35MByte XML file... I haven't fount time yet to optimize the plug-in for such a big number of warnings...
          Ulli Hafner made changes -
          Link New: This issue is duplicated by JENKINS-4863 [ JENKINS-4863 ]
          Ulli Hafner made changes -
          Link New: This issue is duplicated by JENKINS-5330 [ JENKINS-5330 ]

          weigo added a comment -

          The problem is the implementation of AnnotationDifferencer.difference(target, other).

          It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1).

          The implementation should look like this:

          Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java
          ===================================================================
          — plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050)
          +++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie)
          @@ -41,7 +41,7 @@
          private static Set<FileAnnotation> difference(
          final Collection<FileAnnotation> target, final Collection<FileAnnotation> other)

          { Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; }

          This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

          weigo added a comment - The problem is the implementation of AnnotationDifferencer.difference(target, other). It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1). The implementation should look like this: Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java =================================================================== — plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050) +++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie) @@ -41,7 +41,7 @@ private static Set<FileAnnotation> difference( final Collection<FileAnnotation> target, final Collection<FileAnnotation> other) { Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; } This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

          weigo added a comment -

          Patch to resolve this performance issue.

          weigo added a comment - Patch to resolve this performance issue.
          weigo made changes -
          Attachment New: AnnotationDifferencer.patch [ 19191 ]

          Ulli Hafner added a comment -

          Thanks for the patch. What performance improvement did you get in your case?

          Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o)? How expensive is the object creation?

          Ulli Hafner added a comment - Thanks for the patch. What performance improvement did you get in your case? Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o )? How expensive is the object creation?

          weigo added a comment -

          Hi,

          i've written a little test case that logs the time needed by the
          original and the patched variant. It reads a file of ~84.000 checkstyle
          warnings and computes the difference of sublists whose size increases by
          1000 in each iteration. The first column is the size of the list, second
          is the time needed by the original version and the last column shows the
          time needed by the version using a HashSet as other:

          1000: 91, 12
          2000: 148, 19
          3000: 341, 8
          4000: 634, 5
          5000: 1017, 6
          6000: 1505, 9
          7000: 2131, 11
          8000: 2724, 13
          9000: 3563, 15
          10000: 4562, 17
          11000: 5803, 20
          12000: 7223, 24
          13000: 8851, 25
          14000: 10803, 25
          15000: 13014, 27
          16000: 15565, 29
          17000: 18307, 31
          18000: 21308, 36
          19000: 24706, 35
          20000: 29213, 39
          21000: 32408, 39
          22000: 36629, 40
          23000: 41124, 43
          24000: 46726, 46
          25000: 52839, 47
          26000: 56127, 53
          27000: 61806, 51
          28000: 67967, 52
          29000: 74429, 55
          30000: 96905, 81
          31000: 103298, 59
          32000: 110423, 65
          33000: 120556, 62
          34000: 127095, 65
          35000: 136958, 67
          36000: 149837, 69
          37000: 160435, 72
          38000: 170230, 74
          39000: 180313, 78
          40000: 190803, 81

          All times are in milliseconds.

          I think the cost of constructing a Hashset from the other collection is
          negligible.

          My example shows the worst case since the original and other collections
          are identical. When the other collection contains fewer elements than
          the original the numbers will decrease slightly since there are fewer
          elements to test the contains relation against.

          Using the HashSet version is faster because the contains operation used
          to test the existance of the current element from the original is
          constant O(1) instead of O of an ArrayList (that is used for the
          CheckStyle-Plugin).

          My code is indeed O because it has to iterate over the original
          collection. The original collection is O(n^2) because of the combination
          of the underlying implementation of AbstractCollection.removeAll() and
          the used type of collection for other (ArrayList).

          For each element e in the original set E call contains(e) on other. If
          this returns true remove e from e.

          For the pair (Collection, ArrayList) the complexity is O * O(m) =
          O(n^2) for n = m.

          For the pair (Collection, HashSet) the complexity is O * O(1) = O.

          I hope i made my argument clearer this time.

          Regards

          Dirk

          weigo added a comment - Hi, i've written a little test case that logs the time needed by the original and the patched variant. It reads a file of ~84.000 checkstyle warnings and computes the difference of sublists whose size increases by 1000 in each iteration. The first column is the size of the list, second is the time needed by the original version and the last column shows the time needed by the version using a HashSet as other: 1000: 91, 12 2000: 148, 19 3000: 341, 8 4000: 634, 5 5000: 1017, 6 6000: 1505, 9 7000: 2131, 11 8000: 2724, 13 9000: 3563, 15 10000: 4562, 17 11000: 5803, 20 12000: 7223, 24 13000: 8851, 25 14000: 10803, 25 15000: 13014, 27 16000: 15565, 29 17000: 18307, 31 18000: 21308, 36 19000: 24706, 35 20000: 29213, 39 21000: 32408, 39 22000: 36629, 40 23000: 41124, 43 24000: 46726, 46 25000: 52839, 47 26000: 56127, 53 27000: 61806, 51 28000: 67967, 52 29000: 74429, 55 30000: 96905, 81 31000: 103298, 59 32000: 110423, 65 33000: 120556, 62 34000: 127095, 65 35000: 136958, 67 36000: 149837, 69 37000: 160435, 72 38000: 170230, 74 39000: 180313, 78 40000: 190803, 81 All times are in milliseconds. I think the cost of constructing a Hashset from the other collection is negligible. My example shows the worst case since the original and other collections are identical. When the other collection contains fewer elements than the original the numbers will decrease slightly since there are fewer elements to test the contains relation against. Using the HashSet version is faster because the contains operation used to test the existance of the current element from the original is constant O(1) instead of O of an ArrayList (that is used for the CheckStyle-Plugin). My code is indeed O because it has to iterate over the original collection. The original collection is O(n^2) because of the combination of the underlying implementation of AbstractCollection.removeAll() and the used type of collection for other (ArrayList). For each element e in the original set E call contains(e) on other. If this returns true remove e from e. For the pair (Collection, ArrayList) the complexity is O * O(m) = O(n^2) for n = m. For the pair (Collection, HashSet) the complexity is O * O(1) = O . I hope i made my argument clearer this time. Regards Dirk

          weigo added a comment -

          TestCase to reproduce the numbers in the previous post.

          You will have to supply a file checkstyle-warnings.xml
          in the same package to get to the numbers.

          The file should contain several 1000 warnings to get
          meaningful data.

          regards

          Dirk

          weigo added a comment - TestCase to reproduce the numbers in the previous post. You will have to supply a file checkstyle-warnings.xml in the same package to get to the numbers. The file should contain several 1000 warnings to get meaningful data. regards Dirk
          weigo made changes -
          Attachment New: CheckStyleDifferencerTest.java [ 19203 ]

            drulli Ulli Hafner
            developster developster
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: