[JENKINS-2978] Checkstyle is taking too long to calculate the difference of warnigs

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: checkstyle-plugin
Labels:
None
Environment:
Platform: All, OS: All

Similar Issues:
Powered by SuggestiMate

Show

We have a project that has a lot warnings (approx 100k warnings). Algorithm that
does differentiation should be optimized a bit, because now it takes 30 minutes,
this is incredible long.
Thanks a lot for your efforts.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

CheckStyleDifferencerTest.java
8 kB
2010-03-04 10:28
AnnotationDifferencer.patch
0.8 kB
2010-02-28 13:49

is duplicated by

JENKINS-4863 Hudson hangs right after the line where it logs: [HUDSON] Archiving...

Closed

JENKINS-5330 Checkstyle report collector uses a lot of time comparing with older build

Closed

developster created issue - 2009-02-05 03:09

Ulli Hafner added a comment - 2009-02-05 03:59

This is related to https://hudson.dev.java.net/issues/show_bug.cgi?id=1827. 100k
warnings currently create a 35MByte XML file...

I haven't fount time yet to optimize the plug-in for such a big number of
warnings...

Ulli Hafner added a comment - 2009-02-05 03:59 This is related to https://hudson.dev.java.net/issues/show_bug.cgi?id=1827 . 100k warnings currently create a 35MByte XML file... I haven't fount time yet to optimize the plug-in for such a big number of warnings...

Ulli Hafner made changes - 2009-12-24 06:36

Link

New: This issue is duplicated by ~~JENKINS-4863~~ [ ~~JENKINS-4863~~ ]

Ulli Hafner made changes - 2010-01-20 12:56

Link

New: This issue is duplicated by ~~JENKINS-5330~~ [ ~~JENKINS-5330~~ ]

weigo added a comment - 2010-02-28 13:47

The problem is the implementation of AnnotationDifferencer.difference(target, other).

It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1).

The implementation should look like this:

Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java
===================================================================
— plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050)
+++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie)
@@ -41,7 +41,7 @@
private static Set<FileAnnotation> difference(
final Collection<FileAnnotation> target, final Collection<FileAnnotation> other)

{ Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; }

This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

weigo added a comment - 2010-02-28 13:47 The problem is the implementation of AnnotationDifferencer.difference(target, other). It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1). The implementation should look like this: Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java =================================================================== — plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050) +++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie) @@ -41,7 +41,7 @@ private static Set<FileAnnotation> difference( final Collection<FileAnnotation> target, final Collection<FileAnnotation> other) { Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; } This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.

weigo added a comment - 2010-02-28 13:49

Patch to resolve this performance issue.

weigo added a comment - 2010-02-28 13:49 Patch to resolve this performance issue.

weigo made changes - 2010-02-28 13:49

Attachment

New: AnnotationDifferencer.patch [ 19191 ]

Ulli Hafner added a comment - 2010-03-02 15:53

Thanks for the patch. What performance improvement did you get in your case?

Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o)? How expensive is the object creation?

Ulli Hafner added a comment - 2010-03-02 15:53 Thanks for the patch. What performance improvement did you get in your case? Actually, I don't see why your code is faster? Is it because you use a set rather than a collection? Isn't you code o )? How expensive is the object creation?

weigo added a comment - 2010-03-04 10:25

Hi,

i've written a little test case that logs the time needed by the
original and the patched variant. It reads a file of ~84.000 checkstyle
warnings and computes the difference of sublists whose size increases by
1000 in each iteration. The first column is the size of the list, second
is the time needed by the original version and the last column shows the
time needed by the version using a HashSet as other:

1000: 91, 12
2000: 148, 19
3000: 341, 8
4000: 634, 5
5000: 1017, 6
6000: 1505, 9
7000: 2131, 11
8000: 2724, 13
9000: 3563, 15
10000: 4562, 17
11000: 5803, 20
12000: 7223, 24
13000: 8851, 25
14000: 10803, 25
15000: 13014, 27
16000: 15565, 29
17000: 18307, 31
18000: 21308, 36
19000: 24706, 35
20000: 29213, 39
21000: 32408, 39
22000: 36629, 40
23000: 41124, 43
24000: 46726, 46
25000: 52839, 47
26000: 56127, 53
27000: 61806, 51
28000: 67967, 52
29000: 74429, 55
30000: 96905, 81
31000: 103298, 59
32000: 110423, 65
33000: 120556, 62
34000: 127095, 65
35000: 136958, 67
36000: 149837, 69
37000: 160435, 72
38000: 170230, 74
39000: 180313, 78
40000: 190803, 81

All times are in milliseconds.

I think the cost of constructing a Hashset from the other collection is
negligible.

My example shows the worst case since the original and other collections
are identical. When the other collection contains fewer elements than
the original the numbers will decrease slightly since there are fewer
elements to test the contains relation against.

Using the HashSet version is faster because the contains operation used
to test the existance of the current element from the original is
constant O(1) instead of O of an ArrayList (that is used for the
CheckStyle-Plugin).

My code is indeed O because it has to iterate over the original
collection. The original collection is O(n^2) because of the combination
of the underlying implementation of AbstractCollection.removeAll() and
the used type of collection for other (ArrayList).

For each element e in the original set E call contains(e) on other. If
this returns true remove e from e.

For the pair (Collection, ArrayList) the complexity is O * O(m) =
O(n^2) for n = m.

For the pair (Collection, HashSet) the complexity is O * O(1) = O.

I hope i made my argument clearer this time.

Regards

Dirk

weigo added a comment - 2010-03-04 10:25 Hi, i've written a little test case that logs the time needed by the original and the patched variant. It reads a file of ~84.000 checkstyle warnings and computes the difference of sublists whose size increases by 1000 in each iteration. The first column is the size of the list, second is the time needed by the original version and the last column shows the time needed by the version using a HashSet as other: 1000: 91, 12 2000: 148, 19 3000: 341, 8 4000: 634, 5 5000: 1017, 6 6000: 1505, 9 7000: 2131, 11 8000: 2724, 13 9000: 3563, 15 10000: 4562, 17 11000: 5803, 20 12000: 7223, 24 13000: 8851, 25 14000: 10803, 25 15000: 13014, 27 16000: 15565, 29 17000: 18307, 31 18000: 21308, 36 19000: 24706, 35 20000: 29213, 39 21000: 32408, 39 22000: 36629, 40 23000: 41124, 43 24000: 46726, 46 25000: 52839, 47 26000: 56127, 53 27000: 61806, 51 28000: 67967, 52 29000: 74429, 55 30000: 96905, 81 31000: 103298, 59 32000: 110423, 65 33000: 120556, 62 34000: 127095, 65 35000: 136958, 67 36000: 149837, 69 37000: 160435, 72 38000: 170230, 74 39000: 180313, 78 40000: 190803, 81 All times are in milliseconds. I think the cost of constructing a Hashset from the other collection is negligible. My example shows the worst case since the original and other collections are identical. When the other collection contains fewer elements than the original the numbers will decrease slightly since there are fewer elements to test the contains relation against. Using the HashSet version is faster because the contains operation used to test the existance of the current element from the original is constant O(1) instead of O of an ArrayList (that is used for the CheckStyle-Plugin). My code is indeed O because it has to iterate over the original collection. The original collection is O(n^2) because of the combination of the underlying implementation of AbstractCollection.removeAll() and the used type of collection for other (ArrayList). For each element e in the original set E call contains(e) on other. If this returns true remove e from e. For the pair (Collection, ArrayList) the complexity is O * O(m) = O(n^2) for n = m. For the pair (Collection, HashSet) the complexity is O * O(1) = O . I hope i made my argument clearer this time. Regards Dirk

weigo added a comment - 2010-03-04 10:28

TestCase to reproduce the numbers in the previous post.

You will have to supply a file checkstyle-warnings.xml
in the same package to get to the numbers.

The file should contain several 1000 warnings to get
meaningful data.

regards

Dirk

weigo added a comment - 2010-03-04 10:28 TestCase to reproduce the numbers in the previous post. You will have to supply a file checkstyle-warnings.xml in the same package to get to the numbers. The file should contain several 1000 warnings to get meaningful data. regards Dirk

weigo made changes - 2010-03-04 10:28

Attachment

New: CheckStyleDifferencerTest.java [ 19203 ]

Assignee:: Ulli Hafner

Reporter:: developster

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2009-02-05 03:09

Updated:: 2010-03-10 02:42

Resolved:: 2010-03-04 15:46

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Ulli Hafner added a comment - 2009-02-05 03:59

Expand comment: Ulli Hafner added a comment - 2009-02-05 03:59

Collapse comment: weigo added a comment - 2010-02-28 13:47

Expand comment: weigo added a comment - 2010-02-28 13:47

Collapse comment: weigo added a comment - 2010-02-28 13:49

Expand comment: weigo added a comment - 2010-02-28 13:49

Collapse comment: Ulli Hafner added a comment - 2010-03-02 15:53

Expand comment: Ulli Hafner added a comment - 2010-03-02 15:53

Collapse comment: weigo added a comment - 2010-03-04 10:25

Expand comment: weigo added a comment - 2010-03-04 10:25

Collapse comment: weigo added a comment - 2010-03-04 10:28

Expand comment: weigo added a comment - 2010-03-04 10:28

People

Dates