Hi,
i've written a little test case that logs the time needed by the
original and the patched variant. It reads a file of ~84.000 checkstyle
warnings and computes the difference of sublists whose size increases by
1000 in each iteration. The first column is the size of the list, second
is the time needed by the original version and the last column shows the
time needed by the version using a HashSet as other:
1000: 91, 12
2000: 148, 19
3000: 341, 8
4000: 634, 5
5000: 1017, 6
6000: 1505, 9
7000: 2131, 11
8000: 2724, 13
9000: 3563, 15
10000: 4562, 17
11000: 5803, 20
12000: 7223, 24
13000: 8851, 25
14000: 10803, 25
15000: 13014, 27
16000: 15565, 29
17000: 18307, 31
18000: 21308, 36
19000: 24706, 35
20000: 29213, 39
21000: 32408, 39
22000: 36629, 40
23000: 41124, 43
24000: 46726, 46
25000: 52839, 47
26000: 56127, 53
27000: 61806, 51
28000: 67967, 52
29000: 74429, 55
30000: 96905, 81
31000: 103298, 59
32000: 110423, 65
33000: 120556, 62
34000: 127095, 65
35000: 136958, 67
36000: 149837, 69
37000: 160435, 72
38000: 170230, 74
39000: 180313, 78
40000: 190803, 81
All times are in milliseconds.
I think the cost of constructing a Hashset from the other collection is
negligible.
My example shows the worst case since the original and other collections
are identical. When the other collection contains fewer elements than
the original the numbers will decrease slightly since there are fewer
elements to test the contains relation against.
Using the HashSet version is faster because the contains operation used
to test the existance of the current element from the original is
constant O(1) instead of O
of an ArrayList (that is used for the
CheckStyle-Plugin).
My code is indeed O
because it has to iterate over the original
collection. The original collection is O(n^2) because of the combination
of the underlying implementation of AbstractCollection.removeAll() and
the used type of collection for other (ArrayList).
For each element e in the original set E call contains(e) on other. If
this returns true remove e from e.
For the pair (Collection, ArrayList) the complexity is O
* O(m) =
O(n^2) for n = m.
For the pair (Collection, HashSet) the complexity is O
* O(1) = O
.
I hope i made my argument clearer this time.
Regards
Dirk
The problem is the implementation of AnnotationDifferencer.difference(target, other).
It uses the removeAll operation on a clone of the target collection. This will loop over all elements in target and call the contains operation on the ohter collection with the current element as argument. So this can lead to O(n^2) performance of contains where it could be O(1).
The implementation should look like this:
Index: plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java
{ Set<FileAnnotation> difference = new HashSet<FileAnnotation>(target); - difference.removeAll(other); + difference.removeAll(new HashSet<FileAnnotation>(other)); return difference; }===================================================================
— plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Revision 28050)
+++ plugins/analysis-core/src/main/java/hudson/plugins/analysis/core/AnnotationDifferencer.java (Arbeitskopie)
@@ -41,7 +41,7 @@
private static Set<FileAnnotation> difference(
final Collection<FileAnnotation> target, final Collection<FileAnnotation> other)
This change leads to usable performance of the checkstyle view on modules with an overall warning count of ~84000. Before the patch it took about 6 minutes to compute the difference between two identical sets of the above size.