Hi,
i've written a little test case that logs the time needed by the
original and the patched variant. It reads a file of ~84.000 checkstyle
warnings and computes the difference of sublists whose size increases by
1000 in each iteration. The first column is the size of the list, second
is the time needed by the original version and the last column shows the
time needed by the version using a HashSet as other:
1000: 91, 12
2000: 148, 19
3000: 341, 8
4000: 634, 5
5000: 1017, 6
6000: 1505, 9
7000: 2131, 11
8000: 2724, 13
9000: 3563, 15
10000: 4562, 17
11000: 5803, 20
12000: 7223, 24
13000: 8851, 25
14000: 10803, 25
15000: 13014, 27
16000: 15565, 29
17000: 18307, 31
18000: 21308, 36
19000: 24706, 35
20000: 29213, 39
21000: 32408, 39
22000: 36629, 40
23000: 41124, 43
24000: 46726, 46
25000: 52839, 47
26000: 56127, 53
27000: 61806, 51
28000: 67967, 52
29000: 74429, 55
30000: 96905, 81
31000: 103298, 59
32000: 110423, 65
33000: 120556, 62
34000: 127095, 65
35000: 136958, 67
36000: 149837, 69
37000: 160435, 72
38000: 170230, 74
39000: 180313, 78
40000: 190803, 81
All times are in milliseconds.
I think the cost of constructing a Hashset from the other collection is
negligible.
My example shows the worst case since the original and other collections
are identical. When the other collection contains fewer elements than
the original the numbers will decrease slightly since there are fewer
elements to test the contains relation against.
Using the HashSet version is faster because the contains operation used
to test the existance of the current element from the original is
constant O(1) instead of O
of an ArrayList (that is used for the
CheckStyle-Plugin).
My code is indeed O
because it has to iterate over the original
collection. The original collection is O(n^2) because of the combination
of the underlying implementation of AbstractCollection.removeAll() and
the used type of collection for other (ArrayList).
For each element e in the original set E call contains(e) on other. If
this returns true remove e from e.
For the pair (Collection, ArrayList) the complexity is O
* O(m) =
O(n^2) for n = m.
For the pair (Collection, HashSet) the complexity is O
* O(1) = O
.
I hope i made my argument clearer this time.
Regards
Dirk
This is related to https://hudson.dev.java.net/issues/show_bug.cgi?id=1827. 100k
warnings currently create a 35MByte XML file...
I haven't fount time yet to optimize the plug-in for such a big number of
warnings...