-
Improvement
-
Resolution: Unresolved
-
Major
-
None
-
gerrit-trigger-plugin 2.33.0 (although same problematic code seen in latest in master branch)
PROBLEM
We have been observing a gradual degradation in a Jenkins master with disabled Gerrit Servers. After some time running, it would not handle all gerrit events timely and they would be enqueued, so response time and triggering jobs take longer and longer.
Besides those symptoms, we could also see many log messages like the following:
WARNING [Thread-2291] com.sonymobile.tools.gerrit.gerritevents.GerritHandler.checkQueueSize The Gerrit incoming events queue contains 40 items! Something might be stuck, or your system can't process the commands fast enough. Try to increase the number of receiving worker threads
And we observed that, when the Jenkins master was collapsing, the queue would go as high as 9k events, with a delay in triggering jobs of more than 10-15 minutes.
WORKAROUND
After thorough analysis, we identified that the root cause was having listeners for Gerrit Servers configured but disabled for a long time. Removing those listeners, so that we only keep enabled Gerrit Servers, solves the problem.
POTENTIAL IMPROVEMENTS
Anyway, analyzing the code, we think that the current implementation can be improved.
The problematic code is the synchronized block at https://github.com/jenkinsci/gerrit-trigger-plugin/blob/gerrit-trigger-2.33.0/src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/playback/GerritMissedEventsPlaybackManager.java#L339 (the link is for v2.33.0, which we were using for the analysis, but the same seems to occur in latest), because when we keep disabled Gerrit Servers for several days, and if many events are received in that period of time, all the events are queued in the "receivedEventCache" object, which is iterated linearly for each received event with an iterator (cost O( n )). In our case, we observed in heap dumps that this cache was bigger than 600k events when delays were starting to be noticeable.
As the list was growing bigger and bigger over time, this was leading to an increased processing time per event, so it explained perfectly the degradation, as only a single thread per server would go into that code, blocking the rest of the threads.
As some ideas for improvement, we consider:
- Implement a TTL for the events in the cache. Probably it does not make sense to keep events more than a day, so events older than that could be removed while traversing the list with the iterator, although this may be problematic because it would lose some events
- A linear search is not efficient for searching huge lists. Other approaches like using a HashMap for storing the events, for which searches has a O(log ( n )) cost, would be way more efficient for searches with many elements in the structure
Probably 2 is preferrable, as it would keep the current functional behaviour and the computational cost of finding the event in the cache would be lower.