Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-70477

Performance degradation over time because of stuck events when Gerrit Servers are disabled

XMLWordPrintable

    • Icon: Improvement Improvement
    • Resolution: Unresolved
    • Icon: Major Major
    • gerrit-trigger-plugin
    • None
    • gerrit-trigger-plugin 2.33.0 (although same problematic code seen in latest in master branch)

      PROBLEM

      We have been observing a gradual degradation in a Jenkins master with disabled Gerrit Servers. After some time running, it would not handle all gerrit events timely and they would be enqueued, so response time and triggering jobs take longer and longer.

      Besides those symptoms, we could also see many log messages like the following:

      WARNING [Thread-2291] com.sonymobile.tools.gerrit.gerritevents.GerritHandler.checkQueueSize The Gerrit incoming events queue contains 40 items! Something might be stuck, or your system can't process the commands fast enough. Try to increase the number of receiving worker threads

      And we observed that, when the Jenkins master was collapsing, the queue would go as high as 9k events, with a delay in triggering jobs of more than 10-15 minutes.

       

      WORKAROUND

      After thorough analysis, we identified that the root cause was having listeners for Gerrit Servers configured but disabled for a long time. Removing those listeners, so that we only keep enabled Gerrit Servers, solves the problem.

       

      POTENTIAL IMPROVEMENTS

      Anyway, analyzing the code, we think that the current implementation can be improved.

      The problematic code is the synchronized block at https://github.com/jenkinsci/gerrit-trigger-plugin/blob/gerrit-trigger-2.33.0/src/main/java/com/sonyericsson/hudson/plugins/gerrit/trigger/playback/GerritMissedEventsPlaybackManager.java#L339 (the link is for v2.33.0, which we were using for the analysis, but the same seems to occur in latest), because when we keep disabled Gerrit Servers for several days, and if many events are received in that period of time, all the events are queued in the "receivedEventCache" object, which is iterated linearly for each received event with an iterator (cost O( n )). In our case, we observed in heap dumps that this cache was bigger than 600k events when delays were starting to be noticeable.

      As the list was growing bigger and bigger over time, this was leading to an increased processing time per event, so it explained perfectly the degradation, as only a single thread per server would go into that code, blocking the rest of the threads.

      As some ideas for improvement, we consider:

      1. Implement a TTL for the events in the cache. Probably it does not make sense to keep events more than a day, so events older than that could be removed while traversing the list with the iterator, although this may be problematic because it would lose some events
      2. A linear search is not efficient for searching huge lists. Other approaches like using a HashMap for storing the events, for which searches has a O(log ( n )) cost, would be way more efficient for searches with many elements in the structure

      Probably 2 is preferrable, as it would keep the current functional behaviour and the computational cost of finding the event in the cache would be lower.

       

            rsandell rsandell
            agoconcept Santiago Gallego
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: