Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63000

Gerrit Triggers stop triggering builds and the queue builds up

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • gerrit-trigger-plugin
    • None
    • CloudBees CI 2.222.2.1
      Gerrit Trigger Plugin 2.30.5

    Description

      1. Summary: 
        1. Gerrit triggers will at some point stop triggering builds in Jenkins. The queue can be seen to grow out of control and is visible as such inside Jenkins. The only current workaround has been to restart the master to get the queue flowing again. This issue has popped up in the effected master twice over a month.
      1. Steps to reproduce
        1. The exact cause is not known yet. 
        2. This is following the Gerrit Trigger Plugin being upgraded from version 2.27.1 to version 2.30.5.
        3. This also followed the Jenkins server being upgraded from a 1.x version to a 2.x version.
      2. Expected behavior. 
        1. The Gerrit Trigger queue will not start to back up and will continue executing builds.
      3. Actual behavior
        1. The Gerrit queue begins to build.
        2. No builds are started from Gerrit Triggers.

      The log messages do indicate that the queue has stopped processing:

      2020-05-07 21:35:57.166+0000 [id=1253] WARNING c.s.t.g.g.GerritHandler#checkQueueSize: The Gerrit incoming events queue contains 247 items! Something might be stuck, or your system can't process the commands fast enough. Try to increase the number of receiving worker threads. Current thread-pool size: 4
      

      Increasing the number of worker threads has no impact on this issue. As observed, the queue size reported will continue to grow until the master is restarted.

      A thread dump was captured when this issue was happening. There were four Gerrit threads sitting in a WAITING state while the queue was growing:

      "Gerrit Worker EventThread_28" id=691286 (0xa8c56) state=WAITING cpu=95%
          - waiting on <0x66dc3d20> (a java.util.concurrent.CountDownLatch$Sync)
          - locked <0x66dc3d20> (a java.util.concurrent.CountDownLatch$Sync)
          at sun.misc.Unsafe.park(Native Method)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
          at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.GerritTrigger.waitForProjectListToBeReady(GerritTrigger.java:1876)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.hudsontrigger.EventListener.gerritEvent(EventListener.java:188)
          at sun.reflect.GeneratedMethodAccessor607.invoke(Unknown Source)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:498)
          at com.sonymobile.tools.gerrit.gerritevents.GerritHandler.notifyListener(GerritHandler.java:496)
          at com.sonymobile.tools.gerrit.gerritevents.GerritHandler.notifyListeners(GerritHandler.java:476)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.JenkinsAwareGerritHandler.notifyListeners(JenkinsAwareGerritHandler.java:80)
          at com.sonymobile.tools.gerrit.gerritevents.workers.AbstractGerritEventWork.perform(AbstractGerritEventWork.java:46)
          at com.sonymobile.tools.gerrit.gerritevents.workers.AbstractJsonObjectWork.perform(AbstractJsonObjectWork.java:77)
          at com.sonymobile.tools.gerrit.gerritevents.workers.StreamEventsStringWork.perform(StreamEventsStringWork.java:67)
          at com.sonymobile.tools.gerrit.gerritevents.GerritHandler$EventWorker.run(GerritHandler.java:302)
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
      

      We also see these Gerrit threads:

      "com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater for review-tbs Thread" id=37 (0x25) state=TIMED_WAITING cpu=63%
          - waiting on <0x0b482201> (a com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater)
          - locked <0x0b482201> (a com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater)
          at java.lang.Object.wait(Native Method)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater.waitFor(GerritProjectListUpdater.java:212)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater.run(GerritProjectListUpdater.java:169)
      
      "com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater for review-tbs-dev Thread" id=38 (0x26) state=TIMED_WAITING cpu=63%
          - waiting on <0x2e1418f7> (a com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater)
          - locked <0x2e1418f7> (a com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater)
          at java.lang.Object.wait(Native Method)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater.waitFor(GerritProjectListUpdater.java:212)
          at com.sonyericsson.hudson.plugins.gerrit.trigger.GerritProjectListUpdater.run(GerritProjectListUpdater.java:169)
      

      Attached full thread dump

      1. Workaround. 
        1. **Restart the effected Master.
      2. Business impact. 
        1. **No Gerrit builds are triggered and the master must be restarted to resolve the problem.

      Attachments

        1. configDiff.png
          configDiff.png
          41 kB
        2. CustomTracesAwait.png
          CustomTracesAwait.png
          8 kB
        3. thread-dump.txt
          373 kB

        Issue Links

          Activity

            Does anybody had a chance to try if the latest gerrit trigger plugin version (2.32.0)?  Is the issue still happening in that version?  Thanks!

            emassfo Massimiliano Sforzini added a comment - Does anybody had a chance to try if the latest gerrit trigger plugin version (2.32.0)?  Is the issue still happening in that version?  Thanks!
            skalbagge Maciek added a comment - - edited

            Reproducible on 2.32.0
            In our case it takes several weeks to stop receiving events.

            EDIT:
            Notable event took place after custom build of gerrit-trigger-plugin with additional traces was loaded on jenkins. Entire system stopped processing events after couple minutes from restart and it was reproducible every restart.

            All gerrit threads were waiting at latch (CountDownLatch$Sync) and the job name returned by the trace was always the same:

            The job turned out to be the only one (from a hundred of same jobs) that had this difference in config file:

            Unfortunately after applying manually this difference to another config to see if it will cause also that other job to wait for latch I lost ability to reproduce this strange behavior. System continued to process events normally and I never saw it again. Can’t tell how much it is relevant.

            skalbagge Maciek added a comment - - edited Reproducible on 2.32.0 In our case it takes several weeks to stop receiving events. EDIT: Notable event took place after custom build of gerrit-trigger-plugin with additional traces was loaded on jenkins. Entire system stopped processing events after couple minutes from restart and it was reproducible every restart. All gerrit threads were waiting at latch (CountDownLatch$Sync) and the job name returned by the trace was always the same: The job turned out to be the only one (from a hundred of same jobs) that had this difference in config file: Unfortunately after applying manually this difference to another config to see if it will cause also that other job to wait for latch I lost ability to reproduce this strange behavior. System continued to process events normally and I never saw it again. Can’t tell how much it is relevant.
            bildrulle Lars Berntzon added a comment - - edited

            We are running 2.36.0 and see this problem often here as well. Our gerrit is a bit old though, its 2.16.15. Is that a problem?

            We see that the numbers of events is growing quickly (several items per second) up to a bit around 150, then it suddenly drops to warn about 40 items .

            We have 30 worker threads.

            bildrulle Lars Berntzon added a comment - - edited We are running 2.36.0 and see this problem often here as well. Our gerrit is a bit old though, its 2.16.15. Is that a problem? We see that the numbers of events is growing quickly (several items per second) up to a bit around 150, then it suddenly drops to warn about 40 items . We have 30 worker threads.
            bildrulle Lars Berntzon added a comment -

            One thing I have noticed it that under the folder gerrit-server-event-data in jenkins-home there is a file named "gerrit-trigger-server-timestamps.xml" there is several thousands of records of type <com.sonyericsson.hudson.plugins.gerrit.gerritevents.dto.events.RefReplicated>, all are failed replication reports. This seems wrong. Why is jenkins plugin caring about replications?

            bildrulle Lars Berntzon added a comment - One thing I have noticed it that under the folder gerrit-server-event-data in jenkins-home there is a file named "gerrit-trigger-server-timestamps.xml" there is several thousands of records of type <com.sonyericsson.hudson.plugins.gerrit.gerritevents.dto.events.RefReplicated>, all are failed replication reports. This seems wrong. Why is jenkins plugin caring about replications?
            bildrulle Lars Berntzon added a comment -

            Follow up on my previous comment: Once I fixed the broken replication in gerrit, we no longer see these messages about "The Gerrit incoming events queue contains xxx items!".

            bildrulle Lars Berntzon added a comment - Follow up on my previous comment: Once I fixed the broken replication in gerrit, we no longer see these messages about "The Gerrit incoming events queue contains xxx items!".

            People

              rsandell rsandell
              mmclaughlin Mitch McLaughlin
              Votes:
              3 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: