Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51057

EventDispatcher and ConcurrentLinkedQueue ate my JVM

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • sse-gateway-plugin

      We started running out of memory in our JVM (Xmx 8G) and when looking at Melody's memory (heap) histogram (JENKINS_URL/monitoring?part=heaphisto) the top two items were:

       

      Class Size (Kb) % size Instances % instances Source
      org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 2,890,399 44 92,492,793 43  
      java.util.concurrent.ConcurrentLinkedQueue$Node 2,167,981 33 92,500,553 43  

       
      77% (and growing as we were researching the problem) of the memory was being used by these two items.

      I have two support bundles from this time and an .hprof as well.

      I can either screen share with someone or if you can tell me how to analyze these files I would be happy to.

          [JENKINS-51057] EventDispatcher and ConcurrentLinkedQueue ate my JVM

          Christian Höltje added a comment - - edited

          Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects:

           

           Class  Size (Kb)  % size  Instances  % instances  Source
          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 96,964 2 3,102,853 4  
          java.util.concurrent.ConcurrentLinkedQueue$Node 73,163 1 3,121,633 4  

           

          It isn't a problem (yet) but this is alarming.

           

          Our other, Jenkins server has no references to EventDispatcher$Retry in the memory histogram, even when expanding details.

          Christian Höltje added a comment - - edited Our Jenkins server has been up 23 hours and we're already seeing large numbers of the EventDispacher objects:    Class  Size (Kb)  % size  Instances  % instances  Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 96,964 2 3,102,853 4   java.util.concurrent.ConcurrentLinkedQueue$Node 73,163 1 3,121,633 4     It isn't a problem (yet) but this is alarming.   Our other, Jenkins server has no references to EventDispatcher$Retry in the memory histogram, even when expanding details.

          In our logs, we're see messages like:

          May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
          Invalid SSE unsubscribe configuration. No active subscription matching filter: 
          May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
          Invalid SSE unsubscribe configuration. No active subscription matching filter: 
          May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
          Invalid SSE unsubscribe configuration. No active subscription matching filter: 
          May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe
          Invalid SSE unsubscribe configuration. No active subscription matching filter: 

          Could that be related?

           

          Christian Höltje added a comment - In our logs, we're see messages like: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: May 02, 2018 1:04:41 PM WARNING org.jenkinsci.plugins.ssegateway.sse.EventDispatcher unsubscribe Invalid SSE unsubscribe configuration. No active subscription matching filter: Could that be related?  

          And now its at:

          Class  Size (Kb)  % size  Instances  % instances  Source
          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 105,174 3 3,365,570 4  
          java.util.concurrent.ConcurrentLinkedQueue$Node 79,051 2 3,372,854 4  

          Christian Höltje added a comment - And now its at: Class  Size (Kb)  % size  Instances  % instances  Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 105,174 3 3,365,570 4   java.util.concurrent.ConcurrentLinkedQueue$Node 79,051 2 3,372,854 4  

          There are a lot of changes/improvements in sse-gateway (mostly by rtyler) since 1.15 was released (back in Jan 16th, 2017).

          In the logs, I see a commit titled message leaks which sounds interesting.

          I tried compiling the master branch and I get errors:

          $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8-alpine mvn verify 
          ...
          [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.0:npm (npm install) on project sse-gateway: Failed to run task: 'npm install' failed. java.io.IOException: Cannot run program "/src/node/node" (in directory "/src"): error=2, No such file or directory -> [Help 1]
          

          Christian Höltje added a comment - There are a lot of changes/improvements in sse-gateway (mostly by rtyler ) since 1.15 was released (back in Jan 16th, 2017). In the logs, I see a commit titled  message leaks which sounds interesting. I tried compiling the master branch and I get errors: $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8-alpine mvn verify ... [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.0:npm (npm install) on project sse-gateway: Failed to run task: 'npm install' failed. java.io.IOException: Cannot run program "/src/node/node" (in directory "/src"): error=2, No such file or directory -> [Help 1]

           

          I got further when using the non-alpine openjdk image.  Still doesn't build, but it is due to tests failing.

           

          $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8 mvn verify
          ...
          Results :
          Failed tests:
          org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_autoDeleteOnExpire(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
           Run 1: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
           Run 2: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
           Run 3: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
           Run 4: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
           Run 5: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50>
          org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_delete_stale_events(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest)
           Run 1: EventHistoryStoreTest.test_delete_stale_events:69
           Run 2: EventHistoryStoreTest.test_delete_stale_events:69
           Run 3: EventHistoryStoreTest.test_delete_stale_events:69
           Run 4: EventHistoryStoreTest.test_delete_stale_events:69
           Run 5: EventHistoryStoreTest.test_delete_stale_events:69
          
          Tests run: 12, Failures: 2, Errors: 0, Skipped: 0
          ...
          [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project sse-gateway: There are test failures.
          ...
          

           

           

          Christian Höltje added a comment -   I got further when using the non-alpine openjdk image.  Still doesn't build, but it is due to tests failing.   $ docker run --rm -it -v maven-repo:/root/.m2 -v $PWD:/src:rw -w /src maven:3-jdk-8 mvn verify ... Results : Failed tests: org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_autoDeleteOnExpire(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest) Run 1: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 2: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 3: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 4: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> Run 5: EventHistoryStoreTest.test_autoDeleteOnExpire:119 expected:<100> but was:<50> org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest.test_delete_stale_events(org.jenkinsci.plugins.ssegateway.EventHistoryStoreTest) Run 1: EventHistoryStoreTest.test_delete_stale_events:69 Run 2: EventHistoryStoreTest.test_delete_stale_events:69 Run 3: EventHistoryStoreTest.test_delete_stale_events:69 Run 4: EventHistoryStoreTest.test_delete_stale_events:69 Run 5: EventHistoryStoreTest.test_delete_stale_events:69 Tests run: 12, Failures: 2, Errors: 0, Skipped: 0 ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.19.1:test (default-test) on project sse-gateway: There are test failures. ...    

          Some more events from our logs:

           

          May 03, 2018 1:36:00 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
          Error dispatching event to SSE channel. Write failed.
          May 03, 2018 1:36:36 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch
          Error dispatching event to SSE channel. Write failed.
          May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
          Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
          May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
          Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).
          May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries
          Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).

          Christian Höltje added a comment - Some more events from our logs:   May 03, 2018 1:36:00 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch Error dispatching event to SSE channel. Write failed. May 03, 2018 1:36:36 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher doDispatch Error dispatching event to SSE channel. Write failed. May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612). May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612). May 03, 2018 1:36:39 PM FINE org.jenkinsci.plugins.ssegateway.sse.EventDispatcher processRetries Error dispatching retry event to SSE channel. Write failed. Dispatcher jenkins-blueocean-core-js-1525286577663-o9g92 (1006446612).

          This morning:

           Class  Size (Kb)  % size  Instances  % instances  Source
          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 283,979 5 9,087,330 10  
          java.util.concurrent.ConcurrentLinkedQueue$Node 214,241 4 9,140,981 10  

           

          Is there a way I can find out what's in the queues to track what's causing the Retry to happen?

          Christian Höltje added a comment - This morning:  Class  Size (Kb)  % size  Instances  % instances  Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 283,979 5 9,087,330 10   java.util.concurrent.ConcurrentLinkedQueue$Node 214,241 4 9,140,981 10     Is there a way I can find out what's in the queues to track what's causing the Retry to happen?

          Looking in $JENKINS_HOME/logs/sse-events/ I see 16 .json files in jobs and 108 .json files in pipeline. And they change frequently (e.g. they are now, a minute later, 6 and 54).

          Is there another place I can look at why there are some many Retry objects?

          Christian Höltje added a comment - Looking in $JENKINS_HOME/logs/sse-events/ I see 16 .json files in jobs and 108 .json files in pipeline . And they change frequently (e.g. they are now, a minute later, 6 and 54). Is there another place I can look at why there are some many Retry objects?

          I managed to use jvirtualvm to look at the .hprof from the initial comment and it was our friend EventDispatcher$Retry and $ConcurrentLinkedQueue$Node using up 90% of the memory.

          Christian Höltje added a comment - I managed to use jvirtualvm to look at the .hprof from the initial comment and it was our friend EventDispatcher$Retry and $ ConcurrentLinkedQueue$Node using up 90% of the memory.

          I finally got VirtualVM working reasonably well (using lots of RAM and a RAM Disk for temp files).

          I found that there are 0 instances of org.jenkinsci.plugins.ssegateway.EventHistoryStore and 0 instances of org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.

          Is it possible that the Retry objects are not being reclaimed by GC?

          When I look at the objects in VirtualVM the Retry object has a reference (item) to a ConcurrentLinkedQueue$Node object.

          The Node object has a reference (item) to the Retry object plus a reference to the next {{Node} in the queue.

          Christian Höltje added a comment - I finally got VirtualVM working reasonably well (using lots of RAM and a RAM Disk for temp files). I found that there are 0 instances of org.jenkinsci.plugins.ssegateway.EventHistoryStore and 0 instances of org.jenkinsci.plugins.ssegateway.sse.EventDispatcher . Is it possible that the Retry objects are not being reclaimed by GC? When I look at the objects in VirtualVM the Retry object has a reference ( item ) to a ConcurrentLinkedQueue$Node object. The Node object has a reference ( item ) to the Retry object plus a reference to the next {{Node} in the queue.

          Since I didn't see this problem in Blue Ocean 1.4.2 but I'm seeing it in 1.5.0 I went looking for changes in Blue Ocean that may have triggered this problem...

          I found PR #1667 by vivek. It was to fix JENKINS-48820. I doubt it caused this problem, but maybe it made it worse?

          Christian Höltje added a comment - Since I didn't see this problem in Blue Ocean 1.4.2 but I'm seeing it in 1.5.0 I went looking for changes in Blue Ocean that may have triggered this problem... I found PR #1667 by vivek . It was to fix JENKINS-48820 . I doubt it caused this problem, but maybe it made it worse?

          Wilfred Hughes added a comment - - edited

          We're seeing this problem with Blue Ocean 1.4.2 and Jenkins 2.89.4.

          Wilfred Hughes added a comment - - edited We're seeing this problem with Blue Ocean 1.4.2 and Jenkins 2.89.4.

          The only workaround I found was to completely uninstall Blue Ocean and the pubsub plugins.

          Christian Höltje added a comment - The only workaround I found was to completely uninstall Blue Ocean and the pubsub plugins.

          rami stern added a comment -

          We think we might've found the root cause, we're checking it now on prod.

          https://github.com/jenkinsci/sse-gateway-plugin/pull/27

          rami stern added a comment - We think we might've found the root cause, we're checking it now on prod. https://github.com/jenkinsci/sse-gateway-plugin/pull/27

          Gustaf Lundh added a comment -

          We just had to restart a critical Jenkins master due to this memory leak, where org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry and java.util.concurrent.ConcurrentLinkedQueue$Node consumed 20GB of our 40GB heap. We need to get rid of blueocean due to this issue. The priority should be raised to critical IMHO.

          Gustaf Lundh added a comment - We just had to restart a critical Jenkins master due to this memory leak, where org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry and java.util.concurrent.ConcurrentLinkedQueue$Node consumed 20GB of our 40GB heap. We need to get rid of blueocean due to this issue. The priority should be raised to critical IMHO.

          Jon Sten added a comment -

          As a temporary workaround, here's a script which will go through and look for old event dispatchers:

          // First of, due to groovy and private field in super classes, we need to
          // change visibility of retryQueue so that we can use reflection instead...
          def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
          retryQueueField.setAccessible(true)
          
          def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
          println("There are ${sessions.size()} sessions...")
          def numEventsRemoved = 0
          def numEventDispatchersPurged = 0
          sessions.each{id, session->
            def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher']
            if (eventDispatchers) {
              println("Sessions ${id} has a ssegateway EventDispatcher map...")
              eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher ->
                def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
                if (retryQueue.isEmpty()) {
                  println("  Found one EventDispatcher, '${dispatcherId}', it has no retry events.")
                } else {
                  def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
                  if (oldestAge > 300) {
                    println("  Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.")
                    numEventsRemoved += retryQueue.size()
                    numEventDispatchersPurged++
                    dispatcher.unsubscribeAll()
                    retryQueue.clear()
               	 } else {
                  println("  Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...")
                	}
                }
              }
            }
          }
          
          println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!")
          

          Btw, I've verified this bug in a fresh Jenkins install (2.121.3) and with latest version of Blue Ocean (1.8.2) which pulls SSE-gateway (1.15).

          Jon Sten added a comment - As a temporary workaround, here's a script which will go through and look for old event dispatchers: // First of, due to groovy and private field in super classes, we need to // change visibility of retryQueue so that we can use reflection instead... def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue') retryQueueField.setAccessible(true) def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions println("There are ${sessions.size()} sessions...") def numEventsRemoved = 0 def numEventDispatchersPurged = 0 sessions.each{id, session-> def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher'] if (eventDispatchers) { println("Sessions ${id} has a ssegateway EventDispatcher map...") eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher -> def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.isEmpty()) { println(" Found one EventDispatcher, '${dispatcherId}', it has no retry events.") } else { def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000 if (oldestAge > 300) { println(" Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.") numEventsRemoved += retryQueue.size() numEventDispatchersPurged++ dispatcher.unsubscribeAll() retryQueue.clear() } else { println(" Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...") } } } } } println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!") Btw, I've verified this bug in a fresh Jenkins install (2.121.3) and with latest version of Blue Ocean (1.8.2) which pulls SSE-gateway (1.15).

          We're having this problem with blue ocean 1.4.2 and Jenkins 2.112.  Out of a 18 GB heap, 16GB are the GuavaPubSub object which containes the concurrent queue and EventDispatchers.  We ran jons 's script and it cleaned up only 2.5GB's of retry objects but still had 14GB's of references to EventDispatchers.  It takes' out our Jenkins Server in about 14-15 days. 

          I took a heap dump and am adding a screenshot of a MAT analyzer view of the memory consumption.  Was hoping someone would accept the PR for the memory leak fix which seems to be outstanding for some time now.

          Maxfield Stewart added a comment - We're having this problem with blue ocean 1.4.2 and Jenkins 2.112.  Out of a 18 GB heap, 16GB are the GuavaPubSub object which containes the concurrent queue and EventDispatchers.  We ran jons 's script and it cleaned up only 2.5GB's of retry objects but still had 14GB's of references to EventDispatchers.  It takes' out our Jenkins Server in about 14-15 days.  I took a heap dump and am adding a screenshot of a MAT analyzer view of the memory consumption.  Was hoping someone would accept the PR for the memory leak fix which seems to be outstanding for some time now.

          Experiencing the same problem, it takes ~30 days for Jenkins to build up to its VM memory limit. We are on SSE Gateway 1.15 too.

           

          Heap     Classes: 10,648,     Instances: 57,541,468,     Kilo-Bytes: 1,715,402

           Class  Size (Kb)  % size  Instances  % instances  Source
          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 800,661 46 25,621,179 44  
          java.util.concurrent.ConcurrentLinkedQueue$Node 600,538 35 25,622,966 44  
          char[] 125,236 7 1,298,894 2  
          java.lang.String 30,420 1 1,297,943 2  
          short[] 20,271 1 9,119 0  

          Giorgio Sironi added a comment - Experiencing the same problem, it takes ~30 days for Jenkins to build up to its VM memory limit. We are on SSE Gateway 1.15 too.   Heap     Classes: 10,648,     Instances: 57,541,468,     Kilo-Bytes: 1,715,402  Class  Size (Kb)  % size  Instances  % instances  Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 800,661 46 25,621,179 44   java.util.concurrent.ConcurrentLinkedQueue$Node 600,538 35 25,622,966 44   char[] 125,236 7 1,298,894 2   java.lang.String 30,420 1 1,297,943 2   short[] 20,271 1 9,119 0  

          Bump. Same issue here on SSE Gateway 1.16 it seems. 

          Any chance for a fix in the near future?

          Heap     Classes: 20,087,     Instances: 280,652,984,     Kilo-Bytes: 9,119,689

           Class  Size (Kb)  % size  Instances  % instances  Source
          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 3,383,758 37 108,280,267 38  
          java.util.concurrent.ConcurrentLinkedQueue$Node 2,538,166 27 108,295,118 38  
          char[] 1,075,470 11 9,044,530 3  
          java.lang.Object[] 205,241 2 5,771,490 2  
          int[] 182,430 2 1,633,983 0  
          byte[] 165,738 1 558,281 0  
          java.lang.String 157,701 1 6,728,615 2  
          java.util.HashMap$Node 99,299 1 3,177,591 1  

          Mikkel S. Andersen added a comment - Bump. Same issue here on SSE Gateway 1.16 it seems.  Any chance for a fix in the near future? Heap      Classes: 20,087,     Instances: 280,652,984,     Kilo-Bytes: 9,119,689  Class  Size (Kb)  % size  Instances  % instances  Source org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 3,383,758 37 108,280,267 38   java.util.concurrent.ConcurrentLinkedQueue$Node 2,538,166 27 108,295,118 38   char[] 1,075,470 11 9,044,530 3   java.lang.Object[] 205,241 2 5,771,490 2   int[] 182,430 2 1,633,983 0   byte[] 165,738 1 558,281 0   java.lang.String 157,701 1 6,728,615 2   java.util.HashMap$Node 99,299 1 3,177,591 1  

          This problem continues. It continues to take production Jenkins offline forcing restarts every 15-25 days. We're beginning to see if there's a way to deploy jenkins with Zero Blue Ocean elements as it's not production ready.  Though the plugin seems to install by default. it's a shame there's been zero ownership of this for almost a year now

          Maxfield Stewart added a comment - This problem continues. It continues to take production Jenkins offline forcing restarts every 15-25 days. We're beginning to see if there's a way to deploy jenkins with Zero Blue Ocean elements as it's not production ready.  Though the plugin seems to install by default. it's a shame there's been zero ownership of this for almost a year now

          Just my 2 cents...

          jons, I tried your cleanup script with mixed results. It was able to empty retryQueues but we still had a big leak. After isolating the problem, I noticed that if the HTTP session from the user expires, that script is not able to find the leaking EventDispatchers. They are, however, still in the GuavaPubSubBus subscribers map, and can be seen with:

           

           

          import org.jenkinsci.plugins.pubsub.PubsubBus;
          this.bus = PubsubBus.getBus();
          this.bus.subscribers.each{ channelSubscriber, guavaSubscriber -> println channelSubscriber }
          println "Done"
          

           

          This outputs some lines containing:

          org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber@3472875b 

          which is the Inner class that implements the ChannelSubscriber interface. I suspect that the memory is still leaking as it is being referenced through this subscriber.

          I saw there is already a PR https://github.com/jenkinsci/sse-gateway-plugin/pull/27 and a fork of the plugin https://github.com/taboola/sse-gateway-plugin

          The approach of this fix is changing the handling of retries so they are aborted and the queue cleared after some amount of time or retries. Howevery, I wonder if the problem is the cleanup should be done when the HttpSession is destroyed. Currently, I can see in https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java the cleanup code for sessionDestroyed is:

              /**
               * Http session listener.
               */
              @Extension
              public static final class SSEHttpSessionListener extends HttpSessionListener {
                  @Override
                  public void sessionDestroyed(HttpSessionEvent httpSessionEvent) {
                      try {
                          Map<String, EventDispatcher> dispatchers = EventDispatcherFactory.getDispatchers(httpSessionEvent.getSession());
                          try {
                              for (EventDispatcher dispatcher : dispatchers.values()) {
                                  try {
                                      dispatcher.unsubscribeAll();
                                  } catch (Exception e) {
                                      LOGGER.log(Level.FINE, "Error during unsubscribeAll() for dispatcher " + dispatcher.getId() + ".", e);
                                  }
                              }
                          } finally {
                              dispatchers.clear();
                          }
                      } catch (Exception e) {
                          LOGGER.log(Level.FINE, "Error during session cleanup. The session has probably timed out.", e);
                      }
                  }
              }
          

          but although I can see that for every dispatcher there is a call to dispatcher.unsubscribeAll(), I am missing a call to retryQueue.clear(); But I am just guessing here...

          Anyone knowing the internals of the plugin can confirm this? I might try forking the plugin and testing a modified version.

          Regards.

          Álvaro Iradier added a comment - Just my 2 cents... jons , I tried your cleanup script with mixed results. It was able to empty retryQueues but we still had a big leak. After isolating the problem, I noticed that if the HTTP session from the user expires, that script is not able to find the leaking EventDispatchers. They are, however, still in the GuavaPubSubBus subscribers map, and can be seen with:     import org.jenkinsci.plugins.pubsub.PubsubBus; this .bus = PubsubBus.getBus(); this .bus.subscribers.each{ channelSubscriber, guavaSubscriber -> println channelSubscriber } println "Done"   This outputs some lines containing: org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber@3472875b  which is the Inner class that implements the ChannelSubscriber interface. I suspect that the memory is still leaking as it is being referenced through this subscriber. I saw there is already a PR https://github.com/jenkinsci/sse-gateway-plugin/pull/27  and a fork of the plugin https://github.com/taboola/sse-gateway-plugin The approach of this fix is changing the handling of retries so they are aborted and the queue cleared after some amount of time or retries. Howevery, I wonder if the problem is the cleanup should be done when the HttpSession is destroyed. Currently, I can see in https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java  the cleanup code for sessionDestroyed is: /** * Http session listener. */ @Extension public static final class SSEHttpSessionListener extends HttpSessionListener { @Override public void sessionDestroyed(HttpSessionEvent httpSessionEvent) { try { Map< String , EventDispatcher> dispatchers = EventDispatcherFactory.getDispatchers(httpSessionEvent.getSession()); try { for (EventDispatcher dispatcher : dispatchers.values()) { try { dispatcher.unsubscribeAll(); } catch (Exception e) { LOGGER.log(Level.FINE, "Error during unsubscribeAll() for dispatcher " + dispatcher.getId() + "." , e); } } } finally { dispatchers.clear(); } } catch (Exception e) { LOGGER.log(Level.FINE, "Error during session cleanup. The session has probably timed out." , e); } } } but although I can see that for every dispatcher there is a call to dispatcher.unsubscribeAll(), I am missing a call to retryQueue.clear(); But I am just guessing here... Anyone knowing the internals of the plugin can confirm this? I might try forking the plugin and testing a modified version. Regards.

          Álvaro Iradier added a comment - - edited

          After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm.

          Just in case it can help, this is the cleanup script we are running daily:

          import org.jenkinsci.plugins.pubsub.PubsubBus;
          import org.jenkinsci.plugins.ssegateway.sse.*;
          
          def dryRun = false
          this.bus = PubsubBus.getBus();
          
          // change visibility of retryQueue so that we can use reflection instead...
          def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
          retryQueueField.setAccessible(true)
          
          def dispatcherCount = 0
          def dispatchersList = []
          
          //Build a list of EventDispatchers in all existing HTTP sessions
          println "DISPATCHERS IN HTTP SESSIONS"
          println "----------------------------"
          def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
          sessions.each{id, session->
            def eventDispatchers = EventDispatcherFactory.getDispatchers(session)
            if (eventDispatchers) {
              eventDispatchers.each { dispatcherId, dispatcher ->
                dispatchersList.add(dispatcherId)
                def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
                if (retryQueue.peek() != null) {
                  def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
                  println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
                } else {
                  println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents"
                }      
              }
            }
          }
          
          println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions"
          println ""
          
          //Find all subscribers in bus
          println "DISPATCHERS IN PUBSUBBUS"
          println "------------------------"
          this.bus.subscribers.any{ channelSubscriber, guavaSubscriber ->
            if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
              dispatcherCount++
          	def dispatcher = channelSubscriber.this$0
          
              def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
              if (retryQueue.peek() != null) {
                def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
          	  println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
                if (oldestAge > 300) {
                	println "  Clearing retryQueue with " + retryQueue.size() + " events"
                  if (!dryRun) {
                      retryQueue.clear()
                      dispatcher.unsubscribeAll()
                      try {
                        dispatcher.stop()
                      } catch (Exception ex) {
                        println "  !! Exception stopping AsynDispatcher"
                      }          
                  } else {
                    println "  Ignoring, dryrun"
                  }
                }
              } else {
          	    println "Dispatcher: " + dispatcher.id + " with no retryEvents"
              }
              
              if (dispatchersList.indexOf(dispatcher.id) < 0) {
                println "  Dispatcher is not in HTTP Sessions. Clearing"
                  if (!dryRun) {
                    dispatcher.unsubscribeAll()
                    try {
                    	dispatcher.stop()
                    } catch (Exception ex) {
                      println "  !! Exception stopping AsynDispatcher"
                    }
                  } else {
                    println "  Ignoring, dryrun"
                  }
              }
            } 
          }
          
          println ""
          println "Dispatchers in PubSubBus: " + dispatcherCount
          

           

          Álvaro Iradier added a comment - - edited After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm. Just in case it can help, this is the cleanup script we are running daily: import org.jenkinsci.plugins.pubsub.PubsubBus; import org.jenkinsci.plugins.ssegateway.sse.*; def dryRun = false this .bus = PubsubBus.getBus(); // change visibility of retryQueue so that we can use reflection instead... def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField( 'retryQueue' ) retryQueueField.setAccessible( true ) def dispatcherCount = 0 def dispatchersList = [] //Build a list of EventDispatchers in all existing HTTP sessions println "DISPATCHERS IN HTTP SESSIONS" println "----------------------------" def sessions = Jenkins.instance.servletContext. this $0._sessionHandler._sessionCache._sessions sessions.each{id, session-> def eventDispatchers = EventDispatcherFactory.getDispatchers(session) if (eventDispatchers) { eventDispatchers.each { dispatcherId, dispatcher -> dispatchersList.add(dispatcherId) def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." } else { println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents" } } } } println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions" println "" //Find all subscribers in bus println "DISPATCHERS IN PUBSUBBUS" println "------------------------" this .bus.subscribers.any{ channelSubscriber, guavaSubscriber -> if (channelSubscriber.getClass().getName().equals( 'org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber' )) { dispatcherCount++ def dispatcher = channelSubscriber. this $0 def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." if (oldestAge > 300) { println " Clearing retryQueue with " + retryQueue.size() + " events" if (!dryRun) { retryQueue.clear() dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } else { println "Dispatcher: " + dispatcher.id + " with no retryEvents" } if (dispatchersList.indexOf(dispatcher.id) < 0) { println " Dispatcher is not in HTTP Sessions. Clearing" if (!dryRun) { dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } } println "" println "Dispatchers in PubSubBus: " + dispatcherCount  

          Jon Sten added a comment -

          airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

          Jon Sten added a comment - airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

          airadier Thank you for the hard work, your script almost works on some of our servers.  I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory.

          However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

          Maxfield Stewart added a comment - airadier Thank you for the hard work, your script almost works on some of our servers.  I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory. However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

          Hi maxfields2000, yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part.

          After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

          Álvaro Iradier added a comment - Hi maxfields2000 , yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part. After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

          airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions.  Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168.  Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

          Maxfield Stewart added a comment - airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions.  Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168.  Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

          Dan Mordechai added a comment - - edited

          airadier Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

          Some things to point out:

          1. I had to change the line 
            def dispatcher = channelSubscriber.this$0
            

            to

            def dispatcher = channelSubscriber.eventDispatcher
            

                   2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

          Dan Mordechai added a comment - - edited airadier  Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins. Some things to point out: I had to change the line  def dispatcher = channelSubscriber. this $0 to def dispatcher = channelSubscriber.eventDispatcher          2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

          A fix got merged to master that definitely helped our masters

          Raihaan Shouhell added a comment - A fix got merged to master that definitely helped our masters

          Olivier Lamy added a comment -

          Olivier Lamy added a comment - fixed with  https://github.com/jenkinsci/sse-gateway-plugin/commit/9cfe9d9c4d9e284ee7d099d4e734b3677ee677f9

          Andreas Galek added a comment -

          Having installed the newest plugin's version 1.19 I was facing error messages like this:

           
          Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException
          SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
          java.lang.OutOfMemoryError: unable to create new native thread
              at java.lang.Thread.start0(Native Method)
              at java.lang.Thread.start(Thread.java:717)
              at java.util.Timer.<init>(Timer.java:160)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)
              at java.util.TimerThread.mainLoop(Timer.java:555)
              at java.util.TimerThread.run(Timer.java:505)
           
           
          Is it the same problem as the one you fixed in this Jira ticket?
           

          Andreas Galek added a comment - Having installed the newest plugin's version 1.19 I was facing error messages like this:   Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code. java.lang.OutOfMemoryError: unable to create new native thread     at java.lang.Thread.start0(Native Method)     at java.lang.Thread.start(Thread.java:717)     at java.util.Timer.<init>(Timer.java:160)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)     at java.util.TimerThread.mainLoop(Timer.java:555)     at java.util.TimerThread.run(Timer.java:505)     Is it the same problem as the one you fixed in this Jira ticket?  

          Olivier Lamy added a comment -

          andreasgk side effect bug of the fix  please read here https://issues.jenkins-ci.org/browse/JENKINS-58684

          Olivier Lamy added a comment - andreasgk side effect bug of the fix   please read here  https://issues.jenkins-ci.org/browse/JENKINS-58684

            olamy Olivier Lamy
            docwhat Christian Höltje
            Votes:
            16 Vote for this issue
            Watchers:
            24 Start watching this issue

              Created:
              Updated:
              Resolved: