Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-51057

EventDispatcher and ConcurrentLinkedQueue ate my JVM

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • sse-gateway-plugin

      We started running out of memory in our JVM (Xmx 8G) and when looking at Melody's memory (heap) histogram (JENKINS_URL/monitoring?part=heaphisto) the top two items were:

       

      Class Size (Kb) % size Instances % instances Source
      org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$Retry 2,890,399 44 92,492,793 43  
      java.util.concurrent.ConcurrentLinkedQueue$Node 2,167,981 33 92,500,553 43  

       
      77% (and growing as we were researching the problem) of the memory was being used by these two items.

      I have two support bundles from this time and an .hprof as well.

      I can either screen share with someone or if you can tell me how to analyze these files I would be happy to.

          [JENKINS-51057] EventDispatcher and ConcurrentLinkedQueue ate my JVM

          Álvaro Iradier added a comment - - edited

          After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm.

          Just in case it can help, this is the cleanup script we are running daily:

          import org.jenkinsci.plugins.pubsub.PubsubBus;
          import org.jenkinsci.plugins.ssegateway.sse.*;
          
          def dryRun = false
          this.bus = PubsubBus.getBus();
          
          // change visibility of retryQueue so that we can use reflection instead...
          def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
          retryQueueField.setAccessible(true)
          
          def dispatcherCount = 0
          def dispatchersList = []
          
          //Build a list of EventDispatchers in all existing HTTP sessions
          println "DISPATCHERS IN HTTP SESSIONS"
          println "----------------------------"
          def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
          sessions.each{id, session->
            def eventDispatchers = EventDispatcherFactory.getDispatchers(session)
            if (eventDispatchers) {
              eventDispatchers.each { dispatcherId, dispatcher ->
                dispatchersList.add(dispatcherId)
                def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
                if (retryQueue.peek() != null) {
                  def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
                  println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
                } else {
                  println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents"
                }      
              }
            }
          }
          
          println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions"
          println ""
          
          //Find all subscribers in bus
          println "DISPATCHERS IN PUBSUBBUS"
          println "------------------------"
          this.bus.subscribers.any{ channelSubscriber, guavaSubscriber ->
            if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
              dispatcherCount++
          	def dispatcher = channelSubscriber.this$0
          
              def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
              if (retryQueue.peek() != null) {
                def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
          	  println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
                if (oldestAge > 300) {
                	println "  Clearing retryQueue with " + retryQueue.size() + " events"
                  if (!dryRun) {
                      retryQueue.clear()
                      dispatcher.unsubscribeAll()
                      try {
                        dispatcher.stop()
                      } catch (Exception ex) {
                        println "  !! Exception stopping AsynDispatcher"
                      }          
                  } else {
                    println "  Ignoring, dryrun"
                  }
                }
              } else {
          	    println "Dispatcher: " + dispatcher.id + " with no retryEvents"
              }
              
              if (dispatchersList.indexOf(dispatcher.id) < 0) {
                println "  Dispatcher is not in HTTP Sessions. Clearing"
                  if (!dryRun) {
                    dispatcher.unsubscribeAll()
                    try {
                    	dispatcher.stop()
                    } catch (Exception ex) {
                      println "  !! Exception stopping AsynDispatcher"
                    }
                  } else {
                    println "  Ignoring, dryrun"
                  }
              }
            } 
          }
          
          println ""
          println "Dispatchers in PubSubBus: " + dispatcherCount
          

           

          Álvaro Iradier added a comment - - edited After many tests and cleanup scripts, we still we able to find some EventDispatcher$Retry elements in the heap but no subscribers in the GuavaPubSubBus or EventDispatchers in the HttpSessions. Some additional verifications in the plugin made me notice that as AsyncEventDispatcher was being used, it is possible that the EventDispatchers were still being referenced by some asyncContexts or threads that were not completed or released. Finally I also added a dispatcher.stop() call in my cleanup script, and it looks like there are no traces of leaked EventDispatcher$Retry clases in the heap anymore, but we still keep observing our instance and analyzing heap dumps, it is very soon to confirm. Just in case it can help, this is the cleanup script we are running daily: import org.jenkinsci.plugins.pubsub.PubsubBus; import org.jenkinsci.plugins.ssegateway.sse.*; def dryRun = false this .bus = PubsubBus.getBus(); // change visibility of retryQueue so that we can use reflection instead... def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField( 'retryQueue' ) retryQueueField.setAccessible( true ) def dispatcherCount = 0 def dispatchersList = [] //Build a list of EventDispatchers in all existing HTTP sessions println "DISPATCHERS IN HTTP SESSIONS" println "----------------------------" def sessions = Jenkins.instance.servletContext. this $0._sessionHandler._sessionCache._sessions sessions.each{id, session-> def eventDispatchers = EventDispatcherFactory.getDispatchers(session) if (eventDispatchers) { eventDispatchers.each { dispatcherId, dispatcher -> dispatchersList.add(dispatcherId) def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." } else { println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents" } } } } println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions" println "" //Find all subscribers in bus println "DISPATCHERS IN PUBSUBBUS" println "------------------------" this .bus.subscribers.any{ channelSubscriber, guavaSubscriber -> if (channelSubscriber.getClass().getName().equals( 'org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber' )) { dispatcherCount++ def dispatcher = channelSubscriber. this $0 def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class... if (retryQueue.peek() != null ) { def oldestAge = ( System .currentTimeMillis() - retryQueue.peek().timestamp)/1000 println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds." if (oldestAge > 300) { println " Clearing retryQueue with " + retryQueue.size() + " events" if (!dryRun) { retryQueue.clear() dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } else { println "Dispatcher: " + dispatcher.id + " with no retryEvents" } if (dispatchersList.indexOf(dispatcher.id) < 0) { println " Dispatcher is not in HTTP Sessions. Clearing" if (!dryRun) { dispatcher.unsubscribeAll() try { dispatcher.stop() } catch (Exception ex) { println " !! Exception stopping AsynDispatcher" } } else { println " Ignoring, dryrun" } } } } println "" println "Dispatchers in PubSubBus: " + dispatcherCount  

          Jon Sten added a comment -

          airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

          Jon Sten added a comment - airadier Thank you for the investigation. In our case I've only noticed EventDispatcher objects being kept alive by the HTTP sessions (for some unknown reason sessions aren't terminated, might be related to the LDAP plugin, not sure though). However, as you noted it seems like there are other places where the objects are kept referenced on the heap. So some kind of timeout of the dispatcher objects and sessions is really needed.

          airadier Thank you for the hard work, your script almost works on some of our servers.  I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory.

          However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

          Maxfield Stewart added a comment - airadier Thank you for the hard work, your script almost works on some of our servers.  I had to modify the code to put an exception trap around "dispatcher.stop()" as in some cases that was blowing up on my server, after that it cleared memory. However on another (older version) of Jenkins the above code explodes trying to access `def dispatcher = channelSubscriber.this$0` as the reflection property fails and doesn't return the parent class (Dispatcher). Not sure what's up there, but we're upgrading that server soon and I'll try again.

          Hi maxfields2000, yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part.

          After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

          Álvaro Iradier added a comment - Hi maxfields2000 , yes I also got the exception after some tests in our production environment, and had to wrap in try-catch. I just updated my comment with latest version of the script we are running. Probably the dispatcher.stop() is not necessary, as I mostly always get an exception, but it won't hurt. Also I noticed that the part in the " Dispatcher is not in HTTP Sessions. Clearing" was not executed most of the times, so I copied the cleanup part (unsubscribeAll() and the .stop()) to the "Clearing retryQueue with..." part. After a couple of weeks running I examined the thread dumps and saw no trace of any leak regarding the EventDispatcher$Retry elements, so for us it is working. Looking forward for a fix in the official upstream version of the plugin

          airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions.  Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168.  Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

          Maxfield Stewart added a comment - airadier Confirmed by the way that your script needs a newer/newest version of plugins, not entirely sure which ones. Of 3 4 deployments I managed, 1 the script was not working for ( couldn't get the parent dispatcher via this$0 ) until I moved to the latest versions.  Tested on Blue ocean 1.13 and 1.14, Jenkins v2.168.  Doesn't seem to work on versions of Blue Ocean older than 1.10/Jenkins 2.147.

          Dan Mordechai added a comment - - edited

          airadier Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins.

          Some things to point out:

          1. I had to change the line 
            def dispatcher = channelSubscriber.this$0
            

            to

            def dispatcher = channelSubscriber.eventDispatcher
            

                   2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

          Dan Mordechai added a comment - - edited airadier  Thank you for this script! I will definitely try it as we are facing the same issue on our 2.107.2 Jenkins. Some things to point out: I had to change the line  def dispatcher = channelSubscriber. this $0 to def dispatcher = channelSubscriber.eventDispatcher          2. Your script will only work for SSE gatway plugin version > 1.15 as the SSEChannelSubscriber inner class in EventDispatcher.java will not have the eventDispatcher member any more.

          A fix got merged to master that definitely helped our masters

          Raihaan Shouhell added a comment - A fix got merged to master that definitely helped our masters

          Olivier Lamy added a comment -

          Olivier Lamy added a comment - fixed with  https://github.com/jenkinsci/sse-gateway-plugin/commit/9cfe9d9c4d9e284ee7d099d4e734b3677ee677f9

          Andreas Galek added a comment -

          Having installed the newest plugin's version 1.19 I was facing error messages like this:

           
          Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException
          SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code.
          java.lang.OutOfMemoryError: unable to create new native thread
              at java.lang.Thread.start0(Native Method)
              at java.lang.Thread.start(Thread.java:717)
              at java.util.Timer.<init>(Timer.java:160)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)
              at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)
              at java.util.TimerThread.mainLoop(Timer.java:555)
              at java.util.TimerThread.run(Timer.java:505)
           
           
          Is it the same problem as the one you fixed in this Jira ticket?
           

          Andreas Galek added a comment - Having installed the newest plugin's version 1.19 I was facing error messages like this:   Aug 09, 2019 6:36:46 PM hudson.init.impl.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler uncaughtException SEVERE: A thread (EventDispatcher.retryProcessor/37063) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code. java.lang.OutOfMemoryError: unable to create new native thread     at java.lang.Thread.start0(Native Method)     at java.lang.Thread.start(Thread.java:717)     at java.util.Timer.<init>(Timer.java:160)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.scheduleRetryQueueProcessing(EventDispatcher.java:296)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.processRetries(EventDispatcher.java:437)     at org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$1.run(EventDispatcher.java:299)     at java.util.TimerThread.mainLoop(Timer.java:555)     at java.util.TimerThread.run(Timer.java:505)     Is it the same problem as the one you fixed in this Jira ticket?  

          Olivier Lamy added a comment -

          andreasgk side effect bug of the fix  please read here https://issues.jenkins-ci.org/browse/JENKINS-58684

          Olivier Lamy added a comment - andreasgk side effect bug of the fix   please read here  https://issues.jenkins-ci.org/browse/JENKINS-58684

            olamy Olivier Lamy
            docwhat Christian Höltje
            Votes:
            16 Vote for this issue
            Watchers:
            24 Start watching this issue

              Created:
              Updated:
              Resolved: