Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50223

Pipeline Jenkins master ran out of Java Metaspace

    • workflow-cps 2.69

      Setup

      Jenkins v2.89.3 / v2.89.4 LTS
      Pipeline API Plugin: 2.26
      Pipeline Nodes and Processes Plugin: 2.19
      Durable Task Plugin: 1.18
      Pipeline Job Plugin: 2.17
      Pipeline Shared Groovy Libraries Plugin: 2.9
      Pipeline Supporting APIs Plugin: 2.18
      Pipeline Groovy Plugin: 2.45
      Script Security Plugin: 1.41

      Pipeline Default Speed/Durability Level: Performance-Optimized
      "Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

      Problem

      Shortly after upgrading from Jenkins 2.60.3 to Jenkins 2.89.3 LTS, we experienced a java.lang.OutOfMemoryError: Metaspace error. I'm unsure if this a bug or if increased Java Metaspace is to be expected in the newer version of Jenkins and pipeline plugins.

      Details

      Jenkins v2.60.3

      In Jenkins 2.60.3 we used the memory parameters below and the Java nonheap space (Metaspace) stayed between 350mb and 550mb. We ran for months without restarting Jenkins.

      -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g -Dgroovy.use.classvalue=true

      Jenkins v2.89.3

      After upgrading to Jenkins 2.89.3 and upgrading all of the plugins, we used the Java memory settings below. We removed the "groovy.use.classvalue=true" parameter based on the recommendation here: https://support.cloudbees.com/hc/en-us/articles/115001505172-Jenkins-hangs-due-to-Groovy-ClassInfo-deadlock. We saw the Java nonheap memory fluctuate between 250mb and 750mb slowing trending upwards. After a few weeks, nonheap memory was at 1GB and Jenkins no longer responded to requests. The log file was full of java.lang.OutOfMemoryError: Metaspace errors.

      -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g

      Debugging

      Our Jenkins masters perform a large amount of Jenkins pipeline jobs and I noticed that all of our Jenkins 2.89.4 masters show higher Java nonheap memory usage than before the Jenkins upgrade. We disabled pipeline durability but still see high memory usage. Standard Java heap memory usage looks fine.

      Example 1:

      Jenkins Master 1 was completely idle and I performed a System.gc() to force a full garbage collection. Afterwards I saw that the nonheap memory still was over 700mb. Below are some of the Java details I collected on that installation just after the full GC.

      VM.native_memory summary

      Native Memory Tracking:

      Total: reserved=11014208KB, committed=5551020KB
      Java Heap (reserved=8388608KB, committed=4194304KB)
      (mmap: reserved=8388608KB, committed=4194304KB)

      Class (reserved=1686933KB, committed=707989KB)
      (classes #48326)
      (malloc=13717KB #208203)
      (mmap: reserved=1673216KB, committed=694272KB)

      Thread (reserved=93951KB, committed=93951KB)
      (thread #92)
      (stack: reserved=93548KB, committed=93548KB)
      (malloc=296KB #463)
      (arena=107KB #182)

      Code (reserved=273528KB, committed=155620KB)
      (malloc=23928KB #24978)
      (mmap: reserved=249600KB, committed=131692KB)

      GC (reserved=412351KB, committed=256703KB)
      (malloc=68287KB #693569)
      (mmap: reserved=344064KB, committed=188416KB)

      Compiler (reserved=406KB, committed=406KB)
      (malloc=276KB #2009)
      (arena=131KB #3)

      Internal (reserved=88791KB, committed=88791KB)
      (malloc=88759KB #184270)
      (mmap: reserved=32KB, committed=32KB)

      Symbol (reserved=30516KB, committed=30516KB)
      (malloc=27279KB #301740)
      (arena=3236KB #1)

      Native Memory Tracking (reserved=22549KB, committed=22549KB)
      (malloc=348KB #5361)
      (tracking overhead=22201KB)

      Arena Chunk (reserved=190KB, committed=190KB)
      (malloc=190KB)

      Unknown (reserved=16384KB, committed=0KB)
      (mmap: reserved=16384KB, committed=0KB)

      GC.class_histogram

      num #instances #bytes class name
      ----------------------------------------------
      1: 490462 84832616 [C
      2: 2552835 40845360 com.cloudbees.groovy.cps.impl.ConstantBlock
      3: 930699 37227960 com.cloudbees.groovy.cps.impl.FunctionCallBlock
      4: 1493734 35849616 com.cloudbees.groovy.cps.impl.SourceLocation
      5: 883507 33258176 [Ljava.lang.Object;
      6: 179552 30097544 [B
      7: 922229 29511328 java.util.HashMap$Node
      8: 1151159 27386104 [Lcom.cloudbees.groovy.cps.Block;
      9: 947492 22739808 java.lang.String
      10: 790957 18982968 com.cloudbees.groovy.cps.impl.LocalVariableBlock
      11: 213822 13097984 [Ljava.util.HashMap$Node;
      12: 519301 12463224 com.cloudbees.groovy.cps.impl.SequenceBlock
      13: 452808 10867392 java.util.ArrayList
      14: 320616 10259712 com.cloudbees.groovy.cps.impl.PropertyAccessBlock
      15: 250810 10032400 com.google.common.cache.LocalCache$WeakEntry
      16: 168578 9440368 org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
      17: 260734 8343488 java.util.concurrent.locks.ReentrantLock$NonfairSync
      18: 250147 7394208 [Lhudson.model.Action;
      19: 142590 6844320 java.util.HashMap
      20: 139363 6689424 org.jenkinsci.plugins.workflow.cps.nodes.StepAtomNode
      21: 264178 6340272 com.google.common.collect.SingletonImmutableList
      22: 195095 6243040 com.cloudbees.groovy.cps.impl.AssignmentBlock
      23: 49477 6237064 java.lang.Class
      24: 253041 6072984 java.util.concurrent.CopyOnWriteArrayList
      25: 250681 6016344 org.jenkinsci.plugins.workflow.support.storage.BulkFlowNodeStorage$Tag

      GC.class_stats

      We use the Pipeline Shared Groovy Libraries Plugin and noticed a large number of class files with the names similar to below. Each of our shared library classes (org.xxxxxxxx.scripts.xxxxxxx) had about 400 references.

      script15202683643871464209649
      sun.reflect.GeneratedMethodAccessor2903
      org.xxxxxxxx.scripts.xxxxxxx

      Example 2

      I ran a Java heap dump through an analyzer program and this is what it stated was the top consumers.

      Class Name Objects Shallow Heap Retained Heap
      org.jenkinsci.plugins.workflow.support.actions.LogActionImpl 976,720 23,441,280 >= 343,262,256
      org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl 735,961 17,663,064 >= 305,348,632

      Example 3

      I ran a different Java heap dump through the analyzer and below are the results. Again this was after a full GC.

      Classes = 35k
      Objects = 47M
      Class Load = 4.6k
      GC Roots = 4.3k

      952,652 instances of "org.jenkinsci.plugins.workflow.support.actions.LogActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c1746598" occupy 342,684,616 (20.83%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      717,629 instances of "org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 296,727,528 (18.04%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      293 instances of "org.jenkinsci.plugins.workflow.cps.CpsFlowExecution", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 210,909,800 (12.82%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      Package Retained Heap Retained Heap, % Top Dominators
      workflow 1,340,511,168 81.50% 5,043,559

      Workaround

      We once again added "-Dgroovy.use.classvalue=true" to Java startup parameters and have noticed a reduction in Java nonheap memory usage. With the parameter we see the memory increase to around 750mb but when the server becomes idle the Java nonheap memory is reduced to under 300mb. Without the parameter we don't see the same reduction. Currently one of our Jenkins masters that doesn't have the parameter started on 3/2/18 at .2 GB and is currently (3/16/18) at .76GB. The graph looks like slow climb without much fluctuation. According to the Cloudbees article, this parameter is not recommended for Jenkins 2.89 however it does seem to help lower Java nonheap memory usage. Since it increases by over 200 mb a week, I should know shortly if some type of garbage collection reduces it on this second Jenkins master that is nearing the 1GB max. Java nonheap memory usage only reduced by 1mb when I manually performed a System.gc().

          [JENKINS-50223] Pipeline Jenkins master ran out of Java Metaspace

          Mike Kozell added a comment -

          We are able to reproduce the memory leak with two examples that is not specific to our code.  In both examples and our potential workaround, the Jenkins jobs were all successful and the output in the console log was the same.  The classes remained in the GC.class_stats query when using the examples and didn't when using the work around.

          Abstract Animal class

          Concrete Lion class extends Animal

          Example 1

          We create a custom pipeline step and pass in an instantiated extended abstract object.  The issue occurs when we use "runIdentify(AbstractObject)"

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  runIdentify(myLion)
              }
          
          
          runIdentify.groovy
          
          def call(Animal animal) {
              if (animal.noise.toLowerCase().contains('roar')) {
                  println('This must be a Lion')
              } else {
                  println('NOT a Lion')
              }
          }

           

          Example 2

          We instantiate an extended abstract object and pass it into a constructor object.  The issue occurs when use "new Identity(AbstractObject)" to create the new object.

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  new Identity(steps,myLion)
              }
          }
          
          Identity.groovy
          
          class Identity implements Serializable {
              Identity(DSL steps, Animal animal) {
                  if (animal.noise.toLowerCase().contains('roar')) {
                      steps.echo('This must be a Lion')
                  } else {
                      steps.echo('NOT a Lion')
                  }
              }
          }
          

          Work Around

          We were able to work around the issue by adding a new method to the abstract or concrete class.

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  myLion.identifyAnimal(steps)
              }
          }
          
              def identifyAnimal(DSL steps) {
                  if (this.noise.toLowerCase().contains('roar')) {
                      steps.echo('This must be a Lion')
                  }
              }
          

           

          I'm not sure if the memory leak is a bug or if I'm doing something incorrect with pipeline in the above examples. The output on the screen for all three scenarios is the same and the builds are successful.

          Mike Kozell added a comment - We are able to reproduce the memory leak with two examples that is not specific to our code.  In both examples and our potential workaround, the Jenkins jobs were all successful and the output in the console log was the same.  The classes remained in the GC.class_stats query when using the examples and didn't when using the work around. Abstract Animal class Concrete Lion class extends Animal Example 1 We create a custom pipeline step and pass in an instantiated extended abstract object.  The issue occurs when we use "runIdentify(AbstractObject)" runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         runIdentify(myLion)     } runIdentify.groovy def call(Animal animal) {     if (animal.noise.toLowerCase().contains( 'roar' )) {         println( 'This must be a Lion' )     } else {         println( 'NOT a Lion' )     } }   Example 2 We instantiate an extended abstract object and pass it into a constructor object.  The issue occurs when use "new Identity(AbstractObject)" to create the new object. runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         new Identity(steps,myLion)     } } Identity.groovy class Identity implements Serializable { Identity(DSL steps, Animal animal) { if (animal.noise.toLowerCase().contains( 'roar' )) { steps.echo( 'This must be a Lion' ) } else { steps.echo( 'NOT a Lion' ) } } } Work Around We were able to work around the issue by adding a new method to the abstract or concrete class. runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         myLion.identifyAnimal(steps)     } } def identifyAnimal(DSL steps) { if ( this .noise.toLowerCase().contains( 'roar' )) { steps.echo( 'This must be a Lion' ) } }   I'm not sure if the memory leak is a bug or if I'm doing something incorrect with pipeline in the above examples. The output on the screen for all three scenarios is the same and the builds are successful.

          James Howe added a comment -

          Just to note:

          I performed a System.gc() to force a full garbage collection

          If you have -XX:+ExplicitGCInvokesConcurrent then it doesn't do a "full" collection.

          James Howe added a comment - Just to note: I performed a System.gc() to force a full garbage collection If you have -XX:+ExplicitGCInvokesConcurrent then it doesn't do a "full" collection.

          Jesse Glick added a comment -

          I was able to reproduce a similar leak based on these hints. The use of the node step, as well as a library, appears critical: the leak goes through ExecutorStepExecution.PlaceholderTask.context, but that is not really at fault: StepContext is advertised as Serializable, and a CpsStepContext.outcome will hold a reference to the return value of the step, even after the build finishes. (PlaceholderTask can stick around for a while, for example in Queue.leftItems.) With that fixed, a basic test passes, but in a real Jenkins instance the library classes (and their loaders) are nonetheless retained, though with no apparent root references, so I am be missing a secondary issue.

          Jesse Glick added a comment - I was able to reproduce a similar leak based on these hints. The use of the node step, as well as a library, appears critical: the leak goes through ExecutorStepExecution.PlaceholderTask.context , but that is not really at fault: StepContext is advertised as Serializable , and a CpsStepContext.outcome will hold a reference to the return value of the step, even after the build finishes. ( PlaceholderTask can stick around for a while, for example in Queue.leftItems .) With that fixed, a basic test passes, but in a real Jenkins instance the library classes (and their loaders) are nonetheless retained, though with no apparent root references, so I am be missing a secondary issue.

          Jesse Glick added a comment -

          In a live Jenkins instance, any use of a type (from src), but not a global variable (from vars), triggers a mystery leak for me. The MemoryAssert test does not reproduce this, since there are no apparent root references. Perhaps that is what the comment of 2018-04-02 alludes to. Running with -XX:MetaspaceSize=1m -verbose:gc -verbose:class seems to confirm that while WorkflowScript, globalVar, and sun.reflect.GeneratedSerializationConstructorAccessor1234 get unloaded pretty quickly, pkg.NamedType and pkg.NamedType$method never do. So far I am at a loss to explain why.

          Jesse Glick added a comment - In a live Jenkins instance, any use of a type (from src ), but not a global variable (from vars ), triggers a mystery leak for me. The MemoryAssert test does not reproduce this, since there are no apparent root references. Perhaps that is what the comment of 2018-04-02 alludes to. Running with -XX:MetaspaceSize=1m -verbose:gc -verbose:class seems to confirm that while WorkflowScript , globalVar , and sun.reflect.GeneratedSerializationConstructorAccessor1234 get unloaded pretty quickly, pkg.NamedType and pkg.NamedType$method never do. So far I am at a loss to explain why.

          Jesse Glick added a comment -

          Correction: the library classes do eventually get unloaded, it just takes a while. I suspect there is a SoftReference to blame somewhere, but I have yet to find where.

          Jesse Glick added a comment - Correction: the library classes do eventually get unloaded, it just takes a while. I suspect there is a SoftReference to blame somewhere, but I have yet to find where.

          Jesse Glick added a comment -

          Finally tracked it down, to ClassHelperCache, and managed to fully clear all classes immediately upon end of the build.

          Jesse Glick added a comment - Finally tracked it down, to ClassHelperCache , and managed to fully clear all classes immediately upon end of the build.

          Jesse Glick added a comment -

          For future reference, my monitoring script for the live version:

          while :; do jmap -histo:live $(jps -lm | fgrep war | cut -f1 -d' ') | fgrep GroovyClassLoader > /tmp/x; clear; cat /tmp/x; sleep 5; done
          

          Jesse Glick added a comment - For future reference, my monitoring script for the live version: while :; do jmap -histo:live $(jps -lm | fgrep war | cut -f1 -d ' ' ) | fgrep GroovyClassLoader > /tmp/x; clear; cat /tmp/x; sleep 5; done

          Devin Nusbaum added a comment -

          A fix for this issue was just released in version 2.69 of Pipeline Groovy Plugin.

          Devin Nusbaum added a comment - A fix for this issue was just released in version 2.69 of Pipeline Groovy Plugin.

          Excellent! I think we managed to trigger this problem as well, via some seemingly unrelated refactoring, resulting in max uptime of 4-5 days for some masters.

          For what it is worth, Jenkins Metrics plug-in sends 2 useful metrics – `vm.class.loaded`, `vm.class.unloaded` – difference of which is number of classes currently retained. So we are graphing that to see class retention over time, which showed steady increase without pronunced sawtooth pattern which normally occurs.

          Tatu Saloranta added a comment - Excellent! I think we managed to trigger this problem as well, via some seemingly unrelated refactoring, resulting in max uptime of 4-5 days for some masters. For what it is worth, Jenkins Metrics plug-in sends 2 useful metrics – `vm.class.loaded`, `vm.class.unloaded` – difference of which is number of classes currently retained. So we are graphing that to see class retention over time, which showed steady increase without pronunced sawtooth pattern which normally occurs.

          Tatu Saloranta added a comment - - edited

          Alas, it is actually not sure version 2.69 fixes this problem for good. Problem came back (we had initially reverted code we deemed that must have triggered retention), and upgrade of Pipeline Groovy plug-in to 2.70 does not seem to have solved steady increase (although due to time it takes for classes to accumulate, over 1 week, it is difficult to be sure after 2 days since restart). If confirmed, it's back to Ye Olde Drawing Board to solve the problem.

          Tatu Saloranta added a comment - - edited Alas, it is actually not sure version 2.69 fixes this problem for good. Problem came back (we had initially reverted code we deemed that must have triggered retention), and upgrade of Pipeline Groovy plug-in to 2.70 does not seem to have solved steady increase (although due to time it takes for classes to accumulate, over 1 week, it is difficult to be sure after 2 days since restart). If confirmed, it's back to Ye Olde Drawing Board to solve the problem.

            jglick Jesse Glick
            mkozell Mike Kozell
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: