Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-50223

Pipeline Jenkins master ran out of Java Metaspace

    • workflow-cps 2.69

      Setup

      Jenkins v2.89.3 / v2.89.4 LTS
      Pipeline API Plugin: 2.26
      Pipeline Nodes and Processes Plugin: 2.19
      Durable Task Plugin: 1.18
      Pipeline Job Plugin: 2.17
      Pipeline Shared Groovy Libraries Plugin: 2.9
      Pipeline Supporting APIs Plugin: 2.18
      Pipeline Groovy Plugin: 2.45
      Script Security Plugin: 1.41

      Pipeline Default Speed/Durability Level: Performance-Optimized
      "Do not allow the pipeline to resume if the master restarts": Enabled on all jobs

      Problem

      Shortly after upgrading from Jenkins 2.60.3 to Jenkins 2.89.3 LTS, we experienced a java.lang.OutOfMemoryError: Metaspace error. I'm unsure if this a bug or if increased Java Metaspace is to be expected in the newer version of Jenkins and pipeline plugins.

      Details

      Jenkins v2.60.3

      In Jenkins 2.60.3 we used the memory parameters below and the Java nonheap space (Metaspace) stayed between 350mb and 550mb. We ran for months without restarting Jenkins.

      -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g -Dgroovy.use.classvalue=true

      Jenkins v2.89.3

      After upgrading to Jenkins 2.89.3 and upgrading all of the plugins, we used the Java memory settings below. We removed the "groovy.use.classvalue=true" parameter based on the recommendation here: https://support.cloudbees.com/hc/en-us/articles/115001505172-Jenkins-hangs-due-to-Groovy-ClassInfo-deadlock. We saw the Java nonheap memory fluctuate between 250mb and 750mb slowing trending upwards. After a few weeks, nonheap memory was at 1GB and Jenkins no longer responded to requests. The log file was full of java.lang.OutOfMemoryError: Metaspace errors.

      -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:MaxMetaspaceSize=1g -XX:MetaspaceSize=256M -Xms4g -Xmx8g

      Debugging

      Our Jenkins masters perform a large amount of Jenkins pipeline jobs and I noticed that all of our Jenkins 2.89.4 masters show higher Java nonheap memory usage than before the Jenkins upgrade. We disabled pipeline durability but still see high memory usage. Standard Java heap memory usage looks fine.

      Example 1:

      Jenkins Master 1 was completely idle and I performed a System.gc() to force a full garbage collection. Afterwards I saw that the nonheap memory still was over 700mb. Below are some of the Java details I collected on that installation just after the full GC.

      VM.native_memory summary

      Native Memory Tracking:

      Total: reserved=11014208KB, committed=5551020KB
      Java Heap (reserved=8388608KB, committed=4194304KB)
      (mmap: reserved=8388608KB, committed=4194304KB)

      Class (reserved=1686933KB, committed=707989KB)
      (classes #48326)
      (malloc=13717KB #208203)
      (mmap: reserved=1673216KB, committed=694272KB)

      Thread (reserved=93951KB, committed=93951KB)
      (thread #92)
      (stack: reserved=93548KB, committed=93548KB)
      (malloc=296KB #463)
      (arena=107KB #182)

      Code (reserved=273528KB, committed=155620KB)
      (malloc=23928KB #24978)
      (mmap: reserved=249600KB, committed=131692KB)

      GC (reserved=412351KB, committed=256703KB)
      (malloc=68287KB #693569)
      (mmap: reserved=344064KB, committed=188416KB)

      Compiler (reserved=406KB, committed=406KB)
      (malloc=276KB #2009)
      (arena=131KB #3)

      Internal (reserved=88791KB, committed=88791KB)
      (malloc=88759KB #184270)
      (mmap: reserved=32KB, committed=32KB)

      Symbol (reserved=30516KB, committed=30516KB)
      (malloc=27279KB #301740)
      (arena=3236KB #1)

      Native Memory Tracking (reserved=22549KB, committed=22549KB)
      (malloc=348KB #5361)
      (tracking overhead=22201KB)

      Arena Chunk (reserved=190KB, committed=190KB)
      (malloc=190KB)

      Unknown (reserved=16384KB, committed=0KB)
      (mmap: reserved=16384KB, committed=0KB)

      GC.class_histogram

      num #instances #bytes class name
      ----------------------------------------------
      1: 490462 84832616 [C
      2: 2552835 40845360 com.cloudbees.groovy.cps.impl.ConstantBlock
      3: 930699 37227960 com.cloudbees.groovy.cps.impl.FunctionCallBlock
      4: 1493734 35849616 com.cloudbees.groovy.cps.impl.SourceLocation
      5: 883507 33258176 [Ljava.lang.Object;
      6: 179552 30097544 [B
      7: 922229 29511328 java.util.HashMap$Node
      8: 1151159 27386104 [Lcom.cloudbees.groovy.cps.Block;
      9: 947492 22739808 java.lang.String
      10: 790957 18982968 com.cloudbees.groovy.cps.impl.LocalVariableBlock
      11: 213822 13097984 [Ljava.util.HashMap$Node;
      12: 519301 12463224 com.cloudbees.groovy.cps.impl.SequenceBlock
      13: 452808 10867392 java.util.ArrayList
      14: 320616 10259712 com.cloudbees.groovy.cps.impl.PropertyAccessBlock
      15: 250810 10032400 com.google.common.cache.LocalCache$WeakEntry
      16: 168578 9440368 org.codehaus.groovy.runtime.metaclass.MetaMethodIndex$Entry
      17: 260734 8343488 java.util.concurrent.locks.ReentrantLock$NonfairSync
      18: 250147 7394208 [Lhudson.model.Action;
      19: 142590 6844320 java.util.HashMap
      20: 139363 6689424 org.jenkinsci.plugins.workflow.cps.nodes.StepAtomNode
      21: 264178 6340272 com.google.common.collect.SingletonImmutableList
      22: 195095 6243040 com.cloudbees.groovy.cps.impl.AssignmentBlock
      23: 49477 6237064 java.lang.Class
      24: 253041 6072984 java.util.concurrent.CopyOnWriteArrayList
      25: 250681 6016344 org.jenkinsci.plugins.workflow.support.storage.BulkFlowNodeStorage$Tag

      GC.class_stats

      We use the Pipeline Shared Groovy Libraries Plugin and noticed a large number of class files with the names similar to below. Each of our shared library classes (org.xxxxxxxx.scripts.xxxxxxx) had about 400 references.

      script15202683643871464209649
      sun.reflect.GeneratedMethodAccessor2903
      org.xxxxxxxx.scripts.xxxxxxx

      Example 2

      I ran a Java heap dump through an analyzer program and this is what it stated was the top consumers.

      Class Name Objects Shallow Heap Retained Heap
      org.jenkinsci.plugins.workflow.support.actions.LogActionImpl 976,720 23,441,280 >= 343,262,256
      org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl 735,961 17,663,064 >= 305,348,632

      Example 3

      I ran a different Java heap dump through the analyzer and below are the results. Again this was after a full GC.

      Classes = 35k
      Objects = 47M
      Class Load = 4.6k
      GC Roots = 4.3k

      952,652 instances of "org.jenkinsci.plugins.workflow.support.actions.LogActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c1746598" occupy 342,684,616 (20.83%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      717,629 instances of "org.jenkinsci.plugins.workflow.cps.actions.ArgumentsActionImpl", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 296,727,528 (18.04%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      293 instances of "org.jenkinsci.plugins.workflow.cps.CpsFlowExecution", loaded by "hudson.ClassicPluginStrategy$AntClassLoader2 @ 0x5c17d9770" occupy 210,909,800 (12.82%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x5c0000000"

      Package Retained Heap Retained Heap, % Top Dominators
      workflow 1,340,511,168 81.50% 5,043,559

      Workaround

      We once again added "-Dgroovy.use.classvalue=true" to Java startup parameters and have noticed a reduction in Java nonheap memory usage. With the parameter we see the memory increase to around 750mb but when the server becomes idle the Java nonheap memory is reduced to under 300mb. Without the parameter we don't see the same reduction. Currently one of our Jenkins masters that doesn't have the parameter started on 3/2/18 at .2 GB and is currently (3/16/18) at .76GB. The graph looks like slow climb without much fluctuation. According to the Cloudbees article, this parameter is not recommended for Jenkins 2.89 however it does seem to help lower Java nonheap memory usage. Since it increases by over 200 mb a week, I should know shortly if some type of garbage collection reduces it on this second Jenkins master that is nearing the 1GB max. Java nonheap memory usage only reduced by 1mb when I manually performed a System.gc().

          [JENKINS-50223] Pipeline Jenkins master ran out of Java Metaspace

          Mike Kozell created issue -

          Sam Van Oort added a comment -

          mkozell Are these masters which are also experiencing https://issues.jenkins-ci.org/browse/JENKINS-50199 ? It is quite possible the issues there are preventing the Pipeline memory cleanup routines from running if it causes a failure before the cleanup can run.

          org/jenkinsci/plugins/workflow/cps/CpsGroovyShell$CleanGroovyClassLoader is an internal ClassLoader which would be removed during the cleanup process.

          Recommendation: for now switch back to MAX_SURVIVABILITY durability level (slowest).

          Sam Van Oort added a comment - mkozell Are these masters which are also experiencing https://issues.jenkins-ci.org/browse/JENKINS-50199 ? It is quite possible the issues there are preventing the Pipeline memory cleanup routines from running if it causes a failure before the cleanup can run. org/jenkinsci/plugins/workflow/cps/CpsGroovyShell$CleanGroovyClassLoader is an internal ClassLoader which would be removed during the cleanup process. Recommendation: for now switch back to MAX_SURVIVABILITY durability level (slowest).

          Mike Kozell added a comment -

          Yes these are the same masters.  I will try to compare the memory usage from successful builds and build failures.  Then I will try changing the durability level.  We had another master run out of Metaspace today.

          Mike Kozell added a comment - Yes these are the same masters.  I will try to compare the memory usage from successful builds and build failures.  Then I will try changing the durability level.  We had another master run out of Metaspace today.

          Mike Kozell added a comment -

          I switched a test Jenkins master back to MAX_SURVIVABILITY and still see the memory leak when failed builds occur.  I'm going to test if the issue occurs on successful builds.

          Mike Kozell added a comment - I switched a test Jenkins master back to MAX_SURVIVABILITY and still see the memory leak when failed builds occur.  I'm going to test if the issue occurs on successful builds.

          Sam Van Oort added a comment -

          mkozell Several questions, then:

          1. Are you on the latest versions of Artifactory plugin if you are using it? It had a memory leak with its Pipeline step in older versions, the latter has fixes that prevent leaks from occurring.
          2. Are you using ConfigSlurper or extending java.util.Level in your Pipelines? Both can cause memory leaks and are discouraged now.
          3. Do you have a fairly simple Pipeline that will reproduce the issue?

          My suspicion is that this represents an edge case perhaps linked to the failures above or a bad plugin interaction, because we have explicit tests for memory leaks in Pipeline.

          Finally... if you are by any chance a CloudBees customer, please open a support case with a support bundle & heap dump, because a heap dump + support bundle will give us the fastest route to a fix. While wearing my Open Source hat and not my CloudBees hat there's not a good & secure way to transfer a heap dump.

          Either way I consider memory leaks to be critical issues needing an urgent fix, but it's exponentially easier to do that with the additional information in a heap dump.

          Sam Van Oort added a comment - mkozell Several questions, then: 1. Are you on the latest versions of Artifactory plugin if you are using it? It had a memory leak with its Pipeline step in older versions, the latter has fixes that prevent leaks from occurring. 2. Are you using ConfigSlurper or extending java.util.Level in your Pipelines? Both can cause memory leaks and are discouraged now. 3. Do you have a fairly simple Pipeline that will reproduce the issue? My suspicion is that this represents an edge case perhaps linked to the failures above or a bad plugin interaction, because we have explicit tests for memory leaks in Pipeline. Finally... if you are by any chance a CloudBees customer, please open a support case with a support bundle & heap dump, because a heap dump + support bundle will give us the fastest route to a fix. While wearing my Open Source hat and not my CloudBees hat there's not a good & secure way to transfer a heap dump. Either way I consider memory leaks to be critical issues needing an urgent fix, but it's exponentially easier to do that with the additional information in a heap dump.

          Mike Kozell added a comment -

          We are able to reproduce the memory leak with two examples that is not specific to our code.  In both examples and our potential workaround, the Jenkins jobs were all successful and the output in the console log was the same.  The classes remained in the GC.class_stats query when using the examples and didn't when using the work around.

          Abstract Animal class

          Concrete Lion class extends Animal

          Example 1

          We create a custom pipeline step and pass in an instantiated extended abstract object.  The issue occurs when we use "runIdentify(AbstractObject)"

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  runIdentify(myLion)
              }
          
          
          runIdentify.groovy
          
          def call(Animal animal) {
              if (animal.noise.toLowerCase().contains('roar')) {
                  println('This must be a Lion')
              } else {
                  println('NOT a Lion')
              }
          }

           

          Example 2

          We instantiate an extended abstract object and pass it into a constructor object.  The issue occurs when use "new Identity(AbstractObject)" to create the new object.

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  new Identity(steps,myLion)
              }
          }
          
          Identity.groovy
          
          class Identity implements Serializable {
              Identity(DSL steps, Animal animal) {
                  if (animal.noise.toLowerCase().contains('roar')) {
                      steps.echo('This must be a Lion')
                  } else {
                      steps.echo('NOT a Lion')
                  }
              }
          }
          

          Work Around

          We were able to work around the issue by adding a new method to the abstract or concrete class.

          runZoo.groovy
          
          def call() {
              node {
                  def myLion = new Lion()
                  myLion.noise = "Roar"
                  myLion.sound(steps,myLion.noise)
                  myLion.identifyAnimal(steps)
              }
          }
          
              def identifyAnimal(DSL steps) {
                  if (this.noise.toLowerCase().contains('roar')) {
                      steps.echo('This must be a Lion')
                  }
              }
          

           

          I'm not sure if the memory leak is a bug or if I'm doing something incorrect with pipeline in the above examples. The output on the screen for all three scenarios is the same and the builds are successful.

          Mike Kozell added a comment - We are able to reproduce the memory leak with two examples that is not specific to our code.  In both examples and our potential workaround, the Jenkins jobs were all successful and the output in the console log was the same.  The classes remained in the GC.class_stats query when using the examples and didn't when using the work around. Abstract Animal class Concrete Lion class extends Animal Example 1 We create a custom pipeline step and pass in an instantiated extended abstract object.  The issue occurs when we use "runIdentify(AbstractObject)" runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         runIdentify(myLion)     } runIdentify.groovy def call(Animal animal) {     if (animal.noise.toLowerCase().contains( 'roar' )) {         println( 'This must be a Lion' )     } else {         println( 'NOT a Lion' )     } }   Example 2 We instantiate an extended abstract object and pass it into a constructor object.  The issue occurs when use "new Identity(AbstractObject)" to create the new object. runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         new Identity(steps,myLion)     } } Identity.groovy class Identity implements Serializable { Identity(DSL steps, Animal animal) { if (animal.noise.toLowerCase().contains( 'roar' )) { steps.echo( 'This must be a Lion' ) } else { steps.echo( 'NOT a Lion' ) } } } Work Around We were able to work around the issue by adding a new method to the abstract or concrete class. runZoo.groovy def call() {     node {         def myLion = new Lion()         myLion.noise = "Roar"         myLion.sound(steps,myLion.noise)         myLion.identifyAnimal(steps)     } } def identifyAnimal(DSL steps) { if ( this .noise.toLowerCase().contains( 'roar' )) { steps.echo( 'This must be a Lion' ) } }   I'm not sure if the memory leak is a bug or if I'm doing something incorrect with pipeline in the above examples. The output on the screen for all three scenarios is the same and the builds are successful.
          Mike Kozell made changes -
          Remote Link New: This issue links to "Define Custom Steps (Web Link)" [ 20313 ]

          James Howe added a comment -

          Just to note:

          I performed a System.gc() to force a full garbage collection

          If you have -XX:+ExplicitGCInvokesConcurrent then it doesn't do a "full" collection.

          James Howe added a comment - Just to note: I performed a System.gc() to force a full garbage collection If you have -XX:+ExplicitGCInvokesConcurrent then it doesn't do a "full" collection.
          Jesse Glick made changes -
          Assignee New: Jesse Glick [ jglick ]
          Jesse Glick made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

            jglick Jesse Glick
            mkozell Mike Kozell
            Votes:
            1 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: