Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-63975

Jenkins controller failing with java.lang.OutOfMemoryError: GC overhead limit exceeded

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • core
    • Jenkins 2.235.3, Using Docker container

      Hi,

      Since last few weeks we are facing issue with Jenkins master, it crashes every now and then with exception

      java.lang.OutOfMemoryError: GC overhead limit exceeded

      Below is what we found in logs, attached file "log_trace.txt" contains full exception stack trace.

      2020-10-14 15:02:48.604+0000 [id=13455] WARNING h.i.i.InstallUncaughtExceptionHandler#handleException org.apache.commons.jelly.JellyTagException: jar:file:/var/jenkins_home/war/WEB-INF/lib/jenkins-core-2.235.3.jar!/hudson/model/View/index.jelly:42:43: <st:include> org.apache.commons.jelly.JellyTagException: jar:file:/var/jenkins_home/war/WEB-INF/lib/jenkins-core-2.235.3.jar!/lib/hudson/projectViewRow.jelly:35:52: <st:include> java.lang.OutOfMemoryError: GC overhead limit exceeded     at org.apache.commons.jelly.impl.TagScript.handleException(TagScript.java:726)     at org.apache.commons.jelly.impl.TagScript.run(TagScript.java:281)     at org.apache.commons.jelly.impl.ScriptBlock.run(ScriptBlock.java:95)     at org.kohsuke.stapler.jelly.CallTagLibScript$1.run(CallTagLibScript.java:99)
      
      

      When created heap dump and histogram below stack trace is looks suspicious with large size. (attached the full histogram details)

      3: 1007784 32249088 java.lang.StackTraceElement

      Also attached the GC Root from dump file for same stack trace. We have observed same trace in all our slave dumpExportTable.

      Memory parameters configured for Jenkins "-Xmx4096m -XX:MaxPermSize=1024m"

      Is this a memory leak causing the failures ? During the same time our CPU usage also clocks as 100%, are they both related to each other ?

      Please note this Jenkins instance is running on a dedicated VM.

       

        1. gc.log
          5.04 MB
        2. heap_dump_gc_root.PNG
          heap_dump_gc_root.PNG
          171 kB
        3. heap_histo.txt
          1.31 MB
        4. log_trace.txt
          48 kB
        5. node_dump_export_table.txt
          1.05 MB

          [JENKINS-63975] Jenkins controller failing with java.lang.OutOfMemoryError: GC overhead limit exceeded

          Oleg Nenashev added a comment -

          I would need a GC log to say for sure. Right now there is no evidence that there is a memory leak on Jenkins controller. It might be a legitimate system load

          Oleg Nenashev added a comment - I would need a GC log to say for sure. Right now there is no evidence that there is a memory leak on Jenkins controller. It might be a legitimate system load

          Hi oleg_nenashev,

          attached the gc log.

          Reddysekhar Gaduputi added a comment - Hi oleg_nenashev , attached the gc log.

          Reddysekhar Gaduputi added a comment - - edited

          We have managed to fix this issue by tuning our Jenkins/pipelines, In case if any one faces same below might help 

          1) Run pipelines with Durability Level as Performance-Optimized, this will decrease the write operations done by Jenkins to save pipelines state. On other hand with this option Pipelines wont be able to resume after master restart (which is fine in our case)

          2) Loose couple the Jenkins nodes by setting the option to bring them online only when there is demand, otherwise keep them in offline mode (because we have observed many exceptions in node logs as attached)

          3) Restrict log size of the pipelines using logfilesizechecker plugin, log size directly impacts memory (some of our pipelines were generating GB's of logs some times which was abnormal, so configured logfilesizechecker to abort the pipeline after specific size of the log)

          Reddysekhar Gaduputi added a comment - - edited We have managed to fix this issue by tuning our Jenkins/pipelines, In case if any one faces same below might help  1) Run pipelines with Durability Level as Performance-Optimized , this will decrease the write operations done by Jenkins to save pipelines state. On other hand with this option Pipelines wont be able to resume after master restart (which is fine in our case) 2) Loose couple the Jenkins nodes by setting the option to bring them online only when there is demand, otherwise keep them in offline mode (because we have observed many exceptions in node logs as attached) 3) Restrict log size of the pipelines using logfilesizechecker  plugin, log size directly impacts memory (some of our pipelines were generating GB's of logs some times which was abnormal, so configured logfilesizechecker to abort the pipeline after specific size of the log)

          Raihaan Shouhell added a comment - - edited

          This doesn't seem like a bug AFAICT, it seems like the usage of jenkins might have grown which caused its memory usage to have grown. The CPU usage is likely from constant GC cycles due to low memory available.

          The large usage of stacktraceelement seems to be comming from remoting.

          CC: jthompson, perhaps remoting should not capture stack traces by default as from my understanding its purely for debugging purposes, we could save quite a bit of memory here. WDYT?

          Sample implementation to disabling traces https://github.com/jenkinsci/remoting/pull/441

          Raihaan Shouhell added a comment - - edited This doesn't seem like a bug AFAICT, it seems like the usage of jenkins might have grown which caused its memory usage to have grown. The CPU usage is likely from constant GC cycles due to low memory available. The large usage of stacktraceelement seems to be comming from remoting. CC: jthompson , perhaps remoting should not capture stack traces by default as from my understanding its purely for debugging purposes, we could save quite a bit of memory here. WDYT? Sample implementation to disabling traces https://github.com/jenkinsci/remoting/pull/441

          May be you are right its not really a bug but higher usage of Jenkins itself.

          But please note after above mentioned changes it works fine with same resources (even much higher load these days compared to earlier)

          I agree remoting stack traces in nodes is something need to be looked into.

           

          Reddysekhar Gaduputi added a comment - May be you are right its not really a bug but higher usage of Jenkins itself. But please note after above mentioned changes it works fine with same resources (even much higher load these days compared to earlier) I agree remoting stack traces in nodes is something need to be looked into.  

            Unassigned Unassigned
            rgaduput Reddysekhar Gaduputi
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: