• Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical Critical
    • core
    • docker jenkins/jenkins:2.346.1-lts-jdk11 on CentOS 7

      Running Jenkins with hundreds of nodes and jobs, monitoring metaspace and heap with
      jcmd GC.heap_info,
      I got growing used metaspace while used heap is stable (after gc). see attached picture.

      My java options:
      -Xmx400g -Xms200g -XX:MetaspaceSize=8192m -XX:MaxMetaspaceSize=8192m -XX:+ExplicitGCInvokesConcurrent -XX:+UseG1GC

      Jenkins ran out of metaspace and unresponsive finally:
      2022-07-15 16:36:52.921  [Pipeline] End of Pipeline
      2022-07-15 16:36:52.967  java.lang.OutOfMemoryError: Metaspace
      2022-07-15 16:36:52.973  Finished: FAILURE

      top class_stats after 10 hours. the used metaspace increased around 22%, see attached picture.

      metaspace issue disappeared on 7/19 suddenly.

        1. class_stats.png
          class_stats.png
          139 kB
        2. metaspace.png
          metaspace.png
          389 kB
        3. metaspace-issue-gone.png
          metaspace-issue-gone.png
          310 kB
        4. Snipaste_2022-08-15_17-02-34.png
          Snipaste_2022-08-15_17-02-34.png
          91 kB

          [JENKINS-69022] Metaspace leak on Jenkins 2.346.1/Java11

          Mark Waite added a comment -

          loblab could you review the comments and discussion in JENKINS-63766 to see if the condition you're seeing is related to a metaspace memory leak bug in JDK 11? A workaround was implemented in Jenkins 2.346.1 and I believe that a fix is expected in Java 11.0.16

          Mark Waite added a comment - loblab could you review the comments and discussion in JENKINS-63766 to see if the condition you're seeing is related to a metaspace memory leak bug in JDK 11? A workaround was implemented in Jenkins 2.346.1 and I believe that a fix is expected in Java 11.0.16

          Jason Gao added a comment -

          markewaite  thanks. I saw that post before and did reproduce that issue on 2.332.3, while no such issue on 2.346.1 (both JDK 11).
          BTW, my Jenkins also crash before upgrade to 2.346 (from 2.332). however, i did not dig into the issue on 2.332.
          yes, java version of my 2.346.1 :  openjdk 11.0.15 2022-04-19

          Jason Gao added a comment - markewaite   thanks. I saw that post before and did reproduce that issue on 2.332.3, while no such issue on 2.346.1 (both JDK 11). BTW, my Jenkins also crash before upgrade to 2.346 (from 2.332). however, i did not dig into the issue on 2.332. yes, java version of my 2.346.1 :  openjdk 11.0.15 2022-04-19

          Jason Gao added a comment -

          also attached top class_stats comparison after 10 hours. the used metaspace increased around 22% during the period. see attached picture.
          is there any clue?

          Jason Gao added a comment - also attached top class_stats comparison after 10 hours. the used metaspace increased around 22% during the period. see attached picture. is there any clue?

          Jason Gao added a comment -

          metaspace issue disappeared on 7/19 suddenly.

          in gclog:

          [2022-07-19T11:15:21.132+0800][325814.684s][info ][gc,metaspace      ] GC(2511) Metaspace: 6449217K->6449217K(7002112K)
          [2022-07-19T11:15:21.132+0800][325814.684s][debug][gc,heap           ] GC(2511)  Metaspace       used 6449217K, capacity 6887476K, committed 6949664K, reserved 7002112K
          [2022-07-19T11:15:23.571+0800][325817.124s][debug][gc,phases,start   ] GC(2512) Purge Metaspace
          [2022-07-19T11:15:25.790+0800][325819.343s][debug][gc,phases         ] GC(2512) Purge Metaspace 2219.016ms
          [2022-07-19T11:15:49.778+0800][325843.330s][debug][gc,heap           ] GC(2513)  Metaspace       used 962031K, capacity 1043251K, committed 5239904K, reserved 5269504K
          [2022-07-19T11:15:49.948+0800][325843.501s][info ][gc,metaspace      ] GC(2513) Metaspace: 962031K->962031K(5269504K)

          Jason Gao added a comment - metaspace issue disappeared on 7/19 suddenly. in gclog: – [2022-07-19T11:15:21.132+0800] [325814.684s] [info ] [gc,metaspace      ] GC(2511) Metaspace: 6449217K->6449217K(7002112K) [2022-07-19T11:15:21.132+0800] [325814.684s] [debug] [gc,heap           ] GC(2511)  Metaspace       used 6449217K, capacity 6887476K, committed 6949664K, reserved 7002112K [2022-07-19T11:15:23.571+0800] [325817.124s] [debug] [gc,phases,start   ] GC(2512) Purge Metaspace [2022-07-19T11:15:25.790+0800] [325819.343s] [debug] [gc,phases         ] GC(2512) Purge Metaspace 2219.016ms [2022-07-19T11:15:49.778+0800] [325843.330s] [debug] [gc,heap           ] GC(2513)  Metaspace       used 962031K, capacity 1043251K, committed 5239904K, reserved 5269504K [2022-07-19T11:15:49.948+0800] [325843.501s] [info ] [gc,metaspace      ] GC(2513) Metaspace: 962031K->962031K(5269504K)

          Basil Crow added a comment -

          Hi loblab, that is quite a large heap size you have there! I do not think I have ever seen anything like it for a Jenkins controller. Some more background information about your use case could help us contextualize your deployment pattern.

          The July 2022 releases of OpenJDK 11.0.16 and 17.0.4 just came out. I have no reason to believe they would help you, but I do not think they could  possibly hurt you, either.

          Without steps to reproduce this problem from scratch, we cannot do much investigation on our side. But we can try to point you in the right direction to investigate this problem. When running with -verbose:class you should see that every class that is loaded is also unloaded (as in JENKINS-63766 (comment). If there is a class that is getting loaded but not unloaded, this would be an indication that we need to study which class is not being unloaded and why. In my experience, a frequent cause of metaspace problems is when class loaders fail to let go of a loaded class, which was the case in JENKINS-63766. But I see no evidence (yet) that this is occurring in your case.

          The metaspace has undergone significant changes in recent JVMs, and this YouTube video is worth watching to understand the architecture. See also these blog posts:

          The last post describes how to use the jcmd VM.metaspace to do further analysis. As you can see, this is a complex topic that is deeply tied to your JVM configuration and usage patterns. I hope these resources assist you in your investigation.

          Basil Crow added a comment - Hi loblab , that is quite a large heap size you have there! I do not think I have ever seen anything like it for a Jenkins controller. Some more background information about your use case could help us contextualize your deployment pattern. The July 2022 releases of OpenJDK 11.0.16 and 17.0.4 just came out. I have no reason to believe they would help you, but I do not think they could  possibly hurt you, either. Without steps to reproduce this problem from scratch, we cannot do much investigation on our side. But we can try to point you in the right direction to investigate this problem. When running with -verbose:class you should see that every class that is loaded is also unloaded (as in JENKINS-63766 (comment). If there is a class that is getting loaded but not unloaded, this would be an indication that we need to study which class is not being unloaded and why. In my experience, a frequent cause of metaspace problems is when class loaders fail to let go of a loaded class, which was the case in JENKINS-63766 . But I see no evidence (yet) that this is occurring in your case. The metaspace has undergone significant changes in recent JVMs, and this YouTube video is worth watching to understand the architecture. See also these blog posts: https://stuefe.de/posts/metaspace/what-is-metaspace/ https://stuefe.de/posts/metaspace/metaspace-architecture/ https://stuefe.de/posts/metaspace/what-is-compressed-class-space/ https://stuefe.de/posts/metaspace/sizing-metaspace/ https://stuefe.de/posts/metaspace/analyze-metaspace-with-jcmd/ The last post describes how to use the jcmd VM.metaspace to do further analysis. As you can see, this is a complex topic that is deeply tied to your JVM configuration and usage patterns. I hope these resources assist you in your investigation.

          Jason Gao added a comment -

          Hi basil , thank you. very good videos and posts.

          The issue was gone from 2.346.1. I upgraded to 2.346.2, and did not see the issue neither.

          See the new uploaded screenshot for metaspace used in recent 3 weeks (zero means reboot). My system was stable recently.

          Maybe some code in my jobs caused the metaspace leak, and the code was modified later...

           

          Jason Gao added a comment - Hi basil , thank you. very good videos and posts. The issue was gone from 2.346.1. I upgraded to 2.346.2, and did not see the issue neither. See the new uploaded screenshot for metaspace used in recent 3 weeks (zero means reboot). My system was stable recently. Maybe some code in my jobs caused the metaspace leak, and the code was modified later...  

          Mark Waite added a comment -

          Closing as the issue is resolved as reported by loblab

          Mark Waite added a comment - Closing as the issue is resolved as reported by loblab

            Unassigned Unassigned
            loblab Jason Gao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: