Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-69352

Multiple concurrent builds cause severe memory spikes

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Critical Critical
    • core
    • None
    • Jenkins: 2.346.3
      OS: Linux - 5.4.0-58-generic

      Since the update from 2.346.2 to 2.346.3, multiple builds running in parallel (either concurrent builds from the same pipeline or multiple builds from different pipelines) cause memory spikes that exceed the heap size of the JVM by a lot which leads to kubernetes killing the jenkins pod (OOMkilled).

      Setup:

      Kubernetes container with memory request+limit = 8gb

      Jenkins -Xmx=4g

       

      Steps to reproduce:

      • Upgrade to 2.346.3
      • Run multiple builds in parallel (they should take some minutes)
      • In the jenkins pod, you can see that the memory of the jenkins process spikes within a couple of seconds
      • As soon as the memory exceeds the memory limit of the container, kubernetes kills the pod. We tested the same scenario with a memory limit of 32gb and -Xmx=4g which leads to the issue taking longer to occur.

      Downgrading to 2.346.2 solved the issue.

       

      Could it be a plugin?

      The exact same set of plugins works with 2.346.2.

          [JENKINS-69352] Multiple concurrent builds cause severe memory spikes

          Basil Crow added a comment -

          This bug report is light on steps to reproduce or details. Which JVM is running out of memory, the controller JVM or an agent JVM? If the controller JVM, does the same issue persist with the 2.346.2 Docker image but the 2.346.3 jenkins.war file (that would tell us whether the regression is in the Java code or in the environment delivered in the Docker image such as OS version, Java version, etc). And finally have you done any analysis of the heap dump to report what is using up the heap?

          Basil Crow added a comment - This bug report is light on steps to reproduce or details. Which JVM is running out of memory, the controller JVM or an agent JVM? If the controller JVM, does the same issue persist with the 2.346.2 Docker image but the 2.346.3 jenkins.war file (that would tell us whether the regression is in the Java code or in the environment delivered in the Docker image such as OS version, Java version, etc). And finally have you done any analysis of the heap dump to report what is using up the heap?

          Basil Crow added a comment -

          I am aware of a metaspace (not heap!) leak in workflow-cps versions prior to 2705.v0449852ee36f that only manifests itself on OpenJDK 11.0.16 or later. The 2.346.2 Docker image ships with OpenJDK 11.0.15 while the 2.346.3 Docker image ships with OpenJDK 11.0.16, so that could be your problem. If it is, upgrade workflow-cps to the latest version on 2.346.3. You should always upgrade your plugins after upgrading Jenkins core anyway.

          Basil Crow added a comment - I am aware of a metaspace (not heap!) leak in workflow-cps versions prior to 2705.v0449852ee36f that only manifests itself on OpenJDK 11.0.16 or later. The 2.346.2 Docker image ships with OpenJDK 11.0.15 while the 2.346.3 Docker image ships with OpenJDK 11.0.16, so that could be your problem. If it is, upgrade workflow-cps to the latest version on 2.346.3. You should always upgrade your plugins after upgrading Jenkins core anyway.

          Jumping in for dotstone 

          Maybe the description is unclear on that one, the memory spikes should not come from the heap because it should be fixed at 4g as stated in the description, the problem is with Jenkins itself, not with the agents

           

          basil all plugins are used with latest, so also the workflow-cps plugin has a high enough version, but thanks for the suggestion

           

          One thing that was noticed is that if running the same configuration (Same Jenkins image, same requests/limits/xmx), but smaller node size (t3.xlarge) on AWS, the problem seems to be gone, but I am not sure what that implies

          Jenkins: 2.346.3
          OS: Linux - 5.4.196-108.356.amzn2.x86_64 

           

           

          Georg Blumenschein added a comment - Jumping in for dotstone   Maybe the description is unclear on that one, the memory spikes should not come from the heap because it should be fixed at 4g as stated in the description, the problem is with Jenkins itself, not with the agents   basil all plugins are used with latest, so also the workflow-cps plugin has a high enough version, but thanks for the suggestion   One thing that was noticed is that if running the same configuration (Same Jenkins image, same requests/limits/xmx), but smaller node size (t3.xlarge) on AWS, the problem seems to be gone, but I am not sure what that implies Jenkins: 2.346.3 OS: Linux - 5.4.196-108.356.amzn2.x86_64    

          Basil Crow added a comment -

          the problem is with Jenkins itself, not with the agents

          Then my previous question remains unanswered:

          If the controller JVM, does the same issue persist with the 2.346.2 Docker image but the 2.346.3 jenkins.war file (that would tell us whether the regression is in the Java code or in the environment delivered in the Docker image such as OS version, Java version, etc).

          This question also remains unanswered:

          And finally have you done any analysis of the heap dump to report what is using up the heap?

          Basil Crow added a comment - the problem is with Jenkins itself, not with the agents Then my previous question remains unanswered: If the controller JVM, does the same issue persist with the 2.346.2 Docker image but the 2.346.3 jenkins.war file (that would tell us whether the regression is in the Java code or in the environment delivered in the Docker image such as OS version, Java version, etc). This question also remains unanswered: And finally have you done any analysis of the heap dump to report what is using up the heap?

          Basil Crow added a comment -

          OpenJDK 11.0.16, which we started shipping in the 2.346.3 Docker image, supports cgroups v2. That might impact the memory usage characteristics of the process as described in https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness#opinionated_configuration.

          Basil Crow added a comment - OpenJDK 11.0.16, which we started shipping in the 2.346.3 Docker image, supports cgroups v2. That might impact the memory usage characteristics of the process as described in https://developers.redhat.com/articles/2022/04/19/java-17-whats-new-openjdks-container-awareness#opinionated_configuration .

          Basil Crow added a comment -

          Can you try running with -Xlog:os+container=trace and -Xlog:gc=info and compare the os/container and GC logs between the working and failing version? This information would also be helpful (again, both on the working and failing version):

          java -XshowSettings:system -version
          java -Xlog:gc=info -version
          kubectl describe deployment
          jmcd $PID VM.info | grep -A13 '(cgroup)'
          

          Also, are you using cgroups v1, cgroups v2, or the unified hierarchy?

          In OpenJDK 11.0.16, the Kubernetes CPU/memory limits can now be observed by the JVM, which can impact things like the GC algorithm chosen by OpenJDK. Try focusing your investigation along these lines and see if you discover anything interesting.

          Basil Crow added a comment - Can you try running with -Xlog:os+container=trace and -Xlog:gc=info and compare the os/container and GC logs between the working and failing version? This information would also be helpful (again, both on the working and failing version): java -XshowSettings:system -version java -Xlog:gc=info -version kubectl describe deployment jmcd $PID VM.info | grep -A13 '(cgroup)' Also, are you using cgroups v1, cgroups v2, or the unified hierarchy? In OpenJDK 11.0.16, the Kubernetes CPU/memory limits can now be observed by the JVM, which can impact things like the GC algorithm chosen by OpenJDK. Try focusing your investigation along these lines and see if you discover anything interesting.

          Basil Crow added a comment -

          allan_burdajewicz informed me about JDK-8292260, which led me to file jenkinsci/docker#1441. At the time of this writing, https://adoptium.net contains the following notice:

          18th of August 2022: Upstream OpenJDK has released fixes to resolve a regression in some of the July releases. We are working to release 11.0.16.1+1 and 17.0.4.1+1. As usual, you can continue to track progress by platform or by detailed release checklist.

          At the time of this writing the 11.0.16.1+1 and 17.0.4.1+1 build tarballs have been published, but the Docker images have not. I think as soon as the Docker images are published we will want to kick off Dependabot updating and backporting.

          Observations:

          • We absolutely want this to be fixed in the next weekly release.
          • I do not think we want to ship 2.361.1 until this is fixed.
          • We may possibly want to issue a new 2.346.x release with this fix as well.

          If you think you are being impacted by JDK-8292260 and need help proving or disproving that theory, please send me your hs_err_pid crash file and I can assist with your investigation and try to come up with a workaround.

          CC dduportal markewaite timja

          Basil Crow added a comment - allan_burdajewicz informed me about JDK-8292260 , which led me to file jenkinsci/docker#1441 . At the time of this writing, https://adoptium.net contains the following notice: 18th of August 2022: Upstream OpenJDK has released fixes to resolve a regression in some of the July releases. We are working to release 11.0.16.1+1 and 17.0.4.1+1. As usual, you can continue to track progress by platform or by detailed release checklist . At the time of this writing the 11.0.16.1+1 and 17.0.4.1+1 build tarballs have been published, but the Docker images have not. I think as soon as the Docker images are published we will want to kick off Dependabot updating and backporting. Observations: We absolutely want this to be fixed in the next weekly release. I do not think we want to ship 2.361.1 until this is fixed. We may possibly want to issue a new 2.346.x release with this fix as well. If you think you are being impacted by JDK-8292260 and need help proving or disproving that theory, please send me your hs_err_pid crash file and I can assist with your investigation and try to come up with a workaround. CC dduportal markewaite timja

          Basil Crow added a comment -

          Although we never got proof (in the form of an hs_err log) that this was the C2 JIT issue, I think it is quite likely that it was and this has been fixed in 2.346.3-2. If this issue occurs again, please open a new ticket.

          Basil Crow added a comment - Although we never got proof (in the form of an hs_err log) that this was the C2 JIT issue, I think it is quite likely that it was and this has been fixed in 2.346.3-2. If this issue occurs again, please open a new ticket.

            Unassigned Unassigned
            dotstone Dominik Steinbinder
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: