Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-70388

Jenkins OOM when agent nodes alway keep running

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • workflow-api-plugin
    • None

      We are using Jenkins master-slave framework in our project to run 80 tasks for every 15 minutes. Jenkins is running in Kubenetes cluster and using Kubernetes plugin to run dynamic agent.

      We hope agent nodes alway keep running to process those tasks. So we set idleMinutes to 15 minutes. But it seems make Jenkins OOM. Below is the memory info of Jenkins after running 4 days. The young generation is growing very fast. If there is no task,  it grows about 1MB-2MB every second. But if running 80 tasks at the same time, it will grow about 6GB in one minute, and trigger 2 ygc. Full GC happens about every 40-50 minutes. But it only can release very little memory for every fgc. It’s clear that Jenkins is leaking memory.

      We dump memory snapshot of Jenkins. Below is memory leak suspects. Object DelayBufferedOutputStream takes more then 700MB space, and Object CpsFlowExecution takes 100MB space. Both these two objects are referenced from instance java.util.HashMap$Node[]. We suspect one node represents one slave node. If the slave node is not destroyed, related objects cannot be released.

      We also try to set idleMinutes  to 5 minutes, then agent nodes will be destroyed after completing tasks. Jenkins memory is normal. 

      Will Jenkins agent nodes always keep running cause memory leaks? Anyone can help on this? Thanks in advance.

          [JENKINS-70388] Jenkins OOM when agent nodes alway keep running

          Removing the kubernetes component as this looks only related to workflow plugin.

          You're running a pretty old version of Jenkins so I can't say whether it is a problem that has been addressed between your version and the current one.

          About using idleMinutes, it is in general discouraged because

          • kubernetes agents are in general pretty fast to provision (compared to the overall time it takes to complete a job)
          • they are meant to be one-off agents, with no possibility for builds to cause side-effects affecting other builds.

          Vincent Latombe added a comment - Removing the kubernetes component as this looks only related to workflow plugin. You're running a pretty old version of Jenkins so I can't say whether it is a problem that has been addressed between your version and the current one. About using idleMinutes, it is in general discouraged because kubernetes agents are in general pretty fast to provision (compared to the overall time it takes to complete a job) they are meant to be one-off agents, with no possibility for builds to cause side-effects affecting other builds.

          shanshan added a comment -

          Hi Vincent,

          Thanks very much for your comments!

          We tried Jenkins version 2.375, but still have such problem.

          About agents, if distroy them after completing tasks. They are ont-off agents. 
          But what if set idleMinutes to a long time, and don't let agents destroy? Our problem happens at this situation, and it cause Jenkins memory leak.
          By analyzing Jenkins memory, instances of org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream take too much memory space and cannot be gc released. Because they are referenced from one instance of java.util.HashMap$Node[]. We suspect one Node represents one agent. If agent is not destroyed, the instances that it references cannot be released.

          Not sure if above analysis is correct. Any comments about this? 

          shanshan added a comment - Hi Vincent, Thanks very much for your comments! We tried Jenkins version 2.375, but still have such problem. About agents, if distroy them after completing tasks. They are ont-off agents.  But what if set idleMinutes to a long time, and don't let agents destroy? Our problem happens at this situation, and it cause Jenkins memory leak. By analyzing Jenkins memory, instances of org.jenkinsci.plugins.workflow.log.DelayBufferedOutputStream take too much memory space and cannot be gc released. Because they are referenced from one instance of java.util.HashMap$Node[] . We suspect one Node represents one agent. If agent is not destroyed, the instances that it references cannot be released. Not sure if above analysis is correct. Any comments about this? 

            Unassigned Unassigned
            fushsh shanshan
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: