[JENKINS-54757] High CPU caused by dumping the classloader in CpsFlowExecution

Type: Bug
Resolution: Unresolved
Priority: Major
Component/s: workflow-cps-plugin
Labels:
None
Environment:
Jenkins 2.144, Workflow CPS Plugin 2.60, Job DSL Plugin 1.70

Similar Issues:
Powered by SuggestiMate

Show

I have a pretty big system, 1000+ Jenkins items (mostly pipelines), 50+ slaves at all times.I'm investigating some high CPU issues after the upgrade to 2.144 (I've upgraded the plugins as well, so I doubt it's really Jenkins core).

What caught my eye is the following stacktrace (a few variations of it):

   java.lang.Thread.State: RUNNABLE
        at java.lang.String.valueOf(String.java:2994)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.cleanUpGlobalClassValue(CpsFlowExecution.java:1345)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.cleanUpLoader(CpsFlowExecution.java:1291)
        at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.cleanUpHeap(CpsFlowExecution.java:1265)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:375)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$200(CpsThreadGroup.java:83)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:244)
        at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:232)
        at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:64)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

jvisualvm is reporting over 60% Self Time (CPU) in cleanUpGlobalClassValue and it looks like the offender is CpsFlowExecution.java:1345

LOGGER.log(Level.FINEST, "ignoring {0} with loader {1}", new Object[] {klazz, /* do not hold from LogRecord */String.valueOf(encounteredLoader)});

jvisualvm was also reporting 300k loaded classes at that time. I presume the encounteredLoader is huge.

I've made a patch for myself, removing the line, it will take me a few days to confirm if this remedies the issue - the count of total loaded classes goes up over time and it will take 2-3 days to get back to 300k. I still don't know if that's an issue by itself. Initial observations show that it's the DSL Plugin that runs several times a day that's responsible for the spikes in the metric.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

cpu-sampling.png
25 kB
2018-11-23 10:35
memory-sampling.png
13 kB
2018-11-23 10:35
tdump-23-11.txt
1.28 MB
2018-11-23 10:34
trend.png
89 kB
2018-11-29 12:14

Ben Herfurth added a comment - 2018-11-23 09:04 - edited

Encountering the same issue here.

Jenkins 2.138.3

Ben Herfurth added a comment - 2018-11-23 09:04 - edited Encountering the same issue here. Jenkins 2.138.3

Grigor Lechev added a comment - 2018-11-23 10:43

I've attached a couple of screenshots from JVM and a thread dump. This is after removing the log line at 1345. The CPU consumption is lower than before, but it's still pretty high.

Grigor Lechev added a comment - 2018-11-23 10:43 I've attached a couple of screenshots from JVM and a thread dump. This is after removing the log line at 1345. The CPU consumption is lower than before, but it's still pretty high.

Grigor Lechev added a comment - 2018-11-29 12:27

Attaching a build trend screenshot, I'm reproducing this successfully with the following scenario:
1) Create 2 jobs, one should be executing a content generation using the DSL plugin. Other should be empty pipeline. Schedule both of them to execute recurrently every minute.

2) The empty pipeline will initially run for < 1 second. It will start regressing with time, in a couple of hours it will go over a minute execution time.

3) Stopping the generation job will stop the regression.As you can see in the screenshot, I stopped it at build #719.

I've not confirmed if the item count that the generation job creates maters in this case (currently, my seed generates over 1000 items).

I'll try downgrading the DSL plugin to 1.69 as this was the previous known stable version for me and go from there.

Grigor Lechev added a comment - 2018-11-29 12:27 Attaching a build trend screenshot, I'm reproducing this successfully with the following scenario: 1) Create 2 jobs, one should be executing a content generation using the DSL plugin. Other should be empty pipeline. Schedule both of them to execute recurrently every minute. 2) The empty pipeline will initially run for < 1 second. It will start regressing with time, in a couple of hours it will go over a minute execution time. 3) Stopping the generation job will stop the regression.As you can see in the screenshot, I stopped it at build #719. I've not confirmed if the item count that the generation job creates maters in this case (currently, my seed generates over 1000 items). I'll try downgrading the DSL plugin to 1.69 as this was the previous known stable version for me and go from there.

Grigor Lechev added a comment - 2018-12-04 15:53

Since my last comment:
1. Downgrading the DSL plugin does not solve the issue
2. The volume of items that DSL plugin generates does impact the performance (more items -> faster regression per run).

I've minimized the issue for me by building a custom seed that generates only a fraction of all 1000 items, based on the changed files. Jenkins is running 4 days straight now and I'm not seeing anything like the CPU numbers I had. This is not a real solution, but it significantly increases the lifespan of the system.

Grigor Lechev added a comment - 2018-12-04 15:53 Since my last comment: 1. Downgrading the DSL plugin does not solve the issue 2. The volume of items that DSL plugin generates does impact the performance (more items -> faster regression per run). I've minimized the issue for me by building a custom seed that generates only a fraction of all 1000 items, based on the changed files. Jenkins is running 4 days straight now and I'm not seeing anything like the CPU numbers I had. This is not a real solution, but it significantly increases the lifespan of the system.

rives davy added a comment - 2021-02-17 11:55 - edited

We are suffering a similar issue. We have a huge project with thousand of builds per day dispatched on 10 K8S slaves and performances decrease abnormally in time such we need to schedule an auto restart every night (which is another source of problems because of night jobs).

Here is my analysis : On the master, we have processes that match all job running on slaves, and those take a lot of CPU where they are supposed to be idling (work is done on slaves)

> jvmtop.sh 7   PID 7: /usr/share/jenkins/jenkins.war  ARGS: --argumentsRealm.passwd.admin=******** --argumentsRealm.roles.adm[...]  VMARGS: -Dhudson.spool-svn=true -Dpermissive-script-security.enabled=no_s[...]  VM: Oracle Corporation OpenJDK 64-Bit Server VM 1.8.0_242  UP: 165:19m #THR: 333  #THRPEAK: 400  #THRCREATED: 3428879 USER: jenkins  GC-Time: 22:41m   #GC-Runs: 109913    #TotalLoadedClasses: 428328  CPU: 93.08% GC:  2.59% HEAP:5942m /6353m NONHEAP: 404m /  n/a     TID   NAME                                    STATE    CPU  TOTALCPU BLOCKEDBY  3420190 Running CpsFlowExecution[Owner       RUNNABLE 11.99%     0.01%  3416563 Running CpsFlowExecution[Owner       RUNNABLE 11.54%     0.01%  3401804 Running CpsFlowExecution[Owner       RUNNABLE 11.09%     0.00%  3426542 Running CpsFlowExecution[Owner       RUNNABLE  9.49%     0.00%  3425950 Running CpsFlowExecution[Owner       RUNNABLE  9.40%     0.00%  3419231 Running CpsFlowExecution[Owner       RUNNABLE  9.27%     0.01%  3425470 Running CpsFlowExecution[Owner       RUNNABLE  8.12%     0.00%  3421926 Running CpsFlowExecution[Owner       RUNNABLE  7.67%     0.00%  3419964 Running CpsFlowExecution[Owner       RUNNABLE  7.47%     0.00%  3399448 Running CpsFlowExecution[Owner       RUNNABLE  7.41%     0.01%  Note: Only top 10 threads (according cpu load) are shown!

> jstack 7 | grep CpsFlow 
...
"Running CpsFlowExecution[Owner[9.2.x-PROD/u_commons/9:9.2.x-PROD/u_commons #9]]" #3419232 daemon prio=5 os_prio=0 tid=0x00007fe10867c000 nid=0x5471 runnable [0x00007fe0e9d38000] "Jetty (winstone)-3419228" #3419228 prio=5 os_prio=0 tid=0x00007fe17c12f000 nid=0x546c runnable [0x00007fe0f46d6000]
...

NOTE : we are using Jenkins 2.249.1 and plugin workflow-cps in version 2.83

This behavior is really a big issue for big projects and intensive use of build.

rives davy added a comment - 2021-02-17 11:55 - edited We are suffering a similar issue. We have a huge project with thousand of builds per day dispatched on 10 K8S slaves and performances decrease abnormally in time such we need to schedule an auto restart every night (which is another source of problems because of night jobs). Here is my analysis : On the master, we have processes that match all job running on slaves, and those take a lot of CPU where they are supposed to be idling (work is done on slaves) > jvmtop.sh 7 PID 7: /usr/share/jenkins/jenkins.war ARGS: --argumentsRealm.passwd.admin=******** --argumentsRealm.roles.adm[...] VMARGS: -Dhudson.spool-svn= true -Dpermissive-script-security.enabled=no_s[...] VM: Oracle Corporation OpenJDK 64-Bit Server VM 1.8.0_242 UP: 165:19m #THR: 333 #THRPEAK: 400 #THRCREATED: 3428879 USER: jenkins GC-Time: 22:41m #GC-Runs: 109913 #TotalLoadedClasses: 428328 CPU: 93.08% GC: 2.59% HEAP:5942m /6353m NONHEAP: 404m / n/a TID NAME STATE CPU TOTALCPU BLOCKEDBY 3420190 Running CpsFlowExecution[Owner RUNNABLE 11.99% 0.01% 3416563 Running CpsFlowExecution[Owner RUNNABLE 11.54% 0.01% 3401804 Running CpsFlowExecution[Owner RUNNABLE 11.09% 0.00% 3426542 Running CpsFlowExecution[Owner RUNNABLE 9.49% 0.00% 3425950 Running CpsFlowExecution[Owner RUNNABLE 9.40% 0.00% 3419231 Running CpsFlowExecution[Owner RUNNABLE 9.27% 0.01% 3425470 Running CpsFlowExecution[Owner RUNNABLE 8.12% 0.00% 3421926 Running CpsFlowExecution[Owner RUNNABLE 7.67% 0.00% 3419964 Running CpsFlowExecution[Owner RUNNABLE 7.47% 0.00% 3399448 Running CpsFlowExecution[Owner RUNNABLE 7.41% 0.01% Note: Only top 10 threads (according cpu load) are shown! > jstack 7 | grep CpsFlow ... "Running CpsFlowExecution[Owner[9.2.x-PROD/u_commons/9:9.2.x-PROD/u_commons #9]]" #3419232 daemon prio=5 os_prio=0 tid=0x00007fe10867c000 nid=0x5471 runnable [0x00007fe0e9d38000] "Jetty (winstone)-3419228" #3419228 prio=5 os_prio=0 tid=0x00007fe17c12f000 nid=0x546c runnable [0x00007fe0f46d6000] ... NOTE : we are using Jenkins 2.249.1 and plugin workflow-cps in version 2.83 This behavior is really a big issue for big projects and intensive use of build.

Jenkins

Details

Description

Attachments

Attachments

Activity

Collapse comment: Ben Herfurth added a comment - 2018-11-23 09:04, Edited by Ben Herfurth - 2018-11-23 09:19

Expand comment: Ben Herfurth added a comment - 2018-11-23 09:04, Edited by Ben Herfurth - 2018-11-23 09:19

Collapse comment: Grigor Lechev added a comment - 2018-11-23 10:43

Expand comment: Grigor Lechev added a comment - 2018-11-23 10:43

Collapse comment: Grigor Lechev added a comment - 2018-11-29 12:27

Expand comment: Grigor Lechev added a comment - 2018-11-29 12:27

Collapse comment: Grigor Lechev added a comment - 2018-12-04 15:53

Expand comment: Grigor Lechev added a comment - 2018-12-04 15:53

Collapse comment: rives davy added a comment - 2021-02-17 11:55, Edited by rives davy - 2021-02-17 11:58

Expand comment: rives davy added a comment - 2021-02-17 11:55, Edited by rives davy - 2021-02-17 11:58

People

Dates