[JENKINS-38741] Job DSL execution is occassionally slow on large scale production servers

Type: Bug
Resolution: Unresolved
Priority: Minor
Component/s: job-dsl-plugin
Labels:
None
Environment:
Jenkins 1.642.4
Job DSL plugin v1.42

Similar Issues:
Powered by SuggestiMate

Show

We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Also, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs (ie: if all job configuration updates could be scheduled as a single larger operation instead of many smaller ones)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

thread_dumps_master_01_run_dsl_slow_07_19_2017.zip
278 kB
2017-07-20 01:51
jstack.3028.181541.450951510
224 kB
2017-05-04 18:44
jstack.3028.181510.060518328
221 kB
2017-05-04 18:44
healthy_trend.png
269 kB
2017-08-16 04:06
DSL_Perf.png
425 kB
2017-07-20 01:51

is duplicated by

JENKINS-40099 Job creation takes awhile

Closed

Kevin Phillips added a comment - 2016-10-05 14:59

I'm not sure if this may be relevant or not, but I have noticed that some of the jobs generated by our scripts are being detected as 'changed' even when there hasn't been any changes to their respective DSL code. As an associated side effect, I've noticed that immediately after running our DSL scripts, if you examine some of the jobs that get generated they have some notification text on them indicating that their job configuration has been changed since they were created ... even though there have been no manual modifications made to them.

I'm guessing that somewhere in our DSL scripts, the XML that is being generated is somehow slightly different than what the Jenkins service is expecting and it performs some sort of pre-processing logic to the XML when it gets stored to disk thus causing a discrepancy between what is running in production and what is generated by the scripts.

I suspect that maybe one reason the DSL generation is slow on occassion is partially caused by extra overhead on the Jenkins master to perform these pre-processing operations and if there were some way for us to avoid these false-positives it may minimize the overhead on the master and may avoid whatever bottleneck we are hitting.

Kevin Phillips added a comment - 2016-10-05 14:59 I'm not sure if this may be relevant or not, but I have noticed that some of the jobs generated by our scripts are being detected as 'changed' even when there hasn't been any changes to their respective DSL code. As an associated side effect, I've noticed that immediately after running our DSL scripts, if you examine some of the jobs that get generated they have some notification text on them indicating that their job configuration has been changed since they were created ... even though there have been no manual modifications made to them. I'm guessing that somewhere in our DSL scripts, the XML that is being generated is somehow slightly different than what the Jenkins service is expecting and it performs some sort of pre-processing logic to the XML when it gets stored to disk thus causing a discrepancy between what is running in production and what is generated by the scripts. I suspect that maybe one reason the DSL generation is slow on occassion is partially caused by extra overhead on the Jenkins master to perform these pre-processing operations and if there were some way for us to avoid these false-positives it may minimize the overhead on the master and may avoid whatever bottleneck we are hitting.

Christopher Shannon added a comment - 2016-10-05 18:11

I can confirm that running a job DSL script causes a significant load on the master. Our Jenkins instance was previously running on my laptop, and my PC would become almost unusable while a job DSL script was running.

We have since moved to a real server and increased the max heap size that Jenkins can consume considerably, so it is no longer an issue for us.

I'd be curious from the output of this thread "what" it is doing, but can at least confirm that there is a very noticeable load put on the master while a script is being processed.

Christopher Shannon added a comment - 2016-10-05 18:11 I can confirm that running a job DSL script causes a significant load on the master. Our Jenkins instance was previously running on my laptop, and my PC would become almost unusable while a job DSL script was running. We have since moved to a real server and increased the max heap size that Jenkins can consume considerably, so it is no longer an issue for us. I'd be curious from the output of this thread "what" it is doing, but can at least confirm that there is a very noticeable load put on the master while a script is being processed.

Vlad Aginsky added a comment - 2016-10-09 20:19

we observe it as well, in Jenkins ver. 1.617

Vlad Aginsky added a comment - 2016-10-09 20:19 we observe it as well, in Jenkins ver. 1.617

Daniel Spilker added a comment - 2016-10-12 09:26

The "Process Job DSLs" build steps always runs on the master, even if the job is assigned to an executor on an agent (aka slave). The Job DSL build steps needs to modify the internal memory structures in Jenkins, so running the step on an agent would require a lot of remote communication which would make the step every slow and cause load on both, master and agent. When running on an agent, the seed job's workspace will reside on the agent machine. Since the script runs on the master it needs to load all scripts and files through the remote channel, which does not make things faster. So in general it is discouraged to run the seed job on an agent.

There are several reasons why the performance may be slow sometimes. One is memory pressure and garbage collection. There are several tools to monitor a Java process, e.g. could try the Monitoring Plugin. If your Jenkins master is short of memory and/or doing a lot of GC, your system will be slow. You need to find out if that's the case.

Another reason why the seed job may be slow is that lot of lock contention is happening in Jenkins master. The Job DSL build steps modifies internal structure in Jenkins (job config, view config, etc). To ensure that the internal data is consistent, Jenkins uses locks to prevent multiple threads from modifying data concurrently. If your Jenkins instance is under load, e.g. many job are running, then a lot of internal data structures are modified. So it may happen that a thread (e.g. the executor running the seed job) must wait to acquire a lock to modify data. And so your seed job may take time until all locks are acquired and all data is modified. There are a lot of profiling tools to analyze lock contention. A simple method to debug lock contention is using thread dumps. If your seed job is slow, take a few thread dumps from Jenkins master in short intervals (e.g. 5 seconds) and then analyze the thread dumps for any threads that are blocked.

Daniel Spilker added a comment - 2016-10-12 09:26 The "Process Job DSLs" build steps always runs on the master, even if the job is assigned to an executor on an agent (aka slave). The Job DSL build steps needs to modify the internal memory structures in Jenkins, so running the step on an agent would require a lot of remote communication which would make the step every slow and cause load on both, master and agent. When running on an agent, the seed job's workspace will reside on the agent machine. Since the script runs on the master it needs to load all scripts and files through the remote channel, which does not make things faster. So in general it is discouraged to run the seed job on an agent. There are several reasons why the performance may be slow sometimes. One is memory pressure and garbage collection. There are several tools to monitor a Java process, e.g. could try the Monitoring Plugin . If your Jenkins master is short of memory and/or doing a lot of GC, your system will be slow. You need to find out if that's the case. Another reason why the seed job may be slow is that lot of lock contention is happening in Jenkins master. The Job DSL build steps modifies internal structure in Jenkins (job config, view config, etc). To ensure that the internal data is consistent, Jenkins uses locks to prevent multiple threads from modifying data concurrently. If your Jenkins instance is under load, e.g. many job are running, then a lot of internal data structures are modified. So it may happen that a thread (e.g. the executor running the seed job) must wait to acquire a lock to modify data. And so your seed job may take time until all locks are acquired and all data is modified. There are a lot of profiling tools to analyze lock contention. A simple method to debug lock contention is using thread dumps. If your seed job is slow, take a few thread dumps from Jenkins master in short intervals (e.g. 5 seconds) and then analyze the thread dumps for any threads that are blocked.

Kevin Phillips added a comment - 2016-10-12 13:30

Lock contention does seem like a plausible explanation for this behavior. I'm not an avid Java developer so I'll need to figure out how to do thread dumps and setup the tools necessary to analyze the thread usage to see if I can confirm. If you have any tips / links that could help me along please let me know.

Kevin Phillips added a comment - 2016-10-12 13:30 Lock contention does seem like a plausible explanation for this behavior. I'm not an avid Java developer so I'll need to figure out how to do thread dumps and setup the tools necessary to analyze the thread usage to see if I can confirm. If you have any tips / links that could help me along please let me know.

Christopher Shannon added a comment - 2016-10-12 19:58

Waiting for locks could certainly explain variability in the latency of a Job DSL run, which I suppose is the crux of this original issue, but would not explain the high load that we see on the master node while the job is running...

Christopher Shannon added a comment - 2016-10-12 19:58 Waiting for locks could certainly explain variability in the latency of a Job DSL run, which I suppose is the crux of this original issue, but would not explain the high load that we see on the master node while the job is running...

Daniel Spilker added a comment - 2016-10-13 11:46

w60001 what about garbage collection?

Daniel Spilker added a comment - 2016-10-13 11:46 w60001 what about garbage collection?

Christopher Shannon added a comment - 2016-10-13 12:15

I suppose that this is possible. I have looked at the output of the monitoring plugin and the spikes in CPU utilization don't necessarily correlate to drops in memory utilization, which I'd expect from a garbage collection...

Christopher Shannon added a comment - 2016-10-13 12:15 I suppose that this is possible. I have looked at the output of the monitoring plugin and the spikes in CPU utilization don't necessarily correlate to drops in memory utilization, which I'd expect from a garbage collection...

Daniel Spilker added a comment - 2016-10-17 13:47

You can use this script to create a few thread dumps from Jenkins master: http://wiki.eclipse.org/How_to_report_a_deadlock#jstackSeries_--_jstack_sampling_in_fixed_time_intervals_.28tested_on_Linux.29

Run the script when the load is high on master, zip the generated files and attach the archive to this issue.

Daniel Spilker added a comment - 2016-10-17 13:47 You can use this script to create a few thread dumps from Jenkins master: http://wiki.eclipse.org/How_to_report_a_deadlock#jstackSeries_--_jstack_sampling_in_fixed_time_intervals_.28tested_on_Linux.29 Run the script when the load is high on master, zip the generated files and attach the archive to this issue.

Martin Nonnenmacher added a comment - 2016-11-09 09:32 - edited

leedega: Jenkins does indeed modify the XML generated by Job DSL. For example it adds plugin attributes referencing the plugin version to some tags. As a result many more jobs are updated than required. This especially leads to performance problems when the Job Config History plugin is used without configuring a retention, because the plugin reads the full history of a job when it is updated, and the history can become very large when the seed job runs often. I think the XML comparison done by Job DSL could be improved, but I did not spend much time on the investigation.

Martin Nonnenmacher added a comment - 2016-11-09 09:32 - edited leedega : Jenkins does indeed modify the XML generated by Job DSL. For example it adds plugin attributes referencing the plugin version to some tags. As a result many more jobs are updated than required. This especially leads to performance problems when the Job Config History plugin is used without configuring a retention, because the plugin reads the full history of a job when it is updated, and the history can become very large when the seed job runs often. I think the XML comparison done by Job DSL could be improved, but I did not spend much time on the investigation.

Christopher Shannon added a comment - 2016-11-09 12:31

manonnen: Super informative comment. I can confirm on our side that we do have the Job Config History plugin, many jobs, and likely many jobs without a retention policy in place. That probably goes a long way to explaining the server load we're seeing on our end.

The that isn't clear, though, what do you mean by "many more jobs are updated than required"? Is Jenkins updating jobs that aren't generated by that particular run of the Job DSL? Or is it just the job history of those generated jobs that increases the load?

Christopher Shannon added a comment - 2016-11-09 12:31 manonnen : Super informative comment. I can confirm on our side that we do have the Job Config History plugin, many jobs, and likely many jobs without a retention policy in place. That probably goes a long way to explaining the server load we're seeing on our end. The that isn't clear, though, what do you mean by "many more jobs are updated than required"? Is Jenkins updating jobs that aren't generated by that particular run of the Job DSL? Or is it just the job history of those generated jobs that increases the load?

Martin Nonnenmacher added a comment - 2016-11-09 16:05

I meant that because Jenkins is changing the XML Job DSL thinks the job has changed and is updating it, instead of skipping. This creates a new entry in the job config history for affected jobs every time the seed job runs. For example one of or seed jobs is always updating about 90% of the jobs, even when only one of them was changed. Some of our jobs had hundreds of entries in the config history after a while and suddenly the updates became extremely slow. By configuring a retention for the config history we could increase the runtime of the seed job by factor 100.

Please also be aware that the retention for the config history needs to be configured globally in the system settings, not per job. Currently we only run into locking issues sometimes when a folder on the root level is updated (1-2 minutes delay), but job updates are very fast.

Martin Nonnenmacher added a comment - 2016-11-09 16:05 I meant that because Jenkins is changing the XML Job DSL thinks the job has changed and is updating it, instead of skipping. This creates a new entry in the job config history for affected jobs every time the seed job runs. For example one of or seed jobs is always updating about 90% of the jobs, even when only one of them was changed. Some of our jobs had hundreds of entries in the config history after a while and suddenly the updates became extremely slow. By configuring a retention for the config history we could increase the runtime of the seed job by factor 100. Please also be aware that the retention for the config history needs to be configured globally in the system settings, not per job. Currently we only run into locking issues sometimes when a folder on the root level is updated (1-2 minutes delay), but job updates are very fast.

Daniel Spilker added a comment - 2017-02-20 16:38

I'm going to close this as not reproducible if there is no proof that something in Job DSL can be fixed for this issue.

Daniel Spilker added a comment - 2017-02-20 16:38 I'm going to close this as not reproducible if there is no proof that something in Job DSL can be fixed for this issue.

Kevin Phillips added a comment - 2017-05-04 18:44

Here are a couple of thread dumps I just generated today off one of our production servers. I generated them using the script Daniel referred to above. Both were captured within a minute of one another, so hopefully they'll provide a good basis for comparison.

Kevin Phillips added a comment - 2017-05-04 18:44 Here are a couple of thread dumps I just generated today off one of our production servers. I generated them using the script Daniel referred to above. Both were captured within a minute of one another, so hopefully they'll provide a good basis for comparison.

Kevin Phillips added a comment - 2017-05-04 18:53

I probably should also mention that the script used to generate those thread dumps generated 10 dump files, but more than half of them were empty because the jstack call threw the following exception several times:

sun.jvm.hotspot.debugger.UnmappedAddressException: 3a80170

Each time the memory address was different of course, but the error was the same. Not sure if this detail is important to the analysis of this issue or not. Just thought I'd mention it.

Kevin Phillips added a comment - 2017-05-04 18:53 I probably should also mention that the script used to generate those thread dumps generated 10 dump files, but more than half of them were empty because the jstack call threw the following exception several times: sun.jvm.hotspot.debugger.UnmappedAddressException: 3a80170 Each time the memory address was different of course, but the error was the same. Not sure if this detail is important to the analysis of this issue or not. Just thought I'd mention it.

Kevin Phillips added a comment - 2017-05-04 18:57

Reading through the other tips on the same wiki as the thread dump script, I decided to try creating a heap dump to see if that revealed anything helpful. After many attempts to do so using the 'jmap' tool described on that wiki I have given up. Each time the dump failed with the following exception:

$ sudo jmap -F -dump:format=b,file=HeapDump3.hprof 3028

Attaching to process ID 3028, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.102-b14
Dumping heap to HeapDump3.hprof ...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x00000007347dc788
at sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62)
at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(VirtualBaseConstructor.java:109)
at sun.jvm.hotspot.oops.Metadata.instantiateWrapperFor(Metadata.java:68)
at sun.jvm.hotspot.oops.Oop.getKlassForOopHandle(Oop.java:211)
at sun.jvm.hotspot.oops.ObjectHeap.newOop(ObjectHeap.java:251)
at sun.jvm.hotspot.classfile.ClassLoaderData.getClassLoader(ClassLoaderData.java:64)
at sun.jvm.hotspot.memory.DictionaryEntry.loader(DictionaryEntry.java:63)
at sun.jvm.hotspot.memory.Dictionary.classesDo(Dictionary.java:67)
at sun.jvm.hotspot.memory.SystemDictionary.classesDo(SystemDictionary.java:190)
at sun.jvm.hotspot.memory.SystemDictionary.allClassesDo(SystemDictionary.java:183)
at sun.jvm.hotspot.utilities.HeapHprofBinWriter.writeClasses(HeapHprofBinWriter.java:954)
at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(HeapHprofBinWriter.java:427)
at sun.jvm.hotspot.tools.HeapDumper.run(HeapDumper.java:62)
at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:223)
at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118)
at sun.jvm.hotspot.tools.HeapDumper.main(HeapDumper.java:83)
... 6 more

Kevin Phillips added a comment - 2017-05-04 18:57 Reading through the other tips on the same wiki as the thread dump script, I decided to try creating a heap dump to see if that revealed anything helpful. After many attempts to do so using the 'jmap' tool described on that wiki I have given up. Each time the dump failed with the following exception: $ sudo jmap -F -dump:format=b,file=HeapDump3.hprof 3028 Attaching to process ID 3028, please wait... Debugger attached successfully. Server compiler detected. JVM version is 25.102-b14 Dumping heap to HeapDump3.hprof ... Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sun.tools.jmap.JMap.runTool(JMap.java:201) at sun.tools.jmap.JMap.main(JMap.java:130) Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x00000007347dc788 at sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62) at sun.jvm.hotspot.runtime.VirtualBaseConstructor.instantiateWrapperFor(VirtualBaseConstructor.java:109) at sun.jvm.hotspot.oops.Metadata.instantiateWrapperFor(Metadata.java:68) at sun.jvm.hotspot.oops.Oop.getKlassForOopHandle(Oop.java:211) at sun.jvm.hotspot.oops.ObjectHeap.newOop(ObjectHeap.java:251) at sun.jvm.hotspot.classfile.ClassLoaderData.getClassLoader(ClassLoaderData.java:64) at sun.jvm.hotspot.memory.DictionaryEntry.loader(DictionaryEntry.java:63) at sun.jvm.hotspot.memory.Dictionary.classesDo(Dictionary.java:67) at sun.jvm.hotspot.memory.SystemDictionary.classesDo(SystemDictionary.java:190) at sun.jvm.hotspot.memory.SystemDictionary.allClassesDo(SystemDictionary.java:183) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.writeClasses(HeapHprofBinWriter.java:954) at sun.jvm.hotspot.utilities.HeapHprofBinWriter.write(HeapHprofBinWriter.java:427) at sun.jvm.hotspot.tools.HeapDumper.run(HeapDumper.java:62) at sun.jvm.hotspot.tools.Tool.startInternal(Tool.java:260) at sun.jvm.hotspot.tools.Tool.start(Tool.java:223) at sun.jvm.hotspot.tools.Tool.execute(Tool.java:118) at sun.jvm.hotspot.tools.HeapDumper.main(HeapDumper.java:83) ... 6 more

Anand Shah added a comment - 2017-07-20 01:53 - edited

We are running a production system and we use DSL heavily.

For every commit, We re-generate all the jobs available on a production system. Roughly generating 1158 jobs every time. I noticed that over the time DSL job which generates all the jobs, gets slower and slower (from 25 mins to 1 hour 23 mins). Attached DSL_Perf.png will give you some insight. I took 10 thread dumps at interval 1 secs. I am attaching that zip thread_dumps_master_01_run_dsl_slow_07_19_2017.zip here for thread dumps and hope to get some help or pointers.

Anand Shah added a comment - 2017-07-20 01:53 - edited We are running a production system and we use DSL heavily. For every commit, We re-generate all the jobs available on a production system. Roughly generating 1158 jobs every time. I noticed that over the time DSL job which generates all the jobs, gets slower and slower (from 25 mins to 1 hour 23 mins). Attached DSL_Perf.png will give you some insight. I took 10 thread dumps at interval 1 secs. I am attaching that zip thread_dumps_master_01_run_dsl_slow_07_19_2017.zip here for thread dumps and hope to get some help or pointers.

Mikołaj Morawski added a comment - 2017-07-20 06:09

ajmsh Recently I have posted the issue that is very similar to your problem, unfortunately still no solution how to fix this, only Jenkins restart helps for some time.

https://issues.jenkins-ci.org/browse/JENKINS-44056

Mikołaj Morawski added a comment - 2017-07-20 06:09 ajmsh Recently I have posted the issue that is very similar to your problem, unfortunately still no solution how to fix this, only Jenkins restart helps for some time. https://issues.jenkins-ci.org/browse/JENKINS-44056

Martin Nonnenmacher added a comment - 2017-07-20 07:10

ajmsh: From your screenshot I can see that you are using the JobConfigHistory plugin. Have you configured a retention for it (like <100 entries per job) as I have recommended above? We have a seed job updating ~500 jobs that finishes in less than a minute, the only time we had builds taking more than half an hour was before we limited the amount of history entries.

Martin Nonnenmacher added a comment - 2017-07-20 07:10 ajmsh : From your screenshot I can see that you are using the JobConfigHistory plugin. Have you configured a retention for it (like <100 entries per job) as I have recommended above? We have a seed job updating ~500 jobs that finishes in less than a minute, the only time we had builds taking more than half an hour was before we limited the amount of history entries.

Anand Shah added a comment - 2017-07-25 04:33

Hi manonnen , There was no retention configured for jobConfigHistory. So by default, it was keeping all the history. I have modified this value to keep only 10 entries and restarted Jenkins. I will update you how this plays out in a week or so. After restart it generates ~1158 jobs in 10 mins, crossing my fingers it will stay at that time.

Anand Shah added a comment - 2017-07-25 04:33 Hi manonnen , There was no retention configured for jobConfigHistory. So by default, it was keeping all the history. I have modified this value to keep only 10 entries and restarted Jenkins. I will update you how this plays out in a week or so. After restart it generates ~1158 jobs in 10 mins, crossing my fingers it will stay at that time.

Anand Shah added a comment - 2017-08-03 22:00 - edited

So it seems, limiting jobConfigHistory does not help much. I started taking heap dump every day to figure it out if there is any memory leak. I analyzed heap dumps in mat tool and I found two memory leaks as following.

Problem Suspect 1

106 instances of "hudson.remoting.Channel", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x400305b48" occupy 1,073,486,048 (40.31%) bytes.

Biggest instances:

hudson.remoting.Channel @ 0x4203a2778 - 98,671,000 (3.70%) bytes.

Problem Suspect 2

272,374 instances of "java.lang.Class", loaded by "<system class loader>" occupy 715,804,624 (26.88%) bytes.

Biggest instances:

class java.beans.ThreadGroupContext @ 0x40504ed98 - 149,043,688 (5.60%) bytes.
class org.codehaus.groovy.reflection.ClassInfo @ 0x405037168 - 126,458,224 (4.75%) bytes.

These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]", loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x400305b48"

Suspect 2 might be related to a memory leak in groovy itself.
We are using job-dsl plugin version 1.47 and have "groovy-all-2.4.7.jar" under "WEB-INF/lib"

Groovy Leak: https://issues.apache.org/jira/browse/GROOVY-7683

As described in ~~JENKINS-33358~~, setting -Dgroovy.use.classvalue=true might help.

Anand Shah added a comment - 2017-08-03 22:00 - edited So it seems, limiting jobConfigHistory does not help much. I started taking heap dump every day to figure it out if there is any memory leak. I analyzed heap dumps in mat tool and I found two memory leaks as following. Problem Suspect 1 106 instances of "hudson.remoting.Channel" , loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x400305b48" occupy 1,073,486,048 (40.31%) bytes. Biggest instances: hudson.remoting.Channel @ 0x4203a2778 - 98,671,000 (3.70%) bytes. Problem Suspect 2 272,374 instances of "java.lang.Class" , loaded by "<system class loader>" occupy 715,804,624 (26.88%) bytes. Biggest instances: class java.beans.ThreadGroupContext @ 0x40504ed98 - 149,043,688 (5.60%) bytes. class org.codehaus.groovy.reflection.ClassInfo @ 0x405037168 - 126,458,224 (4.75%) bytes. These instances are referenced from one instance of "org.codehaus.groovy.util.AbstractConcurrentMapBase$Segment[]" , loaded by "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x400305b48" Suspect 2 might be related to a memory leak in groovy itself. We are using job-dsl plugin version 1.47 and have "groovy-all-2.4.7.jar" under "WEB-INF/lib" Groovy Leak: https://issues.apache.org/jira/browse/GROOVY-7683 As described in JENKINS-33358 , setting -Dgroovy.use.classvalue=true might help.

Trey Bohon added a comment - 2017-08-08 18:45 - edited

We're seeing a very similar linear increase on Job DSL execution time at my organization as Anand's. We also P4 SCM trigger all of our seeds (nearly 500 today) within 15 minutes. Each batch of these regens results in a single seed increasing about 0.5 seconds in execution time (starts around 6 seconds). Because our master has 12 executors, these 0.5 increments gradually result in these 500 batch regens taking 2+ hours instead of 15 minutes. Rebooting Jenkins resets this degradation.

For this problem, I wouldn't call it "Job DSL execution is occassionally slow on large scale production servers" but instead something like "Job DSL execution time increases linearly with usage for a given Jenkins uptime instance". Our trendline looks very similar to the attached.

I'm trying to narrow this down by generating a single job with an existing Groovy file on disk with enough seed jobs scheduled to run every minute such that all 12 executors are in use. It will take several hours to confirm it, but it appears to be leaking on our production master. It for sure doesn't on a dev machine with a very similar setup (it has been running for days, tens of thousands of seed runs). The only meaningful differences between the setups are the JDK build:
Production:
jdk1.8.0_131

Dev Machine:
1.8.0_112

And the Java flags. Which I know are absolutely essential for a large production server - for example not setting InitiatingHeapOccupancyPercent to something like 70 with a G1 GC means you will run out of memory within hours or days depending on your setup.

We're using Jenkins LTS 2.60.2 so that Groovy leak fix shouldn't be a factor.

I'll post back when I have more data. If my debugging setup reproduces it, I think it throws out a lot of the XML mutation and job config history ideas because we're talking about 1 job that's generated with a mostly unchanging history over hours.

I'd be open to a workaround by infrequent reboots as we generally have maintenance windows for OS updates and security weekly, which ultimately may be the realistic solution as the issue may be "Job DSL does something that causes something somewhere in Jenkins to leak".

EDIT: And the linear trend on our production master reset. I knew I posted this too soon. Our production master has a massive heap (100 GB) whereas my dev machine does not, so maybe 2.60.2 does fix this (which we recently upgraded to) but with larger heaps it takes longer for GC to cleanup periodically. I'm probably going to have to watch this over a couple of days to see if the data follows this behavior.

Trey Bohon added a comment - 2017-08-08 18:45 - edited We're seeing a very similar linear increase on Job DSL execution time at my organization as Anand's. We also P4 SCM trigger all of our seeds (nearly 500 today) within 15 minutes. Each batch of these regens results in a single seed increasing about 0.5 seconds in execution time (starts around 6 seconds). Because our master has 12 executors, these 0.5 increments gradually result in these 500 batch regens taking 2+ hours instead of 15 minutes. Rebooting Jenkins resets this degradation. For this problem, I wouldn't call it "Job DSL execution is occassionally slow on large scale production servers" but instead something like "Job DSL execution time increases linearly with usage for a given Jenkins uptime instance". Our trendline looks very similar to the attached. I'm trying to narrow this down by generating a single job with an existing Groovy file on disk with enough seed jobs scheduled to run every minute such that all 12 executors are in use. It will take several hours to confirm it, but it appears to be leaking on our production master. It for sure doesn't on a dev machine with a very similar setup (it has been running for days, tens of thousands of seed runs). The only meaningful differences between the setups are the JDK build: Production: jdk1.8.0_131 Dev Machine: 1.8.0_112 And the Java flags. Which I know are absolutely essential for a large production server - for example not setting InitiatingHeapOccupancyPercent to something like 70 with a G1 GC means you will run out of memory within hours or days depending on your setup. We're using Jenkins LTS 2.60.2 so that Groovy leak fix shouldn't be a factor. I'll post back when I have more data. If my debugging setup reproduces it, I think it throws out a lot of the XML mutation and job config history ideas because we're talking about 1 job that's generated with a mostly unchanging history over hours. I'd be open to a workaround by infrequent reboots as we generally have maintenance windows for OS updates and security weekly, which ultimately may be the realistic solution as the issue may be "Job DSL does something that causes something somewhere in Jenkins to leak". EDIT: And the linear trend on our production master reset. I knew I posted this too soon. Our production master has a massive heap (100 GB) whereas my dev machine does not, so maybe 2.60.2 does fix this (which we recently upgraded to) but with larger heaps it takes longer for GC to cleanup periodically. I'm probably going to have to watch this over a couple of days to see if the data follows this behavior.

Trey Bohon added a comment - 2017-08-09 15:15

So a full GC run, eg:
jcmd.exe %pid% GC.run

Only seems to recover a small amount of execution time, which reflects what Anand implied with at least 2 major leaks with the 2nd being fixed now. Something independent of the GC also seems to recover execution time periodically, but it never recovers completely and trends upward. I'm just going to work around this with weekly restarts that coincide with our IT maintenance window.

In case it's useful - our production server leaks even with just calling Job DSL whereas my dev machine does not. Major differences are:

Production (leaks):

JDK 1.8.0_131 x64
<arguments>~~server -XX:+AlwaysPreTouch -Xloggc:$JENKINS_HOME/gc~~%t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -XX:InitiatingHeapOccupancyPercent=70 -Xrs -Xms100g -Xmx100g -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -jar "%BASE%\jenkins.war" --httpPort=8080 --webroot="%BASE%\war"</arguments>

Dev (does not):
JDK 1.8.0_112 x64

<arguments>-XX:InitiatingHeapOccupancyPercent=70 -Xrs -Xms4g -Xmx4g -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -jar "%BASE%\jenkins.war" --httpPort=9090 --webroot="%BASE%\war"</arguments>

Trey Bohon added a comment - 2017-08-09 15:15 So a full GC run, eg: jcmd.exe %pid% GC.run Only seems to recover a small amount of execution time, which reflects what Anand implied with at least 2 major leaks with the 2nd being fixed now. Something independent of the GC also seems to recover execution time periodically, but it never recovers completely and trends upward. I'm just going to work around this with weekly restarts that coincide with our IT maintenance window. In case it's useful - our production server leaks even with just calling Job DSL whereas my dev machine does not. Major differences are: Production (leaks): JDK 1.8.0_131 x64 <arguments> server -XX:+AlwaysPreTouch -Xloggc:$JENKINS_HOME/gc %t.log -XX:NumberOfGCLogFiles=5 -XX:+UseGCLogFileRotation -XX:GCLogFileSize=20m -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCCause -XX:+PrintTenuringDistribution -XX:+PrintReferenceGC -XX:+PrintAdaptiveSizePolicy -XX:+UseG1GC -XX:+ExplicitGCInvokesConcurrent -XX:+ParallelRefProcEnabled -XX:+UseStringDeduplication -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20 -XX:+UnlockDiagnosticVMOptions -XX:G1SummarizeRSetStatsPeriod=1 -XX:InitiatingHeapOccupancyPercent=70 -Xrs -Xms100g -Xmx100g -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -jar "%BASE%\jenkins.war" --httpPort=8080 --webroot="%BASE%\war"</arguments> Dev (does not): JDK 1.8.0_112 x64 <arguments>-XX:InitiatingHeapOccupancyPercent=70 -Xrs -Xms4g -Xmx4g -Dhudson.lifecycle=hudson.lifecycle.WindowsServiceLifecycle -jar "%BASE%\jenkins.war" --httpPort=9090 --webroot="%BASE%\war"</arguments>

Anand Shah added a comment - 2017-08-16 04:06

Jenkins version: 2.32.2

Just an update: After setting "-Dgroovy.use.classvalue=true" , Jenkins master is running healthy for a week without any increased time under DSL jobs.

Build Time Trend:

Anand Shah added a comment - 2017-08-16 04:06 Jenkins version: 2.32.2 Just an update: After setting " -Dgroovy.use.classvalue=true" , Jenkins master is running healthy for a week without any increased time under DSL jobs. Build Time Trend:

Daniel Spilker added a comment - 2019-08-26 11:12

Is there still a problem when using newer versions of Jenkins and Job DSL? If not, I'm going to close this ticket.

Daniel Spilker added a comment - 2019-08-26 11:12 Is there still a problem when using newer versions of Jenkins and Job DSL? If not, I'm going to close this ticket.

Ben Hines added a comment - 2023-06-10 02:05

Hey daspilker this is defintiely still the case. However, for us, we have narrowed down the issue to running on agents. We have a quite large amount of DSL code. When run on remote executors, it takes about 12-15 minutes. When run on the 'main' node, it takes 30 seconds. It seems like job-dsl is simply not at all optimized to run on a remote agent.

Is '-Dgroovy.use.classvalue=true' still a suggested workaround? haven't tried that yet.

Ben Hines added a comment - 2023-06-10 02:05 Hey daspilker this is defintiely still the case. However, for us, we have narrowed down the issue to running on agents. We have a quite large amount of DSL code. When run on remote executors, it takes about 12-15 minutes. When run on the 'main' node, it takes 30 seconds. It seems like job-dsl is simply not at all optimized to run on a remote agent. Is '-Dgroovy.use.classvalue=true' still a suggested workaround? haven't tried that yet.

Assignee:: Jamie Tanna

Reporter:: Kevin Phillips

Votes:: 9 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2016-10-05 14:51

Updated:: 2023-06-10 02:05

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Kevin Phillips added a comment - 2016-10-05 14:59

Expand comment: Kevin Phillips added a comment - 2016-10-05 14:59

Collapse comment: Christopher Shannon added a comment - 2016-10-05 18:11

Expand comment: Christopher Shannon added a comment - 2016-10-05 18:11

Collapse comment: Vlad Aginsky added a comment - 2016-10-09 20:19

Expand comment: Vlad Aginsky added a comment - 2016-10-09 20:19

Collapse comment: Daniel Spilker added a comment - 2016-10-12 09:26

Expand comment: Daniel Spilker added a comment - 2016-10-12 09:26

Collapse comment: Kevin Phillips added a comment - 2016-10-12 13:30

Expand comment: Kevin Phillips added a comment - 2016-10-12 13:30

Collapse comment: Christopher Shannon added a comment - 2016-10-12 19:58

Expand comment: Christopher Shannon added a comment - 2016-10-12 19:58

Collapse comment: Daniel Spilker added a comment - 2016-10-13 11:46

Expand comment: Daniel Spilker added a comment - 2016-10-13 11:46

Collapse comment: Christopher Shannon added a comment - 2016-10-13 12:15

Expand comment: Christopher Shannon added a comment - 2016-10-13 12:15

Collapse comment: Daniel Spilker added a comment - 2016-10-17 13:47

Expand comment: Daniel Spilker added a comment - 2016-10-17 13:47

Collapse comment: Martin Nonnenmacher added a comment - 2016-11-09 09:32, Edited by Martin Nonnenmacher - 2016-11-09 09:33

Expand comment: Martin Nonnenmacher added a comment - 2016-11-09 09:32, Edited by Martin Nonnenmacher - 2016-11-09 09:33

Collapse comment: Christopher Shannon added a comment - 2016-11-09 12:31

Expand comment: Christopher Shannon added a comment - 2016-11-09 12:31

Collapse comment: Martin Nonnenmacher added a comment - 2016-11-09 16:05

Expand comment: Martin Nonnenmacher added a comment - 2016-11-09 16:05

Collapse comment: Daniel Spilker added a comment - 2017-02-20 16:38

Expand comment: Daniel Spilker added a comment - 2017-02-20 16:38

Collapse comment: Kevin Phillips added a comment - 2017-05-04 18:44

Expand comment: Kevin Phillips added a comment - 2017-05-04 18:44

Collapse comment: Kevin Phillips added a comment - 2017-05-04 18:53

Expand comment: Kevin Phillips added a comment - 2017-05-04 18:53

Collapse comment: Kevin Phillips added a comment - 2017-05-04 18:57

Expand comment: Kevin Phillips added a comment - 2017-05-04 18:57

Collapse comment: Anand Shah added a comment - 2017-07-20 01:53, Edited by Anand Shah - 2017-07-20 01:55

Expand comment: Anand Shah added a comment - 2017-07-20 01:53, Edited by Anand Shah - 2017-07-20 01:55

Collapse comment: Mikołaj Morawski added a comment - 2017-07-20 06:09

Expand comment: Mikołaj Morawski added a comment - 2017-07-20 06:09

Collapse comment: Martin Nonnenmacher added a comment - 2017-07-20 07:10

Expand comment: Martin Nonnenmacher added a comment - 2017-07-20 07:10

Collapse comment: Anand Shah added a comment - 2017-07-25 04:33

Expand comment: Anand Shah added a comment - 2017-07-25 04:33

Collapse comment: Anand Shah added a comment - 2017-08-03 22:00, Edited by Anand Shah - 2017-08-03 22:02

Problem Suspect 1

Problem Suspect 2

Expand comment: Anand Shah added a comment - 2017-08-03 22:00, Edited by Anand Shah - 2017-08-03 22:02

Collapse comment: Trey Bohon added a comment - 2017-08-08 18:45, Edited by Trey Bohon - 2017-08-08 19:04

Expand comment: Trey Bohon added a comment - 2017-08-08 18:45, Edited by Trey Bohon - 2017-08-08 19:04

Collapse comment: Trey Bohon added a comment - 2017-08-09 15:15

Expand comment: Trey Bohon added a comment - 2017-08-09 15:15

Collapse comment: Anand Shah added a comment - 2017-08-16 04:06

Expand comment: Anand Shah added a comment - 2017-08-16 04:06

Collapse comment: Daniel Spilker added a comment - 2019-08-26 11:12

Expand comment: Daniel Spilker added a comment - 2019-08-26 11:12

Collapse comment: Ben Hines added a comment - 2023-06-10 02:05

Expand comment: Ben Hines added a comment - 2023-06-10 02:05

People

Dates