Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-38741

Job DSL execution is occassionally slow on large scale production servers

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • job-dsl-plugin
    • None
    • Jenkins 1.642.4
      Job DSL plugin v1.42

      We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

      Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

      Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

      If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Also, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs (ie: if all job configuration updates could be scheduled as a single larger operation instead of many smaller ones)

          [JENKINS-38741] Job DSL execution is occassionally slow on large scale production servers

          Kevin Phillips created issue -
          Kevin Phillips made changes -
          Environment New: Jenkins 1.642.4
          Job DSL plugin v1.42

          I'm not sure if this may be relevant or not, but I have noticed that some of the jobs generated by our scripts are being detected as 'changed' even when there hasn't been any changes to their respective DSL code. As an associated side effect, I've noticed that immediately after running our DSL scripts, if you examine some of the jobs that get generated they have some notification text on them indicating that their job configuration has been changed since they were created ... even though there have been no manual modifications made to them.

          I'm guessing that somewhere in our DSL scripts, the XML that is being generated is somehow slightly different than what the Jenkins service is expecting and it performs some sort of pre-processing logic to the XML when it gets stored to disk thus causing a discrepancy between what is running in production and what is generated by the scripts.

          I suspect that maybe one reason the DSL generation is slow on occassion is partially caused by extra overhead on the Jenkins master to perform these pre-processing operations and if there were some way for us to avoid these false-positives it may minimize the overhead on the master and may avoid whatever bottleneck we are hitting.

          Kevin Phillips added a comment - I'm not sure if this may be relevant or not, but I have noticed that some of the jobs generated by our scripts are being detected as 'changed' even when there hasn't been any changes to their respective DSL code. As an associated side effect, I've noticed that immediately after running our DSL scripts, if you examine some of the jobs that get generated they have some notification text on them indicating that their job configuration has been changed since they were created ... even though there have been no manual modifications made to them. I'm guessing that somewhere in our DSL scripts, the XML that is being generated is somehow slightly different than what the Jenkins service is expecting and it performs some sort of pre-processing logic to the XML when it gets stored to disk thus causing a discrepancy between what is running in production and what is generated by the scripts. I suspect that maybe one reason the DSL generation is slow on occassion is partially caused by extra overhead on the Jenkins master to perform these pre-processing operations and if there were some way for us to avoid these false-positives it may minimize the overhead on the master and may avoid whatever bottleneck we are hitting.
          Kevin Phillips made changes -
          Description Original: We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

          Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

          Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

          If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Otherwise, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs.
          New: We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

          Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

          Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

          If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Also, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs.
          Kevin Phillips made changes -
          Description Original: We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

          Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

          Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

          If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Also, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs.
          New: We have several Jenkins jobs which do nothing more than execute a single build step for running job DSL code. Further, some of these scripts are responsible for dynamically generating hundreds of new jobs. Now, sometimes these scripts run through to completion quickly (ie: 60 seconds or less) but other times they take severely long periods of time to complete (ie: 20-30 minutes or more).

          Based on some crude investigations on my part trolling through build logs and poking around at the source code for the job DSL plugin, what I think it happening is that after the job DSL gets processed a list of newly generated and / or modified job configs are stored in memory and then the Jenkins master is 'scheduled' to run each of the individual job operations whenever it can. In these extreme situations it appears to take a very long period of time for all of these individual job operations (for which there are hundreds) to get scheduled and run. For example, when looking at the time stamps of specific log messages on the master we see it may be as much as 60 seconds from when one job update operation completes and the next one begins, with other unrelated output from the master in between.

          Why this happens exactly, I am not sure. The conditions causing the slowdown are difficult at best to reproduce. I can say that in our production environment the Jenkins master runs on a virtual machine with considerable resources assigned to it, and we don't have any executors configured on the master so any load it may see is 100% associated with the master processes and not some unpredictable build operation.

          If anyone can shed some light on why this behaves as it does, and whether there's anything we can do on our end to try and mitigate the issue that would be awesome. Also, it would be nice to see some sort of implementation change in the plugin itself to try and ensure the performance from the generation process is kept consistent between runs (ie: if all job configuration updates could be scheduled as a single larger operation instead of many smaller ones)

          I can confirm that running a job DSL script causes a significant load on the master. Our Jenkins instance was previously running on my laptop, and my PC would become almost unusable while a job DSL script was running.

          We have since moved to a real server and increased the max heap size that Jenkins can consume considerably, so it is no longer an issue for us.

          I'd be curious from the output of this thread "what" it is doing, but can at least confirm that there is a very noticeable load put on the master while a script is being processed.

          Christopher Shannon added a comment - I can confirm that running a job DSL script causes a significant load on the master. Our Jenkins instance was previously running on my laptop, and my PC would become almost unusable while a job DSL script was running. We have since moved to a real server and increased the max heap size that Jenkins can consume considerably, so it is no longer an issue for us. I'd be curious from the output of this thread "what" it is doing, but can at least confirm that there is a very noticeable load put on the master while a script is being processed.

          Vlad Aginsky added a comment -

          we observe it as well, in Jenkins ver. 1.617

          Vlad Aginsky added a comment - we observe it as well, in Jenkins ver. 1.617

          The "Process Job DSLs" build steps always runs on the master, even if the job is assigned to an executor on an agent (aka slave). The Job DSL build steps needs to modify the internal memory structures in Jenkins, so running the step on an agent would require a lot of remote communication which would make the step every slow and cause load on both, master and agent. When running on an agent, the seed job's workspace will reside on the agent machine. Since the script runs on the master it needs to load all scripts and files through the remote channel, which does not make things faster. So in general it is discouraged to run the seed job on an agent.

          There are several reasons why the performance may be slow sometimes. One is memory pressure and garbage collection. There are several tools to monitor a Java process, e.g. could try the Monitoring Plugin. If your Jenkins master is short of memory and/or doing a lot of GC, your system will be slow. You need to find out if that's the case.

          Another reason why the seed job may be slow is that lot of lock contention is happening in Jenkins master. The Job DSL build steps modifies internal structure in Jenkins (job config, view config, etc). To ensure that the internal data is consistent, Jenkins uses locks to prevent multiple threads from modifying data concurrently. If your Jenkins instance is under load, e.g. many job are running, then a lot of internal data structures are modified. So it may happen that a thread (e.g. the executor running the seed job) must wait to acquire a lock to modify data. And so your seed job may take time until all locks are acquired and all data is modified. There are a lot of profiling tools to analyze lock contention. A simple method to debug lock contention is using thread dumps. If your seed job is slow, take a few thread dumps from Jenkins master in short intervals (e.g. 5 seconds) and then analyze the thread dumps for any threads that are blocked.

          Daniel Spilker added a comment - The "Process Job DSLs" build steps always runs on the master, even if the job is assigned to an executor on an agent (aka slave). The Job DSL build steps needs to modify the internal memory structures in Jenkins, so running the step on an agent would require a lot of remote communication which would make the step every slow and cause load on both, master and agent. When running on an agent, the seed job's workspace will reside on the agent machine. Since the script runs on the master it needs to load all scripts and files through the remote channel, which does not make things faster. So in general it is discouraged to run the seed job on an agent. There are several reasons why the performance may be slow sometimes. One is memory pressure and garbage collection. There are several tools to monitor a Java process, e.g. could try the Monitoring Plugin . If your Jenkins master is short of memory and/or doing a lot of GC, your system will be slow. You need to find out if that's the case. Another reason why the seed job may be slow is that lot of lock contention is happening in Jenkins master. The Job DSL build steps modifies internal structure in Jenkins (job config, view config, etc). To ensure that the internal data is consistent, Jenkins uses locks to prevent multiple threads from modifying data concurrently. If your Jenkins instance is under load, e.g. many job are running, then a lot of internal data structures are modified. So it may happen that a thread (e.g. the executor running the seed job) must wait to acquire a lock to modify data. And so your seed job may take time until all locks are acquired and all data is modified. There are a lot of profiling tools to analyze lock contention. A simple method to debug lock contention is using thread dumps. If your seed job is slow, take a few thread dumps from Jenkins master in short intervals (e.g. 5 seconds) and then analyze the thread dumps for any threads that are blocked.

          Lock contention does seem like a plausible explanation for this behavior. I'm not an avid Java developer so I'll need to figure out how to do thread dumps and setup the tools necessary to analyze the thread usage to see if I can confirm. If you have any tips / links that could help me along please let me know.

          Kevin Phillips added a comment - Lock contention does seem like a plausible explanation for this behavior. I'm not an avid Java developer so I'll need to figure out how to do thread dumps and setup the tools necessary to analyze the thread usage to see if I can confirm. If you have any tips / links that could help me along please let me know.

          Waiting for locks could certainly explain variability in the latency of a Job DSL run, which I suppose is the crux of this original issue, but would not explain the high load that we see on the master node while the job is running...

          Christopher Shannon added a comment - Waiting for locks could certainly explain variability in the latency of a Job DSL run, which I suppose is the crux of this original issue, but would not explain the high load that we see on the master node while the job is running...

            jamietanna Jamie Tanna
            leedega Kevin Phillips
            Votes:
            9 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated: