[JENKINS-47724] Pipeline with parallel jobs performance issue with EC2 slaves - Jenkins Jira

Type: Improvement
Resolution: Unresolved
Priority: Major
Component/s: workflow-api-plugin, workflow-durable-task-step-plugin
Labels:
- performance
- pipeline
Environment:
Jenkins 2.60.3 & 2.86
Pipeline 2.5
Pipeline API 2.23.1
Pipeline Nodes and Processes 2.17
Pipeline Step API 2.13
EC2 Plugin 1.36

Similar Issues:
Powered by SuggestiMate

Show

We are in the process of converting our freestyle matrix jobs over to parallel pipeline and we are seeing a few performance issues when using pipeline. We are able to reproduce the issue using sample pipeline code that is very similar to -~~JENKINS-45553~~-. We are also using the EC2 plugin to start and stop slaves. In all of my testing below, the slaves were started outside of pipeline jobs to prevent slave startup time from skewing any results.

Parallel Pipeline code:

def stepsForParallel = [:]
for (int i = 0; i < 150; i++) {
  def s = "subjob_${i}" 
  stepsForParallel[s] = {
    node("JENKINS-SLAVE-NODE-LABEL") {
      sh '''
      date +%c
      '''
    }
  }
}
timestamps {
parallel stepsForParallel
}

1. Freestyle Matrix VS Parallel Pipeline

Using 16 slave VMs:
I have a freestyle matrix axis job that runs the date command on slaves across 355 slave nodes. This build took 28 seconds to run after slave startup and around 23 seconds thereafter.

Using the sample parallel pipeline above with 355 branches, it takes 4 mins and 49 seconds to complete after the slaves are started and then around 2 minus 47 seconds thereafter.

I'm unsure why the time discrepancy is so large when both jobs are performing the same work with the same output.

2. The first parallel pipeline job on a slave is slower

I'm noticing that after the 16 slaves are started, the first pipeline job that runs takes a lot longer than subsequent runs. See below.
First build after slaves started: 4m 49s
Subsequent runs: 2m 47s, 2m 44s, 2m 43s, 2m 32s
After Jenkins slaves were stopped and then started without restarting Jenkins
the parallel pipeline job took 6m 36s to complete.
With extra debugging I did see this.

pipeline performance after slave restart

Slaves started:
4m 49s
timings for OwnerJOBNAME/34:JOBNAME #34:
Unknown macro: {classLoad=79ms, flowNode=29209ms, parse=49ms, run=287301ms, saveProgram=9880ms}
Build Reran:
2m 44s
timings for OwnerJOBNAME/35:JOBNAME #35:
Unknown macro: {classLoad=8ms, flowNode=34032ms, parse=21ms, run=162601ms, saveProgram=9192ms}
Slaves restarted:
5m 4s
timings for OwnerJOBNAME/36:JOBNAME #36:
Unknown macro: {classLoad=108ms, flowNode=33247ms, parse=71ms, run=301825ms, saveProgram=8836ms}

I tried first running a freestyle job and then a parallel pipeline job after slaves were restarted but it didn't help. I'm unsure why there is a performance penalty for being the first pipeline job to run on a slave after it is started.

3. Parallel pipeline job appears to pause when starting on slaves

This is a major issue for us. When running a parallel pipeline job, I can see the slaves on the left nav all show part of the pipeline job and an increasing queue. This process appears to hang or pause for a significant large amount of time. I verified on the slaves that they are idle and not doing anything. The Jenkins master shows little CPU and disk I/O load. This issue seems to get worse with an increased slave count.

Duration	Branches	Slave VMs
14m	100	100
3m	100	16
1m 13s	100	4
29m	1000	16
34m	1000	4

I would expect to see that parallel pipeline jobs run faster with more slaves not slower. I'm attaching thread dumps that I took during the pause.

log files during pause period

In the debug log I see this repeat slowly

Oct 30, 2017 2:05:30 AM FINEST org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep

-~~JENKINS-34021~~-: DurableTaskStep.Execution.listener present in CpsStepContext[471:sh]:OwnerJOBNAME #83

In the console log I see the following. Note a 2 minute pause occurs after the first line when watching the console log in real time

02:05:35 [subjob_149] Waiting for next available executor on JENKINS-SLAVE-LABEL
02:05:30 [subjob_0] + date +%c
02:05:39 [subjob_6] + date +%c
02:05:49 [subjob_5] + date +%c
02:05:58 [subjob_1] + date +%c
02:06:07 [subjob_2] + date +%c
02:06:16 [subjob_4] + date +%c
02:06:25 [subjob_3] + date +%c
02:06:34 [subjob_10] + date +%c
02:06:44 [subjob_7] + date +%c
02:06:53 [subjob_8] + date +%c
02:07:02 [subjob_15] + date +%c
02:07:11 [subjob_12] + date +%c
02:07:20 [subjob_9] + date +%c
02:07:29 [subjob_11] + date +%c
02:07:39 [subjob_13] + date +%c
02:07:48 [subjob_14] + date +%c
..

Notes:

I ran the sample parallel pipeline code on the Jenkins master and with 100 branches and 100 executors and couldn't reproduce the problem. I also ran the sample pipeline code with ' echo Hello ' instead of ' sh "date +%c" ' and also couldn't reproduce the problem. Adding additional sh commands to run on the slaves didn't add any significant time to the builds. I'm not sure if this issue is specific to EC2 plugin slaves. The slaves were started and fully running before any of the above tests were performed to minimize EC2 plugin involvement.

Tickets that maybe related:
-~~JENKINS-45553~~-
~~JENKINS-33761~~
JENKINS-39489
JENKINS-47170

4. High CPU on Jenkins Masters

I'm seeing higher CPU usage on our Jenkins masters that are running parallel pipeline code compared to freestyle. In addition, the more parallel branches that are ran the higher the load. I tried to quantify the differences in the table below using a freestyle matrix job and parallel pipeline job that performs close to the same code on 2 CPU / 8 GB RAM sized Jenkins Master instances.

Type	number of branches	number of slaves agents	peak 1m load avg	peak 5m load avg	15m load avg when job finished
Freestyle	259	55	2.0	.53	.08
Pipeline	259	60	3.19	1.7	.85
Pipeline	259	120	4.38	1.7	.89
Pipeline	259	190	4.34	2.03	1.06
Pipeline	355	190	6.26	2.12	.98

In summary, when using parallel pipeline I'm seeing higher load average on the Jenkins masters as the number of branches and slaves agents increases even though all of the testing occurs on Jenkins slave agents. Matrix freestyle jobs compared to parallel pipeline seem to have a fraction of the Jenkins master load after the initial peak.

UPDATE: Running two parallel pipeline jobs with a total of 110 slave agents pushed the Jenkins master instance to a peak 1m load avg of 10.77 and peak 5m load avg to 7.18.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

job-thread_dump.txt
19 kB
2017-10-30 01:46
system-thread_dump.txt
11 kB
2017-10-30 01:46

Details

Description

Parallel Pipeline code:

1. Freestyle Matrix VS Parallel Pipeline

2. The first parallel pipeline job on a slave is slower

3. Parallel pipeline job appears to pause when starting on slaves

Notes:

4. High CPU on Jenkins Masters

Attachments

Attachments

Activity

People

Dates