[JENKINS-65308] Jenkins.trimLabels gets increasingly slower as number of nodes and labels increase

Type: Bug
Resolution: Fixed
Priority: Major
Component/s: core
Labels:
- 2.277.4-rejected
Environment:

Hide
agents jdk information:
swarm-client-3.24.jar
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

primary node jdk information:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)

Jenkins: 2.276
OS: Linux - 5.4.0-1036-gcp
---
ace-editor:1.1
pipeline-rest-api:2.19
pipeline-input-step:2.12
dtkit-api:3.0.0
build-name-setter:2.1.0
blueocean-pipeline-editor:1.24.4
extended-read-permission:3.2
swarm:3.24
pipeline-utility-steps:2.6.1
ldap:2.3
jaxb:2.3.0
blueocean-display-url:2.4.0
blueocean-pipeline-api-impl:1.24.4
handlebars:1.1.1
trilead-api:1.0.13
bootstrap4-api:4.6.0-1
conditional-buildstep:1.4.1
scm-api:2.6.4
powershell:1.4
pipeline-stage-view:2.19
okhttp-api:3.14.9
command-launcher:1.5
aws-java-sdk:1.11.930
blueocean-bitbucket-pipeline:1.24.4
s3:0.11.6
blueocean-rest:1.24.4
credentials-binding:1.24
pipeline-graph-analysis:1.10
lockable-resources:2.10
postbuildscript:2.11.0
github-branch-source:2.10.0
build-user-vars-plugin:1.6
cloudbees-disk-usage-simple:0.10
blueocean-git-pipeline:1.24.4
node-iterator-api:1.5.0
junit-attachments:1.6
momentjs:1.1.1
plain-credentials:1.7
workflow-support:3.8
workflow-aggregator:2.6
credentials:2.3.14
workflow-cps-global-lib:2.17
email-ext:2.81
authorize-project:1.3.0
docker-commons:1.17
gradle:1.36
cobertura:1.16
blueocean-commons:1.24.4
antisamy-markup-formatter:2.1
favorite:2.3.3
embeddable-build-status:2.0.3
jjwt-api:0.11.2-7.a257d5ff5a6b
jsch:0.1.55.2
mailer:1.32.1
pipeline-stage-tags-metadata:1.8.1
external-monitor-job:1.7
ssh-slaves:1.31.5
built-on-column:1.1
jenkins-multijob-plugin:1.36
popper-api:1.16.1-1
token-macro:2.15
blueocean-github-pipeline:1.24.4
ws-cleanup:0.38
workflow-basic-steps:2.23
rebuild:1.31
pipeline-model-definition:1.8.1
junit:1.48
ghprb:1.42.1
job-dsl:1.77
jenkins-design-language:1.24.4
pipeline-milestone-step:1.3.2
blueocean-rest-impl:1.24.4
github-api:1.122
workflow-multibranch:2.22
pipeline-model-api:1.8.1
allure-jenkins-plugin:2.29.0
docker-workflow:1.25
display-url-api:2.3.4
git-server:1.9
durable-task:1.35
pipeline-build-step:2.13
pam-auth:1.6
vsphere-cloud:2.25
google-storage-plugin:1.5.3
slack:2.45
google-oauth-plugin:1.0.3
authentication-tokens:1.4
snakeyaml-api:1.27.0
resource-disposer:0.14
workflow-job:2.40
apache-httpcomponents-client-4-api:4.5.13-1.0
jdk-tool:1.4
ssh-credentials:1.18.1
blueocean-core-js:1.24.4
blueocean-config:1.24.4
github-scm-trait-notification-context:1.1
workflow-scm-step:2.11
yaml-axis:0.3.0
build-timeout:1.20
basic-branch-build-strategies:1.3.2
variant:1.4
pipeline-githubnotify-step:1.0.5
envinject-api:1.7
matrix-combinations-parameter:1.3.1
plugin-util-api:1.6.1
checks-api:1.3.0
htmlpublisher:1.25
generic-webhook-trigger:1.72
blueocean-pipeline-scm-api:1.24.4
cloudbees-folder:6.15
blueocean-autofavorite:1.2.4
blueocean-i18n:1.24.4
echarts-api:4.9.0-3
structs:1.22
metrics:4.0.2.7
cloudbees-bitbucket-branch-source:2.9.7
blueocean-web:1.24.4
github:1.32.0
windows-slaves:1.7
maven-plugin:3.9
jackson2-api:2.12.1
git:4.5.2
envinject:2.4.0
handy-uri-templates-2-api:2.1.8-1.0
matrix-project:1.18
ssh-agent:1.20
timestamper:1.11.8
disable-github-multibranch-status:1.2
bouncycastle-api:2.18
blueocean:1.24.4
run-condition:1.5
blueocean-events:1.24.4
jquery3-api:3.5.1-2
pipeline-stage-step:2.5
font-awesome-api:5.15.2-1
blueocean-dashboard:1.24.4
git-client:3.6.0
leastload:3.0.0
javadoc:1.6
oauth-credentials:0.4
workflow-api:2.41
sse-gateway:1.24
blueocean-jwt:1.24.4
script-security:1.76
github-oauth:0.33
branch-api:2.6.3
custom-build-properties:1.9.1
ant:1.11
workflow-durable-task-step:2.37
workflow-step-api:2.23
hashicorp-vault-plugin:3.7.0
ansicolor:0.5.3
pipeline-github:2.7
parameterized-trigger:2.40
code-coverage-api:1.3.1
copyartifact:1.46
jquery:1.12.4-1
mask-passwords:3.0
blueocean-personalization:1.24.4
prometheus:2.0.9
google-metadata-plugin:0.3.1
pipeline-model-extensions:1.8.1
xunit:3.0.0
workflow-cps:2.87
pubsub-light:1.13
matrix-auth:2.6.5

Show
agents jdk information: swarm-client-3.24.jar openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) primary node jdk information: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) Jenkins: 2.276 OS: Linux - 5.4.0-1036-gcp --- ace-editor:1.1 pipeline-rest-api:2.19 pipeline-input-step:2.12 dtkit-api:3.0.0 build-name-setter:2.1.0 blueocean-pipeline-editor:1.24.4 extended-read-permission:3.2 swarm:3.24 pipeline-utility-steps:2.6.1 ldap:2.3 jaxb:2.3.0 blueocean-display-url:2.4.0 blueocean-pipeline-api-impl:1.24.4 handlebars:1.1.1 trilead-api:1.0.13 bootstrap4-api:4.6.0-1 conditional-buildstep:1.4.1 scm-api:2.6.4 powershell:1.4 pipeline-stage-view:2.19 okhttp-api:3.14.9 command-launcher:1.5 aws-java-sdk:1.11.930 blueocean-bitbucket-pipeline:1.24.4 s3:0.11.6 blueocean-rest:1.24.4 credentials-binding:1.24 pipeline-graph-analysis:1.10 lockable-resources:2.10 postbuildscript:2.11.0 github-branch-source:2.10.0 build-user-vars-plugin:1.6 cloudbees-disk-usage-simple:0.10 blueocean-git-pipeline:1.24.4 node-iterator-api:1.5.0 junit-attachments:1.6 momentjs:1.1.1 plain-credentials:1.7 workflow-support:3.8 workflow-aggregator:2.6 credentials:2.3.14 workflow-cps-global-lib:2.17 email-ext:2.81 authorize-project:1.3.0 docker-commons:1.17 gradle:1.36 cobertura:1.16 blueocean-commons:1.24.4 antisamy-markup-formatter:2.1 favorite:2.3.3 embeddable-build-status:2.0.3 jjwt-api:0.11.2-7.a257d5ff5a6b jsch:0.1.55.2 mailer:1.32.1 pipeline-stage-tags-metadata:1.8.1 external-monitor-job:1.7 ssh-slaves:1.31.5 built-on-column:1.1 jenkins-multijob-plugin:1.36 popper-api:1.16.1-1 token-macro:2.15 blueocean-github-pipeline:1.24.4 ws-cleanup:0.38 workflow-basic-steps:2.23 rebuild:1.31 pipeline-model-definition:1.8.1 junit:1.48 ghprb:1.42.1 job-dsl:1.77 jenkins-design-language:1.24.4 pipeline-milestone-step:1.3.2 blueocean-rest-impl:1.24.4 github-api:1.122 workflow-multibranch:2.22 pipeline-model-api:1.8.1 allure-jenkins-plugin:2.29.0 docker-workflow:1.25 display-url-api:2.3.4 git-server:1.9 durable-task:1.35 pipeline-build-step:2.13 pam-auth:1.6 vsphere-cloud:2.25 google-storage-plugin:1.5.3 slack:2.45 google-oauth-plugin:1.0.3 authentication-tokens:1.4 snakeyaml-api:1.27.0 resource-disposer:0.14 workflow-job:2.40 apache-httpcomponents-client-4-api:4.5.13-1.0 jdk-tool:1.4 ssh-credentials:1.18.1 blueocean-core-js:1.24.4 blueocean-config:1.24.4 github-scm-trait-notification-context:1.1 workflow-scm-step:2.11 yaml-axis:0.3.0 build-timeout:1.20 basic-branch-build-strategies:1.3.2 variant:1.4 pipeline-githubnotify-step:1.0.5 envinject-api:1.7 matrix-combinations-parameter:1.3.1 plugin-util-api:1.6.1 checks-api:1.3.0 htmlpublisher:1.25 generic-webhook-trigger:1.72 blueocean-pipeline-scm-api:1.24.4 cloudbees-folder:6.15 blueocean-autofavorite:1.2.4 blueocean-i18n:1.24.4 echarts-api:4.9.0-3 structs:1.22 metrics:4.0.2.7 cloudbees-bitbucket-branch-source:2.9.7 blueocean-web:1.24.4 github:1.32.0 windows-slaves:1.7 maven-plugin:3.9 jackson2-api:2.12.1 git:4.5.2 envinject:2.4.0 handy-uri-templates-2-api:2.1.8-1.0 matrix-project:1.18 ssh-agent:1.20 timestamper:1.11.8 disable-github-multibranch-status:1.2 bouncycastle-api:2.18 blueocean:1.24.4 run-condition:1.5 blueocean-events:1.24.4 jquery3-api:3.5.1-2 pipeline-stage-step:2.5 font-awesome-api:5.15.2-1 blueocean-dashboard:1.24.4 git-client:3.6.0 leastload:3.0.0 javadoc:1.6 oauth-credentials:0.4 workflow-api:2.41 sse-gateway:1.24 blueocean-jwt:1.24.4 script-security:1.76 github-oauth:0.33 branch-api:2.6.3 custom-build-properties:1.9.1 ant:1.11 workflow-durable-task-step:2.37 workflow-step-api:2.23 hashicorp-vault-plugin:3.7.0 ansicolor:0.5.3 pipeline-github:2.7 parameterized-trigger:2.40 code-coverage-api:1.3.1 copyartifact:1.46 jquery:1.12.4-1 mask-passwords:3.0 blueocean-personalization:1.24.4 prometheus:2.0.9 google-metadata-plugin:0.3.1 pipeline-model-extensions:1.8.1 xunit:3.0.0 workflow-cps:2.87 pubsub-light:1.13 matrix-auth:2.6.5

Similar Issues:
Powered by SuggestiMate

Show
Released As:
2.288 - Apr 13, 2021, 2.289 - Apr 20, 2021

We've been trying to track down some issues we've been seeing around Queue lock
contention on one of our Jenkins clusters. The lock contention manifests in
both UI instability/slowness and failures with REST API calls to add, update or
remove nodes. We use the Swarm plugin on the primary and swarm-client (version
3.24) on the agents to connect to the primary. The REST API failures aren't due
to exceptions from Jenkins, but to the API calls exceeding the configured
proxy_read_timeout (180s) for the nginx instance we have in front of Jenkins.
That manifests in the swarm-client process on the agents receiving a 504 from
nginx since Jenkins didn't respond in time.

Thread dumps gathered during periods of instability show that hundreds of
threads are waiting for the Queue lock to be able to add, update or remove
a node.

"Handling POST /plugin/swarm/createSlave from 10.224.1.234 : Jetty (winstone)-1218487" #1218487 prio=5 os_prio=0 tid=0x00007f4732c7f800 nid=0x2f96 waiting on condition [0x00007f3c6b5f2000]
   java.lang.Thread.State: WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for  <0x00007f3f117ca288> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
       at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
       at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
       at hudson.model.Queue._withLock(Queue.java:1441)
[snip..]

The vast majority of the time, the thread holding the Queue lock during each
thread dump is performing operations within the Jenkins.trimLabels method as
part of adding, updating or removing a node.

"Handling POST /plugin/swarm/createSlave from 10.240.81.215 : Jetty (winstone)-1218362" #1218362 prio=5 os_prio=0 tid=0x00007f47265fd000 nid=0x2e11 runnable [0x00007f3c85ceb000]
   java.lang.Thread.State: RUNNABLE
        at hudson.util.QuotedStringTokenizer.hasMoreTokens(QuotedStringTokenizer.java:184)
        at hudson.model.Label.parse(Label.java:585)
        at hudson.model.Node.getAssignedLabels(Node.java:303)
        at hudson.model.Label.matches(Label.java:196)
        at hudson.model.Label.getNodes(Label.java:233)
        at hudson.model.Label.isEmpty(Label.java:430)
        at jenkins.model.Jenkins.trimLabels(Jenkins.java:2201)
        at jenkins.model.Nodes$4.call(Nodes.java:214)
        at jenkins.model.Nodes$4.call(Nodes.java:210)
        at hudson.model.Queue._withLock(Queue.java:1443)
        at hudson.model.Queue.withLock(Queue.java:1304)
        at jenkins.model.Nodes.updateNode(Nodes.java:210)
        at jenkins.model.Jenkins.updateNode(Jenkins.java:2176)
        at hudson.model.Node.save(Node.java:139)
        at hudson.model.Node.setTemporaryOfflineCause(Node.java:274)
        at hudson.model.Computer.setNode(Computer.java:820)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:895)
        at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:137)
        at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:43)
        at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:223)
        at hudson.model.Queue._withLock(Queue.java:1384)
        at hudson.model.Queue.withLock(Queue.java:1261)
        at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:206)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1632)
        at jenkins.model.Nodes$2.run(Nodes.java:151)
        at hudson.model.Queue._withLock(Queue.java:1384)
        at hudson.model.Queue.withLock(Queue.java:1261)
        at jenkins.model.Nodes.addNode(Nodes.java:147)
        at jenkins.model.Jenkins.addNode(Jenkins.java:2155)
        at hudson.plugins.swarm.PluginImpl.doCreateSlave(PluginImpl.java:224)

I've attached a couple of archives created using the collectPerformanceData
script that contain the relevant thread dumps.

During the aforementioned periods of instability there are between 1500-1600
unique labels and 400-500 workers, as gathered from the script console using
Jenkins.instance.labels.size() and Jenkins.instance.nodes.size().

I'm able to replicate the increasing slowness using Groovy scripts that mirror
what our worker creation steps look like. I've attached both scripts.
create-workers.groovy creates the workers, remove-workers.groovy removes
them. To make it match our swarm-client workflow we create SwarmSlave agents
in the script but that detail probably doesn't matter for reproduction
purposes.

Creating and then removing workers with a single label is fast, as you'd
expect. Here's some snipped output for creation (full output attached as create-workers-single-label.log):

...
uniqueLabels: 395 nodes: 393 swarm-test-392: 63ms
uniqueLabels: 396 nodes: 394 swarm-test-393: 65ms
uniqueLabels: 397 nodes: 395 swarm-test-394: 87ms
uniqueLabels: 398 nodes: 396 swarm-test-395: 62ms
uniqueLabels: 399 nodes: 397 swarm-test-396: 62ms
uniqueLabels: 400 nodes: 398 swarm-test-397: 62ms
uniqueLabels: 401 nodes: 399 swarm-test-398: 63ms
uniqueLabels: 402 nodes: 400 swarm-test-399: 63ms
uniqueLabels: 403 nodes: 401 swarm-test-400: 64ms
Total time to create 400 workers: 9183ms

And then the same for removal (full output attached as remove-workers-single-label.log):

...
uniqueLabels: 10 nodes: 8 swarm-test-91: 0ms
uniqueLabels: 9 nodes: 7 swarm-test-92: 0ms
uniqueLabels: 8 nodes: 6 swarm-test-93: 1ms
uniqueLabels: 7 nodes: 5 swarm-test-94: 0ms
uniqueLabels: 6 nodes: 4 swarm-test-95: 1ms
uniqueLabels: 5 nodes: 3 swarm-test-96: 0ms
uniqueLabels: 4 nodes: 2 swarm-test-97: 1ms
uniqueLabels: 3 nodes: 1 swarm-test-98: 0ms
uniqueLabels: 1 nodes: 0 swarm-test-99: 1ms
Total time to remove 401 workers: 8675ms

But once you start adding more labels, thing start slowing down drastically.
Here's some snipped output for creation (full output attached as create-workers-multiple-labels.log):

...
uniqueLabels: 809 nodes: 393 swarm-test-392: 1875ms
uniqueLabels: 811 nodes: 394 swarm-test-393: 1875ms
uniqueLabels: 813 nodes: 395 swarm-test-394: 1883ms
uniqueLabels: 815 nodes: 396 swarm-test-395: 1888ms
uniqueLabels: 817 nodes: 397 swarm-test-396: 1901ms
uniqueLabels: 819 nodes: 398 swarm-test-397: 1913ms
uniqueLabels: 821 nodes: 399 swarm-test-398: 1915ms
uniqueLabels: 823 nodes: 400 swarm-test-399: 1927ms
uniqueLabels: 825 nodes: 401 swarm-test-400: 1939ms
Total time to create 400 workers: 261866ms

And then the same for removal (full output attached as remove-workers-multiple-labels.log):

...
uniqueLabels: 39 nodes: 8 swarm-test-91: 3ms
uniqueLabels: 37 nodes: 7 swarm-test-92: 2ms
uniqueLabels: 35 nodes: 6 swarm-test-93: 2ms
uniqueLabels: 33 nodes: 5 swarm-test-94: 1ms
uniqueLabels: 31 nodes: 4 swarm-test-95: 1ms
uniqueLabels: 29 nodes: 3 swarm-test-96: 0ms
uniqueLabels: 27 nodes: 2 swarm-test-97: 0ms
uniqueLabels: 25 nodes: 1 swarm-test-98: 1ms
uniqueLabels: 1 nodes: 0 swarm-test-99: 0ms
Total time to remove 401 workers: 258555ms

Increasing (roughly doubling it in this case) the number of unique labels makes
the same process that originally took about 9s for each operation take about
4 minutes and 20 seconds for each operation.

Is there some way to make Jenkins.trimLabels less expensive even in the
face of thousands of labels and hundreds of workers? To my eye it looks like
the current code path has several nested loops (outer loop over every label,
inner loop over every worker, inner loop over every parsed token from the label
tokenizer, inner loop over every char in the raw label str) which are what
contribute to the increase in execution time as the inputs get larger.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

performanceData.13272-3.output.tar.gz
5.52 MB
2021-04-07 18:52
performanceData.13272-2.output.tar.gz
6.06 MB
2021-04-07 18:52
create-workers-multiple-labels.log
20 kB
2021-04-07 18:51
create-workers-single-label.log
19 kB
2021-04-07 18:51
create-workers.groovy
1 kB
2021-04-07 18:51
remove-workers-multiple-labels.log
20 kB
2021-04-07 18:51
remove-workers-single-label.log
19 kB
2021-04-07 18:51
remove-workers.groovy
0.6 kB
2021-04-07 18:51

causes

JENKINS-67099 trimLabels calling Cloud#canProvision too often

Closed

JENKINS-68055 load statistics not working for label expressions (regression in 2.289)

Closed

links to

PR-5402

PR-5412

Jonah Bull added a comment - 2021-04-12 21:36

Really appreciate the quick response to this issue! After the linked PR was
merged I built jenkins.war from master locally and built a jdk11 Docker
image based on the that WAR file using the Makefile from the Jenkins docker
repo. Unfortunately in my tests the behavior I described looks worse and not
better after these changes.

Here's some snipped output for creation using latest master:

jenkins@3fd5d3f4801e:~$ tail create-workers-with-labels.log
uniqueLabels: 811 nodes: 393 swarm-test-392: 1019ms
uniqueLabels: 813 nodes: 394 swarm-test-393: 949ms
uniqueLabels: 815 nodes: 395 swarm-test-394: 2334ms
uniqueLabels: 817 nodes: 396 swarm-test-395: 2048ms
uniqueLabels: 819 nodes: 397 swarm-test-396: 996ms
uniqueLabels: 821 nodes: 398 swarm-test-397: 972ms
uniqueLabels: 823 nodes: 399 swarm-test-398: 1036ms
uniqueLabels: 825 nodes: 400 swarm-test-399: 1096ms
uniqueLabels: 827 nodes: 401 swarm-test-400: 997ms
Total time to create 400 workers: 361632ms

And then the same for removal:

jenkins@3fd5d3f4801e:~$ tail remove-workers-with-labels.log
uniqueLabels: 41 nodes: 8 swarm-test-91: 3ms
uniqueLabels: 39 nodes: 7 swarm-test-92: 2ms
uniqueLabels: 37 nodes: 6 swarm-test-93: 2ms
uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms
uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms
uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms
uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms
uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms
uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms
Total time to remove 401 workers: 391983ms

In contrast, here's the snipped output for creation using the 2.276 docker
image:

jenkins@58c6fe073fc5:~$ tail create-workers-with-labels.log
uniqueLabels: 811 nodes: 393 swarm-test-392: 3430ms
uniqueLabels: 813 nodes: 394 swarm-test-393: 3445ms
uniqueLabels: 815 nodes: 395 swarm-test-394: 1824ms
uniqueLabels: 817 nodes: 396 swarm-test-395: 1619ms
uniqueLabels: 819 nodes: 397 swarm-test-396: 1682ms
uniqueLabels: 821 nodes: 398 swarm-test-397: 1665ms
uniqueLabels: 823 nodes: 399 swarm-test-398: 1652ms
uniqueLabels: 825 nodes: 400 swarm-test-399: 1676ms
uniqueLabels: 827 nodes: 401 swarm-test-400: 1668ms
Total time to create 400 workers: 234127ms

And then the same for removal:

jenkins@58c6fe073fc5:~$ tail remove-workers-with-labels.log
uniqueLabels: 41 nodes: 8 swarm-test-91: 5ms
uniqueLabels: 39 nodes: 7 swarm-test-92: 4ms
uniqueLabels: 37 nodes: 6 swarm-test-93: 4ms
uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms
uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms
uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms
uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms
uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms
uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms
Total time to remove 401 workers: 262418ms

This is all using the groovy scripts attached to this issue. I'll dig some more
tomorrow and see if I can provide some further data.

Jonah Bull added a comment - 2021-04-12 21:36 Really appreciate the quick response to this issue! After the linked PR was merged I built jenkins.war from master locally and built a jdk11 Docker image based on the that WAR file using the Makefile from the Jenkins docker repo. Unfortunately in my tests the behavior I described looks worse and not better after these changes. Here's some snipped output for creation using latest master: jenkins@3fd5d3f4801e:~$ tail create-workers-with-labels.log uniqueLabels: 811 nodes: 393 swarm-test-392: 1019ms uniqueLabels: 813 nodes: 394 swarm-test-393: 949ms uniqueLabels: 815 nodes: 395 swarm-test-394: 2334ms uniqueLabels: 817 nodes: 396 swarm-test-395: 2048ms uniqueLabels: 819 nodes: 397 swarm-test-396: 996ms uniqueLabels: 821 nodes: 398 swarm-test-397: 972ms uniqueLabels: 823 nodes: 399 swarm-test-398: 1036ms uniqueLabels: 825 nodes: 400 swarm-test-399: 1096ms uniqueLabels: 827 nodes: 401 swarm-test-400: 997ms Total time to create 400 workers: 361632ms And then the same for removal: jenkins@3fd5d3f4801e:~$ tail remove-workers-with-labels.log uniqueLabels: 41 nodes: 8 swarm-test-91: 3ms uniqueLabels: 39 nodes: 7 swarm-test-92: 2ms uniqueLabels: 37 nodes: 6 swarm-test-93: 2ms uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms Total time to remove 401 workers: 391983ms In contrast, here's the snipped output for creation using the 2.276 docker image: jenkins@58c6fe073fc5:~$ tail create-workers-with-labels.log uniqueLabels: 811 nodes: 393 swarm-test-392: 3430ms uniqueLabels: 813 nodes: 394 swarm-test-393: 3445ms uniqueLabels: 815 nodes: 395 swarm-test-394: 1824ms uniqueLabels: 817 nodes: 396 swarm-test-395: 1619ms uniqueLabels: 819 nodes: 397 swarm-test-396: 1682ms uniqueLabels: 821 nodes: 398 swarm-test-397: 1665ms uniqueLabels: 823 nodes: 399 swarm-test-398: 1652ms uniqueLabels: 825 nodes: 400 swarm-test-399: 1676ms uniqueLabels: 827 nodes: 401 swarm-test-400: 1668ms Total time to create 400 workers: 234127ms And then the same for removal: jenkins@58c6fe073fc5:~$ tail remove-workers-with-labels.log uniqueLabels: 41 nodes: 8 swarm-test-91: 5ms uniqueLabels: 39 nodes: 7 swarm-test-92: 4ms uniqueLabels: 37 nodes: 6 swarm-test-93: 4ms uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms Total time to remove 401 workers: 262418ms This is all using the groovy scripts attached to this issue. I'll dig some more tomorrow and see if I can provide some further data.

Raihaan Shouhell added a comment - 2021-04-12 22:04

Hey Jonah thanks for the feedback, I'll probably write another PR which you can hopefully test.

In my limited testing, this improved things but it doesn't seem to have helped you.

Raihaan Shouhell added a comment - 2021-04-12 22:04 Hey Jonah thanks for the feedback, I'll probably write another PR which you can hopefully test. In my limited testing, this improved things but it doesn't seem to have helped you.

Raihaan Shouhell added a comment - 2021-04-12 23:23

Hey jonahbull if you get the chance could you try out the newer PR.

Raihaan Shouhell added a comment - 2021-04-12 23:23 Hey jonahbull if you get the chance could you try out the newer PR.

Jonah Bull added a comment - 2021-04-13 16:30

Hey Raihaan, thanks again for the quick work on these PRs! I tested the newer PR this morning and got not-very-improved results, which didn't make any sense to me. I slowly realized that my timing/reporting code in the worker creation loop was probably what was dominating the time now since it has to get the number of labels and nodes to print each iteration. I commented the timing code in the worker creation loop, leaving just the measurement of the total time to create/remove all the workers. That seems to have done the trick as far as reporting accurate results. So the first PR was actually a massive improvement from our baseline on 2.276 and the second PR is even better still!

2.276:

Total time to create 400 workers: 240958ms
Total time to remove 401 workers: 255528ms

PR-5402 (first PR):

Total time to create 400 workers: 2903ms
Total time to remove 401 workers: 1871ms

PR-5412 (second PR):

Total time to create 400 workers: 1637ms
Total time to remove 401 workers: 1441ms

Personally I find the implementation of the second PR clearer so I'd love to see that one land as well. I'm not sure if there are any concerns over whether the second PR changes the contract of the trimLabels method significantly, hopefully folks who know that area of the code better will chime in on that aspect.

Jonah Bull added a comment - 2021-04-13 16:30 Hey Raihaan, thanks again for the quick work on these PRs! I tested the newer PR this morning and got not-very-improved results, which didn't make any sense to me. I slowly realized that my timing/reporting code in the worker creation loop was probably what was dominating the time now since it has to get the number of labels and nodes to print each iteration. I commented the timing code in the worker creation loop, leaving just the measurement of the total time to create/remove all the workers. That seems to have done the trick as far as reporting accurate results. So the first PR was actually a massive improvement from our baseline on 2.276 and the second PR is even better still! 2.276: Total time to create 400 workers: 240958ms Total time to remove 401 workers: 255528ms PR-5402 (first PR): Total time to create 400 workers: 2903ms Total time to remove 401 workers: 1871ms PR-5412 (second PR): Total time to create 400 workers: 1637ms Total time to remove 401 workers: 1441ms Personally I find the implementation of the second PR clearer so I'd love to see that one land as well. I'm not sure if there are any concerns over whether the second PR changes the contract of the trimLabels method significantly, hopefully folks who know that area of the code better will chime in on that aspect.

Raihaan Shouhell added a comment - 2021-04-13 16:44

Hey Jonah, thanks a ton for testing. I'll be getting the newer PR merged as well but since this improves your situation. You can use the official 2.288 that contains the first implementation in the meantime and hopefully next week the second implementation gets merged.

Raihaan Shouhell added a comment - 2021-04-13 16:44 Hey Jonah, thanks a ton for testing. I'll be getting the newer PR merged as well but since this improves your situation. You can use the official 2.288 that contains the first implementation in the meantime and hopefully next week the second implementation gets merged.

Gavin Williams added a comment - 2021-04-15 15:30

raihaan Could we consider this improvement for back porting to LTS release aswell?

Gavin Williams added a comment - 2021-04-15 15:30 raihaan Could we consider this improvement for back porting to LTS release aswell?

Raihaan Shouhell added a comment - 2021-04-15 15:34

I can label it as such but, there is no guarantees it will make it to LTS

Raihaan Shouhell added a comment - 2021-04-15 15:34 I can label it as such but, there is no guarantees it will make it to LTS

Gavin Williams added a comment - 2021-04-15 15:51

Cool... Well atleast it will be considered now...

Gavin Williams added a comment - 2021-04-15 15:51 Cool... Well atleast it will be considered now...

Raihaan Shouhell added a comment - 2021-04-19 09:23

Marking this resolved as the first fix technically rectifies the issue. And the second fix will be going in for 2.289

Raihaan Shouhell added a comment - 2021-04-19 09:23 Marking this resolved as the first fix technically rectifies the issue. And the second fix will be going in for 2.289

Daniel Beck added a comment - 2021-04-21 08:51

It is unclear what should be backported into 2.277.x as there are two related PRs, and one isn't even out a full day yet.

Daniel Beck added a comment - 2021-04-21 08:51 It is unclear what should be backported into 2.277.x as there are two related PRs, and one isn't even out a full day yet.

Gavin Williams added a comment - 2021-04-23 13:59

I just thought I'd drop by and say thank you from myself and Jonah for the awesome work on getting this issue resolved soo quickly.

We deployed the latest `2.289` version to production yesterday, and have seen dramatic improvements on the performance of creating new agents. The 99th percentile for the `POST /plugin/swarm/createSlave` call is now a mere 2.2 seconds!!!

Gavin Williams added a comment - 2021-04-23 13:59 I just thought I'd drop by and say thank you from myself and Jonah for the awesome work on getting this issue resolved soo quickly. We deployed the latest `2.289` version to production yesterday, and have seen dramatic improvements on the performance of creating new agents. The 99th percentile for the `POST /plugin/swarm/createSlave` call is now a mere 2.2 seconds!!!

Raihaan Shouhell added a comment - 2021-04-23 18:32

Hey Gavin, thanks to both you and Jonah that this was able to be resolved so quickly. The initial investigation done was superb which allowed me to focus on the issue. Its collaboration like this that helps issues get resolved quickly.

Raihaan Shouhell added a comment - 2021-04-23 18:32 Hey Gavin, thanks to both you and Jonah that this was able to be resolved so quickly. The initial investigation done was superb which allowed me to focus on the issue. Its collaboration like this that helps issues get resolved quickly.

Gavin Williams added a comment - 2021-04-26 14:00

Hi bmunoz

I just noticed that this issue has been labelled as `2.277.4-rejected`.

Does that mean it's not going to be part of the next LTS release?

Could you elaborate on the reasoning behind that? From our testing and production usage thus far, this change has a massive benefit on overall Jenkins performance, especially when runnings hundreds of connected agents that are being dynamically created...

Thanks

Gavin Williams added a comment - 2021-04-26 14:00 Hi bmunoz I just noticed that this issue has been labelled as `2.277.4-rejected`. Does that mean it's not going to be part of the next LTS release? Could you elaborate on the reasoning behind that? From our testing and production usage thus far, this change has a massive benefit on overall Jenkins performance, especially when runnings hundreds of connected agents that are being dynamically created... Thanks

Raihaan Shouhell added a comment - 2021-04-26 14:52

Hey Gavin, it was rejected because the change has not been in weekly releases for long and it is not a regression. Thus it was decided that it be left out because this change poses a risk of a regression. So yes it will not be in 2.277.4, it will likely be in the LTS after that whose baseline is likely to be 2.289.

Raihaan Shouhell added a comment - 2021-04-26 14:52 Hey Gavin, it was rejected because the change has not been in weekly releases for long and it is not a regression. Thus it was decided that it be left out because this change poses a risk of a regression. So yes it will not be in 2.277.4, it will likely be in the LTS after that whose baseline is likely to be 2.289.

Beatriz Muñoz added a comment - 2021-04-27 07:36

Hey Gavin. What Raihaan said is correct. Here the discussion

Beatriz Muñoz added a comment - 2021-04-27 07:36 Hey Gavin. What Raihaan said is correct. Here the discussion

Gavin Williams added a comment - 2021-04-28 09:24

Thanks for the detail both

Gavin Williams added a comment - 2021-04-28 09:24 Thanks for the detail both

Assignee:: Raihaan Shouhell

Reporter:: Jonah Bull

Votes:: 3 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2021-04-07 18:54

Updated:: 2022-04-07 20:29

Resolved:: 2021-04-19 09:23

Jenkins

Details

Description

Attachments

Attachments

Issue Links

Activity

Collapse comment: Jonah Bull added a comment - 2021-04-12 21:36

Expand comment: Jonah Bull added a comment - 2021-04-12 21:36

Collapse comment: Raihaan Shouhell added a comment - 2021-04-12 22:04

Expand comment: Raihaan Shouhell added a comment - 2021-04-12 22:04

Collapse comment: Raihaan Shouhell added a comment - 2021-04-12 23:23

Expand comment: Raihaan Shouhell added a comment - 2021-04-12 23:23

Collapse comment: Jonah Bull added a comment - 2021-04-13 16:30

Expand comment: Jonah Bull added a comment - 2021-04-13 16:30

Collapse comment: Raihaan Shouhell added a comment - 2021-04-13 16:44

Expand comment: Raihaan Shouhell added a comment - 2021-04-13 16:44

Collapse comment: Gavin Williams added a comment - 2021-04-15 15:30

Expand comment: Gavin Williams added a comment - 2021-04-15 15:30

Collapse comment: Raihaan Shouhell added a comment - 2021-04-15 15:34

Expand comment: Raihaan Shouhell added a comment - 2021-04-15 15:34

Collapse comment: Gavin Williams added a comment - 2021-04-15 15:51

Expand comment: Gavin Williams added a comment - 2021-04-15 15:51

Collapse comment: Raihaan Shouhell added a comment - 2021-04-19 09:23

Expand comment: Raihaan Shouhell added a comment - 2021-04-19 09:23

Collapse comment: Daniel Beck added a comment - 2021-04-21 08:51

Expand comment: Daniel Beck added a comment - 2021-04-21 08:51

Collapse comment: Gavin Williams added a comment - 2021-04-23 13:59

Expand comment: Gavin Williams added a comment - 2021-04-23 13:59

Collapse comment: Raihaan Shouhell added a comment - 2021-04-23 18:32

Expand comment: Raihaan Shouhell added a comment - 2021-04-23 18:32

Collapse comment: Gavin Williams added a comment - 2021-04-26 14:00

Expand comment: Gavin Williams added a comment - 2021-04-26 14:00

Collapse comment: Raihaan Shouhell added a comment - 2021-04-26 14:52

Expand comment: Raihaan Shouhell added a comment - 2021-04-26 14:52

Collapse comment: Beatriz Muñoz added a comment - 2021-04-27 07:36

Expand comment: Beatriz Muñoz added a comment - 2021-04-27 07:36

Collapse comment: Gavin Williams added a comment - 2021-04-28 09:24

Expand comment: Gavin Williams added a comment - 2021-04-28 09:24

People

Dates