-
Bug
-
Resolution: Fixed
-
Major
-
agents jdk information:
swarm-client-3.24.jar
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
primary node jdk information:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
Jenkins: 2.276
OS: Linux - 5.4.0-1036-gcp
---
ace-editor:1.1
pipeline-rest-api:2.19
pipeline-input-step:2.12
dtkit-api:3.0.0
build-name-setter:2.1.0
blueocean-pipeline-editor:1.24.4
extended-read-permission:3.2
swarm:3.24
pipeline-utility-steps:2.6.1
ldap:2.3
jaxb:2.3.0
blueocean-display-url:2.4.0
blueocean-pipeline-api-impl:1.24.4
handlebars:1.1.1
trilead-api:1.0.13
bootstrap4-api:4.6.0-1
conditional-buildstep:1.4.1
scm-api:2.6.4
powershell:1.4
pipeline-stage-view:2.19
okhttp-api:3.14.9
command-launcher:1.5
aws-java-sdk:1.11.930
blueocean-bitbucket-pipeline:1.24.4
s3:0.11.6
blueocean-rest:1.24.4
credentials-binding:1.24
pipeline-graph-analysis:1.10
lockable-resources:2.10
postbuildscript:2.11.0
github-branch-source:2.10.0
build-user-vars-plugin:1.6
cloudbees-disk-usage-simple:0.10
blueocean-git-pipeline:1.24.4
node-iterator-api:1.5.0
junit-attachments:1.6
momentjs:1.1.1
plain-credentials:1.7
workflow-support:3.8
workflow-aggregator:2.6
credentials:2.3.14
workflow-cps-global-lib:2.17
email-ext:2.81
authorize-project:1.3.0
docker-commons:1.17
gradle:1.36
cobertura:1.16
blueocean-commons:1.24.4
antisamy-markup-formatter:2.1
favorite:2.3.3
embeddable-build-status:2.0.3
jjwt-api:0.11.2-7.a257d5ff5a6b
jsch:0.1.55.2
mailer:1.32.1
pipeline-stage-tags-metadata:1.8.1
external-monitor-job:1.7
ssh-slaves:1.31.5
built-on-column:1.1
jenkins-multijob-plugin:1.36
popper-api:1.16.1-1
token-macro:2.15
blueocean-github-pipeline:1.24.4
ws-cleanup:0.38
workflow-basic-steps:2.23
rebuild:1.31
pipeline-model-definition:1.8.1
junit:1.48
ghprb:1.42.1
job-dsl:1.77
jenkins-design-language:1.24.4
pipeline-milestone-step:1.3.2
blueocean-rest-impl:1.24.4
github-api:1.122
workflow-multibranch:2.22
pipeline-model-api:1.8.1
allure-jenkins-plugin:2.29.0
docker-workflow:1.25
display-url-api:2.3.4
git-server:1.9
durable-task:1.35
pipeline-build-step:2.13
pam-auth:1.6
vsphere-cloud:2.25
google-storage-plugin:1.5.3
slack:2.45
google-oauth-plugin:1.0.3
authentication-tokens:1.4
snakeyaml-api:1.27.0
resource-disposer:0.14
workflow-job:2.40
apache-httpcomponents-client-4-api:4.5.13-1.0
jdk-tool:1.4
ssh-credentials:1.18.1
blueocean-core-js:1.24.4
blueocean-config:1.24.4
github-scm-trait-notification-context:1.1
workflow-scm-step:2.11
yaml-axis:0.3.0
build-timeout:1.20
basic-branch-build-strategies:1.3.2
variant:1.4
pipeline-githubnotify-step:1.0.5
envinject-api:1.7
matrix-combinations-parameter:1.3.1
plugin-util-api:1.6.1
checks-api:1.3.0
htmlpublisher:1.25
generic-webhook-trigger:1.72
blueocean-pipeline-scm-api:1.24.4
cloudbees-folder:6.15
blueocean-autofavorite:1.2.4
blueocean-i18n:1.24.4
echarts-api:4.9.0-3
structs:1.22
metrics:4.0.2.7
cloudbees-bitbucket-branch-source:2.9.7
blueocean-web:1.24.4
github:1.32.0
windows-slaves:1.7
maven-plugin:3.9
jackson2-api:2.12.1
git:4.5.2
envinject:2.4.0
handy-uri-templates-2-api:2.1.8-1.0
matrix-project:1.18
ssh-agent:1.20
timestamper:1.11.8
disable-github-multibranch-status:1.2
bouncycastle-api:2.18
blueocean:1.24.4
run-condition:1.5
blueocean-events:1.24.4
jquery3-api:3.5.1-2
pipeline-stage-step:2.5
font-awesome-api:5.15.2-1
blueocean-dashboard:1.24.4
git-client:3.6.0
leastload:3.0.0
javadoc:1.6
oauth-credentials:0.4
workflow-api:2.41
sse-gateway:1.24
blueocean-jwt:1.24.4
script-security:1.76
github-oauth:0.33
branch-api:2.6.3
custom-build-properties:1.9.1
ant:1.11
workflow-durable-task-step:2.37
workflow-step-api:2.23
hashicorp-vault-plugin:3.7.0
ansicolor:0.5.3
pipeline-github:2.7
parameterized-trigger:2.40
code-coverage-api:1.3.1
copyartifact:1.46
jquery:1.12.4-1
mask-passwords:3.0
blueocean-personalization:1.24.4
prometheus:2.0.9
google-metadata-plugin:0.3.1
pipeline-model-extensions:1.8.1
xunit:3.0.0
workflow-cps:2.87
pubsub-light:1.13
matrix-auth:2.6.5agents jdk information: swarm-client-3.24.jar openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) primary node jdk information: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08) OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode) Jenkins: 2.276 OS: Linux - 5.4.0-1036-gcp --- ace-editor:1.1 pipeline-rest-api:2.19 pipeline-input-step:2.12 dtkit-api:3.0.0 build-name-setter:2.1.0 blueocean-pipeline-editor:1.24.4 extended-read-permission:3.2 swarm:3.24 pipeline-utility-steps:2.6.1 ldap:2.3 jaxb:2.3.0 blueocean-display-url:2.4.0 blueocean-pipeline-api-impl:1.24.4 handlebars:1.1.1 trilead-api:1.0.13 bootstrap4-api:4.6.0-1 conditional-buildstep:1.4.1 scm-api:2.6.4 powershell:1.4 pipeline-stage-view:2.19 okhttp-api:3.14.9 command-launcher:1.5 aws-java-sdk:1.11.930 blueocean-bitbucket-pipeline:1.24.4 s3:0.11.6 blueocean-rest:1.24.4 credentials-binding:1.24 pipeline-graph-analysis:1.10 lockable-resources:2.10 postbuildscript:2.11.0 github-branch-source:2.10.0 build-user-vars-plugin:1.6 cloudbees-disk-usage-simple:0.10 blueocean-git-pipeline:1.24.4 node-iterator-api:1.5.0 junit-attachments:1.6 momentjs:1.1.1 plain-credentials:1.7 workflow-support:3.8 workflow-aggregator:2.6 credentials:2.3.14 workflow-cps-global-lib:2.17 email-ext:2.81 authorize-project:1.3.0 docker-commons:1.17 gradle:1.36 cobertura:1.16 blueocean-commons:1.24.4 antisamy-markup-formatter:2.1 favorite:2.3.3 embeddable-build-status:2.0.3 jjwt-api:0.11.2-7.a257d5ff5a6b jsch:0.1.55.2 mailer:1.32.1 pipeline-stage-tags-metadata:1.8.1 external-monitor-job:1.7 ssh-slaves:1.31.5 built-on-column:1.1 jenkins-multijob-plugin:1.36 popper-api:1.16.1-1 token-macro:2.15 blueocean-github-pipeline:1.24.4 ws-cleanup:0.38 workflow-basic-steps:2.23 rebuild:1.31 pipeline-model-definition:1.8.1 junit:1.48 ghprb:1.42.1 job-dsl:1.77 jenkins-design-language:1.24.4 pipeline-milestone-step:1.3.2 blueocean-rest-impl:1.24.4 github-api:1.122 workflow-multibranch:2.22 pipeline-model-api:1.8.1 allure-jenkins-plugin:2.29.0 docker-workflow:1.25 display-url-api:2.3.4 git-server:1.9 durable-task:1.35 pipeline-build-step:2.13 pam-auth:1.6 vsphere-cloud:2.25 google-storage-plugin:1.5.3 slack:2.45 google-oauth-plugin:1.0.3 authentication-tokens:1.4 snakeyaml-api:1.27.0 resource-disposer:0.14 workflow-job:2.40 apache-httpcomponents-client-4-api:4.5.13-1.0 jdk-tool:1.4 ssh-credentials:1.18.1 blueocean-core-js:1.24.4 blueocean-config:1.24.4 github-scm-trait-notification-context:1.1 workflow-scm-step:2.11 yaml-axis:0.3.0 build-timeout:1.20 basic-branch-build-strategies:1.3.2 variant:1.4 pipeline-githubnotify-step:1.0.5 envinject-api:1.7 matrix-combinations-parameter:1.3.1 plugin-util-api:1.6.1 checks-api:1.3.0 htmlpublisher:1.25 generic-webhook-trigger:1.72 blueocean-pipeline-scm-api:1.24.4 cloudbees-folder:6.15 blueocean-autofavorite:1.2.4 blueocean-i18n:1.24.4 echarts-api:4.9.0-3 structs:1.22 metrics:4.0.2.7 cloudbees-bitbucket-branch-source:2.9.7 blueocean-web:1.24.4 github:1.32.0 windows-slaves:1.7 maven-plugin:3.9 jackson2-api:2.12.1 git:4.5.2 envinject:2.4.0 handy-uri-templates-2-api:2.1.8-1.0 matrix-project:1.18 ssh-agent:1.20 timestamper:1.11.8 disable-github-multibranch-status:1.2 bouncycastle-api:2.18 blueocean:1.24.4 run-condition:1.5 blueocean-events:1.24.4 jquery3-api:3.5.1-2 pipeline-stage-step:2.5 font-awesome-api:5.15.2-1 blueocean-dashboard:1.24.4 git-client:3.6.0 leastload:3.0.0 javadoc:1.6 oauth-credentials:0.4 workflow-api:2.41 sse-gateway:1.24 blueocean-jwt:1.24.4 script-security:1.76 github-oauth:0.33 branch-api:2.6.3 custom-build-properties:1.9.1 ant:1.11 workflow-durable-task-step:2.37 workflow-step-api:2.23 hashicorp-vault-plugin:3.7.0 ansicolor:0.5.3 pipeline-github:2.7 parameterized-trigger:2.40 code-coverage-api:1.3.1 copyartifact:1.46 jquery:1.12.4-1 mask-passwords:3.0 blueocean-personalization:1.24.4 prometheus:2.0.9 google-metadata-plugin:0.3.1 pipeline-model-extensions:1.8.1 xunit:3.0.0 workflow-cps:2.87 pubsub-light:1.13 matrix-auth:2.6.5
-
-
2.288 - Apr 13, 2021, 2.289 - Apr 20, 2021
We've been trying to track down some issues we've been seeing around Queue lock
contention on one of our Jenkins clusters. The lock contention manifests in
both UI instability/slowness and failures with REST API calls to add, update or
remove nodes. We use the Swarm plugin on the primary and swarm-client (version
3.24) on the agents to connect to the primary. The REST API failures aren't due
to exceptions from Jenkins, but to the API calls exceeding the configured
proxy_read_timeout (180s) for the nginx instance we have in front of Jenkins.
That manifests in the swarm-client process on the agents receiving a 504 from
nginx since Jenkins didn't respond in time.
Thread dumps gathered during periods of instability show that hundreds of
threads are waiting for the Queue lock to be able to add, update or remove
a node.
"Handling POST /plugin/swarm/createSlave from 10.224.1.234 : Jetty (winstone)-1218487" #1218487 prio=5 os_prio=0 tid=0x00007f4732c7f800 nid=0x2f96 waiting on condition [0x00007f3c6b5f2000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00007f3f117ca288> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at hudson.model.Queue._withLock(Queue.java:1441) [snip..]
The vast majority of the time, the thread holding the Queue lock during each
thread dump is performing operations within the Jenkins.trimLabels method as
part of adding, updating or removing a node.
"Handling POST /plugin/swarm/createSlave from 10.240.81.215 : Jetty (winstone)-1218362" #1218362 prio=5 os_prio=0 tid=0x00007f47265fd000 nid=0x2e11 runnable [0x00007f3c85ceb000] java.lang.Thread.State: RUNNABLE at hudson.util.QuotedStringTokenizer.hasMoreTokens(QuotedStringTokenizer.java:184) at hudson.model.Label.parse(Label.java:585) at hudson.model.Node.getAssignedLabels(Node.java:303) at hudson.model.Label.matches(Label.java:196) at hudson.model.Label.getNodes(Label.java:233) at hudson.model.Label.isEmpty(Label.java:430) at jenkins.model.Jenkins.trimLabels(Jenkins.java:2201) at jenkins.model.Nodes$4.call(Nodes.java:214) at jenkins.model.Nodes$4.call(Nodes.java:210) at hudson.model.Queue._withLock(Queue.java:1443) at hudson.model.Queue.withLock(Queue.java:1304) at jenkins.model.Nodes.updateNode(Nodes.java:210) at jenkins.model.Jenkins.updateNode(Jenkins.java:2176) at hudson.model.Node.save(Node.java:139) at hudson.model.Node.setTemporaryOfflineCause(Node.java:274) at hudson.model.Computer.setNode(Computer.java:820) at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:895) at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:137) at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:43) at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:223) at hudson.model.Queue._withLock(Queue.java:1384) at hudson.model.Queue.withLock(Queue.java:1261) at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:206) at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1632) at jenkins.model.Nodes$2.run(Nodes.java:151) at hudson.model.Queue._withLock(Queue.java:1384) at hudson.model.Queue.withLock(Queue.java:1261) at jenkins.model.Nodes.addNode(Nodes.java:147) at jenkins.model.Jenkins.addNode(Jenkins.java:2155) at hudson.plugins.swarm.PluginImpl.doCreateSlave(PluginImpl.java:224)
I've attached a couple of archives created using the collectPerformanceData
script that contain the relevant thread dumps.
During the aforementioned periods of instability there are between 1500-1600
unique labels and 400-500 workers, as gathered from the script console using
Jenkins.instance.labels.size() and Jenkins.instance.nodes.size().
I'm able to replicate the increasing slowness using Groovy scripts that mirror
what our worker creation steps look like. I've attached both scripts.
create-workers.groovy creates the workers, remove-workers.groovy removes
them. To make it match our swarm-client workflow we create SwarmSlave agents
in the script but that detail probably doesn't matter for reproduction
purposes.
Creating and then removing workers with a single label is fast, as you'd
expect. Here's some snipped output for creation (full output attached as create-workers-single-label.log):
... uniqueLabels: 395 nodes: 393 swarm-test-392: 63ms uniqueLabels: 396 nodes: 394 swarm-test-393: 65ms uniqueLabels: 397 nodes: 395 swarm-test-394: 87ms uniqueLabels: 398 nodes: 396 swarm-test-395: 62ms uniqueLabels: 399 nodes: 397 swarm-test-396: 62ms uniqueLabels: 400 nodes: 398 swarm-test-397: 62ms uniqueLabels: 401 nodes: 399 swarm-test-398: 63ms uniqueLabels: 402 nodes: 400 swarm-test-399: 63ms uniqueLabels: 403 nodes: 401 swarm-test-400: 64ms Total time to create 400 workers: 9183ms
And then the same for removal (full output attached as remove-workers-single-label.log):
... uniqueLabels: 10 nodes: 8 swarm-test-91: 0ms uniqueLabels: 9 nodes: 7 swarm-test-92: 0ms uniqueLabels: 8 nodes: 6 swarm-test-93: 1ms uniqueLabels: 7 nodes: 5 swarm-test-94: 0ms uniqueLabels: 6 nodes: 4 swarm-test-95: 1ms uniqueLabels: 5 nodes: 3 swarm-test-96: 0ms uniqueLabels: 4 nodes: 2 swarm-test-97: 1ms uniqueLabels: 3 nodes: 1 swarm-test-98: 0ms uniqueLabels: 1 nodes: 0 swarm-test-99: 1ms Total time to remove 401 workers: 8675ms
But once you start adding more labels, thing start slowing down drastically.
Here's some snipped output for creation (full output attached as create-workers-multiple-labels.log):
... uniqueLabels: 809 nodes: 393 swarm-test-392: 1875ms uniqueLabels: 811 nodes: 394 swarm-test-393: 1875ms uniqueLabels: 813 nodes: 395 swarm-test-394: 1883ms uniqueLabels: 815 nodes: 396 swarm-test-395: 1888ms uniqueLabels: 817 nodes: 397 swarm-test-396: 1901ms uniqueLabels: 819 nodes: 398 swarm-test-397: 1913ms uniqueLabels: 821 nodes: 399 swarm-test-398: 1915ms uniqueLabels: 823 nodes: 400 swarm-test-399: 1927ms uniqueLabels: 825 nodes: 401 swarm-test-400: 1939ms Total time to create 400 workers: 261866ms
And then the same for removal (full output attached as remove-workers-multiple-labels.log):
... uniqueLabels: 39 nodes: 8 swarm-test-91: 3ms uniqueLabels: 37 nodes: 7 swarm-test-92: 2ms uniqueLabels: 35 nodes: 6 swarm-test-93: 2ms uniqueLabels: 33 nodes: 5 swarm-test-94: 1ms uniqueLabels: 31 nodes: 4 swarm-test-95: 1ms uniqueLabels: 29 nodes: 3 swarm-test-96: 0ms uniqueLabels: 27 nodes: 2 swarm-test-97: 0ms uniqueLabels: 25 nodes: 1 swarm-test-98: 1ms uniqueLabels: 1 nodes: 0 swarm-test-99: 0ms Total time to remove 401 workers: 258555ms
Increasing (roughly doubling it in this case) the number of unique labels makes
the same process that originally took about 9s for each operation take about
4 minutes and 20 seconds for each operation.
Is there some way to make Jenkins.trimLabels less expensive even in the
face of thousands of labels and hundreds of workers? To my eye it looks like
the current code path has several nested loops (outer loop over every label,
inner loop over every worker, inner loop over every parsed token from the label
tokenizer, inner loop over every char in the raw label str) which are what
contribute to the increase in execution time as the inputs get larger.
- causes
-
JENKINS-67099 trimLabels calling Cloud#canProvision too often
- Closed
-
JENKINS-68055 load statistics not working for label expressions (regression in 2.289)
- Closed
- links to