Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-65308

Jenkins.trimLabels gets increasingly slower as number of nodes and labels increase

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • 2.288 - Apr 13, 2021, 2.289 - Apr 20, 2021

      We've been trying to track down some issues we've been seeing around Queue lock
      contention on one of our Jenkins clusters. The lock contention manifests in
      both UI instability/slowness and failures with REST API calls to add, update or
      remove nodes. We use the Swarm plugin on the primary and swarm-client (version
      3.24) on the agents to connect to the primary. The REST API failures aren't due
      to exceptions from Jenkins, but to the API calls exceeding the configured
      proxy_read_timeout (180s) for the nginx instance we have in front of Jenkins.
      That manifests in the swarm-client process on the agents receiving a 504 from
      nginx since Jenkins didn't respond in time.

      Thread dumps gathered during periods of instability show that hundreds of
      threads are waiting for the Queue lock to be able to add, update or remove
      a node.

      "Handling POST /plugin/swarm/createSlave from 10.224.1.234 : Jetty (winstone)-1218487" #1218487 prio=5 os_prio=0 tid=0x00007f4732c7f800 nid=0x2f96 waiting on condition [0x00007f3c6b5f2000]
         java.lang.Thread.State: WAITING (parking)
             at sun.misc.Unsafe.park(Native Method)
             - parking to wait for  <0x00007f3f117ca288> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
             at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
             at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
             at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
             at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
             at hudson.model.Queue._withLock(Queue.java:1441)
      [snip..]
      

      The vast majority of the time, the thread holding the Queue lock during each
      thread dump is performing operations within the Jenkins.trimLabels method as
      part of adding, updating or removing a node.

      "Handling POST /plugin/swarm/createSlave from 10.240.81.215 : Jetty (winstone)-1218362" #1218362 prio=5 os_prio=0 tid=0x00007f47265fd000 nid=0x2e11 runnable [0x00007f3c85ceb000]
         java.lang.Thread.State: RUNNABLE
              at hudson.util.QuotedStringTokenizer.hasMoreTokens(QuotedStringTokenizer.java:184)
              at hudson.model.Label.parse(Label.java:585)
              at hudson.model.Node.getAssignedLabels(Node.java:303)
              at hudson.model.Label.matches(Label.java:196)
              at hudson.model.Label.getNodes(Label.java:233)
              at hudson.model.Label.isEmpty(Label.java:430)
              at jenkins.model.Jenkins.trimLabels(Jenkins.java:2201)
              at jenkins.model.Nodes$4.call(Nodes.java:214)
              at jenkins.model.Nodes$4.call(Nodes.java:210)
              at hudson.model.Queue._withLock(Queue.java:1443)
              at hudson.model.Queue.withLock(Queue.java:1304)
              at jenkins.model.Nodes.updateNode(Nodes.java:210)
              at jenkins.model.Jenkins.updateNode(Jenkins.java:2176)
              at hudson.model.Node.save(Node.java:139)
              at hudson.model.Node.setTemporaryOfflineCause(Node.java:274)
              at hudson.model.Computer.setNode(Computer.java:820)
              at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:895)
              at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:137)
              at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:43)
              at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:223)
              at hudson.model.Queue._withLock(Queue.java:1384)
              at hudson.model.Queue.withLock(Queue.java:1261)
              at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:206)
              at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1632)
              at jenkins.model.Nodes$2.run(Nodes.java:151)
              at hudson.model.Queue._withLock(Queue.java:1384)
              at hudson.model.Queue.withLock(Queue.java:1261)
              at jenkins.model.Nodes.addNode(Nodes.java:147)
              at jenkins.model.Jenkins.addNode(Jenkins.java:2155)
              at hudson.plugins.swarm.PluginImpl.doCreateSlave(PluginImpl.java:224)
      

      I've attached a couple of archives created using the collectPerformanceData
      script that contain the relevant thread dumps.

      During the aforementioned periods of instability there are between 1500-1600
      unique labels and 400-500 workers, as gathered from the script console using
      Jenkins.instance.labels.size() and Jenkins.instance.nodes.size().

      I'm able to replicate the increasing slowness using Groovy scripts that mirror
      what our worker creation steps look like. I've attached both scripts.
      create-workers.groovy creates the workers, remove-workers.groovy removes
      them. To make it match our swarm-client workflow we create SwarmSlave agents
      in the script but that detail probably doesn't matter for reproduction
      purposes.

      Creating and then removing workers with a single label is fast, as you'd
      expect. Here's some snipped output for creation (full output attached as create-workers-single-label.log):

      ...
      uniqueLabels: 395 nodes: 393 swarm-test-392: 63ms
      uniqueLabels: 396 nodes: 394 swarm-test-393: 65ms
      uniqueLabels: 397 nodes: 395 swarm-test-394: 87ms
      uniqueLabels: 398 nodes: 396 swarm-test-395: 62ms
      uniqueLabels: 399 nodes: 397 swarm-test-396: 62ms
      uniqueLabels: 400 nodes: 398 swarm-test-397: 62ms
      uniqueLabels: 401 nodes: 399 swarm-test-398: 63ms
      uniqueLabels: 402 nodes: 400 swarm-test-399: 63ms
      uniqueLabels: 403 nodes: 401 swarm-test-400: 64ms
      Total time to create 400 workers: 9183ms
      

      And then the same for removal (full output attached as remove-workers-single-label.log):

      ...
      uniqueLabels: 10 nodes: 8 swarm-test-91: 0ms
      uniqueLabels: 9 nodes: 7 swarm-test-92: 0ms
      uniqueLabels: 8 nodes: 6 swarm-test-93: 1ms
      uniqueLabels: 7 nodes: 5 swarm-test-94: 0ms
      uniqueLabels: 6 nodes: 4 swarm-test-95: 1ms
      uniqueLabels: 5 nodes: 3 swarm-test-96: 0ms
      uniqueLabels: 4 nodes: 2 swarm-test-97: 1ms
      uniqueLabels: 3 nodes: 1 swarm-test-98: 0ms
      uniqueLabels: 1 nodes: 0 swarm-test-99: 1ms
      Total time to remove 401 workers: 8675ms
      

      But once you start adding more labels, thing start slowing down drastically.
      Here's some snipped output for creation (full output attached as create-workers-multiple-labels.log):

      ...
      uniqueLabels: 809 nodes: 393 swarm-test-392: 1875ms
      uniqueLabels: 811 nodes: 394 swarm-test-393: 1875ms
      uniqueLabels: 813 nodes: 395 swarm-test-394: 1883ms
      uniqueLabels: 815 nodes: 396 swarm-test-395: 1888ms
      uniqueLabels: 817 nodes: 397 swarm-test-396: 1901ms
      uniqueLabels: 819 nodes: 398 swarm-test-397: 1913ms
      uniqueLabels: 821 nodes: 399 swarm-test-398: 1915ms
      uniqueLabels: 823 nodes: 400 swarm-test-399: 1927ms
      uniqueLabels: 825 nodes: 401 swarm-test-400: 1939ms
      Total time to create 400 workers: 261866ms
      

      And then the same for removal (full output attached as remove-workers-multiple-labels.log):

      ...
      uniqueLabels: 39 nodes: 8 swarm-test-91: 3ms
      uniqueLabels: 37 nodes: 7 swarm-test-92: 2ms
      uniqueLabels: 35 nodes: 6 swarm-test-93: 2ms
      uniqueLabels: 33 nodes: 5 swarm-test-94: 1ms
      uniqueLabels: 31 nodes: 4 swarm-test-95: 1ms
      uniqueLabels: 29 nodes: 3 swarm-test-96: 0ms
      uniqueLabels: 27 nodes: 2 swarm-test-97: 0ms
      uniqueLabels: 25 nodes: 1 swarm-test-98: 1ms
      uniqueLabels: 1 nodes: 0 swarm-test-99: 0ms
      Total time to remove 401 workers: 258555ms
      

      Increasing (roughly doubling it in this case) the number of unique labels makes
      the same process that originally took about 9s for each operation take about
      4 minutes and 20 seconds for each operation.

      Is there some way to make Jenkins.trimLabels less expensive even in the
      face of thousands of labels and hundreds of workers? To my eye it looks like
      the current code path has several nested loops (outer loop over every label,
      inner loop over every worker, inner loop over every parsed token from the label
      tokenizer, inner loop over every char in the raw label str) which are what
      contribute to the increase in execution time as the inputs get larger.

          [JENKINS-65308] Jenkins.trimLabels gets increasingly slower as number of nodes and labels increase

          Jonah Bull created issue -
          Jonah Bull made changes -
          Description Original: We've been trying to track down some issues we've been seeing around Queue lock
          contention on one of our Jenkins clusters. The lock contention manifests in
          both UI instability/slowness and failures with REST API calls to add, update or
          remove nodes. We use the Swarm plugin on the primary and swarm-client (version
          3.24) on the agents to connect to the primary. The REST API failures aren't due
          to exceptions from Jenkins, but to the API calls exceeding the configured
          proxy_read_timeout (180s) for the nginx instance we have in front of Jenkins.
          That manifests in the swarm-client process on the agents receiving a 504 from
          nginx since Jenkins didn't respond in time.

          Thread dumps gathered during periods of instability show that hundreds of
          threads are waiting for the Queue lock to be able to add, update or remove
          a node.
          {noformat}
          "Handling POST /plugin/swarm/createSlave from 10.224.1.234 : Jetty (winstone)-1218487" #1218487 prio=5 os_prio=0 tid=0x00007f4732c7f800 nid=0x2f96 waiting on condition [0x00007f3c6b5f2000]
             java.lang.Thread.State: WAITING (parking)
                 at sun.misc.Unsafe.park(Native Method)
                 - parking to wait for <0x00007f3f117ca288> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
                 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                 at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
                 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
                 at hudson.model.Queue._withLock(Queue.java:1441)
          [snip..]
          {noformat}

          The vast majority of the time, the thread holding the Queue lock during each
          thread dump is performing operations within the Jenkins.trimLabels method as
          part of adding, updating or removing a node.
          {noformat}
          "Handling POST /plugin/swarm/createSlave from 10.240.81.215 : Jetty (winstone)-1218362" #1218362 prio=5 os_prio=0 tid=0x00007f47265fd000 nid=0x2e11 runnable [0x00007f3c85ceb000]
             java.lang.Thread.State: RUNNABLE
                  at hudson.util.QuotedStringTokenizer.hasMoreTokens(QuotedStringTokenizer.java:184)
                  at hudson.model.Label.parse(Label.java:585)
                  at hudson.model.Node.getAssignedLabels(Node.java:303)
                  at hudson.model.Label.matches(Label.java:196)
                  at hudson.model.Label.getNodes(Label.java:233)
                  at hudson.model.Label.isEmpty(Label.java:430)
                  at jenkins.model.Jenkins.trimLabels(Jenkins.java:2201)
                  at jenkins.model.Nodes$4.call(Nodes.java:214)
                  at jenkins.model.Nodes$4.call(Nodes.java:210)
                  at hudson.model.Queue._withLock(Queue.java:1443)
                  at hudson.model.Queue.withLock(Queue.java:1304)
                  at jenkins.model.Nodes.updateNode(Nodes.java:210)
                  at jenkins.model.Jenkins.updateNode(Jenkins.java:2176)
                  at hudson.model.Node.save(Node.java:139)
                  at hudson.model.Node.setTemporaryOfflineCause(Node.java:274)
                  at hudson.model.Computer.setNode(Computer.java:820)
                  at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:895)
                  at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:137)
                  at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:43)
                  at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:223)
                  at hudson.model.Queue._withLock(Queue.java:1384)
                  at hudson.model.Queue.withLock(Queue.java:1261)
                  at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:206)
                  at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1632)
                  at jenkins.model.Nodes$2.run(Nodes.java:151)
                  at hudson.model.Queue._withLock(Queue.java:1384)
                  at hudson.model.Queue.withLock(Queue.java:1261)
                  at jenkins.model.Nodes.addNode(Nodes.java:147)
                  at jenkins.model.Jenkins.addNode(Jenkins.java:2155)
                  at hudson.plugins.swarm.PluginImpl.doCreateSlave(PluginImpl.java:224)
          {noformat}

          I've attached a couple of archives created using the {{collectPerformanceData}}
          script that contain the relevant thread dumps.

          During the aforementioned periods of instability there are between 1500-1600
          unique labels and 400-500 workers, as gathered from the script console using
          {{Jenkins.instance.labels.size()}} and {{Jenkins.instance.nodes.size()}}.

          I'm able to replicate the increasing slowness using Groovy scripts that mirror
          what our worker creation steps look like. I've attached both scripts.
          {{create-workers.groovy}} creates the workers, {{remove-workers.groovy}} removes
          them. To make it match our swarm-client workflow we create {{SwarmSlave}} agents
          in the script but that detail probably doesn't matter for reproduction
          purposes.

          Creating and then removing workers with a single label is fast, as you'd
          expect. Here's some snipped output for creation (full output attached as {{create-workers-single-label.log}}):
          {noformat}
          ...
          uniqueLabels: 395 nodes: 393 swarm-test-392: 63ms
          uniqueLabels: 396 nodes: 394 swarm-test-393: 65ms
          uniqueLabels: 397 nodes: 395 swarm-test-394: 87ms
          uniqueLabels: 398 nodes: 396 swarm-test-395: 62ms
          uniqueLabels: 399 nodes: 397 swarm-test-396: 62ms
          uniqueLabels: 400 nodes: 398 swarm-test-397: 62ms
          uniqueLabels: 401 nodes: 399 swarm-test-398: 63ms
          uniqueLabels: 402 nodes: 400 swarm-test-399: 63ms
          uniqueLabels: 403 nodes: 401 swarm-test-400: 64ms
          Total time to create 400 workers: 9183ms
          {noformat}

          And then the same for removal (full output attached as {{remove-workers-single-label.log}}:
          {noformat}
          ...
          uniqueLabels: 10 nodes: 8 swarm-test-91: 0ms
          uniqueLabels: 9 nodes: 7 swarm-test-92: 0ms
          uniqueLabels: 8 nodes: 6 swarm-test-93: 1ms
          uniqueLabels: 7 nodes: 5 swarm-test-94: 0ms
          uniqueLabels: 6 nodes: 4 swarm-test-95: 1ms
          uniqueLabels: 5 nodes: 3 swarm-test-96: 0ms
          uniqueLabels: 4 nodes: 2 swarm-test-97: 1ms
          uniqueLabels: 3 nodes: 1 swarm-test-98: 0ms
          uniqueLabels: 1 nodes: 0 swarm-test-99: 1ms
          Total time to remove 401 workers: 8675ms
          {noformat}

          But once you start adding more labels, thing start slowing down drastically.
          Here's some snipped output for creation (full output attached as {{create-workers-multiple-labels.log}}:
          {noformat}
          ...
          uniqueLabels: 809 nodes: 393 swarm-test-392: 1875ms
          uniqueLabels: 811 nodes: 394 swarm-test-393: 1875ms
          uniqueLabels: 813 nodes: 395 swarm-test-394: 1883ms
          uniqueLabels: 815 nodes: 396 swarm-test-395: 1888ms
          uniqueLabels: 817 nodes: 397 swarm-test-396: 1901ms
          uniqueLabels: 819 nodes: 398 swarm-test-397: 1913ms
          uniqueLabels: 821 nodes: 399 swarm-test-398: 1915ms
          uniqueLabels: 823 nodes: 400 swarm-test-399: 1927ms
          uniqueLabels: 825 nodes: 401 swarm-test-400: 1939ms
          Total time to create 400 workers: 261866ms
          {noformat}
          And then the same for removal (full output attached as {{remove-workers-multiple-labels.log}}:
          {noformat}
          ...
          uniqueLabels: 39 nodes: 8 swarm-test-91: 3ms
          uniqueLabels: 37 nodes: 7 swarm-test-92: 2ms
          uniqueLabels: 35 nodes: 6 swarm-test-93: 2ms
          uniqueLabels: 33 nodes: 5 swarm-test-94: 1ms
          uniqueLabels: 31 nodes: 4 swarm-test-95: 1ms
          uniqueLabels: 29 nodes: 3 swarm-test-96: 0ms
          uniqueLabels: 27 nodes: 2 swarm-test-97: 0ms
          uniqueLabels: 25 nodes: 1 swarm-test-98: 1ms
          uniqueLabels: 1 nodes: 0 swarm-test-99: 0ms
          Total time to remove 401 workers: 258555ms
          {noformat}

          Increasing (roughly doubling it in this case) the number of unique labels makes
          the same process that originally took about 9s for each operation take about
          4 minutes and 20 seconds for each operation.

          Is there some way to make {{Jenkins.trimLabels}} less expensive even in the
          face of thousands of labels and hundreds of workers? To my eye it looks like
          the current code path has several nested loops (outer loop over every label,
          inner loop over every worker, inner loop over every parsed token from the label
          tokenizer, inner loop over every char in the raw label str) which are what
          contribute to the increase in execution time as the inputs get larger.
          New: We've been trying to track down some issues we've been seeing around Queue lock
          contention on one of our Jenkins clusters. The lock contention manifests in
          both UI instability/slowness and failures with REST API calls to add, update or
          remove nodes. We use the Swarm plugin on the primary and swarm-client (version
          3.24) on the agents to connect to the primary. The REST API failures aren't due
          to exceptions from Jenkins, but to the API calls exceeding the configured
          proxy_read_timeout (180s) for the nginx instance we have in front of Jenkins.
          That manifests in the swarm-client process on the agents receiving a 504 from
          nginx since Jenkins didn't respond in time.

          Thread dumps gathered during periods of instability show that hundreds of
          threads are waiting for the Queue lock to be able to add, update or remove
          a node.
          {noformat}
          "Handling POST /plugin/swarm/createSlave from 10.224.1.234 : Jetty (winstone)-1218487" #1218487 prio=5 os_prio=0 tid=0x00007f4732c7f800 nid=0x2f96 waiting on condition [0x00007f3c6b5f2000]
             java.lang.Thread.State: WAITING (parking)
                 at sun.misc.Unsafe.park(Native Method)
                 - parking to wait for <0x00007f3f117ca288> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
                 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
                 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
                 at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
                 at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
                 at hudson.model.Queue._withLock(Queue.java:1441)
          [snip..]
          {noformat}

          The vast majority of the time, the thread holding the Queue lock during each
          thread dump is performing operations within the Jenkins.trimLabels method as
          part of adding, updating or removing a node.
          {noformat}
          "Handling POST /plugin/swarm/createSlave from 10.240.81.215 : Jetty (winstone)-1218362" #1218362 prio=5 os_prio=0 tid=0x00007f47265fd000 nid=0x2e11 runnable [0x00007f3c85ceb000]
             java.lang.Thread.State: RUNNABLE
                  at hudson.util.QuotedStringTokenizer.hasMoreTokens(QuotedStringTokenizer.java:184)
                  at hudson.model.Label.parse(Label.java:585)
                  at hudson.model.Node.getAssignedLabels(Node.java:303)
                  at hudson.model.Label.matches(Label.java:196)
                  at hudson.model.Label.getNodes(Label.java:233)
                  at hudson.model.Label.isEmpty(Label.java:430)
                  at jenkins.model.Jenkins.trimLabels(Jenkins.java:2201)
                  at jenkins.model.Nodes$4.call(Nodes.java:214)
                  at jenkins.model.Nodes$4.call(Nodes.java:210)
                  at hudson.model.Queue._withLock(Queue.java:1443)
                  at hudson.model.Queue.withLock(Queue.java:1304)
                  at jenkins.model.Nodes.updateNode(Nodes.java:210)
                  at jenkins.model.Jenkins.updateNode(Jenkins.java:2176)
                  at hudson.model.Node.save(Node.java:139)
                  at hudson.model.Node.setTemporaryOfflineCause(Node.java:274)
                  at hudson.model.Computer.setNode(Computer.java:820)
                  at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:895)
                  at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:137)
                  at hudson.model.AbstractCIBase.access$000(AbstractCIBase.java:43)
                  at hudson.model.AbstractCIBase$2.run(AbstractCIBase.java:223)
                  at hudson.model.Queue._withLock(Queue.java:1384)
                  at hudson.model.Queue.withLock(Queue.java:1261)
                  at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:206)
                  at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1632)
                  at jenkins.model.Nodes$2.run(Nodes.java:151)
                  at hudson.model.Queue._withLock(Queue.java:1384)
                  at hudson.model.Queue.withLock(Queue.java:1261)
                  at jenkins.model.Nodes.addNode(Nodes.java:147)
                  at jenkins.model.Jenkins.addNode(Jenkins.java:2155)
                  at hudson.plugins.swarm.PluginImpl.doCreateSlave(PluginImpl.java:224)
          {noformat}

          I've attached a couple of archives created using the {{collectPerformanceData}}
          script that contain the relevant thread dumps.

          During the aforementioned periods of instability there are between 1500-1600
          unique labels and 400-500 workers, as gathered from the script console using
          {{Jenkins.instance.labels.size()}} and {{Jenkins.instance.nodes.size()}}.

          I'm able to replicate the increasing slowness using Groovy scripts that mirror
          what our worker creation steps look like. I've attached both scripts.
          {{create-workers.groovy}} creates the workers, {{remove-workers.groovy}} removes
          them. To make it match our swarm-client workflow we create {{SwarmSlave}} agents
          in the script but that detail probably doesn't matter for reproduction
          purposes.

          Creating and then removing workers with a single label is fast, as you'd
          expect. Here's some snipped output for creation (full output attached as {{create-workers-single-label.log}}):
          {noformat}
          ...
          uniqueLabels: 395 nodes: 393 swarm-test-392: 63ms
          uniqueLabels: 396 nodes: 394 swarm-test-393: 65ms
          uniqueLabels: 397 nodes: 395 swarm-test-394: 87ms
          uniqueLabels: 398 nodes: 396 swarm-test-395: 62ms
          uniqueLabels: 399 nodes: 397 swarm-test-396: 62ms
          uniqueLabels: 400 nodes: 398 swarm-test-397: 62ms
          uniqueLabels: 401 nodes: 399 swarm-test-398: 63ms
          uniqueLabels: 402 nodes: 400 swarm-test-399: 63ms
          uniqueLabels: 403 nodes: 401 swarm-test-400: 64ms
          Total time to create 400 workers: 9183ms
          {noformat}

          And then the same for removal (full output attached as {{remove-workers-single-label.log}}):
          {noformat}
          ...
          uniqueLabels: 10 nodes: 8 swarm-test-91: 0ms
          uniqueLabels: 9 nodes: 7 swarm-test-92: 0ms
          uniqueLabels: 8 nodes: 6 swarm-test-93: 1ms
          uniqueLabels: 7 nodes: 5 swarm-test-94: 0ms
          uniqueLabels: 6 nodes: 4 swarm-test-95: 1ms
          uniqueLabels: 5 nodes: 3 swarm-test-96: 0ms
          uniqueLabels: 4 nodes: 2 swarm-test-97: 1ms
          uniqueLabels: 3 nodes: 1 swarm-test-98: 0ms
          uniqueLabels: 1 nodes: 0 swarm-test-99: 1ms
          Total time to remove 401 workers: 8675ms
          {noformat}

          But once you start adding more labels, thing start slowing down drastically.
          Here's some snipped output for creation (full output attached as {{create-workers-multiple-labels.log}}):
          {noformat}
          ...
          uniqueLabels: 809 nodes: 393 swarm-test-392: 1875ms
          uniqueLabels: 811 nodes: 394 swarm-test-393: 1875ms
          uniqueLabels: 813 nodes: 395 swarm-test-394: 1883ms
          uniqueLabels: 815 nodes: 396 swarm-test-395: 1888ms
          uniqueLabels: 817 nodes: 397 swarm-test-396: 1901ms
          uniqueLabels: 819 nodes: 398 swarm-test-397: 1913ms
          uniqueLabels: 821 nodes: 399 swarm-test-398: 1915ms
          uniqueLabels: 823 nodes: 400 swarm-test-399: 1927ms
          uniqueLabels: 825 nodes: 401 swarm-test-400: 1939ms
          Total time to create 400 workers: 261866ms
          {noformat}
          And then the same for removal (full output attached as {{remove-workers-multiple-labels.log}}):
          {noformat}
          ...
          uniqueLabels: 39 nodes: 8 swarm-test-91: 3ms
          uniqueLabels: 37 nodes: 7 swarm-test-92: 2ms
          uniqueLabels: 35 nodes: 6 swarm-test-93: 2ms
          uniqueLabels: 33 nodes: 5 swarm-test-94: 1ms
          uniqueLabels: 31 nodes: 4 swarm-test-95: 1ms
          uniqueLabels: 29 nodes: 3 swarm-test-96: 0ms
          uniqueLabels: 27 nodes: 2 swarm-test-97: 0ms
          uniqueLabels: 25 nodes: 1 swarm-test-98: 1ms
          uniqueLabels: 1 nodes: 0 swarm-test-99: 0ms
          Total time to remove 401 workers: 258555ms
          {noformat}

          Increasing (roughly doubling it in this case) the number of unique labels makes
          the same process that originally took about 9s for each operation take about
          4 minutes and 20 seconds for each operation.

          Is there some way to make {{Jenkins.trimLabels}} less expensive even in the
          face of thousands of labels and hundreds of workers? To my eye it looks like
          the current code path has several nested loops (outer loop over every label,
          inner loop over every worker, inner loop over every parsed token from the label
          tokenizer, inner loop over every char in the raw label str) which are what
          contribute to the increase in execution time as the inputs get larger.
          Raihaan Shouhell made changes -
          Assignee New: Raihaan Shouhell [ raihaan ]
          Raihaan Shouhell made changes -
          Remote Link New: This issue links to "PR-5402 (Web Link)" [ 26612 ]
          Raihaan Shouhell made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]
          Raihaan Shouhell made changes -
          Status Original: In Progress [ 3 ] New: In Review [ 10005 ]

          Jonah Bull added a comment -

          Really appreciate the quick response to this issue! After the linked PR was
          merged I built jenkins.war from master locally and built a jdk11 Docker
          image based on the that WAR file using the Makefile from the Jenkins docker
          repo. Unfortunately in my tests the behavior I described looks worse and not
          better after these changes.

          Here's some snipped output for creation using latest master:

          jenkins@3fd5d3f4801e:~$ tail create-workers-with-labels.log
          uniqueLabels: 811 nodes: 393 swarm-test-392: 1019ms
          uniqueLabels: 813 nodes: 394 swarm-test-393: 949ms
          uniqueLabels: 815 nodes: 395 swarm-test-394: 2334ms
          uniqueLabels: 817 nodes: 396 swarm-test-395: 2048ms
          uniqueLabels: 819 nodes: 397 swarm-test-396: 996ms
          uniqueLabels: 821 nodes: 398 swarm-test-397: 972ms
          uniqueLabels: 823 nodes: 399 swarm-test-398: 1036ms
          uniqueLabels: 825 nodes: 400 swarm-test-399: 1096ms
          uniqueLabels: 827 nodes: 401 swarm-test-400: 997ms
          Total time to create 400 workers: 361632ms
          

          And then the same for removal:

          jenkins@3fd5d3f4801e:~$ tail remove-workers-with-labels.log
          uniqueLabels: 41 nodes: 8 swarm-test-91: 3ms
          uniqueLabels: 39 nodes: 7 swarm-test-92: 2ms
          uniqueLabels: 37 nodes: 6 swarm-test-93: 2ms
          uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms
          uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms
          uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms
          uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms
          uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms
          uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms
          Total time to remove 401 workers: 391983ms
          

          In contrast, here's the snipped output for creation using the 2.276 docker
          image:

          jenkins@58c6fe073fc5:~$ tail create-workers-with-labels.log
          uniqueLabels: 811 nodes: 393 swarm-test-392: 3430ms
          uniqueLabels: 813 nodes: 394 swarm-test-393: 3445ms
          uniqueLabels: 815 nodes: 395 swarm-test-394: 1824ms
          uniqueLabels: 817 nodes: 396 swarm-test-395: 1619ms
          uniqueLabels: 819 nodes: 397 swarm-test-396: 1682ms
          uniqueLabels: 821 nodes: 398 swarm-test-397: 1665ms
          uniqueLabels: 823 nodes: 399 swarm-test-398: 1652ms
          uniqueLabels: 825 nodes: 400 swarm-test-399: 1676ms
          uniqueLabels: 827 nodes: 401 swarm-test-400: 1668ms
          Total time to create 400 workers: 234127ms
          

          And then the same for removal:

          jenkins@58c6fe073fc5:~$ tail remove-workers-with-labels.log
          uniqueLabels: 41 nodes: 8 swarm-test-91: 5ms
          uniqueLabels: 39 nodes: 7 swarm-test-92: 4ms
          uniqueLabels: 37 nodes: 6 swarm-test-93: 4ms
          uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms
          uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms
          uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms
          uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms
          uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms
          uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms
          Total time to remove 401 workers: 262418ms
          

          This is all using the groovy scripts attached to this issue. I'll dig some more
          tomorrow and see if I can provide some further data.

          Jonah Bull added a comment - Really appreciate the quick response to this issue! After the linked PR was merged I built jenkins.war from master locally and built a jdk11 Docker image based on the that WAR file using the Makefile from the Jenkins docker repo. Unfortunately in my tests the behavior I described looks worse and not better after these changes. Here's some snipped output for creation using latest master: jenkins@3fd5d3f4801e:~$ tail create-workers-with-labels.log uniqueLabels: 811 nodes: 393 swarm-test-392: 1019ms uniqueLabels: 813 nodes: 394 swarm-test-393: 949ms uniqueLabels: 815 nodes: 395 swarm-test-394: 2334ms uniqueLabels: 817 nodes: 396 swarm-test-395: 2048ms uniqueLabels: 819 nodes: 397 swarm-test-396: 996ms uniqueLabels: 821 nodes: 398 swarm-test-397: 972ms uniqueLabels: 823 nodes: 399 swarm-test-398: 1036ms uniqueLabels: 825 nodes: 400 swarm-test-399: 1096ms uniqueLabels: 827 nodes: 401 swarm-test-400: 997ms Total time to create 400 workers: 361632ms And then the same for removal: jenkins@3fd5d3f4801e:~$ tail remove-workers-with-labels.log uniqueLabels: 41 nodes: 8 swarm-test-91: 3ms uniqueLabels: 39 nodes: 7 swarm-test-92: 2ms uniqueLabels: 37 nodes: 6 swarm-test-93: 2ms uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms Total time to remove 401 workers: 391983ms In contrast, here's the snipped output for creation using the 2.276 docker image: jenkins@58c6fe073fc5:~$ tail create-workers-with-labels.log uniqueLabels: 811 nodes: 393 swarm-test-392: 3430ms uniqueLabels: 813 nodes: 394 swarm-test-393: 3445ms uniqueLabels: 815 nodes: 395 swarm-test-394: 1824ms uniqueLabels: 817 nodes: 396 swarm-test-395: 1619ms uniqueLabels: 819 nodes: 397 swarm-test-396: 1682ms uniqueLabels: 821 nodes: 398 swarm-test-397: 1665ms uniqueLabels: 823 nodes: 399 swarm-test-398: 1652ms uniqueLabels: 825 nodes: 400 swarm-test-399: 1676ms uniqueLabels: 827 nodes: 401 swarm-test-400: 1668ms Total time to create 400 workers: 234127ms And then the same for removal: jenkins@58c6fe073fc5:~$ tail remove-workers-with-labels.log uniqueLabels: 41 nodes: 8 swarm-test-91: 5ms uniqueLabels: 39 nodes: 7 swarm-test-92: 4ms uniqueLabels: 37 nodes: 6 swarm-test-93: 4ms uniqueLabels: 35 nodes: 5 swarm-test-94: 3ms uniqueLabels: 33 nodes: 4 swarm-test-95: 2ms uniqueLabels: 31 nodes: 3 swarm-test-96: 2ms uniqueLabels: 29 nodes: 2 swarm-test-97: 2ms uniqueLabels: 27 nodes: 1 swarm-test-98: 2ms uniqueLabels: 3 nodes: 0 swarm-test-99: 1ms Total time to remove 401 workers: 262418ms This is all using the groovy scripts attached to this issue. I'll dig some more tomorrow and see if I can provide some further data.

          Hey Jonah thanks for the feedback, I'll probably write another PR which you can hopefully test.

          In my limited testing, this improved things but it doesn't seem to have helped you.

          Raihaan Shouhell added a comment - Hey Jonah thanks for the feedback, I'll probably write another PR which you can hopefully test. In my limited testing, this improved things but it doesn't seem to have helped you.
          Raihaan Shouhell made changes -
          Status Original: In Review [ 10005 ] New: In Progress [ 3 ]
          Raihaan Shouhell made changes -
          Status Original: In Progress [ 3 ] New: Open [ 1 ]

            raihaan Raihaan Shouhell
            jonahbull Jonah Bull
            Votes:
            3 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: