Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-27650

Page loads slow with hundreds of throttled builds in queue

      When there are hundreds of throttled builds in the queue, page loads increase by an order of magnitude.

      Steps to reproduce:

      1. Run Jenkins 1.580.2 and latest throttle-concurrent-builds plugin
      2. Create a matrix job with 200 combinations (attached)
      3. In the same job, select "Throttle Concurrent Builds" with a maximum of 7 builds throttled as part of a category called 'semaphore'
      4. Set number of executors on the 'master' queue to 200
      5. Run the job. There should only be 7 builds running due to the throttling

      Page load times will increase by an order of magnitude – I observed 10 seconds from

      time curl http://localhost:8080/jenkins/ajaxBuildQueue

      If you remove the throttling in the job configuration, the page load times will be under 50 ms.

          [JENKINS-27650] Page loads slow with hundreds of throttled builds in queue

          Jesse Glick added a comment -

          Will see if https://github.com/jenkinsci/throttle-concurrent-builds-plugin/pull/26 can be improved upon. From thread dumps it is clear that this plugin is to blame for wasting resources. Yes 1.607+ will no longer block the web UI due to problems like this, but the backend will still be excessively loaded.

          Jesse Glick added a comment - Will see if https://github.com/jenkinsci/throttle-concurrent-builds-plugin/pull/26 can be improved upon. From thread dumps it is clear that this plugin is to blame for wasting resources. Yes 1.607+ will no longer block the web UI due to problems like this, but the backend will still be excessively loaded.

          Ryan Campbell added a comment -

          The referenced job configuration.

          Ryan Campbell added a comment - The referenced job configuration.

          Jesse Glick added a comment -

          Turns out those steps to reproduce do not work after all.

          Jesse Glick added a comment - Turns out those steps to reproduce do not work after all.

          Oleg Nenashev added a comment -

          @Jesse
          The issue appears when you have a big jobs traffic through the queue (real submissions instead of canTake() calls). I would recommend to...

          • decrease the execution time of test jobs (e.g. to 1 second) and increase the build queue to about 10k.
          • reconfigure throttling policies to support many submissions at once

          Oleg Nenashev added a comment - @Jesse The issue appears when you have a big jobs traffic through the queue (real submissions instead of canTake() calls). I would recommend to... decrease the execution time of test jobs (e.g. to 1 second) and increase the build queue to about 10k. reconfigure throttling policies to support many submissions at once

          Jesse Glick added a comment -

          To reproduce: clone the attached repo, and inside it

          docker build -t jenkins-27650 .
          docker run -p 8080:8080 jenkins-27650
          

          and then from http://localhost:8080/ click the Build button next to runme. Jenkins will quickly become unresponsive.

          Jesse Glick added a comment - To reproduce: clone the attached repo, and inside it docker build -t jenkins-27650 . docker run -p 8080:8080 jenkins-27650 and then from http://localhost:8080/ click the Build button next to runme . Jenkins will quickly become unresponsive.

          Jesse Glick added a comment -

          Experimenting, no luck so far.

          Jesse Glick added a comment - Experimenting, no luck so far.

          Jesse Glick added a comment -

          JENKINS-19623 apparently was not enough.

          Jesse Glick added a comment - JENKINS-19623 apparently was not enough.

          Jesse Glick added a comment -

          Tried various things in PR 27, but as explained there, the result is not satisfactory. I suspect JENKINS-27708 needs to be addressed first.

          My fear is that the current basic design of ThrottleQueueTaskDispatcher just cannot be made to scale well. I wonder if it would be better to invert the logic: implement ExecutorListener (as a second extension) to track what is running in each category, keeping a map from nodes to a histogram of task counts running by category (WeakHashMap<Node,HashMap<String,Integer>>?). Then canTake/canRun would only need to look up configuration for the proposed job, and do a table lookup to see the current count and compare that to the configured limit.

          I am not sure how that would relate to JENKINS-27708. ExecutorListener seems to be called with the Queue lock held, which is good, but that problem seems to stem from QueueTaskDispatcher being asked to make decisions about multiple jobs before any of them are actually scheduled. The call to taskAccepted does come from new WorkUnitContext, within maintain, so the question is whether this is interleaved with QueueTaskDispatcher calls, or after all of them have completed.

          Jesse Glick added a comment - Tried various things in PR 27, but as explained there, the result is not satisfactory. I suspect JENKINS-27708 needs to be addressed first. My fear is that the current basic design of ThrottleQueueTaskDispatcher just cannot be made to scale well. I wonder if it would be better to invert the logic: implement ExecutorListener (as a second extension) to track what is running in each category, keeping a map from nodes to a histogram of task counts running by category ( WeakHashMap<Node,HashMap<String,Integer>> ?). Then canTake / canRun would only need to look up configuration for the proposed job, and do a table lookup to see the current count and compare that to the configured limit. I am not sure how that would relate to JENKINS-27708 . ExecutorListener seems to be called with the Queue lock held, which is good, but that problem seems to stem from QueueTaskDispatcher being asked to make decisions about multiple jobs before any of them are actually scheduled. The call to taskAccepted does come from new WorkUnitContext , within maintain , so the question is whether this is interleaved with QueueTaskDispatcher calls, or after all of them have completed.

          Oleg Nenashev added a comment -

          /** Update to the previous comment:

          • ExecutorListener is not an extension point, we cannot make this approach work
          • There's no listeners in Jenkins core, that could reliably deliver the info
            */

          I've tried to introduce a light-weight off-the-queue caching in PR #28. The result was not satisfactory as well. The performance of canTake() is being improved by up to 10 times on my local benchmarks, but it still no enough to resolve the issue.

          We could somehow merge PRs #27 and #28, but I'm afraid the solution will stay unreliable. An additional synchronisation will be required in such case => scheduling behaviour will be impacted due to the injected quietTimes.

          Hacking of the load balancer could help, but there will be a conflict with other plugins

          Oleg Nenashev added a comment - /** Update to the previous comment: ExecutorListener is not an extension point, we cannot make this approach work There's no listeners in Jenkins core, that could reliably deliver the info */ I've tried to introduce a light-weight off-the-queue caching in PR #28. The result was not satisfactory as well. The performance of canTake() is being improved by up to 10 times on my local benchmarks, but it still no enough to resolve the issue. We could somehow merge PRs #27 and #28, but I'm afraid the solution will stay unreliable. An additional synchronisation will be required in such case => scheduling behaviour will be impacted due to the injected quietTimes. Hacking of the load balancer could help, but there will be a conflict with other plugins

          Oleg Nenashev added a comment -

          Not in progress anymore.

          Some improvement bits have been integrated into the plugin, but it's not enough IMHO

          Oleg Nenashev added a comment - Not in progress anymore. Some improvement bits have been integrated into the plugin, but it's not enough IMHO

            Unassigned Unassigned
            recampbell Ryan Campbell
            Votes:
            12 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated: