Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-20046

Massive Jenkins slowdown when jobs in Queue (due to Queue.maintain())

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Ubuntu 12.04
      Jenkins 1.509.3
      Up-to-date plugins

      As soon as more than a handful builds get queued, the entire GUI crawls to a halt.

      The reason is that the executor thread running the "Queue.maintain()" method is holding the exclusive lock on the queue, but starts a very time-consuming loop on creating the list of applicable hosts matching a certain label.

      Due to this, every Jenkins GUI page and every method that needs access to the Queue gets delayed by ~30 seconds; with the delay rising the more builds are in the queue, due to Queue.maintain() being called more often.

      The server only becomes responsive again, once the entire queue is empty. Setting the server to "shutdown now" does not help.

      A usual stack trace when this occurs looks like this (first from /threadDump; the second from jstack during a different time):
      {{
      "Executor #6 for musxbird038" prio=10 tid=0x00007fe108024800 nid=0x7008 runnable [0x00007fe0f5a99000]
      java.lang.Thread.State: RUNNABLE
      at hudson.model.Slave.getLabelString(Slave.java:245)
      at hudson.model.Node.getAssignedLabels(Node.java:241)
      at hudson.model.Label.matches(Label.java:168)
      at hudson.model.Label.getNodes(Label.java:193)
      at hudson.model.Label.contains(Label.java:405)
      at hudson.model.Node.canTake(Node.java:322)
      at hudson.model.Queue$JobOffer.canTake(Queue.java:250)
      at hudson.model.Queue.maintain(Queue.java:1032)

      • locked <0x00000000e01d3490> (a hudson.model.Queue)
        at hudson.model.Queue.pop(Queue.java:863)
      • locked <0x00000000e01d3490> (a hudson.model.Queue)
        at hudson.model.Executor.grabJob(Executor.java:285)
        at hudson.model.Executor.run(Executor.java:206)
      • locked <0x00000000e01d3490> (a hudson.model.Queue)

      "Executor #0 for musxbird006" Id=591 Group=main RUNNABLE
      at java.util.TreeMap.successor(TreeMap.java:1975)
      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1101)
      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
      at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010)
      at hudson.model.Label$2.resolve(Label.java:159)
      at hudson.model.Label$2.resolve(Label.java:157)
      at hudson.model.labels.LabelAtom.matches(LabelAtom.java:149)
      at hudson.model.labels.LabelExpression$Binary.matches(LabelExpression.java:124)
      at hudson.model.Label.matches(Label.java:157)
      at hudson.model.Label.matches(Label.java:168)
      at hudson.model.Label.getNodes(Label.java:193)
      at hudson.model.Label.contains(Label.java:405)
      at hudson.model.Node.canTake(Node.java:322)
      at hudson.model.Queue$JobOffer.canTake(Queue.java:250)
      at hudson.model.Queue.maintain(Queue.java:1032)

      • locked hudson.model.Queue@2962c1e0
        at hudson.model.Queue.pop(Queue.java:863)
      • locked hudson.model.Queue@2962c1e0
        at hudson.model.Executor.grabJob(Executor.java:285)
        at hudson.model.Executor.run(Executor.java:206)
      • locked hudson.model.Queue@2962c1e0
        }}

      As you can see, the Queue.maintain() method does finish successfully, but needs more than 30 seconds for it. The server does not stop working and will return to normal once the queue has been fully processed.

      We have ~20 nodes with 12 executor slots each running (= 240 executor threads). There is an equal amount of jobs running, but not all of them consume CPU time on the host (most are idling and waiting for certain events).

      This issue has occurred since upgrading from 1.509.1 to 1.509.3.

      Thanks in advance.

          [JENKINS-20046] Massive Jenkins slowdown when jobs in Queue (due to Queue.maintain())

          Martin Schröder created issue -

          Daniel Beck added a comment -

          Do you still experience this issue? Are you still on 1.509.3? What happens when you temporarily reduce online executor count so the total is around ~50 or so?

          Cannot reproduce on 1.532.3 with the queue sometimes ~7000 items long. So either it's resolved, or it really seems to be more related to the number of executors (and possibly labels/nodes) you have, of which I only have ~15 and 4 respectively. Another instance has ~14 nodes with ~40 executors total, and sometimes queues of up to 10 items, and also doesn't have that problem. Same Jenkins version.

          Daniel Beck added a comment - Do you still experience this issue? Are you still on 1.509.3? What happens when you temporarily reduce online executor count so the total is around ~50 or so? Cannot reproduce on 1.532.3 with the queue sometimes ~7000 items long. So either it's resolved, or it really seems to be more related to the number of executors (and possibly labels/nodes) you have, of which I only have ~15 and 4 respectively. Another instance has ~14 nodes with ~40 executors total, and sometimes queues of up to 10 items, and also doesn't have that problem. Same Jenkins version.

          Oleg Nenashev added a comment -

          The queue handling time depends on extensions you use: TaskQueueDispathers, JobProperties, Nodes, ...
          As example, there was a severe performance issue in Throttle Concurrent Builds Plugin several months ago.
          Operations like canTake() lock the queue and may take much time, hence the issue is valid.

          BTW, the only way to fix it inside the core is to implement support of multiple queues within the core (there was a feature request for this case). It would be useful to features like priorities, but the implementation in the current core seems to be extremely hard

          @Martin
          If the issue is actual to you, please provide a list of plugins you use.

          Oleg Nenashev added a comment - The queue handling time depends on extensions you use: TaskQueueDispathers, JobProperties, Nodes, ... As example, there was a severe performance issue in Throttle Concurrent Builds Plugin several months ago. Operations like canTake() lock the queue and may take much time, hence the issue is valid. BTW, the only way to fix it inside the core is to implement support of multiple queues within the core (there was a feature request for this case). It would be useful to features like priorities, but the implementation in the current core seems to be extremely hard @Martin If the issue is actual to you, please provide a list of plugins you use.

          Hi Daniel, hi Oleg.

          We are currently using Jenkins 1.509.4 on our servers. The servers have up to 50 build hosts connected to them, with roughly 1000 executors spawned in total.

          During peaks in test execution load, a significant queue can accumulate. During these times, we found that even a handful of builds in the queue could lock up the servers so badly, that it did not not start new builds in a timely manner, causing the queue to become increasingly longer, thus worsening the problem.

          In short: (High number of executors = Expensive Queue Maintenance) + Queued builds = Constant synchronization on Queue.

          This will also cause the entire GUI to become unresponsive, as it waits for the Queue to be free, before rendering the "nodes" side-panel.

          We are not currently experiencing this issue, because we are running a modified version of the Jenkins servers, where the majority of the "synchronized" calls have been replaced with read/write locks, that allow simultaneous access of several methods into the queue.

          You can find the patches against 1.509.4 and 1.554.1 here:

          https://github.com/HedAurabesh/jenkins/tree/queue-1.509.4
          https://github.com/HedAurabesh/jenkins/tree/queue-1.554.1

          We intend to release a formal pull request to Jenkins soon.

          The only side-effect of this change is, that the "scheduled builds" side-panel on the Job overview sometimes renders a build to be BOTH scheduled AND in progress. But that is a little price to pay for the server not getting stuck. Additionally, it is nothing a quick "F5" can't fix.

          Best regards,
          Martin

          Martin Schröder added a comment - Hi Daniel, hi Oleg. We are currently using Jenkins 1.509.4 on our servers. The servers have up to 50 build hosts connected to them, with roughly 1000 executors spawned in total. During peaks in test execution load, a significant queue can accumulate. During these times, we found that even a handful of builds in the queue could lock up the servers so badly, that it did not not start new builds in a timely manner, causing the queue to become increasingly longer, thus worsening the problem. In short: (High number of executors = Expensive Queue Maintenance) + Queued builds = Constant synchronization on Queue. This will also cause the entire GUI to become unresponsive, as it waits for the Queue to be free, before rendering the "nodes" side-panel. We are not currently experiencing this issue, because we are running a modified version of the Jenkins servers, where the majority of the "synchronized" calls have been replaced with read/write locks, that allow simultaneous access of several methods into the queue. You can find the patches against 1.509.4 and 1.554.1 here: https://github.com/HedAurabesh/jenkins/tree/queue-1.509.4 https://github.com/HedAurabesh/jenkins/tree/queue-1.554.1 We intend to release a formal pull request to Jenkins soon. The only side-effect of this change is, that the "scheduled builds" side-panel on the Job overview sometimes renders a build to be BOTH scheduled AND in progress. But that is a little price to pay for the server not getting stuck. Additionally, it is nothing a quick "F5" can't fix. Best regards, Martin
          Oleg Nenashev made changes -
          Link New: This issue is related to JENKINS-2487 [ JENKINS-2487 ]

          Oleg Nenashev added a comment - - edited

          Hi Martin,

          The PR for such change would be useful in any case.

          Regarding the web UI...
          On our installations we have increased the refresh timeout of the internal queue cache (JENKINS-19691, see https://github.com/jenkinsci/jenkins/pull/1221).
          Together with a periodic cache refresh via a kick-starter task, such solution greatly improves the responsiveness of the UI.
          BTW, it also leads to glitches in the side-panel
          Unfortunately, "F5" does not help in such case

          Oleg Nenashev added a comment - - edited Hi Martin, The PR for such change would be useful in any case. Regarding the web UI... On our installations we have increased the refresh timeout of the internal queue cache ( JENKINS-19691 , see https://github.com/jenkinsci/jenkins/pull/1221 ). Together with a periodic cache refresh via a kick-starter task, such solution greatly improves the responsiveness of the UI. BTW, it also leads to glitches in the side-panel Unfortunately, "F5" does not help in such case
          Oleg Nenashev made changes -
          Link New: This issue is related to JENKINS-19691 [ JENKINS-19691 ]
          Jenkins IRC Bot made changes -
          Component/s Original: gui [ 15492 ]
          Jesse Glick made changes -
          Labels New: performance
          Jesse Glick made changes -
          Labels Original: performance New: performance queue

            Unassigned Unassigned
            mhschroe Martin Schröder
            Votes:
            13 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: