Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-20046

Massive Jenkins slowdown when jobs in Queue (due to Queue.maintain())

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • core
    • Ubuntu 12.04
      Jenkins 1.509.3
      Up-to-date plugins

      As soon as more than a handful builds get queued, the entire GUI crawls to a halt.

      The reason is that the executor thread running the "Queue.maintain()" method is holding the exclusive lock on the queue, but starts a very time-consuming loop on creating the list of applicable hosts matching a certain label.

      Due to this, every Jenkins GUI page and every method that needs access to the Queue gets delayed by ~30 seconds; with the delay rising the more builds are in the queue, due to Queue.maintain() being called more often.

      The server only becomes responsive again, once the entire queue is empty. Setting the server to "shutdown now" does not help.

      A usual stack trace when this occurs looks like this (first from /threadDump; the second from jstack during a different time):
      {{
      "Executor #6 for musxbird038" prio=10 tid=0x00007fe108024800 nid=0x7008 runnable [0x00007fe0f5a99000]
      java.lang.Thread.State: RUNNABLE
      at hudson.model.Slave.getLabelString(Slave.java:245)
      at hudson.model.Node.getAssignedLabels(Node.java:241)
      at hudson.model.Label.matches(Label.java:168)
      at hudson.model.Label.getNodes(Label.java:193)
      at hudson.model.Label.contains(Label.java:405)
      at hudson.model.Node.canTake(Node.java:322)
      at hudson.model.Queue$JobOffer.canTake(Queue.java:250)
      at hudson.model.Queue.maintain(Queue.java:1032)

      • locked <0x00000000e01d3490> (a hudson.model.Queue)
        at hudson.model.Queue.pop(Queue.java:863)
      • locked <0x00000000e01d3490> (a hudson.model.Queue)
        at hudson.model.Executor.grabJob(Executor.java:285)
        at hudson.model.Executor.run(Executor.java:206)
      • locked <0x00000000e01d3490> (a hudson.model.Queue)

      "Executor #0 for musxbird006" Id=591 Group=main RUNNABLE
      at java.util.TreeMap.successor(TreeMap.java:1975)
      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1101)
      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
      at java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1010)
      at hudson.model.Label$2.resolve(Label.java:159)
      at hudson.model.Label$2.resolve(Label.java:157)
      at hudson.model.labels.LabelAtom.matches(LabelAtom.java:149)
      at hudson.model.labels.LabelExpression$Binary.matches(LabelExpression.java:124)
      at hudson.model.Label.matches(Label.java:157)
      at hudson.model.Label.matches(Label.java:168)
      at hudson.model.Label.getNodes(Label.java:193)
      at hudson.model.Label.contains(Label.java:405)
      at hudson.model.Node.canTake(Node.java:322)
      at hudson.model.Queue$JobOffer.canTake(Queue.java:250)
      at hudson.model.Queue.maintain(Queue.java:1032)

      • locked hudson.model.Queue@2962c1e0
        at hudson.model.Queue.pop(Queue.java:863)
      • locked hudson.model.Queue@2962c1e0
        at hudson.model.Executor.grabJob(Executor.java:285)
        at hudson.model.Executor.run(Executor.java:206)
      • locked hudson.model.Queue@2962c1e0
        }}

      As you can see, the Queue.maintain() method does finish successfully, but needs more than 30 seconds for it. The server does not stop working and will return to normal once the queue has been fully processed.

      We have ~20 nodes with 12 executor slots each running (= 240 executor threads). There is an equal amount of jobs running, but not all of them consume CPU time on the host (most are idling and waiting for certain events).

      This issue has occurred since upgrading from 1.509.1 to 1.509.3.

      Thanks in advance.

          [JENKINS-20046] Massive Jenkins slowdown when jobs in Queue (due to Queue.maintain())

          Hi Oleg,

          the copyLogs seems to be extremely busy, is it possible that it iterates over all branches of the pipeline and not only the active ones?

          Florian Manschwetus added a comment - Hi Oleg, the copyLogs seems to be extremely busy, is it possible that it iterates over all branches of the pipeline and not only the active ones?

          Here is the Stack trace of the significantly most busy thread:

          WorkflowRun.copyLogs [#5]
          sun.misc.Unsafe.park(Native Method)
          java.util.concurrent.locks.LockSupport.park(Unknown Source)
          java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source)
          java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
          java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
          java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
          java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          java.lang.Thread.run(Unknown Source)

          Florian Manschwetus added a comment - Here is the Stack trace of the significantly most busy thread: WorkflowRun.copyLogs [#5] sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(Unknown Source) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source) java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) java.lang.Thread.run(Unknown Source)

          Hi Oleg,

          it seems to spend a lot of time in LinearBlockHoppingScanner:120-123

          WorkflowRun.copyLogs [#5] (test/dev #127)
          org.jenkinsci.plugins.workflow.graphanalysis.LinearBlockHoppingScanner.next(LinearBlockHoppingScanner.java:123)
          org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:212)
          org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:94)
          org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:255)
          org.jenkinsci.plugins.workflow.graphanalysis.LinearScanner.findFirstMatch(LinearScanner.java:135)
          org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:274)
          org.jenkinsci.plugins.workflow.support.actions.LogActionImpl.isRunning(LogActionImpl.java:153)
          org.jenkinsci.plugins.workflow.support.actions.LogActionImpl.getLogText(LogActionImpl.java:128)
          org.jenkinsci.plugins.workflow.job.WorkflowRun.copyLogs(WorkflowRun.java:441)
          org.jenkinsci.plugins.workflow.job.WorkflowRun.access$600(WorkflowRun.java:125)
          org.jenkinsci.plugins.workflow.job.WorkflowRun$3.run(WorkflowRun.java:313)
          java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
          java.util.concurrent.FutureTask.runAndReset(Unknown Source)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source)
          java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
          java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
          java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
          java.lang.Thread.run(Unknown Source)

          Florian Manschwetus added a comment - Hi Oleg, it seems to spend a lot of time in LinearBlockHoppingScanner:120-123 WorkflowRun.copyLogs [#5] (test/dev #127) org.jenkinsci.plugins.workflow.graphanalysis.LinearBlockHoppingScanner.next(LinearBlockHoppingScanner.java:123) org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:212) org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.next(AbstractFlowScanner.java:94) org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:255) org.jenkinsci.plugins.workflow.graphanalysis.LinearScanner.findFirstMatch(LinearScanner.java:135) org.jenkinsci.plugins.workflow.graphanalysis.AbstractFlowScanner.findFirstMatch(AbstractFlowScanner.java:274) org.jenkinsci.plugins.workflow.support.actions.LogActionImpl.isRunning(LogActionImpl.java:153) org.jenkinsci.plugins.workflow.support.actions.LogActionImpl.getLogText(LogActionImpl.java:128) org.jenkinsci.plugins.workflow.job.WorkflowRun.copyLogs(WorkflowRun.java:441) org.jenkinsci.plugins.workflow.job.WorkflowRun.access$600(WorkflowRun.java:125) org.jenkinsci.plugins.workflow.job.WorkflowRun$3.run(WorkflowRun.java:313) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.runAndReset(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown Source) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) java.lang.Thread.run(Unknown Source)

          Hi Oleg,

          maybe calling Collection.contains in a loop is not a good idea if this is done quite often for larger sets O(n^2)?

          https://github.com/jenkinsci/workflow-api-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/LinearBlockHoppingScanner.java#L123

          Regards,

          Florian

          Florian Manschwetus added a comment - Hi Oleg, maybe calling Collection.contains in a loop is not a good idea if this is done quite often for larger sets O(n^2)? https://github.com/jenkinsci/workflow-api-plugin/blob/master/src/main/java/org/jenkinsci/plugins/workflow/graphanalysis/LinearBlockHoppingScanner.java#L123 Regards, Florian

          Jesse Glick added a comment -

          manschwetus your issue is totally unrelated: JENKINS-40934

          Jesse Glick added a comment - manschwetus your issue is totally unrelated:  JENKINS-40934

          Oleg Nenashev added a comment -

          I doubt it makes sense to keep this issue open. The queue has significantly changed in 1.609.x, and there are other performance tweaks like the recent https://github.com/jenkinsci/jenkins/pull/3038/files jimilian. The inputs here are probably not relevant anymore

          I would just close this issue as Resolved

          Oleg Nenashev added a comment - I doubt it makes sense to keep this issue open. The queue has significantly changed in 1.609.x, and there are other performance tweaks like the recent https://github.com/jenkinsci/jenkins/pull/3038/files jimilian . The inputs here are probably not relevant anymore I would just close this issue as Resolved

          Code changed in jenkins
          User: Akbashev Alexander
          Path:
          core/src/main/java/hudson/model/Queue.java
          test/src/test/java/hudson/model/QueueTest.java
          http://jenkins-ci.org/commit/jenkins/be0238644911948da4123b5338f0299198dcc048
          Log:
          JENKINS-20046 - Do not query queue dispatchers from UI (#3038)

          • Do not query queue dispatchers from UI
          • Address comments from review
          • Restore old constructors and mark them as @deprecated
          • Optimise query from UI even more
          • Check non-concurrent builds in getCauseOfBlockageForItem

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Akbashev Alexander Path: core/src/main/java/hudson/model/Queue.java test/src/test/java/hudson/model/QueueTest.java http://jenkins-ci.org/commit/jenkins/be0238644911948da4123b5338f0299198dcc048 Log: JENKINS-20046 - Do not query queue dispatchers from UI (#3038) Do not query queue dispatchers from UI Address comments from review Restore old constructors and mark them as @deprecated Optimise query from UI even more Check non-concurrent builds in getCauseOfBlockageForItem

          Daniel Beck added a comment -

          Does jimilian's change towards 2.85 resolve this issue?

          Daniel Beck added a comment - Does jimilian 's change towards 2.85 resolve this issue?

          Florian Straub added a comment - - edited

          We still have this behavior running v2.92 ...

          Update: ... but actually we managed to fix it by enabling LDAP caching in "global security"!

          Florian Straub added a comment - - edited We still have this behavior running v2.92 ... Update: ... but actually we managed to fix it by enabling LDAP caching in "global security"!

          Oleg Nenashev added a comment -

          I am closing this issue since there were A LOT of queue performance patches applied since 1.509.x . If anybody sees performance degradation on recent versions of Jenkins, I suggest to proceed in new tickets so that we can handle other cases separately

          Oleg Nenashev added a comment - I am closing this issue since there were A LOT of queue performance patches applied since 1.509.x . If anybody sees performance degradation on recent versions of Jenkins, I suggest to proceed in new tickets so that we can handle other cases separately

            Unassigned Unassigned
            mhschroe Martin Schröder
            Votes:
            13 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: