Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-58101

jenkins slowdown with many offline nodes

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • core
    • None
    • linux java8 x64,
      jenkins 2.176.1,
      workflow-durable-task-step 2.30
    • Jenkins 2.274 - released 5 Jan 2021 and 2.277.1

    Description

      Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

      "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance] 
         java.lang.Thread.State: RUNNABLE
              at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
              at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
              at hudson.model.Queue.maintain(Queue.java:1616)
              at hudson.model.Queue$1.call(Queue.java:325)
              at hudson.model.Queue$1.call(Queue.java:322)
      

      Steps to reproduce:

      1. install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
      2. create a ssh node with 500 executors, add some random labels "a b c d e f g h i"
      3. mark the node offline using configure->availability->"bring online according to schedule"
      4. create the jobs using job dsl below
      5. wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze
      configs = []
      for (int i = 0; i < 100; i++) {
        configs.add(String.valueOf(i))
      }
      
      for (int i = 0; i < 10; i++) {
        matrixJob("matrix-"+i) {
          axes {
            text('cfg', configs)
          }
          triggers {
            cron('* * * * *')
          }
          steps {
            shell('sleep 30')
          }
        }
      }
      

      It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

      How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.

      Attachments

        Issue Links

          Activity

            mbakhoff Märt Bakhoff created issue -
            mbakhoff Märt Bakhoff made changes -
            Field Original Value New Value
            Description Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

            {noformat}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
               java.lang.Thread.State: RUNNABLE
                    at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
                    at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
                    at hudson.model.Queue.maintain(Queue.java:1616)
                    at hudson.model.Queue$1.call(Queue.java:325)
                    at hudson.model.Queue$1.call(Queue.java:322)
            {noformat}

            Steps to reproduce:
            #1) install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
            # create a ssh node with 500 executors
            # mark the node offline using configure->availability->"bring online according to schedule"
            # create the jobs using job dsl below
            # wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

            {code}
            configs = []
            for (int i = 0; i < 100; i++) {
              configs.add(String.valueOf(i))
            }

            for (int i = 0; i < 10; i++) {
              matrixJob("matrix-"+i) {
                axes {
                  text('cfg', configs)
                }
                triggers {
                  cron('* * * * *')
                }
                steps {
                  shell('sleep 30')
                }
              }
            }
            {code}

            It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

            How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.
            Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

            {noformat}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
               java.lang.Thread.State: RUNNABLE
                    at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
                    at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
                    at hudson.model.Queue.maintain(Queue.java:1616)
                    at hudson.model.Queue$1.call(Queue.java:325)
                    at hudson.model.Queue$1.call(Queue.java:322)
            {noformat}

            Steps to reproduce:
            # install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
            # create a ssh node with 500 executors
            # mark the node offline using configure->availability->"bring online according to schedule"
            # create the jobs using job dsl below
            # wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

            {code}
            configs = []
            for (int i = 0; i < 100; i++) {
              configs.add(String.valueOf(i))
            }

            for (int i = 0; i < 10; i++) {
              matrixJob("matrix-"+i) {
                axes {
                  text('cfg', configs)
                }
                triggers {
                  cron('* * * * *')
                }
                steps {
                  shell('sleep 30')
                }
              }
            }
            {code}

            It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

            How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.
            mbakhoff Märt Bakhoff made changes -
            Link This issue relates to JENKINS-20046 [ JENKINS-20046 ]
            mbakhoff Märt Bakhoff made changes -
            Description Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

            {noformat}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
               java.lang.Thread.State: RUNNABLE
                    at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
                    at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
                    at hudson.model.Queue.maintain(Queue.java:1616)
                    at hudson.model.Queue$1.call(Queue.java:325)
                    at hudson.model.Queue$1.call(Queue.java:322)
            {noformat}

            Steps to reproduce:
            # install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
            # create a ssh node with 500 executors
            # mark the node offline using configure->availability->"bring online according to schedule"
            # create the jobs using job dsl below
            # wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

            {code}
            configs = []
            for (int i = 0; i < 100; i++) {
              configs.add(String.valueOf(i))
            }

            for (int i = 0; i < 10; i++) {
              matrixJob("matrix-"+i) {
                axes {
                  text('cfg', configs)
                }
                triggers {
                  cron('* * * * *')
                }
                steps {
                  shell('sleep 30')
                }
              }
            }
            {code}

            It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

            How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.
            Having a large number of offline executors causes massive slowdown in  hudson.model.Queue. The maintain method is holding the queue lock over 80% of the time in some cases.

            {noformat}
            "AtmostOneTaskExecutor[Periodic Jenkins queue maintenance]
               java.lang.Thread.State: RUNNABLE
                    at org.jenkinsci.plugins.durabletask.executors.ContinuedTask$Scheduler.canTake(ContinuedTask.java:66)
                    at hudson.model.Queue$JobOffer.getCauseOfBlockage(Queue.java:278)
                    at hudson.model.Queue.maintain(Queue.java:1616)
                    at hudson.model.Queue$1.call(Queue.java:325)
                    at hudson.model.Queue$1.call(Queue.java:322)
            {noformat}

            Steps to reproduce:
            # install jenkins + job-dsl-plugin + matrix-project-plugin + ssh-slaves-plugin + workflow-durable-task-step (Pipeline: nodes and processes)
            # create a ssh node with 500 executors, add some random labels "a b c d e f g h i"
            # mark the node offline using configure->availability->"bring online according to schedule"
            # create the jobs using job dsl below
            # wait for the jobs to start, observe the sluggish queue, fire up jvisualvm to analyze

            {code}
            configs = []
            for (int i = 0; i < 100; i++) {
              configs.add(String.valueOf(i))
            }

            for (int i = 0; i < 10; i++) {
              matrixJob("matrix-"+i) {
                axes {
                  text('cfg', configs)
                }
                triggers {
                  cron('* * * * *')
                }
                steps {
                  shell('sleep 30')
                }
              }
            }
            {code}

            It seems each "parked executor" causes a Queue$JobOffer to be created, which is turn triggers some getCauseOfBlockage analysis. This seems to do blockedItems * buildableItems operations which can get quite slow for a large job queue.

            How it was found: We have ~80 nodes with 10 executors each. We took half of them offline during a hardware migration. Soon our jobs filled the queue with 2000 items. Jenkins started timing out due to queue lock contention - a single maintain() call took around 60sec.
            raihaan Raihaan Shouhell made changes -
            Assignee Raihaan Shouhell [ raihaan ]
            raihaan Raihaan Shouhell made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            raihaan Raihaan Shouhell made changes -
            Status In Progress [ 3 ] In Review [ 10005 ]
            raihaan Raihaan Shouhell made changes -
            Remote Link This issue links to "PR-5082 (Web Link)" [ 26340 ]
            markewaite Mark Waite made changes -
            Released As Jenkins 2.274 - released 5 Jan 2021
            Resolution Fixed [ 1 ]
            Status In Review [ 10005 ] Closed [ 6 ]
            markewaite Mark Waite made changes -
            Released As Jenkins 2.274 - released 5 Jan 2021 Jenkins 2.274 - released 5 Jan 2021 and 2.277.1

            People

              raihaan Raihaan Shouhell
              mbakhoff Märt Bakhoff
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: