Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-57161

ec2 plugin locks queue until excessWorkload is 0

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ec2-plugin
    • Jenkins 2.172 (from the jenkins/jenkins:2.172 docker image)
      ec2 1.42
      durable-task 1.29

      We have periods where we enqueue a burst of jobs (> 100) and need to spin up approximately that many workers. When this happens, the ec2 plugin appears to be taking out a substantial number of locks on the Queue, which prevents builds from being processed out of the Queue and assigned to an executor.

      We see the number of available executors go up until the excessWorkload is under the NodeProvisioner's thresholdMargin, at which point the Queue locks are released, and jobs are assigned to the executors. In cases where we need to spin up 100s of workers, we can see the Queue locked for a fairly long time (> 15 minutes). 

      During the time that the Queue has been locked, if any of our EC2 workers reach the specified idle timeout, Jenkins.updateComputerList (which is called during Jenkins.addNode) will decide it needs to remove those workers, despite having a large Queue that we're in the middle of provisioning workers for. EC2OndemandSlave.terminate will then block on its own Queue lock, and then make another call to Jenkins.updateComputerList, which results in more Queue locking. This prevents us from being able to have a low idle timeout.

      Here's the NodeProvisioner options we're using:

      -Dhudson.slaves.NodeProvisioner.MARGIN=30
      -Dhudson.slaves.NodeProvisioner.MARGIN0=0.67
      -Dhudson.agents.NodeProvisioner.MARGIN_DECAY=0.5

      Here's the relevant options we've configured for our EC2 clouds in the Jenkins UI:

      AMI Type: Unix
      Usage: Only build jobs with label expressions matching this node
      Idle Termination Time: 60
      Number of Executors: 1
      Delete root device on instance termination: True
      Connect by SSH Process: True

      I've attached an example Jenkinsfile that we've been using to test this behavior, along with some graphs showing that builds were not being assigned to workers until after Jenkins had provisioned enough workers to process the excess workload. I've also attached graphs taken while running the same test using the ECS plugin, which allows for a build to immediately be assigned to a worker once it is available (as shown by "Jenkins Executor Count" graph in the top right). I've also attached a thread dump of the case where many workers reach the idle timeout value while a new node is being added.

        1. Jenkinsfile.stress_test
          0.4 kB
        2. ecs-test.png
          ecs-test.png
          161 kB
        3. ec2-test.png
          ec2-test.png
          155 kB
        4. add_many_removes_thread_dump.txt
          56 kB

            thoulen FABRIZIO MANFREDI
            kcboschert Kevin Boschert
            Votes:
            4 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: