Status: Closed (View Workflow)
Jenkins 2.172 (from the jenkins/jenkins:2.172 docker image)
We have periods where we enqueue a burst of jobs (> 100) and need to spin up approximately that many workers. When this happens, the ec2 plugin appears to be taking out a substantial number of locks on the Queue, which prevents builds from being processed out of the Queue and assigned to an executor.
We see the number of available executors go up until the excessWorkload is under the NodeProvisioner's thresholdMargin, at which point the Queue locks are released, and jobs are assigned to the executors. In cases where we need to spin up 100s of workers, we can see the Queue locked for a fairly long time (> 15 minutes).
During the time that the Queue has been locked, if any of our EC2 workers reach the specified idle timeout, Jenkins.updateComputerList (which is called during Jenkins.addNode) will decide it needs to remove those workers, despite having a large Queue that we're in the middle of provisioning workers for. EC2OndemandSlave.terminate will then block on its own Queue lock, and then make another call to Jenkins.updateComputerList, which results in more Queue locking. This prevents us from being able to have a low idle timeout.
Here's the NodeProvisioner options we're using:
Here's the relevant options we've configured for our EC2 clouds in the Jenkins UI:
AMI Type: Unix
Usage: Only build jobs with label expressions matching this node
Idle Termination Time: 60
Number of Executors: 1
Delete root device on instance termination: True
Connect by SSH Process: True
I've attached an example Jenkinsfile that we've been using to test this behavior, along with some graphs showing that builds were not being assigned to workers until after Jenkins had provisioned enough workers to process the excess workload. I've also attached graphs taken while running the same test using the ECS plugin, which allows for a build to immediately be assigned to a worker once it is available (as shown by "Jenkins Executor Count" graph in the top right). I've also attached a thread dump of the case where many workers reach the idle timeout value while a new node is being added.
I will merge soon the patch, I am waiting to close some pending of 1.43. After that It will be part of the 1.44
Proposed fix: https://github.com/jenkinsci/ec2-plugin/pull/346