Details
-
Bug
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Fixed
-
None
Description
We a have setup with multiple AMI that are tag based. Here is the scenario that is causing a problem.
Setup:
AMI_1 has tag ONE and only runs Jenkins jobs with that tag. This AMI has an instance cap of 1 and the executor count on the slave is 1 as well, meaning there can be at most one Jenkins job with tag ONE running at a time.
AMI_2 has no tags and is used as much as possible. It also has an instance cap of 6 and each slave has 4 executors.
The overall max cap is 10.
Jenkins is currently running one instance of AMI_1 (at cap) and two running instances and 2 stopped instances of AMI_2 (below cap). All executors are active. In the queue there is 1 job waiting for AMI_1 and 6 jobs waiting for AMI_2.
Problem:
What happens is the plugin tries provision a new slave for AMI_1 but cannot because it is at cap, and then DOES NOT attempt to provision (start) any AMI_2 slaves.
I looked through the code and saw the following in EC2Cloud.java:
private synchronized EC2AbstractSlave provisionSlaveIfPossible(SlaveTemplate template) { /* * Note this is synchronized between counting the instances and then allocating the node. Once the node is * allocated, we don't look at that instance as available for provisioning. */ int possibleSlavesCount = getPossibleNewSlavesCount(template); if (possibleSlavesCount < 0) { LOGGER.log(Level.INFO, "Cannot provision - no capacity for instances: " + possibleSlavesCount); return null; } try { return template.provision(StreamTaskListener.fromStdout(), possibleSlavesCount > 0); } catch (IOException e) { LOGGER.log(Level.WARNING, "Exception during provisioning", e); return null; } }
I think that
possibleSlavesCount < 0
should be
possibleSlavesCount <= 0
This issue does resolve itself once the queue for AMI_1 is empty, but it can mean an unnecessary long wait for jobs to complete when there should available capacity.
This code change did not solve the problem.
I tested a slight different scenario than above where both AMIs had provisioned slaves to the cap amount, but all were in the stopped state, i.e. no more slaves could be provisioned. When I started several builds spread across both AMIs, none of the slaves restarted. I suspect this happened because provisionSlaveIfPossible ended without trying to start any stopped slaves. When the old version of the check, (possibleSlavesCount < 0), was done, it would still try to start a stopped slave.
I believe there is an issue in the logic where provisioning new slaves and starting stopped slaves is treated the same. Instead they should be treated differently and whenever there is a stopped slaves and excess work exists. Or, another option is to have countCurrentEC2Slaves not count stopped slaves.