The EC2 plugin occasionally takes an excessive amount of time to provision new machines. The example I have shows that it took 15 minutes from start to finish to provision and connect to 88 new machines, comprised of 11 minutes to create the PlannedNode instances and another 4 minutes to call start on each of the new Node instances being added into Jenkins.
The 4-minute period appears now to have been caused by a second cloud plugin and that issue has been resolved, the 11-minute period is still outstanding. Moreover, this period happens entirely in the NodeProvisioner.update thread, so it occurs with the provisioning lock being held and does not allow our second cloud plugin to provision machines during this time. We do not see this all the time, but I suspect it comes as more of an issue during heavy periods and we may be running into request throttling from the AWS side.
I believe this is due to a line in the EC2AbstractSlave constructor, which fetches instance data from EC2 on creation of the Slave instance. Reading the code, this data is designed to be fetched when required, so there's no reason to fetch it in the EC2AbstractSlave constructor and block the provisioning thread. I've opened PR #574 to make this change.
Please find filtered logs (based on thread ID) at https://gist.github.com/mtughan/cbad70db249b8eedb485f9253acb135b. Sensitive company information has been stripped from the log as well. If there are any other potential causes of the slowness that we can work to resolve, I am open for discussion.