Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-6691

EC2 plugin will launch more instances than cap allows if enough jobs are queued up while no instances are active

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: Major Major
    • ec2-plugin
    • None
    • EC2 plugin 1.8

      I have an instance cap of 3 set. If I have no instances currently running and I queue up 8 jobs that are tied to one AMI, more than three instances will be started. I believe this is because the instance cap is checked against the number of instances currently running from EC2 and then launches the instance. I'd think that, rather than just checking against the EC2 running instance count, we should be also including the number of instances we've started launching. That said, I'm not entirely sure how we'd go about doing that.

          [JENKINS-6691] EC2 plugin will launch more instances than cap allows if enough jobs are queued up while no instances are active

          mike bayer added a comment -

          this is a major issue that is costing me real cash. over at http://jenkins.sqlalchemy.org/ , a new build will add about ten jobs simultaneously, and despite my instance limit of two, I immediately get five or more instances starting up, which I then have to go and kill manually or pay the price, literally. It seems a lot like the EC2 plugin is spawning based on the number of EC2 instances that amazon reports that are running, without taking into account the number of instances the plugin is already starting to spawn but won't yet be reported by amazon.

          mike bayer added a comment - this is a major issue that is costing me real cash. over at http://jenkins.sqlalchemy.org/ , a new build will add about ten jobs simultaneously, and despite my instance limit of two, I immediately get five or more instances starting up, which I then have to go and kill manually or pay the price, literally. It seems a lot like the EC2 plugin is spawning based on the number of EC2 instances that amazon reports that are running, without taking into account the number of instances the plugin is already starting to spawn but won't yet be reported by amazon.

          mike bayer added a comment -

          pull request at https://github.com/jenkinsci/ec2-plugin/pull/25, seems to be working for me.

          mike bayer added a comment - pull request at https://github.com/jenkinsci/ec2-plugin/pull/25 , seems to be working for me.

          Code changed in jenkins
          User: Mike Bayer
          Path:
          src/main/java/hudson/plugins/ec2/EC2Cloud.java
          http://jenkins-ci.org/commit/ec2-plugin/8d0a9fa1d53d148601e33cb68d50b47b5ff97276
          Log:

          • add a new mechanism to help count the total number of EC2 instances for a particular AMI. As an EC2Slave
            is being provisioned, a temporary count is placed in a HashMap, which is used in addition to the
            count reported by Amazon itself for a particular ami. This is so that the count of total
            nodes provisioned takes into account those which Amazon doesn't report yet. The count returned
            may be too high, if amazon reports it in addition to the "provision" count, but better to err on the
            side of not spawning a node too soon; it will get spawned on the next go-around.
            Tries to fix JENKINS-6691.

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Mike Bayer Path: src/main/java/hudson/plugins/ec2/EC2Cloud.java http://jenkins-ci.org/commit/ec2-plugin/8d0a9fa1d53d148601e33cb68d50b47b5ff97276 Log: add a new mechanism to help count the total number of EC2 instances for a particular AMI. As an EC2Slave is being provisioned, a temporary count is placed in a HashMap, which is used in addition to the count reported by Amazon itself for a particular ami. This is so that the count of total nodes provisioned takes into account those which Amazon doesn't report yet. The count returned may be too high, if amazon reports it in addition to the "provision" count, but better to err on the side of not spawning a node too soon; it will get spawned on the next go-around. Tries to fix JENKINS-6691 .

          Kevin Fleming added a comment -

          I have another patch coming for this shortly that (in my testing) completely resolves the problem of excess instances being launched when there is work queued, when the underlying reason for the excess launches is that the new instances take a non-zero amount of time to come online and accept work. My fix involves a small change to the Jenkins core API (in the Cloud class) so that will have to be addressed first. Stay tuned.

          Kevin Fleming added a comment - I have another patch coming for this shortly that (in my testing) completely resolves the problem of excess instances being launched when there is work queued, when the underlying reason for the excess launches is that the new instances take a non-zero amount of time to come online and accept work. My fix involves a small change to the Jenkins core API (in the Cloud class) so that will have to be addressed first. Stay tuned.

          Emil Sit added a comment -

          A related issue is that the nodes launched do not take into account the number of executors that will be provided by the new slave. Something like the below might address that.

          diff --git a/src/main/java/hudson/plugins/ec2/EC2Cloud.java b/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          index 8952222..06e22b9 100644
          --- a/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          +++ b/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          @@ -319,7 +319,7 @@ public abstract class EC2Cloud extends Cloud {
                       final SlaveTemplate t = getTemplate(label);
                       int amiCap = t.getInstanceCap();
           
          -            for( ; excessWorkload>0; excessWorkload-- ) {
          +            while (excessWorkload>0) {
           
                           if (!addProvisionedSlave(t.ami, amiCap)) {
                               break;
          @@ -350,6 +350,9 @@ public abstract class EC2Cloud extends Cloud {
                                       }
                                   })
                                   ,t.getNumExecutors()));
          +
          +                excessWorkload -= t.getNumExecutors();
          +
                       }
                       return r;
                   } catch (AmazonClientException e) {
          

          Emil Sit added a comment - A related issue is that the nodes launched do not take into account the number of executors that will be provided by the new slave. Something like the below might address that. diff --git a/src/main/java/hudson/plugins/ec2/EC2Cloud.java b/src/main/java/hudson/plugins/ec2/EC2Cloud.java index 8952222..06e22b9 100644 --- a/src/main/java/hudson/plugins/ec2/EC2Cloud.java +++ b/src/main/java/hudson/plugins/ec2/EC2Cloud.java @@ -319,7 +319,7 @@ public abstract class EC2Cloud extends Cloud { final SlaveTemplate t = getTemplate(label); int amiCap = t.getInstanceCap(); - for( ; excessWorkload>0; excessWorkload-- ) { + while (excessWorkload>0) { if (!addProvisionedSlave(t.ami, amiCap)) { break; @@ -350,6 +350,9 @@ public abstract class EC2Cloud extends Cloud { } }) ,t.getNumExecutors())); + + excessWorkload -= t.getNumExecutors(); + } return r; } catch (AmazonClientException e) {

          Kevin Fleming added a comment -

          This has already been addressed in the master branch of the plugin, and will be included in the next release. Thanks!

          Kevin Fleming added a comment - This has already been addressed in the master branch of the plugin, and will be included in the next release. Thanks!

          Emil Sit added a comment -

          Cool; which commit fixes this? Thanks!

          Emil Sit added a comment - Cool; which commit fixes this? Thanks!

          Kevin Fleming added a comment -

          Ahh... oops. I'm wrong

          This is fixed in a branch of mine that is queued for the 1.19 release of the plugin; there is a build currently being tested to be the 1.18 release. Once that is pushed out Francis plans to merge my template-instance-cap-fix branch, which corrects this issue among others, but requires Jenkins 1.503 or later. If you'd like to help test my branch, it's at github.com/kpfleming/jenkins-ec2-plugin, called template-instance-cap-fix.

          Kevin Fleming added a comment - Ahh... oops. I'm wrong This is fixed in a branch of mine that is queued for the 1.19 release of the plugin; there is a build currently being tested to be the 1.18 release. Once that is pushed out Francis plans to merge my template-instance-cap-fix branch, which corrects this issue among others, but requires Jenkins 1.503 or later. If you'd like to help test my branch, it's at github.com/kpfleming/jenkins-ec2-plugin, called template-instance-cap-fix.

          Daniel Hyon added a comment -

          Can some generous soul point me to the changes/diffs that were introduced to remedy this bug?

          Thank you!

          Daniel Hyon added a comment - Can some generous soul point me to the changes/diffs that were introduced to remedy this bug? Thank you!

          Francis Upton added a comment -

          This appears to be fixed thanks to Mike Bayer <mike_mp@zzzcomputing.com> on 10/9/12 6:42 PM

          Francis Upton added a comment - This appears to be fixed thanks to Mike Bayer <mike_mp@zzzcomputing.com> on 10/9/12 6:42 PM

            francisu Francis Upton
            abayer Andrew Bayer
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: