Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-6691

EC2 plugin will launch more instances than cap allows if enough jobs are queued up while no instances are active

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • ec2-plugin
    • None
    • EC2 plugin 1.8

    Description

      I have an instance cap of 3 set. If I have no instances currently running and I queue up 8 jobs that are tied to one AMI, more than three instances will be started. I believe this is because the instance cap is checked against the number of instances currently running from EC2 and then launches the instance. I'd think that, rather than just checking against the EC2 running instance count, we should be also including the number of instances we've started launching. That said, I'm not entirely sure how we'd go about doing that.

      Attachments

        Activity

          abayer Andrew Bayer created issue -

          I have the same problem.
          It's really annoying and the extra money I spend because of this bug is driving me towards bamboo.

          cctkn Tobias Knierim added a comment - I have the same problem. It's really annoying and the extra money I spend because of this bug is driving me towards bamboo.
          cctkn Tobias Knierim added a comment - - edited

          Shouldn't the part to check for more instances and start a new one be synchronized?

          for( ; excessWorkload>0; excessWorkload-- ) {
          if(countCurrentEC2Slaves()>=instanceCap)
          break; // maxed out

          r.add(new PlannedNode(t.getDisplayName(),
          Computer.threadPoolForRemoting.submit(new Callable<Node>() {
          public Node call() throws Exception {
          // TODO: record the output somewhere
          EC2Slave s = t.provision(new StreamTaskListener(System.out));
          Hudson.getInstance().addNode(s);
          // EC2 instances may have a long init script. If we declare
          // the provisioning complete by returning without the connect
          // operation, NodeProvisioner may decide that it still wants
          // one more instance, because it sees that (1) all the slaves
          // are offline (because it's still being launched) and
          // (2) there's no capacity provisioned yet.
          //
          // deferring the completion of provisioning until the launch
          // goes successful prevents this problem.
          s.toComputer().connect(false).get();
          return s;
          }
          })
          ,t.getNumExecutors()));
          }

          cctkn Tobias Knierim added a comment - - edited Shouldn't the part to check for more instances and start a new one be synchronized? for( ; excessWorkload>0; excessWorkload-- ) { if(countCurrentEC2Slaves()>=instanceCap) break; // maxed out r.add(new PlannedNode(t.getDisplayName(), Computer.threadPoolForRemoting.submit(new Callable<Node>() { public Node call() throws Exception { // TODO: record the output somewhere EC2Slave s = t.provision(new StreamTaskListener(System.out)); Hudson.getInstance().addNode(s); // EC2 instances may have a long init script. If we declare // the provisioning complete by returning without the connect // operation, NodeProvisioner may decide that it still wants // one more instance, because it sees that (1) all the slaves // are offline (because it's still being launched) and // (2) there's no capacity provisioned yet. // // deferring the completion of provisioning until the launch // goes successful prevents this problem. s.toComputer().connect(false).get(); return s; } }) ,t.getNumExecutors())); }
          zzzeek mike bayer added a comment -

          this is a major issue that is costing me real cash. over at http://jenkins.sqlalchemy.org/ , a new build will add about ten jobs simultaneously, and despite my instance limit of two, I immediately get five or more instances starting up, which I then have to go and kill manually or pay the price, literally. It seems a lot like the EC2 plugin is spawning based on the number of EC2 instances that amazon reports that are running, without taking into account the number of instances the plugin is already starting to spawn but won't yet be reported by amazon.

          zzzeek mike bayer added a comment - this is a major issue that is costing me real cash. over at http://jenkins.sqlalchemy.org/ , a new build will add about ten jobs simultaneously, and despite my instance limit of two, I immediately get five or more instances starting up, which I then have to go and kill manually or pay the price, literally. It seems a lot like the EC2 plugin is spawning based on the number of EC2 instances that amazon reports that are running, without taking into account the number of instances the plugin is already starting to spawn but won't yet be reported by amazon.
          zzzeek mike bayer added a comment -

          pull request at https://github.com/jenkinsci/ec2-plugin/pull/25, seems to be working for me.

          zzzeek mike bayer added a comment - pull request at https://github.com/jenkinsci/ec2-plugin/pull/25 , seems to be working for me.

          Code changed in jenkins
          User: Mike Bayer
          Path:
          src/main/java/hudson/plugins/ec2/EC2Cloud.java
          http://jenkins-ci.org/commit/ec2-plugin/8d0a9fa1d53d148601e33cb68d50b47b5ff97276
          Log:

          • add a new mechanism to help count the total number of EC2 instances for a particular AMI. As an EC2Slave
            is being provisioned, a temporary count is placed in a HashMap, which is used in addition to the
            count reported by Amazon itself for a particular ami. This is so that the count of total
            nodes provisioned takes into account those which Amazon doesn't report yet. The count returned
            may be too high, if amazon reports it in addition to the "provision" count, but better to err on the
            side of not spawning a node too soon; it will get spawned on the next go-around.
            Tries to fix JENKINS-6691.
          scm_issue_link SCM/JIRA link daemon added a comment - Code changed in jenkins User: Mike Bayer Path: src/main/java/hudson/plugins/ec2/EC2Cloud.java http://jenkins-ci.org/commit/ec2-plugin/8d0a9fa1d53d148601e33cb68d50b47b5ff97276 Log: add a new mechanism to help count the total number of EC2 instances for a particular AMI. As an EC2Slave is being provisioned, a temporary count is placed in a HashMap, which is used in addition to the count reported by Amazon itself for a particular ami. This is so that the count of total nodes provisioned takes into account those which Amazon doesn't report yet. The count returned may be too high, if amazon reports it in addition to the "provision" count, but better to err on the side of not spawning a node too soon; it will get spawned on the next go-around. Tries to fix JENKINS-6691 .
          kpfleming Kevin Fleming added a comment -

          I have another patch coming for this shortly that (in my testing) completely resolves the problem of excess instances being launched when there is work queued, when the underlying reason for the excess launches is that the new instances take a non-zero amount of time to come online and accept work. My fix involves a small change to the Jenkins core API (in the Cloud class) so that will have to be addressed first. Stay tuned.

          kpfleming Kevin Fleming added a comment - I have another patch coming for this shortly that (in my testing) completely resolves the problem of excess instances being launched when there is work queued, when the underlying reason for the excess launches is that the new instances take a non-zero amount of time to come online and accept work. My fix involves a small change to the Jenkins core API (in the Cloud class) so that will have to be addressed first. Stay tuned.
          sit Emil Sit added a comment -

          A related issue is that the nodes launched do not take into account the number of executors that will be provided by the new slave. Something like the below might address that.

          diff --git a/src/main/java/hudson/plugins/ec2/EC2Cloud.java b/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          index 8952222..06e22b9 100644
          --- a/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          +++ b/src/main/java/hudson/plugins/ec2/EC2Cloud.java
          @@ -319,7 +319,7 @@ public abstract class EC2Cloud extends Cloud {
                       final SlaveTemplate t = getTemplate(label);
                       int amiCap = t.getInstanceCap();
           
          -            for( ; excessWorkload>0; excessWorkload-- ) {
          +            while (excessWorkload>0) {
           
                           if (!addProvisionedSlave(t.ami, amiCap)) {
                               break;
          @@ -350,6 +350,9 @@ public abstract class EC2Cloud extends Cloud {
                                       }
                                   })
                                   ,t.getNumExecutors()));
          +
          +                excessWorkload -= t.getNumExecutors();
          +
                       }
                       return r;
                   } catch (AmazonClientException e) {
          
          sit Emil Sit added a comment - A related issue is that the nodes launched do not take into account the number of executors that will be provided by the new slave. Something like the below might address that. diff --git a/src/main/java/hudson/plugins/ec2/EC2Cloud.java b/src/main/java/hudson/plugins/ec2/EC2Cloud.java index 8952222..06e22b9 100644 --- a/src/main/java/hudson/plugins/ec2/EC2Cloud.java +++ b/src/main/java/hudson/plugins/ec2/EC2Cloud.java @@ -319,7 +319,7 @@ public abstract class EC2Cloud extends Cloud { final SlaveTemplate t = getTemplate(label); int amiCap = t.getInstanceCap(); - for( ; excessWorkload>0; excessWorkload-- ) { + while (excessWorkload>0) { if (!addProvisionedSlave(t.ami, amiCap)) { break; @@ -350,6 +350,9 @@ public abstract class EC2Cloud extends Cloud { } }) ,t.getNumExecutors())); + + excessWorkload -= t.getNumExecutors(); + } return r; } catch (AmazonClientException e) {
          kpfleming Kevin Fleming added a comment -

          This has already been addressed in the master branch of the plugin, and will be included in the next release. Thanks!

          kpfleming Kevin Fleming added a comment - This has already been addressed in the master branch of the plugin, and will be included in the next release. Thanks!
          sit Emil Sit added a comment -

          Cool; which commit fixes this? Thanks!

          sit Emil Sit added a comment - Cool; which commit fixes this? Thanks!
          kpfleming Kevin Fleming added a comment -

          Ahh... oops. I'm wrong

          This is fixed in a branch of mine that is queued for the 1.19 release of the plugin; there is a build currently being tested to be the 1.18 release. Once that is pushed out Francis plans to merge my template-instance-cap-fix branch, which corrects this issue among others, but requires Jenkins 1.503 or later. If you'd like to help test my branch, it's at github.com/kpfleming/jenkins-ec2-plugin, called template-instance-cap-fix.

          kpfleming Kevin Fleming added a comment - Ahh... oops. I'm wrong This is fixed in a branch of mine that is queued for the 1.19 release of the plugin; there is a build currently being tested to be the 1.18 release. Once that is pushed out Francis plans to merge my template-instance-cap-fix branch, which corrects this issue among others, but requires Jenkins 1.503 or later. If you'd like to help test my branch, it's at github.com/kpfleming/jenkins-ec2-plugin, called template-instance-cap-fix.
          jntkflow Daniel Hyon added a comment -

          Can some generous soul point me to the changes/diffs that were introduced to remedy this bug?

          Thank you!

          jntkflow Daniel Hyon added a comment - Can some generous soul point me to the changes/diffs that were introduced to remedy this bug? Thank you!
          francisu Francis Upton added a comment -

          This appears to be fixed thanks to Mike Bayer <mike_mp@zzzcomputing.com> on 10/9/12 6:42 PM

          francisu Francis Upton added a comment - This appears to be fixed thanks to Mike Bayer <mike_mp@zzzcomputing.com> on 10/9/12 6:42 PM
          francisu Francis Upton made changes -
          Field Original Value New Value
          Assignee Kohsuke Kawaguchi [ kohsuke ] Francis Upton [ francisu ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          francisu Francis Upton made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          rtyler R. Tyler Croy made changes -
          Workflow JNJira [ 136786 ] JNJira + In-Review [ 204219 ]

          People

            francisu Francis Upton
            abayer Andrew Bayer
            Votes:
            4 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: