Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-23850

PATCH: EC2-plugin always starting new slaves instead of restarting existing

      Every time I start a build, Jenkins launches a new slave rather than restarting one of the stopped instances. It is successfully stopping the instance when it hits the idle time.

      During idle, the log shows lots of this (the _check, idleTimeout, stop entries appear once for every instance currently registered):

      Jul 16, 2014 12:40:47 PM hudson.model.AsyncPeriodicWork$1 run
      INFO: Started EC2 alive slaves monitor
      Jul 16, 2014 12:40:48 PM hudson.model.AsyncPeriodicWork$1 run
      INFO: Finished EC2 alive slaves monitor. 1172 ms
      Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check
      INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c)
      Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout
      INFO: EC2 instance idle time expired: i-ce32a08c
      Jul 16, 2014 12:41:53 PM hudson.plugins.ec2.EC2AbstractSlave stop
      INFO: EC2 instance stopped: i-ce32a08c
      

      Then a build is triggered and the idle timeout checks run again (again, one set of entries for every instance):

      Jul 16, 2014 12:44:23 PM com.cloudbees.jenkins.GitHubPushTrigger$1 run
      INFO: SCM changes detected in edifestivalsapi-master. Triggering  #36
      Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2RetentionStrategy _check
      INFO: Idle timeout: edifestivalsapi build slave (i-ce32a08c)
      Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave idleTimeout
      INFO: EC2 instance idle time expired: i-ce32a08c
      Jul 16, 2014 12:45:53 PM hudson.plugins.ec2.EC2AbstractSlave stop
      INFO: EC2 instance stopped: i-ce32a08c
      

      And then the plugin starts to provision a new instance - apparently without any attempt to restart a stopped slave.

      Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud provision
      INFO: Excess workload after pending Spot instances: 1
      Jul 16, 2014 12:46:33 PM hudson.plugins.ec2.EC2Cloud addProvisionedSlave
      INFO: Provisioning for AMI ami-57ea3d20; Estimated number of total slaves: 0; Estimated number of slaves for ami ami-57ea3d20: 0
      Launching ami-57ea3d20 for template edifestivalsapi build slave
      Jul 16, 2014 12:46:33 PM hudson.slaves.NodeProvisioner update
      INFO: Started provisioning edifestivalsapi build slave (ami-57ea3d20) from ec2-eu-west-1 with 1 executors. Remaining excess workload:0.0
      Looking for existing instances: {InstanceIds: [],Filters: [{Name: image-id,Values: [ami-57ea3d20]}, {Name: group-name,Values: [jenkins-build-slave]}, {Name: key-name,Values: [build-slave]}, {Name: instance-type,Values: [t1.micro]}, {Name: tag:Name,Values: [edifestivalsapi-build-slave]}, {Name: tag:Project,Values: [edifestivalsapi]}, {Name: instance-state-name,Values: [stopped, stopping]}],}
      No existing instance found - created: {InstanceId: i-eb35a8a9,ImageId: ami-57ea3d20,State: {Code: 0,Name: pending},"**REDACTED**}
      

      Then another block of the idle timeout checks while the instance is launched, and then this:

      Jul 16, 2014 12:47:44 PM hudson.slaves.NodeProvisioner update
      INFO: edifestivalsapi build slave (ami-57ea3d20) provisioningE successfully completed. We have now 8 computer(s)
      Jul 16, 2014 12:47:47 PM hudson.node_monitors.ResponseTimeMonitor$1 monitor
      WARNING: Making edifestivalsapi build slave (i-ce32a08c) offline because it’s not responding
      

      The UI shows all the slaves that were launched for previous jobs but shows them as offline with "Time out for last 5 try". When I manually start the instance (by clicking onto the slave page and clicking "Launch slave agent") I see that the stopped instance is restarted and comes online as expected.

      It does successfully run a subsequent build on the correct existing node if there is one still running.

      So my hunch is that that Jenkins somehow isn't detecting that it has a stopped instance for the given AMI?

      I have two slave types configured, both using the same AMI but with different labels.

      I had initially posted this as a comment on https://issues.jenkins-ci.org/browse/JENKINS-23787 but it sounds like it may not be related - he's seeing it waiting indefinitely for the instance to start, where I'm seeing it skipping and spinning up a new node straight away, leaving the old one marked as not available.

          [JENKINS-23850] PATCH: EC2-plugin always starting new slaves instead of restarting existing

          Andrew Coulton added a comment - - edited

          Originally thought this was related to https://issues.jenkins-ci.org/browse/JENKINS-23787 but it may not be.

          Andrew Coulton added a comment - - edited Originally thought this was related to https://issues.jenkins-ci.org/browse/JENKINS-23787 but it may not be.

          I've identified that this happens when using the eu-west-1 region, because the describeInstances API does not return results when used with the group-name filter. The same behaviour has been observed at https://github.com/aws/aws-sdk-java/issues/213 and I have opened a discussion on the AWS forum.

          Per the EC2 API docs the instance.group-name filter is functionally identical, and I have found that this query does produce expected results in eu-west-1 and elsewhere.

          I've submitted a 1-line pull request to switch to the instance.group-name filter at https://github.com/jenkinsci/ec2-plugin/pull/99 which I think will resolve this issue ahead of any response/resolution by the AWS team.

          Andrew Coulton added a comment - I've identified that this happens when using the eu-west-1 region, because the describeInstances API does not return results when used with the group-name filter. The same behaviour has been observed at https://github.com/aws/aws-sdk-java/issues/213 and I have opened a discussion on the AWS forum . Per the EC2 API docs the instance.group-name filter is functionally identical, and I have found that this query does produce expected results in eu-west-1 and elsewhere. I've submitted a 1-line pull request to switch to the instance.group-name filter at https://github.com/jenkinsci/ec2-plugin/pull/99 which I think will resolve this issue ahead of any response/resolution by the AWS team.

          Craig Ringer added a comment -

          Consider adding some instrumentation to SlaveTemplate.java, particularly the provisionOndemand(...) function.

          this changeset: https://github.com/jenkinsci/ec2-plugin/pull/101

          might be interesting for the purpose.

          Craig Ringer added a comment - Consider adding some instrumentation to SlaveTemplate.java, particularly the provisionOndemand(...) function. this changeset: https://github.com/jenkinsci/ec2-plugin/pull/101 might be interesting for the purpose.

          Code changed in jenkins
          User: Andrew Coulton
          Path:
          src/main/java/hudson/plugins/ec2/SlaveTemplate.java
          http://jenkins-ci.org/commit/ec2-plugin/14d537a08728f0baef88ed2d33e73696f681433f
          Log:
          [FIXED JENKINS-23850] Workaround AWS bug that hides stopped slaves

          Use instance.group-name instead of group-name as the describeInstances
          filter when searching for stopped slaves to restart.

          Some EC2 regions are incorrectly returning empty describeInstances
          results when the group-name filter is used. This causes jenkins to start
          new slaves instead of restarting stopped instances.

          instance.group-name is functionally identical and seems to work
          consistently.

          https://issues.jenkins-ci.org/browse/JENKINS-23850

          SCM/JIRA link daemon added a comment - Code changed in jenkins User: Andrew Coulton Path: src/main/java/hudson/plugins/ec2/SlaveTemplate.java http://jenkins-ci.org/commit/ec2-plugin/14d537a08728f0baef88ed2d33e73696f681433f Log: [FIXED JENKINS-23850] Workaround AWS bug that hides stopped slaves Use instance.group-name instead of group-name as the describeInstances filter when searching for stopped slaves to restart. Some EC2 regions are incorrectly returning empty describeInstances results when the group-name filter is used. This causes jenkins to start new slaves instead of restarting stopped instances. instance.group-name is functionally identical and seems to work consistently. https://issues.jenkins-ci.org/browse/JENKINS-23850

            francisu Francis Upton
            acoulton Andrew Coulton
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: