Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-56036

Spot Instance Plugin Spawns Arbitrary Number of Instances

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • ec2-plugin
    • None

      Feb 07, 2019 4:55:53 PM FINE com.amazonaws.http.AmazonHttpClient$RequestExecutor executeOneRequest
      Sending Request: POST https://ec2.us-east-1.amazonaws.com/ / Parameters: ({"Action":["DescribeInstances"],"Version":["2016-11-15"],"InstanceId.1":["i-0cda2343e023df94c"]}Headers: (User-Agent: aws-sdk-java/1.11.457 Linux/4.4.0-47-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.111-b14 java/1.8.0_111 groovy/2.4.12, amz-sdk-invocation-id: 1eb7a6a4-e994-6b97-17bf-e412668194d8, )

      The EC2 Spot instance functionality keeps spawning the spot instances while ignoring the existing ones that are running.

      I see stacktraces like this in Jenkins when this happens.

      INFO: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      Feb 07, 2019 4:56:38 PM hudson.plugins.ec2.EC2Cloud provision
      WARNING: SlaveTemplate{ami='ami-005b3a8001dab02a9', labels=''}. Exception during provisioning
      com.amazonaws.AmazonClientException: Unexpected number of reservations reported by EC2 for instance id 'i-0cda2343e023df94c', expected 1 result, found []. Instance seems to be dead.
      at hudson.plugins.ec2.CloudHelper.getInstance(CloudHelper.java:54)
      at hudson.plugins.ec2.CloudHelper.getInstanceWithRetry(CloudHelper.java:25)
      at hudson.plugins.ec2.EC2AbstractSlave.fetchLiveInstanceData(EC2AbstractSlave.java:499)
      at hudson.plugins.ec2.EC2AbstractSlave.<init>(EC2AbstractSlave.java:159)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:44)
      at hudson.plugins.ec2.EC2SpotSlave.<init>(EC2SpotSlave.java:37)
      at hudson.plugins.ec2.SlaveTemplate.newSpotSlave(SlaveTemplate.java:979)
      at hudson.plugins.ec2.SlaveTemplate.provisionSpot(SlaveTemplate.java:919)
      at hudson.plugins.ec2.SlaveTemplate.provision(SlaveTemplate.java:464)
      at hudson.plugins.ec2.EC2Cloud.getNewOrExistingAvailableSlave(EC2Cloud.java:578)
      at hudson.plugins.ec2.EC2Cloud.provision(EC2Cloud.java:594)
      at hudson.slaves.NodeProvisioner$StandardStrategyImpl.apply(NodeProvisioner.java:715)
      at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:320)
      at hudson.slaves.NodeProvisioner.access$000(NodeProvisioner.java:62)
      at hudson.slaves.NodeProvisioner$1.run(NodeProvisioner.java:177)
      at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
      at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:59)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      I believe this is a result of of using a method on a partially constructed object.

      1) a EC2SpotSlave is constructed with a non-null spot instance request id (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/SlaveTemplate.java#L1060)

      2) This constructor is called which calls the super constructor of EC2AbstractSlave. The instance variable spotInstanceRequestId is assigned after the super constructor is called (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L51)

      3) The EC2AbstractSlave constructor calls fetchLiveInstanceData (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L164)

      4) fetchLiveInstanceData ends up calling getInstanceId() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2AbstractSlave.java#L503)

      5) This is overriden in EC2SpotSlave which calls getSpotRequest() (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L165)

      6) This calls describeSpotInstanceRequests using spotInstanceRequestId which is null (https://github.com/jenkinsci/ec2-plugin/blob/master/src/main/java/hudson/plugins/ec2/EC2SpotSlave.java#L129)

      7) If you pass in null it returns ALL THE SPOT REQUESTS IN THE REGION IN ANY RANDOM ORDER AMAZON WANTS TO GIVE BACK

      8) We take the first spot request and use that which might not necessarily be the spot request we are interested in. Maybe this plugin appears to work a lot of the time because EC2 mostly gives the latest instance back first. Hmmm..

      9) Random shit starts happening because you are using corrupt data

      So.. Considering that this code can spawn lots of instances when things become corrupt and I was seeing errors like 'Unexpected number of reservations'. I wonder if we can be a bit more defensive and set a flag that prevents instances from being spawned further when things are corrupt.

          [JENKINS-56036] Spot Instance Plugin Spawns Arbitrary Number of Instances

          Ben Murphy added a comment -

          This actually might not be so bad. I think when it fails it rechecks the instance limit by looking at all the instances tagged with jenkins so it won't actually try and spawn more than the instance limit. This is a kind of cloudy interpretation of the code so don't take my word on it. :/

          Ben Murphy added a comment - This actually might not be so bad. I think when it fails it rechecks the instance limit by looking at all the instances tagged with jenkins so it won't actually try and spawn more than the instance limit. This is a kind of cloudy interpretation of the code so don't take my word on it. :/

          Ben Murphy added a comment -

          This seems to have been broken in this commit: dd7bdefc4a214934facb93306c33bcda1c9a3a9a

          Oh.. did i forget to mention i hate OO and random side effects :/

          Ben Murphy added a comment - This seems to have been broken in this commit: dd7bdefc4a214934facb93306c33bcda1c9a3a9a Oh.. did i forget to mention i hate OO and random side effects :/

          Thanks, I will provide a fix in the 1.43 (I hope)

          FABRIZIO MANFREDI added a comment - Thanks, I will provide a fix in the 1.43 (I hope)

          Andy Kennealy added a comment -

          I really need a fix for the deadlock issue which is apparently fixed in latest version, but I had to roll back to 1.39 due to this issue here, jenkins-55720, and JENKINS-55639.

          I need to restart Jenkins multiple times a day because of the deadlock issue. I think I'm going to try reverting back to 1.36

          Andy Kennealy added a comment - I really need a fix for the deadlock issue which is apparently fixed in latest version, but I had to roll back to 1.39 due to this issue here, jenkins-55720, and  JENKINS-55639 . I need to restart Jenkins multiple times a day because of the deadlock issue. I think I'm going to try reverting back to 1.36

          Thai Pham added a comment -

          thoulen do you have any ETA on when 1.43 will be released?

          Thai Pham added a comment - thoulen do you have any ETA on when 1.43 will be released?

          Laszlo Gaal added a comment -

          Saw the same thing happening on Jenkins v2.150.2 and EC2 plugin version 1.42.

          benmmurphy, my experience seems to confirm yours: seeing one request for a worker, the plugin spun up as many spot instances as the instance limit was.

          Laszlo Gaal added a comment - Saw the same thing happening on Jenkins v2.150.2 and EC2 plugin version 1.42. benmmurphy , my experience seems to confirm yours: seeing one request for a worker, the plugin spun up as many spot instances as the instance limit was.

          Shubham Dhoka added a comment -

          I have seen this issue with on-demand instances as well. I'm using version 1.42 of the plugin

          Shubham Dhoka added a comment - I have seen this issue with on-demand instances as well. I'm using version 1.42 of the plugin

          Joe Canuel added a comment -

          I'm seeing this issue with on-demand instances and version 1.45 of the plugin. Any update on this issue? Right now it is spinning up on-demand workers but not tracking them due to this issue, so we have to go in and kill the orphaned instances manually.

          Joe Canuel added a comment - I'm seeing this issue with on-demand instances and version 1.45 of the plugin. Any update on this issue? Right now it is spinning up on-demand workers but not tracking them due to this issue, so we have to go in and kill the orphaned instances manually.

            thoulen FABRIZIO MANFREDI
            benmmurphy Ben Murphy
            Votes:
            8 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated: