Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-3104

Use AWS Spot Instances for the EKS cluster used in ci.jenkins.io

    XMLWordPrintable

    Details

    • Similar Issues:
    • Epic Link:

      Description

      Why

      Read the EPIC (AWS cost decrease)

      What

      As per a reminder from Jesse Glick (:hearth:)in the parent EPIC, we could decrease the cost of agent by using Spot instances on the AWS EKS worker pools.

      Supported by the EKS terraform module (https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest) that we use: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/spot-instances.md

      How

        Attachments

          Activity

          Hide
          jglick Jesse Glick added a comment -

          Note that there is some small chance of a Spot instance being abruptly terminated; there is currently no simple way to arrange for the node block to be rerun automatically without this happening in unrelated cases of genuine failure. Probably quite acceptable for typical CI jobs, just avoid such labels for anything critical.

          Show
          jglick Jesse Glick added a comment - Note that there is some small chance of a Spot instance being abruptly terminated; there is currently no simple way to arrange for the node block to be rerun automatically without this happening in unrelated cases of genuine failure. Probably quite acceptable for typical CI jobs, just avoid such labels for anything critical.
          Show
          dduportal Damien Duportal added a comment - Documentations: https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/docs/spot-instances.md https://github.com/aws/aws-node-termination-handler https://aws.amazon.com/blogs/compute/cost-optimization-and-resilience-eks-with-spot-instances/ https://github.com/pusher/k8s-spot-rescheduler
          Hide
          dduportal Damien Duportal added a comment - - edited
          Show
          dduportal Damien Duportal added a comment - - edited https://github.com/jenkins-infra/aws/pull/40 creates 2 worker pools: 1 for the "static" services such as the cluster autoscaler, and upcoming "custom" chart for handling spots. https://github.com/jenkins-infra/charts/pull/1643 ensures that the autoscaler runs only on the new "static" worker pool https://github.com/jenkins-infra/charts/pull/1643 also add the aws node termination handler to manage spot termination https://github.com/jenkins-infra/charts/pull/1644 enable automatic upgrade of the 2 eks-related helm charts https://github.com/jenkins-infra/aws/pull/41 removes the former (expensive) on-demand pool
          Hide
          dduportal Damien Duportal added a comment -

          Applied to the EKS cluster, only 2 "on demand" machines (t3.xlarge) are available:

          • They can handle 2 jnlp pods (before triggering auto scaling to spawn big Spot workers with 16 vCPUS and 64 Gb)
          • They host the "static" services (such as CoreDNS, autoscaler, etc.)
          Show
          dduportal Damien Duportal added a comment - Applied to the EKS cluster, only 2 "on demand" machines (t3.xlarge) are available: They can handle 2 jnlp pods (before triggering auto scaling to spawn big Spot workers with 16 vCPUS and 64 Gb) They host the "static" services (such as CoreDNS, autoscaler, etc.)

            People

            Assignee:
            dduportal Damien Duportal
            Reporter:
            dduportal Damien Duportal
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: