Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-3086

EKS cluster "cik8s" cannot scale to more than 29 workers due to subnet too small

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Why

      Users of the service ci.jenkins.io wants their build to be handled as soon as possible to have the shorter feedback loop possible.

      What

      The Kubernetes cluster "cik8s" (https://github.com/jenkins-infra/charts/blob/master/clusters/cik8s.yaml) hosted on EKS (defined with terraform in https://github.com/jenkins-infra/aws/blob/main/eks-cluster.tf#L6) cannot scale above 29 workers.

      We are using the "autoscaler" chart (as per https://github.com/jenkins-infra/charts/blob/master/clusters/cik8s.yaml#L8) which reports an error when scaling up to 29: the logs of the autoscaler state that scale up is needed, but backed off because an error from AWS EC2 API.

      From the AWS UI or CLI, the following error appears in the autoscaling group associated to the worker node pool:

      Status Reason: There are not enough free addresses in subnet 'subnet-05b452693696b38d7' to satisfy the requested number of instances. Launching EC2 instance failed.
      

      Requirement Before Starting

      None: go go go

      How

      Some resources to help:

        Attachments

          Activity

          Hide
          dduportal Damien Duportal added a comment -
          Show
          dduportal Damien Duportal added a comment - AWS Terraform Changes: https://github.com/jenkins-infra/aws/pull/33 https://github.com/jenkins-infra/aws/pull/34 Changes due to the EKS cluster re-creation: https://github.com/jenkins-infra/charts-secrets/commit/b87296d49c99ee557360a102c25d05b2cd8a4715 (updating secrets for infra.ci to allow chart deployement) https://github.com/jenkins-infra/jenkins-infra/pull/1922 https://github.com/jenkins-infra/jenkins-infra/pull/1923 (hotfix which break tests but fix production \o/)
          Hide
          dduportal Damien Duportal added a comment -

          Tested a bunch of random BOM PR builds: the cluster successfully scaled to 49 workers, and we hit a peak of 148 containers.

          Next limits: cost and Dockerhub rate limit

          Show
          dduportal Damien Duportal added a comment - Tested a bunch of random BOM PR builds: the cluster successfully scaled to 49 workers, and we hit a peak of 148 containers. Next limits: cost and Dockerhub rate limit

            People

            Assignee:
            dduportal Damien Duportal
            Reporter:
            dduportal Damien Duportal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: