Uploaded image for project: 'Infrastructure'
  1. Infrastructure
  2. INFRA-3036

DockerHub API Rate Limit slows down builds on ci.jenkins.io + increases costs

    XMLWordPrintable

    Details

    • Similar Issues:

      Description

      Following INFRA-2918, we discovered that the Docker Hub API rate limit is hit pretty quickly for the Kubernetes agents on ci.jenkins.io.

      The reason are multiple:

      • ACI agents were not having the limt because Microsoft infrastructure for ACI is not constrained by such issue (their image pull are either not rate limited, either using their own unlimited account + public IP rotated quite often)
      • We are using a pattern of "get the latest container image" with the images from https://github.com/jenkins-infra/docker-inbound-agents/ which generates a LOT of uneeded requests to the DockerHub (2-3 requests per agent start)
        • Functionnaly, we NEED to ensure we always got the latest image.

      The following mitigation / solutions are considered:

      • Consider publishing images in another (or 2 differents) Docker registries (keeping the Docker Hub, but add somewhere else it's free such as GHCR) to ensure we do hit API limits when pulling
      • Switch the tag policy from `latest` to tag based (or better: sha based)
        • This is what we do with VM agents
        • Requires a bit of updateCLI on the repo jenkins-infra/jenkins-infra to update checksums daily
        • If we need to "test" a new image different from the default podtemplate: the pipeline can override the image thourhg `podTemplate` (or agent { kubernetes {}}` in declarative)
      • Use a Docker local registry acting as "pull-through" cache (ref. https://docs.docker.com/registry/recipes/mirror/)
        • Per namespace in the KES cluster
        • Private access only (no exposition)
        • Authentication might, or might not be required to pull from worker nodes to the proxy registry
        • This proxy should be authenticated to DockerHub

        Attachments

          Activity

          Hide
          dduportal Damien Duportal added a comment -

          Authentication already done in https://github.com/jenkins-infra/jenkins-infra/pull/1825

          BUT

          it must be added in the "jenkins-agents" chart (TODO)

          Show
          dduportal Damien Duportal added a comment - Authentication already done in https://github.com/jenkins-infra/jenkins-infra/pull/1825 BUT it must be added in the "jenkins-agents" chart (TODO)
          Hide
          olblak Olivier Vernin added a comment -

          I think having a local registry would be best solution. It reduces network pressure, decreases image pull time, and only then we would experiment with other solutions.

          Authentication with Dockerhub could be done at the proxy cache level, but having it available is anyway better.

           

          I like the idea of using checksum as it makes it easier to debug but I wouldn't update them on a daily due to the noise that it introduces.

          Show
          olblak Olivier Vernin added a comment - I think having a local registry would be best solution. It reduces network pressure, decreases image pull time, and only then we would experiment with other solutions. Authentication with Dockerhub could be done at the proxy cache level, but having it available is anyway better.   I like the idea of using checksum as it makes it easier to debug but I wouldn't update them on a daily due to the noise that it introduces.
          Hide
          dduportal Damien Duportal added a comment -

          Started to study these 2 solutions (pull-through registry and checksums).

          I've tried within EC2 a pull-through with a single Docker Engine. The download times are not really improved (gain is sometimes of a few seconds for the `jenkins/inbound-agent:jkd11` image).
          The problem is that with a "latest" model, the pull-through registry will always emit requests to the DockerHub API to check if there is a new layer: I was able to hit the limit in ~1h40 (it's better than the 50 min we had last week, but still far away from the 6 hours).
          Another issue is that configuring the container runtimes used by the kubelets, even though possible, is not immediate and change across Kubernetes providers (yay).
          The good thing though, is that the registry does not act as single point of failure: if it is down, the container runtimes (at least containerd and dockerd that I tested) fallback to DockerHub.

          About the checksums, ack with your comment Olivier Vernin: a weekly update (if it requires human approval) sounds good. But we should want to be able to do it "on demand" if there is a special request (thinking about a CVE or new feature): sounds like a job that can be triggered manually should do the trick.

          Show
          dduportal Damien Duportal added a comment - Started to study these 2 solutions (pull-through registry and checksums). I've tried within EC2 a pull-through with a single Docker Engine. The download times are not really improved (gain is sometimes of a few seconds for the `jenkins/inbound-agent:jkd11` image). The problem is that with a "latest" model, the pull-through registry will always emit requests to the DockerHub API to check if there is a new layer: I was able to hit the limit in ~1h40 (it's better than the 50 min we had last week, but still far away from the 6 hours). Another issue is that configuring the container runtimes used by the kubelets, even though possible, is not immediate and change across Kubernetes providers (yay). The good thing though, is that the registry does not act as single point of failure: if it is down, the container runtimes (at least containerd and dockerd that I tested) fallback to DockerHub. About the checksums, ack with your comment Olivier Vernin : a weekly update (if it requires human approval) sounds good. But we should want to be able to do it "on demand" if there is a special request (thinking about a CVE or new feature): sounds like a job that can be triggered manually should do the trick.
          Hide
          dduportal Damien Duportal added a comment -

          Got started with the "IfNotPresent" policy + digest-based images:

          Show
          dduportal Damien Duportal added a comment - Got started with the "IfNotPresent" policy + digest-based images: https://github.com/jenkins-infra/jenkins-infra/pull/1833 => defined "IfNotPresent" https://github.com/jenkins-infra/jenkins-infra/pull/1834 => enable automatic update for Docker
          Hide
          dduportal Damien Duportal added a comment -
          Show
          dduportal Damien Duportal added a comment - https://github.com/jenkins-infra/jenkins-infra/pull/1835 => First automated upgrade: we should be able to try without pullsecret now that Casc defines agent sha
          Hide
          dduportal Damien Duportal added a comment -
          Show
          dduportal Damien Duportal added a comment - PR to remove imagePullSecrets: https://github.com/jenkins-infra/jenkins-infra/pull/1842
          Hide
          dduportal Damien Duportal added a comment -
          • The secret had been removed from ci.jenkins.io configuration AND from the Kubernetes cluster
          • The Docker Hub account cijenkinsiok8s had been disabled and deleted (it was aimed to be temporarly)
          Show
          dduportal Damien Duportal added a comment - The secret had been removed from ci.jenkins.io configuration AND from the Kubernetes cluster The Docker Hub account cijenkinsiok8s had been disabled and deleted (it was aimed to be temporarly)

            People

            Assignee:
            dduportal Damien Duportal
            Reporter:
            dduportal Damien Duportal
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: