Uploaded image for project: 'Jenkins'
  1. Jenkins
  2. JENKINS-64383

Combined reference repositories are too heavy-weight (git takes ages to parse them)

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • git-plugin
    • None

      For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.

      This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it looks like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).

      My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...

      So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request.

      I have a small PoC that "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction - or that the right one was already taken here

      PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644

          [JENKINS-64383] Combined reference repositories are too heavy-weight (git takes ages to parse them)

          Jim Klimov created issue -
          Jim Klimov made changes -
          Description Original: For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.

          This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).

          My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...

          So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time.

          I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :)
          New: For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.

          This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).

          My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...

          So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time.

          I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :)

          PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644
          Jim Klimov made changes -
          Description Original: For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.

          This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).

          My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...

          So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time.

          I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :)

          PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644
          New: For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.

          This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).

          My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...

          So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request.

          I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :)

          PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644
          Mark Waite made changes -
          Assignee Original: Mark Waite [ markewaite ]
          Mark Waite made changes -
          Summary Original: Combined reference repositories are too heavy-weight (git takes ages to parse them) New: Combined reference repositories over NFS are too heavy-weight (git takes ages to parse them)

          Jim Klimov added a comment - - edited

          Hi Mark, while NFS does not help performance-wise, it helps organizationally, letting dozens of workers share a single download of everything (by itself takes 30 min to several hours to just refresh, when initiated from storage server to the internet).

          I re-checked that a local filesystem copy of the big refrepo, even staying in tmpfs, was a brake. It is not that big BTW, about 3Gb, millions of objects though. The NFS server serves it all from RAM, not hitting the HDDs or the SSD cache. At a gigabit speed, the full read of that directory might be half a minute itself - at least if index is compacted into a few huge files or even one. The thing is, we never expected hit to have to read it all. And beside that read, it spends often minutes on workers churning CPU looking for something...

          A complete CLI Git checkout of a test repo took about a minute from local bitbucket without refrepo, and half a minute if shallow (depth=1 and depth=30 differed marginally).

          A checkout with big refrepo added about a minute to each of that.

          Using that first complete checkout as the smaller-scope refrepo, tests completed in about 5 seconds for a new complete checkout (into another fs to rule out hardlinks) and consistently a bit slower for shallow ones.

          So I'm heading for a manageable single-point-of-config tree of refrepos rather than a huge one, hope that PoC PR gets me there soon

          I'm rather sure the pathology is in git itself, but adressing the symptoms from Jenkins git-client-plugin is good enough if it works for practice

          Jim Klimov added a comment - - edited Hi Mark, while NFS does not help performance-wise, it helps organizationally, letting dozens of workers share a single download of everything (by itself takes 30 min to several hours to just refresh, when initiated from storage server to the internet). I re-checked that a local filesystem copy of the big refrepo, even staying in tmpfs, was a brake. It is not that big BTW, about 3Gb, millions of objects though. The NFS server serves it all from RAM, not hitting the HDDs or the SSD cache. At a gigabit speed, the full read of that directory might be half a minute itself - at least if index is compacted into a few huge files or even one. The thing is, we never expected hit to have to read it all. And beside that read, it spends often minutes on workers churning CPU looking for something... A complete CLI Git checkout of a test repo took about a minute from local bitbucket without refrepo, and half a minute if shallow (depth=1 and depth=30 differed marginally). A checkout with big refrepo added about a minute to each of that. Using that first complete checkout as the smaller-scope refrepo, tests completed in about 5 seconds for a new complete checkout (into another fs to rule out hardlinks) and consistently a bit slower for shallow ones. So I'm heading for a manageable single-point-of-config tree of refrepos rather than a huge one, hope that PoC PR gets me there soon I'm rather sure the pathology is in git itself, but adressing the symptoms from Jenkins git-client-plugin is good enough if it works for practice
          Mark Waite made changes -
          Summary Original: Combined reference repositories over NFS are too heavy-weight (git takes ages to parse them) New: Combined reference repositories are too heavy-weight (git takes ages to parse them)
          Jim Klimov made changes -
          Assignee New: Jim Klimov [ jimklimov ]
          Jim Klimov made changes -
          Status Original: Open [ 1 ] New: In Progress [ 3 ]

          Jim Klimov added a comment -

          Status update from the PR #644:

          The current version of this plugin PR is already bringing us value in the experiments with the GIT_SUBMODULES token interpretation, although parsing the actual submodule data (.gitmodules) is not yet completed. In fact, that seems like a nice speed-up to get into the right directory if the needed repository is exactly the one named in the URL configured by the submodule definition, but is less helpful to co-hosting of forks of the same repository in same directory (the mode with still a combined repo with several remotes, but much smaller scopes to walk across much more relevant commit objects).

          The "fallback" modes of just looking for subdirectories that are git repositories, and recursing into such to inspect the remotes' configurations there, is more I/O intensive but already works

          On the unit-testing side, I'll probably deprecate the original GIT_URL token expansion (which just expects the original URL made into a directory tree on a local filesystem): while it was useful to get the feet wet, and "just worked" on illumos and Linux systems, it does not on Windows (as anticipated, with : being a reserved character), and more useful and portable tokens were designed since that PoC step so there's not much use to complicate matters into making this mode work everywhere by mangling paths or something. The GIT_URL_SHA256 already mangles it by hashing the normalized URL string as a subdirectory name, and GIT_URL_BASENAME strips the tree part and just relies on the final path component like "git-client-plugin(.git)" to match as a subdirectlry name in the refrepo.

          Jim Klimov added a comment - Status update from the PR #644: The current version of this plugin PR is already bringing us value in the experiments with the GIT_SUBMODULES token interpretation, although parsing the actual submodule data (.gitmodules) is not yet completed. In fact, that seems like a nice speed-up to get into the right directory if the needed repository is exactly the one named in the URL configured by the submodule definition, but is less helpful to co-hosting of forks of the same repository in same directory (the mode with still a combined repo with several remotes, but much smaller scopes to walk across much more relevant commit objects). The "fallback" modes of just looking for subdirectories that are git repositories, and recursing into such to inspect the remotes' configurations there, is more I/O intensive but already works On the unit-testing side, I'll probably deprecate the original GIT_URL token expansion (which just expects the original URL made into a directory tree on a local filesystem): while it was useful to get the feet wet, and "just worked" on illumos and Linux systems, it does not on Windows (as anticipated, with : being a reserved character), and more useful and portable tokens were designed since that PoC step so there's not much use to complicate matters into making this mode work everywhere by mangling paths or something. The GIT_URL_SHA256 already mangles it by hashing the normalized URL string as a subdirectory name, and GIT_URL_BASENAME strips the tree part and just relies on the final path component like "git-client-plugin(.git)" to match as a subdirectlry name in the refrepo.

            jimklimov Jim Klimov
            jimklimov Jim Klimov
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: