-
Bug
-
Resolution: Unresolved
-
Minor
-
None
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it looks like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete).
My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD...
So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request.
I have a small PoC that "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction - or that the right one was already taken here
PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644
[JENKINS-64383] Combined reference repositories are too heavy-weight (git takes ages to parse them)
Description |
Original:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time. I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :) |
New:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time. I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :) PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644 |
Description |
Original:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time. I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :) PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644 |
New:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request. I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :) PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644 |
Assignee | Original: Mark Waite [ markewaite ] |
Summary | Original: Combined reference repositories are too heavy-weight (git takes ages to parse them) | New: Combined reference repositories over NFS are too heavy-weight (git takes ages to parse them) |
Summary | Original: Combined reference repositories over NFS are too heavy-weight (git takes ages to parse them) | New: Combined reference repositories are too heavy-weight (git takes ages to parse them) |
Assignee | New: Jim Klimov [ jimklimov ] |
Status | Original: Open [ 1 ] | New: In Progress [ 3 ] |
Description |
Original:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request. I have a small PoC thank "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction :) PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644 |
New:
For our project I made a combined reference repository along the lines of https://plugins.jenkins.io/git/#combining-repositories and more specifically with https://github.com/jimklimov/git-scripts/blob/master/register-git-cache.sh script that I maintain for that purpose.
This repository now hosts several hundred replicas and has well over a million objects, and we found that it is no longer the speedup like when it was young - instead, it is a huge bottleneck for us. Part of the problem is that we have many workers sharing the same cache, so it is served over NFS and while it *looks* like a local filesystem to Git, reading it all to find stuff takes many minutes (especially when many readers compete). My CLI git tracing seems to show that if the client checks out a commit already present in the reference repo, it is quickly found in the index and served. If the commit is not there however (new remote branch HEAD, a PR HEAD, etc.) git scans all objects and gigabytes looking for something... making sure this commit is not lost there not-indexed? No idea. This lately takes tens of minutes, and updating the reference repo from all its remotes can spread over hours. And it is not that many gigabytes, living on an SSD... So the bright idea I had was that the single point of configuration for reference repositories (e.g. in pipeline Organization Folder job) could better lead the git client to individual smaller-scope repositories relevant to a checkout. At least, these can be trawled in finite time for each one cloning request. I have a small PoC that "works for me" which I'll PR in a moment, but it would really benefit from thoughtful design and I can guess a lot of corner cases where the implementation of the PoC is not portable or convenient. So next steps should be after experts agree on the direction - or that the right one was already taken here :) PoC PR: https://github.com/jenkinsci/git-client-plugin/pull/644 |