I worked on VFS for Git for a while, helping create the initial Linux implementation by bridging libfuse with the core VFS for Git .NET implementation. It was neat — filesystem virtualisation feels like such a powerful tool — but I’m unsurprised and pleased to see that what we’ve ended up with avoids that entirely.
It seems to me that the ideal solution would be both. It is nice to do a sparse checkout when you truly do only need part of a repo and it is easy to specify which part. But VFS is super useful to have the full repo available. Especially if you use build systems that aren’t sparse-checkout aware so you would need to check out your whole dependency graph.
So I think it is very valuable to make the non-VFS solution as fast and scalable as possible. But more or less fundamentally the working directory doesn’t scale without a virtual filesystem. So I think it is valuable to pursue that option as well for the users who need it.
This is about the scalar command that is available since git 2.38 and sets appropriate defaults for blobless cloning and periodic maintenance of large repositories, etc. Unlike a shallow clone this doesn’t break git log or git blame.
I’ll have to see whether I can use scalar clone --full-clone --no-src as a drop-in replacement for git clone on larger repos (e.g. Linux kernel)
If git blame supports on-the-fly downloading of blob data it seems like it should be possible to teach most history commands to download missing history on-the-fly in a shallow-clone.
How have I not heard of these commands? I see a lot of submission on the pros and cons of git, along with in-depth technical explainers of git inner workings. For some reason, this is the first time I can say I consciously was presented these specific tools and strategies, with the exception that I already use shallow clone and sparse checkout.
I wondered why treeless clones should be ignored. This article from Derrick Stollee explains things nicely. IIRC the git protocol allows clients to say which parts of the commit graph they already have, but not which parts of the forest of trees, so it makes sense that it’s hard to incrementally back-fill treeless clones efficiently.
On the other hand, in a blobless clone, surely there’s a way for things like git blame to get the missing blobs in 1RTT instead of sequentially one after the other?!
I was thinking about the blame case last night. I’m speculating but… it seems like blame would be pulling commits from newest to oldest for a given file until every line in the newest commit has been traced back to the commit that introduced that line? Based on my understanding of git so far, it seems like… yeah, like it just wouldn’t know at all until it started walking backwards. It might be one commit, it might be every commit all the way back to the start of time (eg an append-only change log). Heuristic-wise it might make sense to pull all of the history down when you do a blame, but there are degenerate cases where that could pull significantly more data down than necessary (eg a file with lots of churn that has been replaced by “// dont use this now”
It probably makes sense to do some sort of batching. Even a simple option like fetching N revisions at a time will provide significant speedup (if the file isn’t huge and round-trip-time dominates it could be nearly N). I imagine that some sort of smarter batching could work, like fetching more each trip. But the heuristics here would be complicated and imperfect.
One of the problems with sparse checkouts in very large repositories is that you need to know what to check out. Your project is likely to have dependencies on other code in the repository, and that code transitively on even more code. (If you didn’t have intra-repo dependencies of some sort, then you wouldn’t really be benefiting from putting all your code in the same repository to begin with.)
Either you need to have tooling just to manage the sparse checkouts, or you need to manually curate sparse checkout lists. In that sense, a VFS-based approach is a little more attractive because, in principle, you wouldn’t really need to figure out all the dependencies ahead of time: you could just pull in your dependencies as your tooling makes requests for them. On the other hand, poorly-integrated tooling can easily do a filesystem trawl and accidentally load everything, largely defeating the point of the “sparse” checkout.
I worked on VFS for Git for a while, helping create the initial Linux implementation by bridging libfuse with the core VFS for Git .NET implementation. It was neat — filesystem virtualisation feels like such a powerful tool — but I’m unsurprised and pleased to see that what we’ve ended up with avoids that entirely.
It seems to me that the ideal solution would be both. It is nice to do a sparse checkout when you truly do only need part of a repo and it is easy to specify which part. But VFS is super useful to have the full repo available. Especially if you use build systems that aren’t sparse-checkout aware so you would need to check out your whole dependency graph.
So I think it is very valuable to make the non-VFS solution as fast and scalable as possible. But more or less fundamentally the working directory doesn’t scale without a virtual filesystem. So I think it is valuable to pursue that option as well for the users who need it.
This is about the
scalarcommand that is available since git 2.38 and sets appropriate defaults for blobless cloning and periodic maintenance of large repositories, etc. Unlike a shallow clone this doesn’t breakgit logorgit blame.I’ll have to see whether I can use
scalar clone --full-clone --no-srcas a drop-in replacement forgit cloneon larger repos (e.g. Linux kernel)If
git blamesupports on-the-fly downloading of blob data it seems like it should be possible to teach most history commands to download missing history on-the-fly in a shallow-clone.How have I not heard of these commands? I see a lot of submission on the pros and cons of git, along with in-depth technical explainers of git inner workings. For some reason, this is the first time I can say I consciously was presented these specific tools and strategies, with the exception that I already use shallow clone and sparse checkout.
I wondered why treeless clones should be ignored. This article from Derrick Stollee explains things nicely. IIRC the git protocol allows clients to say which parts of the commit graph they already have, but not which parts of the forest of trees, so it makes sense that it’s hard to incrementally back-fill treeless clones efficiently.
On the other hand, in a blobless clone, surely there’s a way for things like
git blameto get the missing blobs in 1RTT instead of sequentially one after the other?!I was thinking about the blame case last night. I’m speculating but… it seems like blame would be pulling commits from newest to oldest for a given file until every line in the newest commit has been traced back to the commit that introduced that line? Based on my understanding of git so far, it seems like… yeah, like it just wouldn’t know at all until it started walking backwards. It might be one commit, it might be every commit all the way back to the start of time (eg an append-only change log). Heuristic-wise it might make sense to pull all of the history down when you do a blame, but there are degenerate cases where that could pull significantly more data down than necessary (eg a file with lots of churn that has been replaced by “// dont use this now”
It probably makes sense to do some sort of batching. Even a simple option like fetching N revisions at a time will provide significant speedup (if the file isn’t huge and round-trip-time dominates it could be nearly N). I imagine that some sort of smarter batching could work, like fetching more each trip. But the heuristics here would be complicated and imperfect.
One of the problems with sparse checkouts in very large repositories is that you need to know what to check out. Your project is likely to have dependencies on other code in the repository, and that code transitively on even more code. (If you didn’t have intra-repo dependencies of some sort, then you wouldn’t really be benefiting from putting all your code in the same repository to begin with.)
Either you need to have tooling just to manage the sparse checkouts, or you need to manually curate sparse checkout lists. In that sense, a VFS-based approach is a little more attractive because, in principle, you wouldn’t really need to figure out all the dependencies ahead of time: you could just pull in your dependencies as your tooling makes requests for them. On the other hand, poorly-integrated tooling can easily do a filesystem trawl and accidentally load everything, largely defeating the point of the “sparse” checkout.