I was hoping this post would show improvements that make monorepos in Git easier to manage. But instead, the article’s summary is more like: “it’s bad, split things up instead”.
I hope we’ll some day have great open source tools for managing even large, multi-project repos like this.
That’s a bit of a bleak summary, that wasn’t quite my intention.
In general git is pretty good at managing even large repositories. So I think most people probably don’t have to worry.
But if you choose to move to a monolithic repository (that is large in a number of those dimensions I talk about) you should know what to expect and where it could be affecting you negatively.
And lastly, I don’t see a reason why some of the improvements that are being discussed for Mercurial couldn’t work for git as well.
It would be great if Atlassian gave some thought to supporting monorepo workflows in mercurial where it works right now. You have by far the best mercurial cloud support, but it seems like you have decided that it’s legacy project that you shouldn’t emphasize.
Theoretically, all of the changes Mercurial is doing right now could be done in Git, but there are a lot of things working against it.
The first is that Git lacks a clean extension mechanism. It’s largely, even still, a collection of Perl scripts on top of small C programs. That’s been changing rapidly over the last year—more and more is being pulled into C—but it’s still the actual situation today. This in turn means that Git lacks a real API beyond the shell commands, which means that writing tools in the Git ecosystem is pretty difficult: to do it “properly”, you’d have to constantly invoke scripts and C programs to work with the Git object store.
That first problem caused the second, much nastier problem: since most tools don’t want to have to shell out to a bunch of scripts and small executables, many—possibly even most—tools rely directly on the Git file format and protocol. This means that everything from niche tools like bup to major tools like Gerrit, Eclipse, and anything else using libgit2, Dulwich, JGit, or one of the other “native” Git libraries will break if Git changes its file format or protocol in a meaningful sense.
Combine these two issues, and you’ve got your problem: to do something like Mercurial’s remotefilelog, you really need to make the change to Git proper, and to the key libraries in the Git ecosystem, or it won’t take. And that’s really, really hard.
Mercurial, by contrast, has always had a real extension API, and for a very long time has also had something called the command server, which is a formalized protocol for tools to communicate with Mercurial precisely so that it can alter its internal architecture without breaking third-party tools. This means that when you write something like remotefilelog for Mercurial, all tools that interact with Mercurial can immediately make use of it.
Is it possible to make a tool that looks and acts like Git, but behaves more like the monorepo-capable Mercurials? Absolutely. It’s even possible to do that to the actual Git suite, and my understanding is that Twitter, at least, wants to do so. But it’s a much, much harder hill to climb at this point. I wouldn’t hold my breath.
Facebook’s conclusion was basically that modifying git was too difficult (too much C). Sure, any free software can be made to do anything else, but the architecture and implementation language can complicate it.
Mercurial’s Python allowed Facebook to dynamically replace (monkeypatch) any part of it with their own implementation and less segfaults than C.
I think I was just hoping for improvements I could make use of in my own workflow. :) (I use Git with the Firefox source tree, which is pretty big compared to most projects, but of course still smaller than the megarepos at Facebook or Google.)
Git does work pretty well locally for large monorepos. Hosted tools like GitHub and Bitbucket have a lot more trouble at these extreme sizes, like you’ve discussed in the article. It’s hard a problem to handle all the features of these hosted tools efficiently for large repos, especially since it only affects a small set of users.
We… do? Mercurial.
To be fair, I don’t think the tools needed to make Mercurial scale are yet turnkey or well-documented, and one of them (treemanifest) hasn’t even shipped yet. Once treemanifest does land (which I think is going to be in 3.6, so, imminently), I’ll probably make a “big-Mercurial-in-a-box” post on how to use treemanifest, remotefilelog, and Phabricator to get truly massive repositories working sanely. (And if someone wants to beat me to the punch on that blog post, no hard feelings.)
Edit: And I’ll add that, putting my bias towards Mercurial away for a moment, both Git and Mercurial, today, honestly scale just fine for anything short of large multi-project repositories and projects with lots of binary assets, and even there, they both can handle most use cases if you use something like git-lfs or git-attic on the Git side, or largefiles on the Mercurial side. I’m really thrilled that Mercurial is going to be the DVCS when you need Google- or Facebook-level repositories, but I think that relatively few teams really need that.
Link for the Facebook reference: https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/
I’m glad to see this kind of dissection in light of articles like this floating around: http://www.wired.com/2015/09/google-2-billion-lines-codeand-one-place/