If you have 17 million lines of code in one repo you’ve made at least one horrible mistake. Big tech companies like to shoot themselves in the foot all the time and then brag about how good their surgeons are.
Big tech companies like to shoot themselves in the foot all the time and then brag about how good their surgeons are.
This perfectly describes the whole software industry.
At the risk of sounding pedantic: why? What is wrong about keeping all of your code in a single repository?
As a current Google employee, I second this question. We get lots of benefits from keeping everything together, and I can’t really imagine not doing this in a large, server oriented organization. Well, I can, but I much prefer this approach.
For a public citation, “We have a single large depot with almost all of Google’s projects on it. This aids agile
development and is much loved by our users, since it allows almost anyone to easily view almost
any code, allows projects to share code, and allows engineers to move freely from project to
project. Documentation and data is stored on the server as well as code.” http://research.google.com/pubs/pub39983.html
You can view code in other repos very easily (via github, gitweb, etc), you share code by packaging libraries as libraries and specifying them in the dependency system for your environment (pip, rubygems, apt, w/e), and being able to switch projects is determined by your company’s coding standards, which is orthogonal.
Yep, there’s certainly multiple valid ways to organize things.
We avoid libraries and prefer to build everything from head, which one giant repository helps make possible, though I’m sure that tradeoff isn’t right for everyone. There’s some info about our build system at http://google-engtools.blogspot.com/2011/08/build-in-cloud-how-build-system-works.html, and I’d expect Facebook does something similar (especially since the config format of their open sourced Buck build tool is very similar to the format of the example BUILD file).
Submodules or the equivalent in another system would also help.
Short answer: having everything in the same repo encourages coupling, having each service in a different repo makes you be explicit about dependencies.
These constraints can be enforced in other ways. For example, Chromium uses modules with explicit dependencies, and lower modules aren’t allowed to depend on higher modules: http://www.chromium.org/developers/how-tos/getting-around-the-chrome-source-code. While using different repos is one way to help maintain separation, it’s not the only way. Having the code together helps keep dependencies in sync and eases refactors that cross module boundaries, while module separation can still be enforced at a higher layer.
I’m not claiming that it’s impossible to have everything in the same repo, just that the drawbacks of doing so are clear and the benefits are dubious in my opinion.
That’s fair. Some of the benefit could also be cultural as opposed to technical, encouraging the mindset of being “Facebook” instead of “<some component>”.
DVCSs that require the full history are especially problematic to scale to huge repositories, which is why Facebook’s use of mercurial here is especially interesting. Though I can certainly see where the comment about shooting themselves in the foot and bragging about patching it up is coming from :)
Isn’t tight coupling a bad thing?
At a past company there was a single monolithic codebase and multiple daily deployments. When there was a problem during deployment they had to roll back and re-deploy after kicking out the bad patch. This process would waste a lot of people’s time and limited us to two, maybe three daily deployments.
To combat this, the company moved toward SOA and newer products were deployed decoupled as services. This allow teams to push as often as they liked with much higher iterations. Running a full test suite on the old code base took an hour, while services' test suites took minutes or less.
At the 17 million line level you probably have several distinct products that are bundled together in one repository.
I worked at a company that built embedded Linux systems; many of the
products we supported were years old. All of these systems needed to
stay in sync with certain core components, but there were slight
modifications to the build for each (for supported features, chipsets,
kernel customisations, etc). In this case, a single repo worked well.
Despite the many apparent advantages of multiple repos, and the lack of obvious advantages to having one repo, every large software company tends towards the mega repo. When multiple organizations develop the same solution to a problem that doesn’t exist, that suggests to me that the problem does exist, I just can’t see it.
Why does Facebook need 17 million lines of code, again?
Unsure why you got downvoted. You and myself aren’t the only ones that are curious about this.
I agree: I’d like to see a breakdown out of curiosity!
For a sense of scale, their android app had too many methods for android to install without crashing, so they had to runtime patch the dalvik vm. https://www.facebook.com/notes/facebook-engineering/under-the-hood-dalvik-patch-for-facebook-for-android/10151345597798920 (I suppose that’s almost circular. They have a lot of code because they have a lot of code.)
They have an ios app too. And messenger and photo apps. Somewhere they also have a blog post explaining that the mobile apps used to by html5, but the experience sucked, so they rewrote it all in native code, and now it sucks a little less.
They use mysql. That’s 1.5 million lines of code there. Have a look at https://code.facebook.com/projects/ for more.
Wow, that’s certainly…daunting.
Counting their fork towards the total line-count makes it slightly misleading, IMO. If I do that, frameworks and jQuery push one of my side projects way over 25K, so I can start to see how 17 million becomes a reality!
I’d like to know, even if it’s a poor statistic, how much FB wrote.
I have no idea really, but if it lives in the same source control repo, it doesn’t matter who wrote it, mercurial still needs to deal with it.
hhvm (their compiler/jit) is a 30MB zip file (no history), but it’s mostly testcases, which is something else to consider. You don’t need to ship 17 million lines of code to have 17 million lines of testcases, but again, the source control doesn’t care what ships and what doesn’t.
They use mysql
I don’t think you get to count mysql, or you might as well go down to the kernel and count that too :)
I did find this Quora article that claimed they actually had 62M lines of code. I think that might be counting too much, too.
This PDF explainis why Facebook is a hard problem and what a lot of that code might be doing: