1. 20

I’m seeing more and more startups using GitHub plus Jenkins/Circle CI/Travis and saying: “We want a monorepo”. In my experience, once the code and the team grows big, both Git and GitHub become an impediment towards scaling and quality.

One of the reason is Git way of handling binary files and histories, and the fact that in monorepos not everything depends on everything else. Thus, there will be jobs that only need parts of the repo, which will end up as full checkout every time on the CI. This is just one instance, but in general, all operations will start taking more and more time, leading to people doing workarounds, leading to too much special-casing.

Another reason is that GitHub’s workflows, especially around reviewing are simply subpar. For example, there is no tracking of differences between PR updates. Or, if one sets up CODEOWNERS (good), then everyone gets spammed for every PR. This just leads to: a) people stop reading emails and then everyone pings that one guy for a single component, and/or b) adding more tools on top of GitHub, meaning more components to take care of in the pipeline, that might fail causing lost productivity.

Thus, I’d like to hear your experiences, what did you do to make big monorepos work well and provide a nice experience for developers?

PS. I think both Git and GitHub are great, one as a nice VCS and the other as platform for collaboration, however not for this particular use case.

  1.  

  2. 6

    in monorepos not everything depends on everything else. Thus, there will be jobs that only need parts of the repo, which will end up as full checkout every time on the CI.

    While this may be true for git, it’s not true everywhere. We have a monorepo at work but we use Perforce, and you can sync just a single folder (or file), it’s super useful.

    1. 10

      in monorepos not everything depends on everything else. Thus, there will be jobs that only need parts of the repo, which will end up as full checkout every time on the CI. This is just one instance, but in general, all operations will start taking more and more time, leading to people doing workarounds, leading to too much special-casing.

      This leads me to the question of, why do you want a monorepo in the first place? Multiple repositories is exactly the intended solution for this problem from the Git perspective.

      1. 7

        Multiple repos implies submodules usually. In my opinion submodules do not scale well to many developers. Nothing which cannot be solved with additional scripts and other workarounds but it adds friction. Submodules are even annoying in small teams because you have to commit twice for changes in a submodule.

        The power to change lots of places with a single atomic commit is sometimes critical. At Amazon, if you change an API you better do it in a backwards compatible way and maintain that for a while. You need version numbers and a release process. At Google you just adapt all the API users in the same commit. Ok, this over-simplifies. In both cases you need special tooling. It is a feature which is just not possible with multiple git repos though. Just like permission control is not really possible within a single git repo.

        1. 6

          At Amazon, if you change an API you better do it in a backwards compatible way and maintain that for a while. You need version numbers and a release process.

          Yes, and the idea is that a service run by a team is consumed by other teams. This is designed to remove the need of multi-repo/service/team lockstep changes. It encourages good ownership reduces conflict between teams.

          Unfortunately the hype around microservices led many companies to have more services than employees. API versioning and backward compatibility becomes very expensive… and the monorepo model comes in.

          With one big repo, lockstep changes, build and deployment, it’s the perfect “distributed monolith”. All the problems of the old monolith (complexity, team interactions, lack of modularity…) plus the difficult debugging and overhead of microservices.

          1. 3

            more services than employees

            I am sincerely sad I never thought of such a great, short, obvious, accurate, and vicious description of the problem I have seen at many client sites.

          2. 2

            Yes, exactly this. Working with multiple repositories leads to a lot of submodules to model code dependencies. Then the natural progression of avoiding submodules leads to either storing compiled artifacts into something like Artifactory and having binary dependencies (which leads to horrible cascading releases) or bringing everything into a single repository (which breaks down if developers do not unify build systems).

          3. 9

            Cross repo changes are annoying, especially when the company and codebase are young. Its far easier to avoid techincal debt with a monorepo.

            Personally, I think the best solution is a middle ground. Scale up with a monorepo until it becomes too clunky and then split things out where it makes sense.

            1. 3

              Particularly if the org has a standard like “only QA with RELEASE dependencies”. It can take f-cking months for all the planets to align to get new features into downstream libraries and applications. There’s also typically a lot of ceremony around releases in companies that build sell turnkey or even host for customers. Makes the entire process infuriating and slow (but this is a feature in fact).

              In this type of case I can see where a monorepo would be advantageous.

              1. 2

                Cross repo changes are annoying, especially when the company and codebase are young

                Exactly, so why do you need multiple repositories ? Because of the microservices madness? When you don’t even know the boundaries of your software, why trying to split it?

              2. 4

                I’m not sure the concept of “monorepo” is well defined in the case of startups. It comes from the vocabulary of huge corporations with myriad different projects, but startups typically just work on a single thing, which may be divided into various components like server and client. If you previously had two repositories foo-server and foo-client which you now merge into one repository foo, do you now have a “monorepo”? I don’t really think so.

                From a startup perspective, I think there’s a real issue with premature multiplication of repositories. This has been the case at several startups I’ve worked on. Many people report increased productivity and happiness after merging their startup’s several repositories into one. When a team works on several repositories, it’s like your very source code is a complex distributed system. Working on developer tooling and build configuration becomes trickier.

                A recent contract of mine was with a small business with around a dozen developers working on literally hundreds of repositories, one for each “project.” GraphViz had trouble plotting the dependency graph. Adding a feature to the system would usually involve committing to at least two or three repositories, but some changes might require committing to dozens. Of course this also multiplies external dependencies since each project pins its own versions. I think merging all of their repositories into one would have been great for the developers.

                1. 2

                  I’m not sure the concept of “monorepo” is well defined in the case of startups.

                  Sure it is: When you make a commit, can you make it across all components at once, or do you need to break it up into one commit per component? That’s all a monorepo means.

                  1. 1

                    What if your startup doesn’t consist of many different “projects” but is just e.g. one Rails app developed by a single team? That doesn’t really seem like a special monorepo… just a repo! For example, Wikipedia says (I know it’s not an authority) “few build tools work well in a monorepo” and this indicates that the name refers to a repo that is an especially large amalgamation of different projects.

                    1. 2

                      The key here is that ‘monorepo’ is not a bad thing. Indeed, for many many companies and teams, it’s a good thing.

                      One doesn’t necessarily need a single repo, but the number of repos should be relatively few, because the number of interactions tends to scale exponentially with the number of repos. Before too long, all any developer does is manage repo interactions, and noöne has time to write any actual code

                  2. 2

                    True that “monorepo” might be a stretch for some startups, but I’ve seen cases where it starts from exactly the case you mention, and then moves onto having several components, not each needing all the others, as there might be several products, or at least several attempts at products. Can definitely confirm that after merging several repositories into a single one happiness and productivity jumps up, but only as long as the tooling can keep up.

                    The comparison with complex distributed system is spot on :)

                  3. 4

                    Monorepos are one of those things that sound stupid until you try it, and then everyone I know who has been developing on a monorepo laments whenever they can’t do it anymore.

                    1. 1

                      Fair enough! I hope I end up in a position where I can try it, then.

                      1. 1

                        It seems to be only necessary in a professional setting though. There are not that many monorepos in the Open Source world. That leads to the situation that Open Source version control systems do not support that use case very well. Allegedly closed source ones (Plastic, Perforce) do, but I have no experience with them.

                        1. 1

                          I use a personal mono repo for all my projects.

                          1. 1

                            The BSD’s (FreeBSD, certainly) use a monorepo. It’s one of the defining differences.

                      2. 3

                        Haven’t heard about a hype of monorepos. Is this actually happening? I am structuring one repo per project for my git projects and it works well.

                        At work we have a few SVN repositories that contain more than one project. These contain tools for one specific software or a specific topic and we usually set them up so that each tool has its own dedicated subfolder. For example Repo/ToolA and Repo/ToolB. This allows us to checkout only the specific subfolder in Jenkins.

                        1. 6

                          If you have a large organization where many services interact with one another, a monorepo is fantastic because you can change the interface between services in a single commit (and have CI fail if you didn’t realize that some additional team was depending on your interface).

                          1. 1

                            I see your point with this one, but with the cons given from your question and static code checks being a pro, I would still go for multiple repositories and check the compatibility between the components with a system test suite or so. But maybe we are talking about different sizes here, plus I see that luckily other people have some ideas and solutions for your problem.

                            Just to make sure we are not talking about totally different things. I think in my company (~10k employees) we have a few software packages we sell that are developed each by around 20 people, might also be a few more up to 50. And since these are monolithic desktop applications I assume that these reside in one repository each. This makes 5 repositories for 5 software products. A monorepo for this situation would now mean that 5 software products get developed in 1 repository.

                            So, if you’re talking about the field of microservices (which I expect from the other answers) then we’re definitely talking about different situations :D

                            Maybe there also is a bit a distinction in the usefulness between desktop applications and hosted-only. If we shipped our desktop product P in version 1.2 customer C that customer expects from us that we will keep the API stable a few more versions. This means, we have to keep the API backwards-compatible for a few months (in practice I think it’s years) anyway.

                            1. 1

                              Monolithic desktop applications don’t tend to have frequently-changing network/library interfaces, so I doubt the benefits would be as visible there, yeah

                        2. 3

                          Despite being ugly as sin and having quite a steep learning curve, Gerrit is very nice for reviewing. Due to the learning curve, I’d avoid it for regular OSS.

                          As far as CI goes, the hosts should already have practically all of the git history stored locally (from running the previous build). Writing the worktree to disk is only going to get slow once you’re google-scale. Ephemeral storage is dirt cheap.

                          1. 1

                            Interesting! What do you think are the best advantages of Gerrit? Why would it work in an org but not for regular OSS?

                            Re: CI and ephemeral storage, agree. I saw this breaking down when people take the route of “elastic” CI, spinning up EC2/GCE instances on demand to do the builds – there you must checkout the repository on every new instance.

                            1. 1

                              Regular OSS is best served by making it easy for new contributors to get a patch merged. Putting a completely unfamiliar workflow with its own terminology etc in front of them is too high a barrier.

                              When you work with it every day, the cost of learning how to operate it is well worth the benefits.

                          2. 2
                            1. 2

                              I did, and it looked promising. However, there is still no stable version that works on Linux and macOS.

                            2. 2

                              At coder.com, I wrote a Go command that figures out which targets (directories) need to be built (automatically deduces Go package imports) based on the updated deps and modified code in a PR transitively, and then generates a buildkite pipeline with a concurrent step for every target. It uses default steps for Go targets without explicit steps.

                              Its basically bazel but for Go and buildkite. It’s only about a thousand lines so much lighter than bazel.

                              Buildkite docs for this feature: https://buildkite.com/docs/pipelines/defining-steps#dynamic-pipelines

                              Its really cool and I would highly recommend this approach. We might open source our pipeline generator one day.

                              See https://i.imgur.com/n0YjCFO.jpg for how it looks in the ui.

                              1. 1

                                Neat. Any chance to see this code open-sourced at some point in the future?

                              2. 2

                                I think from what i’ve seen, the main thing to do is use sub-modules. Then the review and work for a given sub-module can be encapsulated in the tertiary repo and developers that need the whole kit repo can clone the monorepo and then pull all the sub-modules.

                                Keep in your head maybe it’s not time to restructure everything. It may be that you’re premature in your thinking about breaking a project up and the problems you perceive in terms of workability may not actually be problems yet. Having a monorepo where everyone has access to everything also helps developers of specific sections identify problems in code elsewhere, or expand their competencies and explore and understand the application as a whole better. If i’m a web developer and I never see the server-side code, it’s a lost opportunity to learn and expand my understanding of how the application works. For instance if I’m working on a request to get some data and it isn’t what I need returned, as a hungry developer I’ll go and track down where that data is coming from, try to understand what needs to change and communicate it to the developer in charge of that code and may even attempt to get it working myself. It’s a lost opportunity as well as feeling of lack of trust in your developers that they are mature enough to see the whole enchilada. Developer lose the feeling of ownership of a feature when they can’t see the places of impact that feature will hit.

                                Another idea (contrary to the above) is to organize multiple monorepos around development context. So for instance a client-side development team would have a client monorepo that is made up of sub-modules that they need access to. The server-side team would have a server-side monorepo, services team would have a monorepo. This would work at a relatively scale as you can reduce the exposure of individual teams to the whole project and limit access to sensitive business processes, etc… Then your deployment/release team has a monorepo with all the sub-projects in it and deployment/build scripts, etc… maybe using Ninja or something to create targets so they can build only specific sub-projects. It reduces the number of checkouts that CI has to do, you’d just have to configure the CI to only build the monorepos in the sub-modules that changed.

                                You can always set up a local git server somewhere, don’t import any code, but just set up different structures and play around “acting” like a developer for a specific context and see how it would “feel” to work in some specific setup and identify pain points.

                                It’s really hard to think through though as a general concept, as so many projects are very different. A desktop application is going to be different than a web app, mobile from desktop and web, etc… and you may need to really think about how you want to organize things based around what you’ve got. Changing to another source control system that supports folder checkouts, etc… might help, but there’s a cost there in your whole team re-learning version control, adopting new version control tools if they use GUIs, etc… that can be a hard hit on productivity if you have a sizeable team.

                                1. 2

                                  To solve the first problem Mozilla uses mercurial and a home grown ci system called taskcluster. We have heuristics to determine which tasks should run for a given push.

                                  For the second, we use phabricator with some custom tooling to allow submitting a series of commits, provide automated review comments (e.g for linting), and automated landing from the web UI. It’s the best review experience I’ve ever used and I can’t stand going back to GitHub anymore.

                                  1. 1

                                    If using git, the useful feature I’d use all of the time is specifying a path to most git commands:

                                    cd bin
                                    git status .
                                    git add .
                                    git log .
                                    ...
                                    

                                    The BSDs mostly use a monorepo as well.

                                    With CVS, it is possible to checkout only a subset of the whole monorepo. This feature makes the difference between a monorepo and a split repo close to 0.