1. 16

I was wondering if this process has been explained anywhere.

  1.  

  2. 9

    It depends on what you mean by “their copy of git”. As far as I’m aware, this comes in at least 2 different forms, possibly more.

    The backend component of the website, written in rails, most likely uses bindings for libgit2 (especially because GitHub maintains libgit2), probably rugged. Their development of libgit2 and rugged is fairly out in the open. The SSH service may use the actual git binary (or rather git-receive-pack/git-upload-pack).

    When I worked at Bitbucket, we used pygit2 and a small patch applied to git itself which improved NFS performance. There were a number of interesting edge cases which were tested for performance and accuracy which determined whether we’d call out to the binary with a very specific command or use pygit2, though all of that was hidden behind a wrapper library which abstracted out hg and git.

    It is theoretically possible to implement git-receive-pack and git-upload-pack in code rather that calling the binaries, but unless there’s a custom storage backend, this probably isn’t worth it.

    EDIT: the git/libgit2 patches from Bitbucket I’m aware of are listed below.

    1. 2

      unless there’s a custom storage backend

      I would be surprised if there wasn’t in GitHub’s case. Is there anything public about that?

      1. 6

        Not directly, but they talk about how they handle replication with distributed operations at https://github.blog/2016-09-07-building-resilience-in-spokes/. I can’t seem to find anything on routing or how they handle repo accesses though.

        There’s also some pretty low level stuff in https://github.blog/2015-09-22-counting-objects/ where they talk about what they did to optimize git clones for very large repos. They also mention “alternates” in passing which can be very powerful (we used these at Bitbucket for diffs - we generated the diff in a different working directory using alternates, but that’s because they use a very unique diffing algorithm).

        As far as I can tell, GitHub tends to do a pretty good job of upstreaming any changes they make… and these blog post seems to imply that they’re using raw git under the hood for most operations, which makes sense. If you can use the canonical implementation of something, you probably should… especially if it’s the technology on which your whole product is built.

        1. 7

          Thanks! I found it after reading your comment:

          I had a hard time believing they weren’t using a dfs or a specialized storage layer because I’ve worked on storage-at-scale problems before and you run into needing something fancy pretty quickly.

      2. 1

        NFS? Really?

        1. 1

          Yep, it was a Netapp storage cluster which exported NFS volumes.

          1. 3

            I still have flashbacks to that exact setup. Of course this was at Apple so we were eating our own dogfood and using OS X as a client. It was very bad. I used to regularly find bugs in the NFS stack by searching our video sources for 1k runs of 0.

      3. 4

        This earlier lobsters thread and the link therein gives some hints for how it likely works.

        https://lobste.rs/s/7khgtp/barebones_git

        I shared some thoughts here:

        https://lobste.rs/s/7khgtp/barebones_git#c_rtjjpm

        I assume you are referring to the client/server setup and not something like a forked git source tree?

        1. 1

          Exactly. Thanks.