1. 19

I am interested in database systems that use git for storage/branching/sync/replication.

  1. 9

    uhm.. pass[0] ? The problem is, git isn’t really built for this use-case. Git is great for text data, but quickly breaks down for binary data, and for more than very small databases, you want DB performance as a very big feature.

    Git as a storage backend for a database would get terrible quickly from a performance perspective. The MS Windows codebase uses Git now[1], and they had to put in a LOT of engineering effort to get it to mostly perform OK-ish. Standard Git couldn’t handle it, and it’s 3.5M source files, and ~ 300GB on disk, which is on the small side for most of today’s databases.

    Even Redis, which is an in-memory database would be hard pressed to store their on disc copies in git, with any sort of performance.

    0: https://git.zx2c4.com/password-store/about/

    1: https://devblogs.microsoft.com/bharry/the-largest-git-repo-on-the-planet/

    1. 9

      Irmin provides a git filesystem as a storage backend, and I’m aware they use a fetch/push/pull mechanism for synchronisation. However, I’m not sure if they use git for replication.

      Edit: One of the main problems with using git as a backend is that merging operates over text structure, and you can’t be sure what the outcome of a 3-way merge will be. So first, you need to fit your application data into something that git can handle. And second, you need to really make sure that (a) you won’t need to do manual merges, and (b) the result of a merge makes sense from the application point of view.

      At that point, you’re better off using something like CRDTs (like, say, Automerge), where you can use higher-level structures (lists, counters, sets, strings, etc), you won’t have merge conflicts, and you can be sure of what the outcome of a merge will be. There are several databases that use CRDTs under the hood, Riak was one of them, and Antidote is another one that comes to mind.

      1. 1

        Irmin was the system that piqued my interesting in the subject and misplaced it. Thank you for bringing it back to my attention. I’ll make a top level post (after I have read more of the material) to discuss Irmin specifically because I think it has many very interesting properties along with MRDT (Mergeable Replicated Data Types).

        1. 1

          I just stumbled across a talk by Martin Kleppmann where he discusses CRDTs in relation to Automerge (Martin in the main developer). Automerge looks like very beautiful software.

        2. 8

          git-bug is an issue tracker using git as database. It’s not a general purpose database, but it has a great document explaining how it stores the data without having to deal with merge conflicts. Maybe you’ll find this document useful.

          1. 4

            very similar: git-dit:

            Git-dit stores issues and associated data directly in commits rather than in blobs within the tree. Similar to threads in a mailing list, issues and comments are modeled as a tree of messages. Each message is stored in one commit.

          2. 4

            In line with the other comments, GNU has recutils for text-based databases. https://www.gnu.org/software/recutils/

            1. 2

              I really like the philosophy/motivation for GNU recutils, but it’s unfortunate that there aren’t bindings for languages other than C/C++. The folks in the GNU Recutils IRC said the closest they’ve seen is a Python implementation, and even that one isn’t completely compliant with the specification; it implements just the core functions, but none of the advanced features (like distributed database support). I would love a Go implementation.

              1. 3

                Good opportunity to write one.

                1. 2

                  I’ve looked into it, and I have higher priorities at the moment. Not opposed to revisiting it

                  1. 2

                    There are several attempts both, bindings and translations


                    I recall vaguely using recutils from within Delphi of all things… I do not think it had (at least back then), anything like a merge function (which is why I thought the OP was looking for)

            2. 4

              You might want to have a look at Gerrit’s NoteDb.

              (I’m not sure, based on my limited experience of Gerrit administration, that you’d want to emulate it, but it’s interesting.)

              1. 3

                I looked into doing this once. At work, we had a tool where users could write little boolean expressions to classify bonds (for example something like: highYield = rating < BBB). These expressions got pretty complicated, and versioning them was tricky. I thought that git would be a nice fit.

                The main problem was replication. We had a strict requirement at the company that all services be distributed over multiple datacenters. With clustered databases, you can get this for free. With git, you need to commit to one repository, and then have some job/hook push out changes to other replicas. This generates a number of small nightmares: which repository is master? What if replication fails? How do you guarantee that commits aren’t lost when switching master?

                Using git turned out to be a pretty awkward thing to do; using a standard RMDBS seemed much more maintainable.

                I don’t think I would recommend using git as a database unless:

                • All (or almost all) the data you’re storing is plain text
                • It’s a non critical service that can withstand some downtime
                1. 3

                  What do you mean?

                  A database that can be controlled and configured using git tooling e.g. something that maps branches and history to clones and schema migrations?

                  Or an implementation of SQL that uses (e.g.) libgit2 to store tables, and can be push’d/pull’d like a git repository?

                  1. 3

                    Mirage Project’s Irmin. “Irmin - A distributed database built on the same principles as Git”

                    1. 2

                      I’m currently building a DVCS for relational data because using git for this usecase seemed infeasable.

                      1. 2

                        One of the biggest system that I have seen that is using Git as a database is Gerrit. Starting from version 3 they switched the code review backend from SQL to pure git. This includes comments, review labels, … Prior to that, they were already storing per-repo configuration in a special git branch. In git-config format obviously :) The whole system is written in Java and uses JGit as the git implementation. A nice side-effect of that is that gerrit replicas essentially git fetch the various repositories from the master.

                        1. 2

                          Matrix is, in addition to being an alternative to slack, at its core a distributed graph database. The creators themselves liken it to git

                          1. 2

                            Git as such isn’t appropriate for a database engine. Git is meant for trees of text files that change every few hours or days.

                            There are some databases that have DVCS-like semantics that could be compared to git; for example, Noms. I’ve seen some others but I don’t remember their names; but hopefully using Noms as a starting point you can find more.