1. 18

  2. 3

    This looks promising as what I really want is the diff/blame versioning of git, but for data. Lots of people build versioned databases, but they are easy to bypass and change the underlying data by skipping the app layer and going straight to the db.

    I’d like to have a database that I can tell if even the admin has changed data and what they changed without having to invoke blockchains.

    1. 3

      Datomic will let you do this (and doesn’t use block chains).

      1. 3

        This is exactly right. Blockchain is a major headache and terrible for efficiency / performance if what you really want is just an audit. The genius of git is that anybody can clone your repo and become an independent source of truth, including all past history. As long as multiple people have clones, it’s basically impossible to commit fraud without getting caught.

      2. 2

        You can get pretty far with PostgreSQL (and probably many other traditional SQL databases). Something like Liquibase for versioning the schema(and static data if you wanted) and an “audit log”[0] that takes transactions against tables and records them somewhere. Then you get unlimited(for as long as you want to store the audit log) undo, auditing, etc.

        It’s not the same certainly and merging across databases gets tricky, but for many use cases it’s probably good enough.

        0: https://severalnines.com/database-blog/postgresql-audit-logging-best-practices

        1. 2

          Yeah, if what you’re looking for is “versioned database,” there are lots of mature and performant commercial solutions. What none of them really address is what Git is good at: collaboration and distribution. If you have multiple independent parties contributing to a data set, standard DBs don’t really give you any help to make it all work. All the features we’re used to in source control (forking, merging, PRs) just don’t exist. Even among data sharing sites like Kaggle, it’s impossible to contribute edits to someone else’s data without emailing CSVs back and forth. We built Dolt to specifically to address the collaboration use case.

          1. 2

            Agreed. Besides the obvious of just giving them an account in your PG instance, or access to your liquibase git repo.

            Dolt seems useful for public datasets, but I don’t see much value in private datasets, since if you are managing access anyway, the obvious solutions above seem easier, and you get all the benefits of a nice SQL database.

            Anyways, I really enjoyed your post and the new perspective, as before now I never saw any value or even thought about git for databases really. Now I’ve started to think about it, so thanks!

            1. 3

              Well, I hope you’re wrong about the value in private datasets. Kind of building a whole company around that model :p

              The way I think about this is, whenever the primary interaction mode for data is human-curated, offline edits, maybe with human review on merge, then the git model of collaboration has a ton of value. We are betting that there are actually very many private data sets like this. It’s not a good match for OLTP, or most application backends that get frequent writes. But compare it to the state of the art for sharing data, which is literally still emailing CSV files around, and the value is obvious.

              1. 1

                I can think of a few human-curated, human-merge datasets at $WORK, but we solved this problem through just having an approval system as part of our PG DB(i.e. person A requests change Z and person B approves change Z).

                Your perspective seems to do the same thing(s) but using the git model of collaboration, so it’s much fuzzier who can approve change Z here, where my model is hard-defined(via some other tables, groups, etc) as to who is allowed to approve change Z. For us, the hard-definition(s) is a win, but I’m in a much-regulated organization that’s > 100 years old. I could see the softer approval definitions, useful in fuzzier orgs, or even at the beginning of an orgs life.

                I agree about CSV. I def. could see some value around sharing of data across orgs, or maybe even inside of orgs with something like this, but a lot of the time sharing this stuff is circulated around reporting, which means you need strong data exploration tools (which is usually just $spreadsheet for CSV).

                But thinking about this more, I can see a use-case for tentative edits, where some $bossy type wants a bunch of changes made and the software isn’t really built for that change, so we end up sending CSV files back and forth between us and $bossy type(s). For us the value would be in being able to then integrate the “finished” back into our PG/SQL DB. I don’t see $bossy type navigating dolthub though. They can barely handle $spreadsheet some of the time.

                I’m not sure I’m a target audience for dolthub. Anyways, good luck, and for your sake I hope I’m wrong too!