1. 75
  1.  

    1. 11

      This is a neat demo of the custom “diff” feature in Git, which I hadn’t seen before.

      I built my own tooling to solve this problem: https://datasette.io/tools/sqlite-diffable - which outputs a “diffable” copy of the data in a SQLite database, precisely so you can store it in Git and look at the differences later.

      I’ve been running that for a couple of years in this repo: https://github.com/simonw/simonwillisonblog-backup - which provides a backup of my blog’s PostgreSQL Django database (first converted to SQLite and then dumped out using sqlite-diffable).

      Here’s an example diff: https://github.com/simonw/simonwillisonblog-backup/commit/72e73b2cdd714fb1f3cd87d6a752971fc6398890

      1. 4

        Simon, just today I went down the rabbit hole that is your blog and, by extension, Datasette. I’m honored to hear from you!

        I’m curious: what advantage does this data format have over plain SQL dumps? It’s hard to justify any additional dependency, but I’m sure you had a compelling reason to build a custom format.

        1. 3

          Mainly I wanted a format that’s somewhat database independent. Every entry on my blog is available in this single newline-delimited JSON file: https://github.com/simonw/simonwillisonblog-backup/blob/main/simonwillisonblog/blog_entry.ndjson

          I don’t need SQLite to work with that data - I can pull it into anything that speaks JSON.

          I also really like the diffs I get back from that - there’s no confusion with different formats of INSERT statement etc.

        2. 2

          There is one huge benefit of storing the data in a textual format: repository size.

          Git’s storage strategy is notoriously bad at handling frequent changes to binary files. Though, if someone’s already made this choice and you have to deal with it, the strategy you mention here is a fantastic one.

          1. 3

            Git’s storage strategy is notoriously bad at handling frequent changes to binary files.

            Although I’m sure git could have better binary deltas the main reason for this is that binary formats are very poor at information preserving. When everything gets scrambled on every change there’s not much that can be done.

            A conversion process is an opportunity to re-structure the data into a more regular format, thus giving deltas the opportunity to do something. You should get similar results with a more regular binary format, and similar issue with scrambled text.

            1. 4

              Although I’m sure git could have better binary deltas the main reason for this is that binary formats are very poor at information preserving. When everything gets scrambled on every change there’s not much that can be done.

              I think that’s not quite true. The problem is in semantics-unaware comparison.

              In text, this is less obvious because there is very little embedded semantics. I work around git’s foibles in text by putting one sentence on each line and so changing a sentence changes a line. Git understands this and works well. I’ve worked on other projects that want to line-wrap text at 80 column boundaries. If you make a one-word correction near the start of a long paragraph and re-wrap it, git doesn’t see this as ‘word changed and rewrap’, it just sees that a load of newline characters have changed. It isn’t aware that the newlines are not semantic markers (if it were, it would strip all single newlines, capture double newlines, and re-wrap on checkout).

              If I have cross references in a text format, I will express them symbolically with a target marker and a reference to that target and so it’s fine to treat them as just a range of bytes. In a binary format, I will have a structured notion of records and cross references will either point to a specific record number or to an entry in an indirection table. Moving the target record causes something else to be updated elsewhere.

              The CODA distributed filesystem dealt with this by having plugins that could extract semantic changes. For an SQLite database, this might be a SQL statement that drops records that have gone away and inserts ones that are added. For something like a Word document, it might be the track-changes record that describes the set of modifications that are applied. These changes could capture higher-level changes and were often smaller than a text or binary diff.

              Some years ago, I had a student build a system that did tree transforms on ASTs to effectively run refactoring tools in reverse: given a before and after snapshot of source code, track what refactoring has been applied. The tool was able to distill multi-thousand-line diffs into short summaries like ‘the foo function was renamed to bar and all callers were updated except the one in mistake.c’ (and, when you saw something like that, you’d know what the next commit was going to be). If git were able to capture things like that, it would handle diffs on source code more densely (though I doubt anyone would care because source code changes are limited by the speed at which people can write code and so tend to be small even if you store a complete copy of the source file per diff). It’s more noticeable in files where there is not a linear relationship between typing speed and data size.

              1. 4

                The problem is in semantics-unaware comparison.

                A VCS can not stake its storage layer on a boondoggle.

                On the other hand you can design your binary format with sync-ability in mind.

                In text, this is less obvious because there is very little embedded semantics. I work around git’s foibles in text by putting one sentence on each line and so changing a sentence changes a line. Git understands this and works well.

                Git does not actually care a whit, delta-ification does not deal with lines. The diffs you see in your terminal are generated on the fly, it’s not how git stores things.

                Git’s delta-ification is quite literally copy and paste between objects. A closer simile would be something like LZ77, except between a base object and a new object instead of having an intra-object window (there is also an intra-object window because the literal or delta-ified objects are DEFLATE-ed).

                If you make a one-word correction near the start of a long paragraph and re-wrap it, git doesn’t see this as ‘word changed and rewrap’, it just sees that a load of newline characters have changed.

                It sees a bunch of stuff has changed. Not especially newlines. Git doesn’t see newlines as anything more than a byte value.

                It isn’t aware that the newlines are not semantic markers

                Thankfully, since they routinely are.

                (if it were, it would strip all single newlines, capture double newlines, and re-wrap on checkout).

                https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#filters_a

                If I have cross references in a text format, I will express them symbolically with a target marker and a reference to that target and so it’s fine to treat them as just a range of bytes. In a binary format, I will have a structured notion of records and cross references will either point to a specific record number or to an entry in an indirection table. Moving the target record causes something else to be updated elsewhere.

                And that is an issue with how the binary format is designed. You could encode records and cross references to limit cascading changes as you do in your text document.

                It’s more noticeable in files where there is not a linear relationship between typing speed and data size.

                It’s more noticeable in files where there is a non linear relationship between the number of changes and the amount of change. It’s not “typing speed” and “data size” which cause the entirety of a png file to get rewritten and scrambled when you change a few pixels at the start of the image.

                Do the same with text, and you’ll have the same issue.

      2. 0

        ah! thanks for that

    2. 4

      Note that when seeing the textual representation, for example in a git gui, you might expect partial staging/discarding and merging changes to work as well; that’s not the case.

    3. 4

      It’s essentially a shell command hook to run prior to diff which is nice and flexible - curious, other than the example here of db diffs what other binary file formats folks have seen in the wild using this or some creative examples that are possible?

      1. 10

        I’ve used this before to transparently decrypt and diff ansible vault secrets. And I’ve seen it used to diff (text extracted from) PDFs.

        1. 3

          Oh blimey, I have a rant about ansible vault, in the form of a few thousand lines of code that I regret every part of. https://dotat.at/prog/regpg/doc/rationale.html

          The short version is that a good secret should have about 100 bits of entropy. There’s no reason a person should ever need to look at a decrypted secret: there’s nothing useful to see, just random bits. If secrets are properly separated from their metadata, the secrets can be kept encrypted and the metadata can be left in the clear. Then there is never any need for transparently (and carelessly) decrypting secrets for trivial reasons.

        2. 3

          Yep. Other examples include office files (Word, Excel, etc).

      2. 2

        Long ago I hooked it up to the imagemagick compare utility, so changes to image files could be diff-ed by opening a window with the image, highlighting the changed pixels.

        I switched it off later because it meant that eg git log would randomly spawn new windows, but it was pretty handy aside from that annoyance.

        I’ve also hooked it up to (sops)[https://github.com/mozilla/sops] to decrypt and diff secrets.

    4. 2
      1. 1

        I believe this technique applies to postgres as well. You might have to tweak the dump command a bit.

        Also check out @simonw’s comment, they’re using a custom format for DB backups.

    5. 1

      oh cool!!!!!

    6. 1

      this is gold!! nice article !