1. 4

Anyone have a good algorithms for minimum 2-dimensional diff? Input is 2 trimmed and clean CSVs (files don’t end in \n; lines don’t end in ,; no quotes or escaping). (one note: There often is a varying number of columns in rows)

Example, given the 2 inputs below, write efficient diff and patch methods and a nice diff encoding.

rank,color
1,blue
2,red
3,green
rank,color
1,blue
2,orange
3,purple

A nice diff format would also be helpful. Can’t seem to find much on this yet, but there must be some stuff out there in the Spreadsheet/Finance world.

  1.  

  2. 7

    Here’s a fun, really not quite what you asked for, but quite effective one…..

    Convert each cell via some algorithm into an RGB or HSV colour. Must only depend on the contents of the cell not the row/column index.

    Convert each file into an image, let’s call them image A and image B. The simplest tool is to use something like a pnm (portable anymap format).

    Now flick rapidly between image A and image B.

    Usually the difference is obvious.

    Our eyes are bloody good pattern matches / diff tools.

    1. 2

      This does sound like a fun one! I’ll have to try it. Could be good for minimaps.

    2. 3

      A regular git diff patch file would work at the minimum, so that’s a good start. Anything better, or different in any other way, would probably try to find smaller diffs. That may or may not make the diff output more readable…

      “nice” is subjective. Are you looking for a plaintext-ish format, or a visual presentation that may have something else (such as XML, cursèd be its name) backing it?

      1. 1

        Yup, regular line based ones are a good baseline.

        By “nice” I mean more about clever semantics.

        Maybe something like: insertColumn at row 2 col 1 values orange purple So yeah, I’ll end up with a clean plain text format, but looking for some neat semantic ideas. I gotta imagine a number of people have written papers on this sort of thing, since spreadsheets are so ubiquitous.

        1. 2

          I bet there are a lot of different metrics for this kind of thing. Probably the smallest diff would be from just compressing the patch file using, e.g., gzip. If you’re looking to do the cleverness manually, I’d probably try to outline the minimal number of blocks (rectangles) that cover only stuff that changed; then store their position, size, and the stuff that changed as a linear array.

          1. 2

            Good point about identifying what metrics I’m trying to optimize. I think it’s about having the minimum number of understandable steps (over the wire that can be gzipped, but I’m more interested in the high level format).

            I like the rectangles idea. It seems like maybe a recursive/dynamic rectangle approach might work well.

            Thanks!

            1. 2

              Looks like the type of problem you’d have to solve in the general case (at least for 2D data) is a rectilinear polygon covering using rectangles—multiple times, for non-contiguous changes—which happens to be NP-hard. Seems like something that’s not worth optimizing all the way, but a similar probabilistic approach might be worth it!

              1. 2

                “Polygon covering”! Why didn’t I think of that ;). So helpful, thank you. Exactly the sort of research I was looking for.

      2. 2

        I took a crack at this problem a few years ago for a side project: https://www.nicediff.com/

        My approach was similar to derek-jones’ comment, except transposed. Row diff first, then column diff.

        Example result: https://www.nicediff.com/view/327cb8bd638fff4bb30677aa957f0e01fda5e2fc

        1. 2

          Love it! Some really nice ideas in here. Check back with me in a few weeks and maybe we’ll have something to show.

        2. 2

          diff a column at a time and merge results to give a row by row diff?

          1. 2

            Here are 2 recs from elsewhere:

            “Paul Fitzpatrick wrote a program called daff that takes two csv files as input and produces a csv diff output. https://github.com/paulfitz/daff

            Specification for Tabular Diff Format https://specs.frictionlessdata.io/tabular-diff/