1. 106

  2. 6

    It’s a really wonderful piece of work that you have shared with us, Wilfred, and knowing the thoughts and ideas behind it only makes me appreciate it more.

    Thank you.

    1. 4

      Thanks for the kind words :)

    2. 1

      I’ve been using difftastic occasionally for the past few months and it’s already become an invaluable tool for me when I’m trying to figure out what actually changed in big patches to rust code. The default diffing algorithm seems to love to latch onto two different single line {s and decide that that is unchanged line which then causes a bunch of unrelated code to get scattered between the actual changes.

      For anyone who wants to try it out without mucking with their normal workflow, I’ve setup git config --global alias.difft "-c diff.external=difft diff" which allows me to just run git difft when I want to use difftastic.

      The only thing I wish I could get it to do is be used by interactive diffing, like git add -p, because it’s so much better at finding unchanged lines. Though, I’m not sure that’s even possible. Anyone know?

      1. 2

        There’s no equivalent of git add -p yet, but several users have asked for it. I’m not sure if it’s possible yet, or to what extent git lets you override how it splits patches.

      2. 0

        It is fantastic, but not generic, and not fast.

        Why do it at the AST level? I would think it would be possible to produce a practically as good diff at the text level. Which would work on any text, possibly faster.

        1. 13

          Based on the first line of the post, it does not appear the author was attempting to solve your problem:

          I’ve always wanted a structural diff tool, so I built difftastic.

          I’m not sure if you’re attempting to provide constructive criticism. The author said he always wanted to design a more comfortable seat, and you told him he failed to build a faster car.

          1. 4

            Also, it does not seem like a well-posed problem – “non-line based diff tool and format for text in general (that could work on words or characters)”

            i.e. It’s one of those things where you’re probably imagining something that not only doesn’t, but can’t exist, and then if you actually tried to build it, you would realize that’s not the problem, and you would end up building something else

            1. 2

              I don’t know why a non-line based diff format is so hard to imagine. When you see red and green words in the terminal, you are looking at one form of it: Color escape codes. It works on words and characters, and the human readability is excellent. It’s probably the most universal diff format in terms of support, except it’s just an output format.

              1. 2

                I agree, @anordal. I am working on a diff format that I hope will eventually facilitate a common language between diff tools like difftastic, and generic tools for patching and visualization. It’s explicitly not just an output format. The idea is simple but it still has difficult trade-offs. I’d be delighted if you want to help out: https://github.com/svenssonaxel/diff-format

                1. 1

                  Not the parent, but it seems to me that looking for sub-line diffs is more poorly specified than line-based. Once you allow sub-line diffs, the question becomes how you decide whether to prefer a partial line edit to treating the change as deleting a line? It seems to me that there are a lot of different metrics you could use, with wildly different results.

                  1. 3

                    The two problems are exactly equivalent: Find an edit that transforms one sequence to another. Whether that sequence consists of lines, lexer tokens or characters, you can get several possibilities where selecting the “best” one is difficult, especially since the minimum edit isn’t always the best one in practice.

              2. 3

                “Couldn’t I do this with a text based diff?” is a sufficiently common question that I’ve also discussed it in the FAQ: https://github.com/wilfred/difftastic#isnt-this-basically---word-diff---ignore-all-space

                People have been building and optimising text-based diffs for decades, so there are plenty of great options for that use case already. I personally like git-diff with the patience algorithm.

              3. 3

                I think you’re casting the problem as something akin to sequence alignment, if I’m understanding correctly? My mental model is that, without lines, you treat the two strings you’re diffing as something akin to DNA sequences and you’re trying to find the optimal alignment that minimizes Levenshtein distance?

                If not, please elaborate. If so, though, cool—that’s how I used to think about diffing too! I, like OP, was pretty unhappy with my git diffs. But I think the barrier to diffs that happen at the “character level” is the algorithmic complexity. I think even the fast heuristic algorithms that do this without “word splitting” is something like O(m * n), where m and n are the lengths of the two strings. DNA databases do speed this up with “word splitting” heuristics, but that just means your algorithmic complexity is O(m * n), where m and n are the number of words.

                In other words, I think doing this at the “line level” vs. doing this at the “AST level” vs. doing this at the “character level” comes down to a time vs. quality tradeoff. You probably can produce amazing diffs without considering lines or ASTs, but it probably takes a while.