I’m writing a small interpreter to explore writing a REPL with all the ‘send this function to the interpreter’ fun that you get in lisps. Languages are much more fun once you have an AST to play with, so I’m trying to get the parser over with quickly.
The steady progress of compiler technology would be interesting here: not only do we have more effective compilers at optimising, but modern machines are so much more powerful than e.g. a C64.
Superoptimisation, techniques for brute-forcing short instruction sequences to find the most efficient option, are also viable sometimes.
Having used Ruby a few times now, with its own opinionated approach to “little languages”, my thought is simple: It’s easier to learn a big language than it is to learn two little languages. 🤷♂️
Very much this: big languages tend to win usage. They’re also more work to implement, so you’re at less risk of many competing implementations (c.f. Scheme).
I’ve been using difftastic occasionally for the past few months and it’s already become an invaluable tool for me when I’m trying to figure out what actually changed in big patches to rust code. The default diffing algorithm seems to love to latch onto two different single line {
s and decide that that is unchanged line which then causes a bunch of unrelated code to get scattered between the actual changes.
For anyone who wants to try it out without mucking with their normal workflow, I’ve setup git config --global alias.difft "-c diff.external=difft diff"
which allows me to just run git difft
when I want to use difftastic.
The only thing I wish I could get it to do is be used by interactive diffing, like git add -p
, because it’s so much better at finding unchanged lines. Though, I’m not sure that’s even possible. Anyone know?
There’s no equivalent of git add -p
yet, but several users have asked for it. I’m not sure if it’s possible yet, or to what extent git lets you override how it splits patches.
It is fantastic, but not generic, and not fast.
Why do it at the AST level? I would think it would be possible to produce a practically as good diff at the text level. Which would work on any text, possibly faster.
Based on the first line of the post, it does not appear the author was attempting to solve your problem:
I’ve always wanted a structural diff tool, so I built difftastic.
I’m not sure if you’re attempting to provide constructive criticism. The author said he always wanted to design a more comfortable seat, and you told him he failed to build a faster car.
Also, it does not seem like a well-posed problem – “non-line based diff tool and format for text in general (that could work on words or characters)”
i.e. It’s one of those things where you’re probably imagining something that not only doesn’t, but can’t exist, and then if you actually tried to build it, you would realize that’s not the problem, and you would end up building something else
I don’t know why a non-line based diff format is so hard to imagine. When you see red and green words in the terminal, you are looking at one form of it: Color escape codes. It works on words and characters, and the human readability is excellent. It’s probably the most universal diff format in terms of support, except it’s just an output format.
I agree, @anordal. I am working on a diff format that I hope will eventually facilitate a common language between diff tools like difftastic, and generic tools for patching and visualization. It’s explicitly not just an output format. The idea is simple but it still has difficult trade-offs. I’d be delighted if you want to help out: https://github.com/svenssonaxel/diff-format
Not the parent, but it seems to me that looking for sub-line diffs is more poorly specified than line-based. Once you allow sub-line diffs, the question becomes how you decide whether to prefer a partial line edit to treating the change as deleting a line? It seems to me that there are a lot of different metrics you could use, with wildly different results.
The two problems are exactly equivalent: Find an edit that transforms one sequence to another. Whether that sequence consists of lines, lexer tokens or characters, you can get several possibilities where selecting the “best” one is difficult, especially since the minimum edit isn’t always the best one in practice.
“Couldn’t I do this with a text based diff?” is a sufficiently common question that I’ve also discussed it in the FAQ: https://github.com/wilfred/difftastic#isnt-this-basically---word-diff---ignore-all-space
People have been building and optimising text-based diffs for decades, so there are plenty of great options for that use case already. I personally like git-diff with the patience algorithm.
I think you’re casting the problem as something akin to sequence alignment, if I’m understanding correctly? My mental model is that, without lines, you treat the two strings you’re diffing as something akin to DNA sequences and you’re trying to find the optimal alignment that minimizes Levenshtein distance?
If not, please elaborate. If so, though, cool—that’s how I used to think about diffing too! I, like OP, was pretty unhappy with my git diffs. But I think the barrier to diffs that happen at the “character level” is the algorithmic complexity. I think even the fast heuristic algorithms that do this without “word splitting” is something like O(m * n), where m and n are the lengths of the two strings. DNA databases do speed this up with “word splitting” heuristics, but that just means your algorithmic complexity is O(m * n), where m and n are the number of words.
In other words, I think doing this at the “line level” vs. doing this at the “AST level” vs. doing this at the “character level” comes down to a time vs. quality tradeoff. You probably can produce amazing diffs without considering lines or ASTs, but it probably takes a while.
It’s a really wonderful piece of work that you have shared with us, Wilfred, and knowing the thoughts and ideas behind it only makes me appreciate it more.
Thank you.
Rust doesn’t seem to have first class support on Ubuntu… What is the recommended way of installing something like this on Ubuntu LTS?
I’ve had good luck building with Rust via asdf
recommended way
Difftastic seems to use bleeding-edge rust features, so it was easier for me to package for Guix than Debian. If you can get guix on your system I could send you a recipe (or I’ve submitted it for inclusion in the guixrus channel also)
I don’t have a good sense of what rust versions are available on different distros. I’m trying to be conservative, and only increase the rust version when there’s a benefit, but I’m not sure what threshold to use.
How old are the default rust versions on your distro?
I tried this last week! It worked very well but I can’t incorporate it into my workflow because it only supports side-by-side diffs. I find those unreadable.
It supports inline if you set INLINE=y
but this is experimental and a bit buggy, so I’m trying the side by side for now.
Ah, so it does, thanks! It ended up panicking, but it’s good to know this feature might be supported in the future.
You can get most of the way there without needing to actually learn how to parse every language you want to diff, with a combination of whitespace-insensitivity and word-level diffing. The only difference I see in the basic example when run through patdiff instead of difftastic is that patdiff makes a slightly worse decision about the line wrapping (it sees an insertion of yvonne",\n"
instead of "yvonne",\n
).
Obviously this becomes problematic for whitespace-sensitive languages, since you’re potentially eliding meaningful changes. However, turning that off (-keep-whitespace
) makes the output somewhat noisier, but still quite readable. I think word-level diffing is the real key to readable diffs.
Agreed word-level diffing is super important – difftastic actually uses word-level diffing inside comments for exactly this reason.
Word boundaries do depend on syntax though. x+1
is three words in JS and one word in lisp. Whitespace is also significant in string literals: " "
(one space) and " "
(two spaces) are different in any language.
I guess I’ve found that getting the word boundaries exactly right doesn’t matter a whole lot. Like, the whitespace-insensitivity I rely on includes inside strings, and I’ve never felt like I’ve missed something I cared about because of it. That said, I haven’t used it on anything as syntactically weird as lisp (punctuation other than _ and maybe - in words sounds pretty weird).
Edit: Oh, I think the reason I assumed difftastic didn’t do word-level diffing was:
If a file has an unrecognised extension, difftastic uses a line-oriented diff.
Maybe I understood that incorrectly?
I find git’s bundled diff-highlight paired with the occasional –color-words to be sufficient for this use case with the strong benefit of working across all files.
diff-highlight:
[core]
# Requires /usr/local/share/git-core/contrib/diff-highlight in $PATH
# or the full path to diff-highlight.
# See https://github.com/git/git/tree/master/contrib/diff-highlight
pager = diff-highlight | less
–color-words:
git diff --color-words
# or
git log --color-words
That’s definitely a great option, although difftastic will do a similar word-based highlight for files it doesn’t understand too.
--color-words
can get confused. For example, it assumes all whitespace is insignificant, which isn’t true in e.g. Python.
lisp are the only language which are the only languages that I know which can really make heavy use of REPL
The difference is that there is tight integration between the editor and the REPL in Lisps. When I work with Clojure, my IDE is connected to the running application. I can inspect and reload any code in the application straight from the editor.
ipython is excellent, but it doesn’t quite support incrementally writing a module in the same way as a lisp. E.g. if foo.py contains from bar import func
, redefining func
in bar is difficult.
Julia has a lovely REPL experience.
Smalltalks are arguably REPL environments: you can evaluate code anywhere, and the debug/edit/continue experience is amazing.
With python you can get a little bit closer using Jupyter, since it blends an editing environment with a REPL more than using iPython directly. I try to remember to use it instead of iPython when I need a REPL. It’s especially nice if you’re doing exploratory data work, because you can draw graphs and charts.
Great post.
There’s a siren song that this time we’ll do it right and loper-os contains some great criticisms of today’s technology. Brett Victor and Tunes (now resurrected in Houyhnhnm Computing) also offer valuable glimpses of what could be.
That said, having actual projects matters. It’s much easier to discuss the limitations of today’s tool du jour than to build something better. I’m guilty of this too.
I don’t share your view that today’s designs will not be supplanted though:
I do think the computing mainstream benefits from fringe projects too. V8 and the JVM are both mature projects that have lispers contributing: there’s a lot to be learnt from rebuilding or relearning from older and less popular tech.
When you have a single entity that blows the scale of the graph out that much, it would be nice to offer a second graph with everyone else. Or just exclude PHP from the graph entirely and put in a note. But I’m not a visual/designer type.
Hello and interesting post!
I noticed that you didn’t include Dylan, which was one of the first languages to tackle a more complex macro system in infix-syntax code.
Perhaps that was due to reasons that you indicated in your post (but you didn’t name languages, so it is difficult to tell):
I had to cut a number of languages from early drafts of this post because their docs were so poor or I needed extensive knowledge of the grammar or even compiler itself!
Anyway, our macro system is documented in a few places independently:
swap
in that chapter.As for your examples …
In Dylan, a swap macro is pretty easy:
define macro swap!
{ swap! (?place1:expression, ?place2:expression) }
=>
{ let value = ?place1;
?place1 := ?place2;
?place2 := value; }
end;
Dylan macros are hygienic, so there’s no problems with that. This is also pretty similar to syntax-rules from Scheme.
As for each-it
, that isn’t hard either, as unlike syntax-rules, Dylan makes it easy to violate hygiene when needed without a lot of ceremony:
define macro each-it
{ each-it (?collection:expression)
?:body
end }
=>
{ for (?=it in ?collection)
?body
end };
end;
This wasn’t hard either … the ?=it
just means that instead of being a substitution from the pattern like ?collection
and ?body
, it is a hygiene violation.
Using it is simple:
each-it (arguments)
format-out("%=\n", it);
end;
That approaches the simplicity of the Common Lisp version of this definition and is (to me) much nicer than the syntax-case definition from Scheme.
Thanks for your kind words!
I’m reluctant to name the languages I struggled with, as it’s hard to draw a line between their complexity and my inexperience. However, I completely overlooked Dylan (to my shame!), but it would have definitely been an interesting addition.
Your macro implementations are pretty readable. How does Dylan fare when you need to use a code to manually build a parse tree?
Well, our normal macro system isn’t like Common Lisp style macros. It is strictly like syntax-rules from Scheme. That said, some pretty complicated things have been built with the standard Dylan macro system, like the binary-data library.
Inside the compiler however, we have an extended version of this macro system that does allow for quotation and such. I like it a lot and it is really powerful. It is how large parts of the compiler are actually written, including our C-FFI interface. If you check out the D-Expressions paper that I reference above, it deals with this pretty extensively, although some of what is discussed in the paper is not implemented (the parts about the module system).
It is an open project for someone to make it so that a project can have a compiler plugin that uses this macro system extension. I don’t know that it would be incredibly difficult. It would probably be challenging, and would almost certainly be enjoyable.
Inside the compiler, these special compiler macros are defined by define ¯o
rather than define macro
. From there, they even start out looking like a regular macro, however, instead of the pattern matching and being used for template substitution (effectively), it is instead executing code which returns AST fragments. The compiler builds upon this and has &converter
and other forms which just expand to ¯o
definitions and are key to how things go from the parsed tokens to the actual compiler IR, eventually.
That would all be quite tedious if one still had to construct AST stuff by hand, but thankfully, it works with quoting as well, using #{ ... }
and allowing for substitution. Inside the compiler, one can use that to construct AST fragments. Unfortunately, these tend to be used in fairly complex areas, so I don’t have a really nice simple example.
define ¯o c-address-definer
{ define c-address ?var-name:name :: ?pointer-designator:expression
?options:*;
end }
=>
begin
let initargs = parse-options($c-address-options, options, var-name);
let options = apply(make, <c-address-options-descriptor>, initargs);
let c-name = c-address-c-name(options);
if (c-name)
let import = c-address-import(options) | #{ #f };
#{ define constant ?var-name
= make(check-c-address-designator(?var-name, ?pointer-designator),
address: primitive-wrap-machine-word
(primitive-cast-pointer-as-raw
(%c-variable-pointer(?c-name, ?import)))) };
else
note(<missing-c-name>,
source-location: fragment-source-location(form),
definition-name: var-name);
#{ };
end if;
end;
end ¯o;
That one isn’t too difficult though and comes from our C-FFI. You can see that it defines a pattern to be matched, define c-address ....
and then it executes some code to parse the options, some basic validation, and then it checks to be sure that a c-name
has been provided in the options. If it has, it returns the quoted AST (define constant ?var-name ....
). If it hasn’t gotten a c-name
option, then it issues a compiler warning (note(<missing-c-name>, ...)
and then returns an empty AST fragment (#{ }
). That wasn’t too terrible! :)
Great article. It’s nice seeing articles about the core of Emacs design that discuss more than “everything is a buffer” (which is true, but less interesting).
Emacs’ text representation is interesting. It’s a gap buffer, not a rope – this simplifies the implementation a bunch, especially for the regexp engine searching the currently open file.
Attributes (essentially arbitrary properties associated with spans of text) compose pretty well, but it’s always surprising when you copy text in Emacs and it sometimes preserves the highlighting of the source file.
The hard bit about text is representing different bytecodes. Emacs will allow you to open e.g. an invalid UTF-8 file and fix it. The ‘multibyte’ representation is robust and versatile, but it definitely complicates the core.