Never having thought about the difference between an AST and a LST and a parse tree in such detail before, my first idea was to to re-render the AST as text again, with the translations made in the tree. But of course, then you’d lose formatting information. Or you include these details in the tree, but then you’d have the bigger and thus heavier parse tree.
At the end of the article it finally clicked for me:
To perform style-preserving source translation, we replace the significant spans, while leaving the insignificant ones in tact.
This is really neat.
Thanks! It would be neater if you could get two algorithms out of it: as I mentioned, I think a reformatter like clang-format and gofmt is the converse: change the insignificant spans and keep the significant ones.
I didn’t actually implement that yet. But I think that using the LST will work. If anyone has experience, let me know.
gofmt will still keep e.g. blocks delimited by blank lines separate - it’s not quite the same as ‘remove insignificant spans’.
I’ve seen this referred to as a “roundtrippable AST”. You might be interested in related discussion from the GHC team.
Thanks! I added this to my wiki page: https://github.com/oilshell/oil/wiki/Lossless-Syntax-Tree-Pattern
It’s a great write-up but I’m not sure a new term is needed. What’s described… in my little experience… appears to be a concrete, syntax tree with new representation. In CompSci work on parsers, I also often saw them use “annotated” AST’s that had the extra information as nodes of a different type or some extra variables/info attached to regular nodes. This was especially common for comments or preprocessor directives. To me it’s just a CST, though, like I saw with langX-to-langY translators of the 90’s and 2000’s.
An example of those doing annotated AST’s and such:
Yeah that’s a good thread discussing this exact problem.
“The core problem is that comments can appear anyplace that white space can appear. For this reason, they are generally deleted by the lexer. Retaining them isn’t a problem per se, but deciding what AST to attach them to is a problem.”
Note that my solution is different. I don’t attach comments or whitespace to nodes. I just represent the thing as a tree (annotated with span IDs) plus an array of spans.
Also I point out in the article why it’s not a CST. The Dragon Book says:
A parse tree pictorially shows how the start symbol of a grammar derives a string in the language.
That is NOT what the Lossless Syntax Tree does. Examples are given.