Nice tutorial, it’s great to see all these ideas summarized! I implemented pretty much exactly this parser and syntax tree design for Typst, after studying rust-analyzer’s implementation. It works great.
It is a subset of Rust, which has just enough features to make some syntax mistakes.
The number of parsing tutorials that either try to be a full language - resulting in a huge amount of code - or have such a small grammar that the things that make parsing difficult don’t arise.
The article is talking about handling errors so they quite reasonably focus on the error handling/recovery options available in LL parsers, but for production parsing there’s an additional advantage: LL parsers are significantly faster than LALR or LR parsers, often by significant whole number multiples. LL parsers can also be designed to more easily handle mode switching without duplicating code or grammar rules.
Many years ago I was teaching the course at my university that encompassed parsing (“concurrency and programming languages”???), and it was super irksome that all parsing theory in the university talked about a specific set of definitions for language grammars and recursive descent parsing that precluded all the techniques used by production parsers. As a result many people end up believing they need to use parser generators because hand writing an LALR+ parser for any real language isn’t really possible, and as a follow on they don’t actually learn how to actually write Real Parsers(tm).
To me this is an issue because in practice there are very few cases where you should be implement a parser where you do not care about error handling or performance. The most common cases where people want to parse something are things like configuration files, and those are things for which there are numerous existing languages, standards, and implementations (innumerable JSON parsers, xml/binary plists are endemic to Darwin, as much as I dislike it, TOML, etc). For cases where you are parsing actual real programming languages you either want to provide useful errors, support partial parsing, and performance matters once you start having any significant volume of code. The first couple of points argue for recursive descent in new/not-yet-popular languages immediately, and last becomes increasingly as the volume of code increases (JS parsers handle so much code they give up the first points in favor of the latter).
As someone who has only written non-production parsers, I’m surprised packrat parsers aren’t more popular. O(n) time and space complexity (with respect to codebase size - space grows with the square of the number of rules), error recovery is straightforward, and you can parse any PEG.
I went down this rabbit hole a few years ago, and it seems like GLL/GLR parsers (mentioned at the top of this article) are the natural evolution of packrat parsers, and are seen more often in the wild these days. To hand-wave, GLL memoizes (index, result) parse outputs on the grammar as it goes, which allows it to parse left-recursive grammars in addition to the ones packrat can do. (I have a long walk-through on the packrattle site if you’re curious.)
I don’t think packrat or GLL are very good at error recovery, though – when they fail, they tend to leave behind no clue about what the simplest “fix” would be. That’s been the subject of a lot of recent papers/blogs, and also a big motivator for writing treesitter & friends, as far as I can tell.
CPython and Zig are both using PEG (with lookahead, and maybe backtracking), but not Packrat, as far as I know.
In practice, it seems you don’t need Packrat – it bounds the worst case, but the common case is fast. I read some papers exploring this ~10 years ago, I think by Grimm.
Related to this post – I would say IDEs don’t use Packrat parsing because there’s no known or straightforward way to make it “resilient”, which you can do with hand-written techniques.
Memoization indeed isn’t too useful(*), as it adds a lot of complexity, some overhead, and helps only with the worst case. But for the worst case you want:
manually handle realistic cases so that they are linear (eg, I remember writing some code to handle (expr) vs (expr, ).
for unrealistic cases, add something like fuel. For example, for an IDE it would be reasonable to bound the nesting depth, to protect code up the stack from stackoverflows.
But backtracking absolutely is used by IDEs, and is a fairly straightforward extension here. To make backtracking resilient, you only need the concept of commit points. It could look like this:
The semantics here is that, once the parser gets to a ` point of a node, it won’t backtrack.
And it’s totally possible to use a parser genrator that does “PEG with backtracking sans memoization”, and that’s exactly what IntelliJ is using for many languages:
So, it’s not really about generated-vs-hand-written, it’s rather LR-vs-LL.
(*) memoization isn’t too useful for a single language. If you want to parse whatever, and allow anyone to write grammars, going full GLR makes a lot of sense, as proved by Tree-sitter.
As a result many people end up believing they need to use parser generators because hand writing an LALR+ parser for any real language isn’t really possible, and as a follow on they don’t actually learn how to actually write Real Parsers(tm).
Yeah to me it seems like all the “real languages” that have come out in the 10 years have narrowed / converged with respect to syntax (but not semantics). If you look at all of
(I would put Prolog/Erlang, Haskell/OCaml, Forth, Lisp as outside that list, and claim those language families are not as active recently. )
If that’s true, then there may ALSO be a narrowing of parsing techniques – both for resilient parsers and strict parsers.
Again I think that’s mostly good, but it does mean that university compiler courses may want to change a few things.
I’ve also noted that both Python and JavaScript grew shell-style string interpolation in the last 10 years, and Swift has it too. So you need a way to handle that cleanly (e.g. “lexer modes” in Oils)
I also wonder how that interacts with Resilent LL parsing, if you want to complete expressions within string literals. Seems like it should be fine since both techniques are top-down. Actually it’s an argument for using a single lexer in the lexer modes style, because the other way of doing it means you look for the closing quote and re-parse, which seems worse for resilience. (CPython started this way, but is changing it AFAIK. So the state of the art isn’t quite settled )
My “Real Parsers(tm)” quip was meant as a tongue-in-cheek joke as what a “real parser” or “real language” is clearly subjective :D
have come out in the 10 years have narrowed / converged with respect to syntax
I’m not sure it’s good or bad. It’s more a by-product of how you make a language popular given the wonders of how humans work. It doesn’t matter how much better your language is, people are more likely to try it, and will have less difficulty adopting it, if it’s familiar. Because bracey languages basically already won, new languages that aren’t bracey are at a disadvantage next to every non-bracey language. As a result all new languages tend towards braceyness and the cycle continues.
I would put Prolog/Erlang, Haskell/OCaml, Forth, Lisp as outside that list, and claim those language families are not as active recently.
Syntactically I would agree, but you have to ask what a language is. The bracey “trait” languages have type systems that are largely identical to Haskell/OCaml (traits are a near 1:1 with haskell type classes, even the actual implementation). You could make an argument that rust or swift are bracification of functional languages, not the strongest argument but it really depends on what you decide is the defining characteristics of the language.
If that’s true, then there may ALSO be a narrowing of parsing techniques – both for resilient parsers and strict parsers.
This has been the case for years - recursive descent is what production parsers use if they care about error handling or performance.
Again I think that’s mostly good, but it does mean that university compiler courses may want to change a few things.
Absolutely agree. The problem is that university CS departments are trying to teach parsing theory (LL < LALR < LR < GLR, and similar for varying look aheads). The problem is that this view of parsing doesn’t cleanly apply to the structure of LL(*) (or even LL(infinity) which may be different?) parsers which makes teaching about the relative strength of grammars harder, and also leads to “why would we do this other parsing type if it is slower and more complicated?” which is certainly a challenge. My experience teaching led to the conclusion that parsing provided a good tool to help explain PDAs more effectively as it provides the answer to “why would you do this”, so it is useful.
Of course I don’t think it’s necessarily useful for university level teaching to go into the practical details of real world implementation as the trade offs and decisions become much more environment, grammar, language, etc specific than any general course could teach.
I’ve also noted that both Python and JavaScript grew shell-style string interpolation in the last 10 years, and Swift has it too.
I can’t speak for python, but in JS you lex the string normally, then break up the string into components, if the splits are invalid (e.g. unterminated) then you report the literal as broken, otherwise you’re just parsing expressions and the failure mode is no different from parsing any other expression.
It’s worth pointing out that Forth and Lisp languages tend to have simple parsers. A basic strict S-expression parser is less than 100 lines of RPython. Block-oriented ALGOL descendants are quite complex in comparison!
Hm nice to see this all put together in one place. It makes sense, and has a nice structure.
The nerd in me wants to ask how to generate the first/follow sets from some declarative grammar-ish spec, and even the whole parser, but maybe there are diminishing returns on that?
I also wonder how closely it matches the approach of IDEs like IntelliJ and Microsoft Visual Studio and VS Code? Do they have names for their approaches?
Or is it more like a bunch of conventions documented somewhere in the code (e.g. the red-green trees IIRC) ?
I actually haven’t looked at those (or, rather, I looked very closely at Roslyn for syntax tree structure, but not for the parser). Also, it seems to me that TypeScript resilience is not great.
Another interesting case would be Swift. IIRC, they re-wrote parser in Swift from scratch for libsyntax. And before that, they did a fun thing where they had C++ parser producing both AST and CST at the same time: parsing functions included both open/close calls, but also returned structs.
Hm you said “post” but linked a Java file? Is that the right link?
FWIW the resilient stuff is interesting to me, though it feels like shell falls slightly outside of the normal usage.
I introduced a mechanism called a “trail” that is passed through the mutually recursive parsers in OSH. The trail collects information even then the parser fails, which is indicated by a Python/C++ exception.
So basically this is an incomplete shell fragment
for i in 1 2; do ls /dir/$i /dir/$(hostname -z<TAB>
But the caller will be able to see that the last 2 words to complete are hostname -z, with the “trail”. It can work through any number of nested parsers.
This seems to work pretty well and is definitely more principled than what other shells do. Other shells are often “wrong”, although there is no strict spec for what the right behavior is.
Nice tutorial, it’s great to see all these ideas summarized! I implemented pretty much exactly this parser and syntax tree design for Typst, after studying rust-analyzer’s implementation. It works great.
I appreciate this:
The number of parsing tutorials that either try to be a full language - resulting in a huge amount of code - or have such a small grammar that the things that make parsing difficult don’t arise.
The article is talking about handling errors so they quite reasonably focus on the error handling/recovery options available in LL parsers, but for production parsing there’s an additional advantage: LL parsers are significantly faster than LALR or LR parsers, often by significant whole number multiples. LL parsers can also be designed to more easily handle mode switching without duplicating code or grammar rules.
Many years ago I was teaching the course at my university that encompassed parsing (“concurrency and programming languages”???), and it was super irksome that all parsing theory in the university talked about a specific set of definitions for language grammars and recursive descent parsing that precluded all the techniques used by production parsers. As a result many people end up believing they need to use parser generators because hand writing an LALR+ parser for any real language isn’t really possible, and as a follow on they don’t actually learn how to actually write Real Parsers(tm).
To me this is an issue because in practice there are very few cases where you should be implement a parser where you do not care about error handling or performance. The most common cases where people want to parse something are things like configuration files, and those are things for which there are numerous existing languages, standards, and implementations (innumerable JSON parsers, xml/binary plists are endemic to Darwin, as much as I dislike it, TOML, etc). For cases where you are parsing actual real programming languages you either want to provide useful errors, support partial parsing, and performance matters once you start having any significant volume of code. The first couple of points argue for recursive descent in new/not-yet-popular languages immediately, and last becomes increasingly as the volume of code increases (JS parsers handle so much code they give up the first points in favor of the latter).
As someone who has only written non-production parsers, I’m surprised packrat parsers aren’t more popular.
O(n)
time and space complexity (with respect to codebase size - space grows with the square of the number of rules), error recovery is straightforward, and you can parse any PEG.I went down this rabbit hole a few years ago, and it seems like GLL/GLR parsers (mentioned at the top of this article) are the natural evolution of packrat parsers, and are seen more often in the wild these days. To hand-wave, GLL memoizes (index, result) parse outputs on the grammar as it goes, which allows it to parse left-recursive grammars in addition to the ones packrat can do. (I have a long walk-through on the packrattle site if you’re curious.)
I don’t think packrat or GLL are very good at error recovery, though – when they fail, they tend to leave behind no clue about what the simplest “fix” would be. That’s been the subject of a lot of recent papers/blogs, and also a big motivator for writing treesitter & friends, as far as I can tell.
CPython and Zig are both using PEG (with lookahead, and maybe backtracking), but not Packrat, as far as I know.
In practice, it seems you don’t need Packrat – it bounds the worst case, but the common case is fast. I read some papers exploring this ~10 years ago, I think by Grimm.
Related to this post – I would say IDEs don’t use Packrat parsing because there’s no known or straightforward way to make it “resilient”, which you can do with hand-written techniques.
So, there are two things here:
Memoization indeed isn’t too useful(*), as it adds a lot of complexity, some overhead, and helps only with the worst case. But for the worst case you want:
(expr)
vs(expr, )
.fuel
. For example, for an IDE it would be reasonable to bound the nesting depth, to protect code up the stack from stackoverflows.But backtracking absolutely is used by IDEs, and is a fairly straightforward extension here. To make backtracking resilient, you only need the concept of commit points. It could look like this:
The semantics here is that, once the parser gets to a ` point of a node, it won’t backtrack.
And it’s totally possible to use a parser genrator that does “PEG with backtracking sans memoization”, and that’s exactly what IntelliJ is using for many languages:
https://github.com/JetBrains/Grammar-Kit
So, it’s not really about generated-vs-hand-written, it’s rather LR-vs-LL.
(*) memoization isn’t too useful for a single language. If you want to parse whatever, and allow anyone to write grammars, going full GLR makes a lot of sense, as proved by Tree-sitter.
Yeah to me it seems like all the “real languages” that have come out in the 10 years have narrowed / converged with respect to syntax (but not semantics). If you look at all of
then it feels to me like less syntactic diversity, which I think is generally good.
e.g. the languages mentioned in this recent thread: https://lobste.rs/s/m9ismq/what_is_last_programming_language_you
(I would put Prolog/Erlang, Haskell/OCaml, Forth, Lisp as outside that list, and claim those language families are not as active recently. )
If that’s true, then there may ALSO be a narrowing of parsing techniques – both for resilient parsers and strict parsers.
Again I think that’s mostly good, but it does mean that university compiler courses may want to change a few things.
I’ve also noted that both Python and JavaScript grew shell-style string interpolation in the last 10 years, and Swift has it too. So you need a way to handle that cleanly (e.g. “lexer modes” in Oils)
I also wonder how that interacts with Resilent LL parsing, if you want to complete expressions within string literals. Seems like it should be fine since both techniques are top-down. Actually it’s an argument for using a single lexer in the lexer modes style, because the other way of doing it means you look for the closing quote and re-parse, which seems worse for resilience. (CPython started this way, but is changing it AFAIK. So the state of the art isn’t quite settled )
My “Real Parsers(tm)” quip was meant as a tongue-in-cheek joke as what a “real parser” or “real language” is clearly subjective :D
I’m not sure it’s good or bad. It’s more a by-product of how you make a language popular given the wonders of how humans work. It doesn’t matter how much better your language is, people are more likely to try it, and will have less difficulty adopting it, if it’s familiar. Because bracey languages basically already won, new languages that aren’t bracey are at a disadvantage next to every non-bracey language. As a result all new languages tend towards braceyness and the cycle continues.
Syntactically I would agree, but you have to ask what a language is. The bracey “trait” languages have type systems that are largely identical to Haskell/OCaml (traits are a near 1:1 with haskell type classes, even the actual implementation). You could make an argument that rust or swift are bracification of functional languages, not the strongest argument but it really depends on what you decide is the defining characteristics of the language.
This has been the case for years - recursive descent is what production parsers use if they care about error handling or performance.
Absolutely agree. The problem is that university CS departments are trying to teach parsing theory (LL < LALR < LR < GLR, and similar for varying look aheads). The problem is that this view of parsing doesn’t cleanly apply to the structure of LL(*) (or even LL(infinity) which may be different?) parsers which makes teaching about the relative strength of grammars harder, and also leads to “why would we do this other parsing type if it is slower and more complicated?” which is certainly a challenge. My experience teaching led to the conclusion that parsing provided a good tool to help explain PDAs more effectively as it provides the answer to “why would you do this”, so it is useful.
Of course I don’t think it’s necessarily useful for university level teaching to go into the practical details of real world implementation as the trade offs and decisions become much more environment, grammar, language, etc specific than any general course could teach.
I can’t speak for python, but in JS you lex the string normally, then break up the string into components, if the splits are invalid (e.g. unterminated) then you report the literal as broken, otherwise you’re just parsing expressions and the failure mode is no different from parsing any other expression.
It’s worth pointing out that Forth and Lisp languages tend to have simple parsers. A basic strict S-expression parser is less than 100 lines of RPython. Block-oriented ALGOL descendants are quite complex in comparison!
Hm nice to see this all put together in one place. It makes sense, and has a nice structure.
The nerd in me wants to ask how to generate the first/follow sets from some declarative grammar-ish spec, and even the whole parser, but maybe there are diminishing returns on that?
I also wonder how closely it matches the approach of IDEs like IntelliJ and Microsoft Visual Studio and VS Code? Do they have names for their approaches?
Or is it more like a bunch of conventions documented somewhere in the code (e.g. the red-green trees IIRC) ?
So this post essentially describes how IJ’s PsiBuilder works:
https://github.com/JetBrains/intellij-community/blob/idea/231.8109.175/platform/core-api/src/com/intellij/lang/PsiBuilder.java
I actually haven’t looked at those (or, rather, I looked very closely at Roslyn for syntax tree structure, but not for the parser). Also, it seems to me that TypeScript resilience is not great.
Another interesting case would be Swift. IIRC, they re-wrote parser in Swift from scratch for libsyntax. And before that, they did a fun thing where they had C++ parser producing both AST and CST at the same time: parsing functions included both open/close calls, but also returned structs.
Hm you said “post” but linked a Java file? Is that the right link?
FWIW the resilient stuff is interesting to me, though it feels like shell falls slightly outside of the normal usage.
I introduced a mechanism called a “trail” that is passed through the mutually recursive parsers in OSH. The trail collects information even then the parser fails, which is indicated by a Python/C++ exception.
So basically this is an incomplete shell fragment
But the caller will be able to see that the last 2 words to complete are
hostname -z
, with the “trail”. It can work through any number of nested parsers.This seems to work pretty well and is definitely more principled than what other shells do. Other shells are often “wrong”, although there is no strict spec for what the right behavior is.
I mean that the “Resilient LL Parsing” article describes how PsiBuilder works. It’s not exactly that, but is very close (eg, mark and precede stuff)