1. 4

    Excellent article! I agree with all points made, and wrote a response at https://rust-analyzer.github.io//blog/2020/09/16/challeging-LR-parsing.html.

    There’s one specific criticism I have though, which I’d rather submit as a comment. Note that I’ll only criticize this one thing, simply because I am silently agreeing with everything else :-)

    I think the article strawmans recursive descent parsing a bit. Or, rather, it is 100% precise and technically correct about pure recursive descent, but in practice people don’t use pure recursive descent. A specific claim I disagree with is this one:

    lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers.

    I claim that handing left-recursion with a hand-written parser is easy. Specifically, the code you write to parse xs := xs x | x is this:

    def parse_xs():
        while not eof:
            parse_x()
    

    Loops are the crucial difference between pure recursive descent as described in the literature, and recursive descent as implemented in practice.

    I think the article would benefit from mentioning, if even in passing, that by adding loops and Pratt parsing, hand-written recursive descent can be made to deal with natural grammars nicely. Of course this has no bearing on the issue that you’ll have no idea about what actual language will be parsed! I fully agree with this criticism of hand-written parsers.

    I am somewhat salty about this issue because I took five different courses in formal grammars at the University, and all of them equated hand-written parser with pure recursive descent, and none mentioned Pratt parsers (which I have discovered by accident from Crockford article). It’s a shame, given that Pratt parser is what I have to use in production :-)

    1. 1

      My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them – the fact that they’re generally more verbose than CFGs is a secondary issue IMHO. It may amuse you to know that the same Michael Van De Vanter I referenced in my post is the same Michael Van De Vanter whose masters thesis expanded and popularised the not-yet-then-called-Pratt parser concept and the same Michael Van De Vanter who recommended Tim Wagner’s thesis to me as the way to go with incremental parsing :) Michael’s influence on the field is much underrated, possibly because he’s too much of a gentleman to toot his own horn.

      As that suggests, I think Tim’s work (which underlies Tree Sitter) can probably address some of your challenges. I don’t know if Tree Sitter implements Tim’s full approach to error recovery, which is much closer to what you call “error resilience” than it is to other error recovery approaches. When Lukas looked at this part of Tim’s work, he found some a bug or two and some missing bits: you might find Lukas’s thesis worth reading (particularly Chapter 3). However, to fully address challenge 1 specifically you would need to be able to inform the parsing system of more of the language’s semantics to get the sort of output you would like. That’s eminently doable – lrpar does a very limited form of this with its %avoid_insert directive – but current systems don’t do it very much.

      I’d argue that Yacc more or less solves your second challenge already – you can use the stategraph -> statetable conversion to write ambiguous expression grammars that are resolved to unambiguous ones using declarations such as %left and so on. I’m not convinced that people always understand the consequences of these declarations, so I tend to discourage them, but when they’re used judiciously, they work well.

      The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

      BTW, in case I sound grumpy, I don’t mean to be! Thanks for rust-analyzer: I use it daily!

      1. 1

        My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them.

        Yeah, I am in full agreement with you on this point.

        The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

        So, I mean I need IDE support for the grammar meta language itself. AFAIK, tree sitter doesn’t have this. Live preview of the grammar, like in this short video, would make writing a grammar much more approachable imo. This is purely an implementation/usability concern, but a pretty important one I would say. Parser generators have a pretty high barrier to entry, lowering it would be valuable.

        1. 1

          So, I mean I need IDE support for the grammar meta language itself.

          Ah, yes, I agree, this is nice to have: Eco supports its own grammar meta-language, for example.

          1. 1

            About the first point, do you have tests that the parser indeed corresponds to the grammar it’s supposed to parse? I mean more than unit tests. I imagine one could describe the grammar as a generator (in quickcheck or proptest) and then try the parser on that… But is it done in practice?

      1. 5

        Nice article! FWIW I went through a similar evolution of avoiding LR and coming back to it (to a degree).

        I posted this thread last year, which resulted in many interesting links and more approaches to parsing than I would have thought (which is further evidence that parsing is still a problem!)

        Survey: What parser generator are you using, if any?

        Although I didn’t “come back” as far as you did. I appreciate the work on error recovery and will have to look into it more deeply.

        Here are the “unresolved” tradeoffs I see:

        • As you mention, I would still go for recursive descent for a production quality implementation of an existing language. You can test against existing code, and you can produce better error messages. It takes a little practice but I think it’s worth it for most programmers.
        • Interfacing with the lexer is surprisingly nontrivial and people do awkward things to avoid it: https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate
        • I like grammars for new languages. For the time being, I’m sticking with Python’s pgen LL(1) parser generator for Oil, not LR parsing. Python switched to PEG last year, so it no longer uses it! But those problems don’t apply to Oil because it has a slightly more explicit syntax. Still, I did see very clearly the limitations of this LL style of parsing after using it.

        Why did I stick with it? Mainly because I could borrow a good implementation of it (one that’s been used for decades by millions of people), and because I like the EBNF syntax in the meta language. I find the EBNF with + and * to be a lot more readabable than “recursion”. I asked Guido and he said this is one of the main reasons he developed his own parsing system for Python.

        Example: https://docs.python.org/3/reference/grammar.html (you can see a lot of the tortured productions that caused them to switch metalanguages; Python 2 had fewer of these, while making elegant use of EBNF)

        There doesn’t appear to be an LR parser generator that uses + and * in the metalanguage rather than recursion. And I believe when you do an automatic conversion, you end up with a bunch of “extra” rules that makes the parser harder to debug and less efficient. I didn’t go deep into that.

        I will be reading this article more carefully, but all-in-all I agree there are a lot of problems to be solved with parsing. And coming back to LR and fixing its problems is one thing we should be doing!

        1. 3

          I wonder if you know of anyone working on these lexing issues?

          The design space is complicated by the fact that some people want more out of parsers/lexers than others. For a traditional “single language batch parser”, something simple like lex has always worked for me. But if you want to do things like embed languages with different syntaxes, you then have a lot of possible choices to muddy the waters. Eric Van Wyk and his gang did some interesting work in this area that’s worth looking at, in the sense of fairly traditional parsers interacting with changing lexers. I don’t know if anyone’s looking at it at the moment or not though.

          1. 2

            OK but I would say the lexing / parsing problem has gotten more demanding in the last 20 years, and since most LR parsing tools were written. We aren’t parsing Pascal or Fortran anymore!

            Hence all the ad hoc hacks you see in the various lexing and parsing metalanguages.

            I think the strongest argument that the problem is more difficult now is that most languages have string interpolation of arbitrary expressions: in the past 10 years, Python added it with “f-strings”, JS added it with backticks, Swift has it with \(), etc. It’s just a problem that most people have to deal with.

            Although there are others, like regexes in JS and Perl: When Are Lexer Modes Useful?

            I think the issue is conflating LR parsing and parser generators. All parser generators seem to have problems with these constructs, regardless of whether they are LR or not!

            You see it all of the /r/ProgrammingLanguages threads I linked. Most implementers/language designers still complain about all the hacks involved in grammars, and they choose to write parsers by hand for that reason. Here’s another similar thread:

            https://old.reddit.com/r/ProgrammingLanguages/comments/dwxwyq/what_lexerparser_generators_do_people_use_these/


            Now that I’ve read the article a bit more carefully, I didn’t find some of the arguments in favor of LR that convincing. They seemed to be along the lines of “well the problems aren’t that bad, especially once you get used to it”.

            I would say the same thing about recursive descent: it’s not that bad once you get used to it. (But again, I prefer grammars for new languages.)

            But I did go through a similar evolution – I went back and looked at why I was avoiding LR, although again I didn’t get as far as you.

            I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance.

            So I would have to quibble with this conclusion now. I believe “the vast majority of purposes” involves some problem that parser generators (LR or not) are bad at handling – at least in terms of implementing programming languages, vs. simpler DSLs. It might be only one or two parts of the language, but those can be a “dealbreaker”.

            I think composing parsing algorithms to fit the language is the most flexible and powerful approach. Oil uses hand-written recursive descent, Pratt parsing, and a LL(1) generated parser, all with the same “modal” lexer. Of course shell is one of the most syntactically complex languages out there, but I think overall languages are trending in that direction!

            People want more elaborate syntax not less elaborate! Rust is another good example of this. It has much more syntax than C, and arguably C++.

            1. 1

              Until recently Java (not a small language, syntactically or semantically) had an LR grammar (maybe it still could, but I’m not sure), so grammar size is not an argument against LR. Of course, some languages are an impossible (e.g. because they’re ambiguous) or poor (e.g. see Ruby’s Yacc grammar) fit for LR, and it would be silly to use LR for those. But as Guy Steele suggests, new languages that don’t have an LR grammar are making user’s lives harder than necessary.

              1. 1

                Well Steele says that he uses LR grammar generators to check his language design, but he doesn’t actually use them to implement the language!

                I totally agree with that, and that’s what I’m doing for Oil! The new language is prototyped with an LL(1) parser generator. Eventually it will probably be replaced by a hand-written parser, but, as I mentioned, I prefer grammars for new languages.

                The Lua authors expressed the same sentiment in a paper – they used yacc to check their language design, and then switched to a hand-written parser.

                I also think there’s a subtlety: it’s great to use a grammar generator to check the design of a large subset of a language. That’s still useful. But yacc falls down for a couple cases in a big language.

                So I would still make the following claim:

                Humans are better at parsing than common algorithms.

                Humans like “subtle syntax”. They don’t have any problems reading string interpolation – they like it.

                Humans like things to be “huffman coded”, which requires more syntax. And for syntax to match semantics. The opposite is Lisp, which has sort of “lost” syntactically.

                That said, I appreciate the work going back to look at LR parsing and improving upon the state of the art. I think LR is super useful but I personally think of it as part of the story – i.e. composing LR with other algorithms where appropriate.

        1. 7

          A very good overview though I’ve gone through a similar process and come to the opposite conclusion: A LL grammar and recursive descent parser are the simplest and most practical and if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

          The thing is, the classic Mathematical Expression problem extends into dealing with prefix and postfix operators in general, which are really pretty common in most programming languages. Once you allow a function call to be Expr "(" [Expr] {"," Expr} ")" or whatever you have the left-recursion problem again. Similar with a[1][2] array dereferences and foo.bar.baz structure fields. There’s nothing wrong with a Lisp syntax IMO, but people don’t seem to like it so it’s worth putting a bit of effort into making these shortcuts, and they all have to deal with left recursion.

          So you do the traditional solution of expressing the precedence rules involved in these with a chain different rules, and rapidly you end up with a giant chain of 10-12 different rules. Some people seem to just Get This and absorb it fine, but even in those moments when I can actually hold what’s going on in my head, it falls out the instant I move on to something else. I’ve been much, much happier writing parsers since I discovered Pratt parsing and equivalent algorithms, which are extensions to a simple recursive-descent parser that solves these problems in a way I find much more comprehensible.

          1. 3

            […] if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

            Is there a tool that takes a grammar and tells whether it’s LL?

            1. 4

              An LL parser (e.g. JavaCC) should be able to reject non-LL grammars (if it doesn’t… it’s probably not a tool you want to use!). [Caveat: there are different classes of LL (e.g. LL(1), LL(k)), but mostly people tend to mean LL(1).]

              1. 2

                Thanks!

                (Ab)using a parser generator crossed my mind, but the ones I’m familiar with (Yacc etc.) are LR.

          1. 1

            One question. You could use LR grammars without using LR parser right? i.e if your target was to avoid ambiguity, you could still have used LR grammars with Earley.

            PS: This link is 404ing: https://web.archive.org/web/20150919164029/https://www.cs.berkeley.edu/Research/Projects/harmonia/papers/twagner-thesis.pdf

            1. 1

              One question. You could use LR grammars without using LR parser right?

              Yes… but unless you check with a non-Earley parser you won’t know if changes introduce ambiguity (plus generalised parsers are likely to be slower than non-generalised parsers, as they have to deal with more cases). I’m not sure that buys you very much. Thanks for the link report – oddly it does work if you wait for achive.org to reload itself for several seconds, but I’ve changed it to a link that I hope works more reliably.

              1. 1

                I can use an external parser to check it; for example, I can use your parser in rust to check my grammar, but I can stick with my Earley parser in Python for my python application because it is much more simpler than an LR parser. About speed, do we have any benchmarks comparing parsers with LR grammars? even Earley has linear time parse for non ambiguous grammars.

                1. 1

                  even Earley has linear time parse for non ambiguous grammars.

                  That’s only true if you apply the relevant optimisations, such as Leo’s. And if speed matters at all to you, you need to apply Practical Earley Parsing from Aycock and Horspool instead of building each Earley item manually.

                  Fast Earley parsing is a ton of work.

                  1. 1

                    Indeed, no arguments there. I was wondering how slow Earley (and other fast general context free parsers) is in practice (with or without Leo’s optimization) when they are pitted against LR parsers for the same grammar.

                    Personally, I find table driven parsers more difficult to debug, but I agree with Laurence that ambiguity in grammars is a pain.

            1. 2

              Hi Laurence, thanks for the great write up. Looking at your blog, I found that you have published previously about error correction with LR parsers, and the code examples in that page were in Python. Is there any chance you might have the implementation of CPCT+ algorithm in Python lying around?

              1. 2

                I’m afraid there’s not a Python implementation of CPCT+ as such, because it needs to be plugged into an LR parser. It’s probably not rocket science to add it to an existing Python LR parser (especially as the algorithm examples in the paper are Python-ish), but I’m very out of date on what the best Python LR parsers might be unfortunately. Sorry!

              1. 14

                I’d need to be convinced that another interpreter solves any major problems of cross-language interoperability. As far as I can tell, the challenge is squaring the circle of different semantics.

                What happens when you take a Rust affine value, pass it to Python, which assigns a few copies of it around, and passes all of those copies back? What happens when you pass something with a destructor to a language that doesn’t support destructors? What happens with weak references that are implicitly nullable when you pass them to a language without NULL?

                Getting all of that to work is where the effort in cross-language interop lies.

                You might be able to get it to work in languages that feel close – the .NET CLR does this, for example. But getting Haskell, C++, Rust, and Python to easily interoperate without tons of glue code seems… unlikely.

                And if you’re ok with glue code, the C ABI and the various language FFIs more or less lets you do the job.

                1. 6

                  C ABI does the job, but I think I am not revealing anything shocking when I say C ABI could be greatly improved. Have you seen Fine-grained Language Composition by Laurence Tratt? It is a clear proof of concept of what is possible, and we are very far from it.

                  1. 3

                    Of course there’s room for improvement – the x86_64 calling convention is way too complicated. But that’s beside the point. The hard part isn’t the calling convention or the VM, it’s figuring out how to handle all of the language semantics.

                    That proof of concept is interesting – but I didn’t see any challenging features shared. What does PHP do with a StopIteration exception from python? Can it consume a python iterator at all? From that description, it seems like you can’t easily pass a list from PHP to python – you need to manually adapt it. And you can get exceptions if the adapted PHP variable stops looking like a list.

                    And, these are two languages that most people would consider close.

                    1. 3

                      And, these are two languages that most people would consider close.

                      I made that mistake when we started that work, but was quickly disabused of it. PHP and Python look close only if you don’t look closely, mostly because PHP is a very unusual language (e.g. PHP’s references, for example, are rather unique). You might find the full paper (https://soft-dev.org/pubs/html/barrett_bolz_diekmann_tratt__fine_grained_language_composition/) has more detail about the challenges and how they were resolved. For example, Section 6 tackles cross-language variable scoping, which stretched my (admittedly meagre) language design abilities to, and possibly beyond, their maximum!

                      1. 2

                        the x86_64 calling convention is way too complicated

                        It’s quite simple. First few integral parameters in registers. First few floating parameters in floating registers. Further arguments on the stack. Large parameters (structs) passed either as pointers or on the stack.

                        1. 2

                          Large parameters (structs) passed either as pointers or on the stack.

                          The complexity comes from how small structs are passed and returned. It’s not unmanagable, but it’s enough to be painful compared to the Windows ABI. The rules around types interactions in small structs are more complicated than they need to be. This makes determining which registers to use to pass a struct overly complex.

                    2. 4

                      Rather than cross-language interoperability, I think that the real value would be in providing a single platform that could be ported to different systems and brings tons of existing code to them. Basically, taking the JVM portability advantage and democratizing it. It would really benefit niche systems. (Commoditize your compliments!)

                      1. 1

                        But getting Haskell, C++, Rust, and Python to easily interoperate without tons of glue code seems… unlikely.

                        But that doesn’t mean there wouldn’t be wins in other areas. Programming in Haxe was like programming in 1 + (targets * .5) languages because you had to be mindful of how everything would be translated. However, things like unit testing were O(1) because the test suites would be automagically generated and run for each target.

                        And if you’re ok with glue code, the C ABI and the various language FFIs more or less lets you do the job.

                        I agree with you insofar that I’m skeptical language interoperability can go beyond functional interfaces passing serializable data structures. However, I think we can significantly improve on the C FFI baseline and piping strings around.

                        TruffleRuby’s implementation of the Ruby C FFI requires interpreting C code, yet it still manages to beat the pants off of mainline because the common AST representation of the two languages allows the JVM to perform optimizations on a shared in-memory representation of their data. Or take the WASM type spec, which would allow for similar optimizations on similar data structures while providing sane defaults when converting between primitive values.

                      1. 12

                        I agree with one implicit message of this post which is that parsing is an almost endless series of trade-offs. For example, for a simple binary format, I’d probably use PEGs, but when parsing textual formats, bitter experience has conditioned me to prefer LR. There are various reasons for this, such as: PEGs are easy to write but, in general, it’s impossible to work out what language they accept (most medium/large PEG parsers I’ve looked at in detail do not work as their authors expected); LR parsers are easy to reason about but more awkward to write and generally have poor error handling. I’ve tried to address that latter issue in grmtools/lrpar https://softdevteam.github.io/grmtools/master/book/errorrecovery.html – it’s not the right solution for everyone, but it might be another interesting point in the endless space of parsing trade-offs!

                        1. 4

                          menhir is an advanced parser generator in OCaml and features pretty good error recovery mechanisms, as far as I know. It’s used in merlin, the semantic server for OCaml, which is extremely robust. LR or LALR generators can be better than yacc!

                          1. 5

                            Menhir has a very cool, but very different, approach to error recovery. In essence, users can manually annotate the stategraph with detailed error messages (and Menhir has a cunning mechanism to make it semi-automatic to migrate those messages as the grammar changes) that can help users understand what’s going on. grmtools/lrpar’s approach is fully automated. As always in parsing there are pros and cons to this! There’s more details in the related work section of https://arxiv.org/abs/1804.07133 if you’re interested.

                        1. 1

                          Left recursion and PEGs is a surprisingly hard topic. The first algorithm I know of in this area by Warth et al. can produce incorrect parses (I wrote that up here).

                          I found it impossible to understand the output of your parser, so I quickly added this pretty printer so that I could see the tree structure more easily:

                          def pp(t, i=0):
                            for x in t:
                              print("%s%s" % (" " * i, x[0]))
                              for y in x[1:]:
                                pp(y, i + 1)
                          

                          With that, I can see that your algorithm does correctly parse the example that Warth et al. have a problem with (1-2-3) but I’m not sure how you encode precedence (so 2+3*4 is not parsed correctly, no matter what order I recorder the productions in E), so I’m not quite sure how fair a comparison this is.

                          Anyway, I wish you luck in your quest! This stuff made my head hurt so much that I gave up on PEGs entirely and moved on to LR parsing.

                          1. 1

                            Thank you for the link! I haven’t really thought about precedence as encoded in the PEG order, but I see that it is a major issue.