1. 55
  1.  

  2. 9

    I hope you don’t mind if I take this opportunity to plug my parser, Lark: https://github.com/lark-parser/lark/

    It’s written in Python (pro or con? you decide), and supports both Earley and LALR(1), and has a long list of features.

    Unlike the package written by the author of the article, it’s not YACC-compatible, and I think it’s for the best :)

    It actually improves on the classic LALR(1) algorithm, and can parse things that YACC (or PLY) cannot.

    1. 3

      I used Lark for a recent project and it was an absolute pleasure, thanks for the hard work!

      1. 1

        Nice, the implementation looks very clean!

        Just one note, you’re using ambiguity="explicit", but doesn’t seem like there’s code to handle an ambiguity (although there probably shouldn’t be one). If it’s just there for debugging, maybe add an _ambig rule to your transformer that throws a nice error.

        1. 1

          Thanks. You’re correct about not correctly handling _ambig - it should throw an error.

          As author of Lark, what are your thoughts on the ideas in this blog post?

          Vis-a-vis my project: before I build something with it, I’d like to make the error handling better, and improve the performance - do you think these are both feasible whilst still using Lark?

          1. 2

            Re performance - absolutely. You’re using the Earley algorithm, which is strong but pretty slow. If you switch to LALR you will see a big improvement in performance (something like 10x if not more), and it should be capable of parsing your language. You might not even need to change that much of your grammar.

            Re errors - I’ll be honest, I didn’t read the article that deeply. But the examples it gives at the end (in the table), are definitely things you can do with Lark. Check out this example to see how: https://github.com/lark-parser/lark/blob/master/examples/advanced/error_reporting_lalr.py

            Error recovery is also possible. It’s still a new feature, but it’s already usable, and I believe it will become even better soon. See this example: https://github.com/lark-parser/lark/blob/master/examples/advanced/error_puppet.py

      2. 2

        Thanks for Lark!

        I’ve used Lark in the past to parse (a subset of) the LLVM-IR.

        1. 2

          Oh dang, lark is awesome, thanks for building and maintaining such a great project. The ability to swap LALR and Earley without having to change the grammar and the terse way of notating the grammar to elide unused nodes to keep the parse tree clean are pretty killer.

          1. 2

            I used Lark for a take-home interview question recently and it came in super handy. Thanks for your work on it!

            1. 1

              Does it support error recovery (general context-free error recovery as in Aho et al.) by any chance?

              1. 1

                It supports manual error recovery, not automatic. The article you linked to sounds interesting, unfortunately I’ll have to pay to view it and find out if it’s actually interesting.

            2. 7

              A very good overview though I’ve gone through a similar process and come to the opposite conclusion: A LL grammar and recursive descent parser are the simplest and most practical and if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

              The thing is, the classic Mathematical Expression problem extends into dealing with prefix and postfix operators in general, which are really pretty common in most programming languages. Once you allow a function call to be Expr "(" [Expr] {"," Expr} ")" or whatever you have the left-recursion problem again. Similar with a[1][2] array dereferences and foo.bar.baz structure fields. There’s nothing wrong with a Lisp syntax IMO, but people don’t seem to like it so it’s worth putting a bit of effort into making these shortcuts, and they all have to deal with left recursion.

              So you do the traditional solution of expressing the precedence rules involved in these with a chain different rules, and rapidly you end up with a giant chain of 10-12 different rules. Some people seem to just Get This and absorb it fine, but even in those moments when I can actually hold what’s going on in my head, it falls out the instant I move on to something else. I’ve been much, much happier writing parsers since I discovered Pratt parsing and equivalent algorithms, which are extensions to a simple recursive-descent parser that solves these problems in a way I find much more comprehensible.

              1. 3

                […] if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

                Is there a tool that takes a grammar and tells whether it’s LL?

                1. 4

                  An LL parser (e.g. JavaCC) should be able to reject non-LL grammars (if it doesn’t… it’s probably not a tool you want to use!). [Caveat: there are different classes of LL (e.g. LL(1), LL(k)), but mostly people tend to mean LL(1).]

                  1. 2

                    Thanks!

                    (Ab)using a parser generator crossed my mind, but the ones I’m familiar with (Yacc etc.) are LR.

              2. 5

                Nice article! FWIW I went through a similar evolution of avoiding LR and coming back to it (to a degree).

                I posted this thread last year, which resulted in many interesting links and more approaches to parsing than I would have thought (which is further evidence that parsing is still a problem!)

                Survey: What parser generator are you using, if any?

                Although I didn’t “come back” as far as you did. I appreciate the work on error recovery and will have to look into it more deeply.

                Here are the “unresolved” tradeoffs I see:

                • As you mention, I would still go for recursive descent for a production quality implementation of an existing language. You can test against existing code, and you can produce better error messages. It takes a little practice but I think it’s worth it for most programmers.
                • Interfacing with the lexer is surprisingly nontrivial and people do awkward things to avoid it: https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate
                • I like grammars for new languages. For the time being, I’m sticking with Python’s pgen LL(1) parser generator for Oil, not LR parsing. Python switched to PEG last year, so it no longer uses it! But those problems don’t apply to Oil because it has a slightly more explicit syntax. Still, I did see very clearly the limitations of this LL style of parsing after using it.

                Why did I stick with it? Mainly because I could borrow a good implementation of it (one that’s been used for decades by millions of people), and because I like the EBNF syntax in the meta language. I find the EBNF with + and * to be a lot more readabable than “recursion”. I asked Guido and he said this is one of the main reasons he developed his own parsing system for Python.

                Example: https://docs.python.org/3/reference/grammar.html (you can see a lot of the tortured productions that caused them to switch metalanguages; Python 2 had fewer of these, while making elegant use of EBNF)

                There doesn’t appear to be an LR parser generator that uses + and * in the metalanguage rather than recursion. And I believe when you do an automatic conversion, you end up with a bunch of “extra” rules that makes the parser harder to debug and less efficient. I didn’t go deep into that.

                I will be reading this article more carefully, but all-in-all I agree there are a lot of problems to be solved with parsing. And coming back to LR and fixing its problems is one thing we should be doing!

                1. 3

                  I wonder if you know of anyone working on these lexing issues?

                  The design space is complicated by the fact that some people want more out of parsers/lexers than others. For a traditional “single language batch parser”, something simple like lex has always worked for me. But if you want to do things like embed languages with different syntaxes, you then have a lot of possible choices to muddy the waters. Eric Van Wyk and his gang did some interesting work in this area that’s worth looking at, in the sense of fairly traditional parsers interacting with changing lexers. I don’t know if anyone’s looking at it at the moment or not though.

                  1. 2

                    OK but I would say the lexing / parsing problem has gotten more demanding in the last 20 years, and since most LR parsing tools were written. We aren’t parsing Pascal or Fortran anymore!

                    Hence all the ad hoc hacks you see in the various lexing and parsing metalanguages.

                    I think the strongest argument that the problem is more difficult now is that most languages have string interpolation of arbitrary expressions: in the past 10 years, Python added it with “f-strings”, JS added it with backticks, Swift has it with \(), etc. It’s just a problem that most people have to deal with.

                    Although there are others, like regexes in JS and Perl: When Are Lexer Modes Useful?

                    I think the issue is conflating LR parsing and parser generators. All parser generators seem to have problems with these constructs, regardless of whether they are LR or not!

                    You see it all of the /r/ProgrammingLanguages threads I linked. Most implementers/language designers still complain about all the hacks involved in grammars, and they choose to write parsers by hand for that reason. Here’s another similar thread:

                    https://old.reddit.com/r/ProgrammingLanguages/comments/dwxwyq/what_lexerparser_generators_do_people_use_these/


                    Now that I’ve read the article a bit more carefully, I didn’t find some of the arguments in favor of LR that convincing. They seemed to be along the lines of “well the problems aren’t that bad, especially once you get used to it”.

                    I would say the same thing about recursive descent: it’s not that bad once you get used to it. (But again, I prefer grammars for new languages.)

                    But I did go through a similar evolution – I went back and looked at why I was avoiding LR, although again I didn’t get as far as you.

                    I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance.

                    So I would have to quibble with this conclusion now. I believe “the vast majority of purposes” involves some problem that parser generators (LR or not) are bad at handling – at least in terms of implementing programming languages, vs. simpler DSLs. It might be only one or two parts of the language, but those can be a “dealbreaker”.

                    I think composing parsing algorithms to fit the language is the most flexible and powerful approach. Oil uses hand-written recursive descent, Pratt parsing, and a LL(1) generated parser, all with the same “modal” lexer. Of course shell is one of the most syntactically complex languages out there, but I think overall languages are trending in that direction!

                    People want more elaborate syntax not less elaborate! Rust is another good example of this. It has much more syntax than C, and arguably C++.

                    1. 1

                      Until recently Java (not a small language, syntactically or semantically) had an LR grammar (maybe it still could, but I’m not sure), so grammar size is not an argument against LR. Of course, some languages are an impossible (e.g. because they’re ambiguous) or poor (e.g. see Ruby’s Yacc grammar) fit for LR, and it would be silly to use LR for those. But as Guy Steele suggests, new languages that don’t have an LR grammar are making user’s lives harder than necessary.

                      1. 1

                        Well Steele says that he uses LR grammar generators to check his language design, but he doesn’t actually use them to implement the language!

                        I totally agree with that, and that’s what I’m doing for Oil! The new language is prototyped with an LL(1) parser generator. Eventually it will probably be replaced by a hand-written parser, but, as I mentioned, I prefer grammars for new languages.

                        The Lua authors expressed the same sentiment in a paper – they used yacc to check their language design, and then switched to a hand-written parser.

                        I also think there’s a subtlety: it’s great to use a grammar generator to check the design of a large subset of a language. That’s still useful. But yacc falls down for a couple cases in a big language.

                        So I would still make the following claim:

                        Humans are better at parsing than common algorithms.

                        Humans like “subtle syntax”. They don’t have any problems reading string interpolation – they like it.

                        Humans like things to be “huffman coded”, which requires more syntax. And for syntax to match semantics. The opposite is Lisp, which has sort of “lost” syntactically.

                        That said, I appreciate the work going back to look at LR parsing and improving upon the state of the art. I think LR is super useful but I personally think of it as part of the story – i.e. composing LR with other algorithms where appropriate.

                2. 4

                  Excellent article! I agree with all points made, and wrote a response at https://rust-analyzer.github.io//blog/2020/09/16/challeging-LR-parsing.html.

                  There’s one specific criticism I have though, which I’d rather submit as a comment. Note that I’ll only criticize this one thing, simply because I am silently agreeing with everything else :-)

                  I think the article strawmans recursive descent parsing a bit. Or, rather, it is 100% precise and technically correct about pure recursive descent, but in practice people don’t use pure recursive descent. A specific claim I disagree with is this one:

                  lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers.

                  I claim that handing left-recursion with a hand-written parser is easy. Specifically, the code you write to parse xs := xs x | x is this:

                  def parse_xs():
                      while not eof:
                          parse_x()
                  

                  Loops are the crucial difference between pure recursive descent as described in the literature, and recursive descent as implemented in practice.

                  I think the article would benefit from mentioning, if even in passing, that by adding loops and Pratt parsing, hand-written recursive descent can be made to deal with natural grammars nicely. Of course this has no bearing on the issue that you’ll have no idea about what actual language will be parsed! I fully agree with this criticism of hand-written parsers.

                  I am somewhat salty about this issue because I took five different courses in formal grammars at the University, and all of them equated hand-written parser with pure recursive descent, and none mentioned Pratt parsers (which I have discovered by accident from Crockford article). It’s a shame, given that Pratt parser is what I have to use in production :-)

                  1. 1

                    My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them – the fact that they’re generally more verbose than CFGs is a secondary issue IMHO. It may amuse you to know that the same Michael Van De Vanter I referenced in my post is the same Michael Van De Vanter whose masters thesis expanded and popularised the not-yet-then-called-Pratt parser concept and the same Michael Van De Vanter who recommended Tim Wagner’s thesis to me as the way to go with incremental parsing :) Michael’s influence on the field is much underrated, possibly because he’s too much of a gentleman to toot his own horn.

                    As that suggests, I think Tim’s work (which underlies Tree Sitter) can probably address some of your challenges. I don’t know if Tree Sitter implements Tim’s full approach to error recovery, which is much closer to what you call “error resilience” than it is to other error recovery approaches. When Lukas looked at this part of Tim’s work, he found some a bug or two and some missing bits: you might find Lukas’s thesis worth reading (particularly Chapter 3). However, to fully address challenge 1 specifically you would need to be able to inform the parsing system of more of the language’s semantics to get the sort of output you would like. That’s eminently doable – lrpar does a very limited form of this with its %avoid_insert directive – but current systems don’t do it very much.

                    I’d argue that Yacc more or less solves your second challenge already – you can use the stategraph -> statetable conversion to write ambiguous expression grammars that are resolved to unambiguous ones using declarations such as %left and so on. I’m not convinced that people always understand the consequences of these declarations, so I tend to discourage them, but when they’re used judiciously, they work well.

                    The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

                    BTW, in case I sound grumpy, I don’t mean to be! Thanks for rust-analyzer: I use it daily!

                    1. 1

                      My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them.

                      Yeah, I am in full agreement with you on this point.

                      The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

                      So, I mean I need IDE support for the grammar meta language itself. AFAIK, tree sitter doesn’t have this. Live preview of the grammar, like in this short video, would make writing a grammar much more approachable imo. This is purely an implementation/usability concern, but a pretty important one I would say. Parser generators have a pretty high barrier to entry, lowering it would be valuable.

                      1. 1

                        So, I mean I need IDE support for the grammar meta language itself.

                        Ah, yes, I agree, this is nice to have: Eco supports its own grammar meta-language, for example.

                        1. 1

                          About the first point, do you have tests that the parser indeed corresponds to the grammar it’s supposed to parse? I mean more than unit tests. I imagine one could describe the grammar as a generator (in quickcheck or proptest) and then try the parser on that… But is it done in practice?

                    2. 3

                      Parsing Expression Grammars (PEG)s and “parser combinators” in some functional languages are just recursive descent parsers in disguise.

                      That’s just silly. While PEGs do work by recursive descent, they’re a big advance over the way a hand-coded RD parser works. PEG generators let you describe a grammar with unlimited look ahead, which really simplifies things. I’ve written RD parsers by hand, I’ve used yacc, and most recently I used a PEG-based tool. The last experience was by far the best.

                      1. 2

                        And I’ve written backtracking RD parsers by hand, just because you have unlimited lookahead, doesn’t mean it’s not using an LL algorithm to parse.

                      2. 2

                        Hi Laurence, thanks for the great write up. Looking at your blog, I found that you have published previously about error correction with LR parsers, and the code examples in that page were in Python. Is there any chance you might have the implementation of CPCT+ algorithm in Python lying around?

                        1. 2

                          I’m afraid there’s not a Python implementation of CPCT+ as such, because it needs to be plugged into an LR parser. It’s probably not rocket science to add it to an existing Python LR parser (especially as the algorithm examples in the paper are Python-ish), but I’m very out of date on what the best Python LR parsers might be unfortunately. Sorry!

                        2. 2

                          Nothings has an interesting lexing strategy here.

                          I remember something about SIMD parsing? Someone wrote a twc clone using SIMD to accelerate it, but I can’t find it. Anyone else remember?

                          1. 1

                            I’m a bit surprised by this. I’ve not had huge exposition to parsers but I thought (since I had learned about them in university 15+ years ago) LR parsers were commonly accepted as being superior. Granted, that doesn’t say anything about ease of use and PEG sounded really enticing when I looked at it.

                            In the practical sense though, every time I needed something I used Ragel or Jison in order to not write the whole parser myself :)

                            1. 1

                              Is there such a thing as “LR parser combinators”?

                              Basically, could you keep the idea of parser objects being first-class, but use a shift-reduce algorithm instead of recursive descent? Would you be able to warn about ambiguity statically?

                              One nice thing about parser combinators is you can factor out common patterns. For example, “zero or more items, separated and optionally ended by commas”.

                              1. 1

                                One question. You could use LR grammars without using LR parser right? i.e if your target was to avoid ambiguity, you could still have used LR grammars with Earley.

                                PS: This link is 404ing: https://web.archive.org/web/20150919164029/https://www.cs.berkeley.edu/Research/Projects/harmonia/papers/twagner-thesis.pdf

                                1. 1

                                  One question. You could use LR grammars without using LR parser right?

                                  Yes… but unless you check with a non-Earley parser you won’t know if changes introduce ambiguity (plus generalised parsers are likely to be slower than non-generalised parsers, as they have to deal with more cases). I’m not sure that buys you very much. Thanks for the link report – oddly it does work if you wait for achive.org to reload itself for several seconds, but I’ve changed it to a link that I hope works more reliably.

                                  1. 1

                                    I can use an external parser to check it; for example, I can use your parser in rust to check my grammar, but I can stick with my Earley parser in Python for my python application because it is much more simpler than an LR parser. About speed, do we have any benchmarks comparing parsers with LR grammars? even Earley has linear time parse for non ambiguous grammars.

                                    1. 1

                                      even Earley has linear time parse for non ambiguous grammars.

                                      That’s only true if you apply the relevant optimisations, such as Leo’s. And if speed matters at all to you, you need to apply Practical Earley Parsing from Aycock and Horspool instead of building each Earley item manually.

                                      Fast Earley parsing is a ton of work.

                                      1. 1

                                        Indeed, no arguments there. I was wondering how slow Earley (and other fast general context free parsers) is in practice (with or without Leo’s optimization) when they are pitted against LR parsers for the same grammar.

                                        Personally, I find table driven parsers more difficult to debug, but I agree with Laurence that ambiguity in grammars is a pain.