1. 24
  1.  

  2. 11

    Couple of thoughts:

    • Pedagogically, I think it’s better to talk about recursive descent+Pratt parsing, or just Pratt parsers. Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.
    • Checking grammar for absence of ambiguities is an important task! I am not quite sold on CFGs being the right tool here. With regular languages, it’s pretty clear that that’s the formalism to use, because the underlying computational device (fsm) is a very well-contained thing. With CFGs it’s much more murky. You don’t want to use the whole class, as it includes ambiguous grammars. “An unambiguous CFG grammar” is not something we can easily recognize. LL(k), LR(k), LL(*) always looked like arbitrary restrictions to me.
    • Real grammars never looked particularly declarative to me. Like, the hard part is parsing expressions, and that requires either contorting the grammar to be unreadable mess, or precedence annotations. Another case which is handled badly by grammars is flags. In Rust, there are grammatical categories of “expressions which are not a struct literal” and “block-like expressions”, expressing that in grammar requires forking the whole expression grammar twice.
    • There’s also a tension that grammars recognize strings, but what you want out of the parser is AST. The more you tweak the grammar to make it amendable to automated parsing, the worse the resulting AST structure becomes.
    • To make it easier to understand what a hand written parser parses (or, really, any small function which parses text), I strongly recommend including the example of the parsed thing in the comment, or, better, actually writing tests next to each function and each branch in the function.

    Ultimately, I think we have an array of tools which are good for some jobs, but we don’t have a universal thing.

    • If you want a human-readable grammar which nails down the AST, you want ungrammar.
    • if you want a parser, write Pratt parser by hand, fuzz it agains grammar to suss out positive and negative bugs. (If you want parsers for many languages, throw tree sitter at it).
    • if you want to make sure you language has an unambiguous grammar, then, ideally, you’d write you grammar in a well-defined formalism which is at least ungrammar’s style recursive regular expressions, but also includes some way to deal with precedence and flags. Then there’s be a separate tool which would check this grammar for ambiguities, and spit out a suite of tests for the language.
    • we don’t have such a magical grammar checker yet, so you have to get by with yacc style parser generators.
    1. 4

      Pedagogically, I think it’s better to talk about recursive descent+Pratt parsing, or just Pratt parsers. Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.

      I just do not get this. Pratt parsers, shunting, etc. are way more complex thank simple, hand-done recursive descent parsers. I’m lazy, hence I use recursive descent.

      The discussions here about + - * / seem overblown to me. Take a language def something like what was found in one of the other posts:

      <E> ::= <E> <op> <E>
           | ( <E> )
           | <D>
      <op> ::= + | - | * | /
      <D> = [0-9]+
      

      Convert it to an LALR(1) form:

      Expression
           AdditiveExpression
      
      AdditiveExpression
          MultiplicativeExpression
          AdditiveExpression "+" MultiplicativeExpression
          AdditiveExpression "-" MultiplicativeExpression
      
      MultiplicativeExpression
          PrimaryExpression
          MultiplicativeExpression "*" PrimaryExpression
          MultiplicativeExpression "/" PrimaryExpression
      
      PrimaryExpression
          "(" Expression ")"
          Digit*
      

      And you’re already pretty much done. Now you just need some code, something like:

      Expression parseExpression()
          {
          return parseAdditive();
          }
          
      Expression parseAdditive()
          {
          Expression expr = parseMultiplicative();
          while (true)
              {
              switch (peek())
                  {
                  case Add:
                  case Sub:
                      expr = new RelationalExpression(expr, current(), parseMultiplicative());
                      break;
      
                  default:
                      return expr;
                  }
              }
          return expr;
          }
      
      Expression parseMultiplicative()
          {
          Expression expr = parsePrimary();
          while (true)
              {
              switch (peek())
                  {
                  case Mul:
                  case Div:
                      expr = new RelationalExpression(expr, current(), parsePrimary());
                      break;
      
                  default:
                      return expr;
                  }
              }
          }
      
      Expression parsePrimary()
          {
          switch (peek())
              {
              case LParen:
                  expect(LParen);
                  Expr expr = parseExpression();
                  expect(RParen);
                  return expr;
                  
              case Digit:
              default:
                  // this will correctly report an error if there are no digits
                  return eatDigitsAsLiteral();
              }
          }
      

      Sorry for so much code, but after you do this a few times, the scary complexity disappears, and it’s just cut & paste. But I admit, the first time was really hard to understand how to transform weird EBNF (etc.) into something simpler to comprehend. It’s the one time that the dragon book came in handy.

      1. 2

        What I usually do is

        Expression parseAdd(Expression left, int myLevel) {
            while (true) {
                if (accept(Add))
                    left = new AddExpr(left, parseArithmetic(myLevel + 1));
                else if (accept(Sub))
                    left = new SubExpr(left, parseArithmetic(myLevel + 1));
                else return left;
            }
        }
        ...
        Expression parseArithmetic(int level = 0) {
            Expression left = parseExpression();
            if (level <= 1) left = parseMul(left, 1);
            if (level <= 0) left = parseAdd(left, 0);
            return left;
        }
        

        That way I can easily reorder my precedence and insert new operators. And it’s easy to intuitively understand what the code does: each parse term picks up “its level” of operators, while using parseArithmetic to eat terms with higher precedence.

        Not sure if that’s an established thing.

        1. 4

          Not sure if that’s an established thing.

          It is! It actually is a Pratt parser, with its characteristic shape:

          parseExpr() {
            while (true) {
              ... parseExpr()
            }
          }
          

          If you expand ..., you’d have parseMul which wold looks exactly like parseAdd, and, if you DRY them out, you get exactly the classical Pratt parser. But that’s just a detail, the important thing is the while loop.

        2. 2

          I would call this a Pratt parser though! The core thing is having a combination of a while loop and recursion. To get canonical Pratt, you need to dry out multiplicative and additive branches:

          Expression parseExpression(int level = 0)
              {
              Expression expr = parsePrimary();
              while (true)
                  {
                  switch (peek())
                      {
                      case Add:
                      case Sub:
                          if (level > 0) break
                          expr = new RelationalExpression(expr, current(), parseExpression(1));
                          break;
                      
                      case Mul:
                      case Div:
                          if (level > 1) break
                          expr = new RelationalExpression(expr, current(), parseExpression(2));
                          break;
          
                      default:
                          return expr;
                      }
                  }
              }
          

          And, having done that, you can skip the first step of eliminating the left recursion from the grammar.

          1. 4

            FWIW it’s possible to parse expressions in a recursive descent style without pratt parsing.

            For example, bash does that simply by encoding the precedence levels in the grammar, and then preserving that structure in recursive procedures.

            https://opensource.apple.com/source/bash/bash-94.1.2/bash-3.2/expr.c.auto.html

            I noted that here and probably in a few other places:

            https://www.oilshell.org/blog/2017/03/30.html

            It’s very wordy for C-style arithmetic, but definitely works.

            That seems to be what @cpurdy is doing.

            The while loop handles associativity, not precedence.

            I’d say the difference between the two algorithms is whether the precedence is data or code. When it’s data, you have less recursion.

            1. 2

              The while loop handles associativity, not precedence.

              Correct.

              Here’s an example of a full Java parser from ’97 or so: https://github.com/oracle/coherence/blob/30b2188ddee49718a9496d636b81ce8239b5d62a/prj/coherence-core/src/main/java/com/tangosol/dev/compiler/java/Compiler.java#L1355

              1. 1

                Not sure, for me passing precedence as a (compile time) parameter feels equivalent to manually “monomorphising” each precedence level to a differently named function.

                1. 1

                  Right. You’re recursing on the same method, but with a parameter indicating that it has an ever-so-slightly different purpose.

            2. 1

              Now add another 4 precedence levels, prefix, and postfix operators (including function calls) to your LALR(1) form. That’s the bit that always made me sad.

              1. 3

                Each precedence level is a “descent”, i.e. just another function. I don’t personally find it to be a bad thing 🤔

                OTOH, I can imagine that other people have other tastes in code layout, and for them, having more small functions might seem ugly. And obviously, when you’re stepping through in a debugger, the 10 levels of precedence you have to step through (i.e. 10 function calls) gets old in a hurry 🤣

            3. 2

              Pure recursive descent is indeed quite inadequate, and many CS students (raises hand) indeed don’t realize how to parse arithmetic expressions manually.

              Is this a hard problem? if you have a grammar say

              <E> ::= <E> <op> <E>
                   | ( <E> )
                   | <D>
              <op> ::= + | - | * | /
              <D> = [0-9]+
              

              which produces various parses for 1+2*3 =*> [1+2]*2; 1+[2*3] etc, isn’t it sufficient to extract the first parse, and apply shunting yard in the tree that correctly manages precedence? e.g. in any tree that has a child [], flatten the node post-parse, and apply the shunting yard on the children of the node.

              1. 3

                This sounds like a shunting yard with extra steps? But you generally don’t want shunting yard, at least not for “student writes their first compiler”, as it’s imperative formulation with extra stack is not super intuitive. You want a Pratt parser here (which is recursive version of shunting yard).

                This also matches my knowledge at the end of the university: to parse expressions, you either eliminate left recursion, do shunting yard, or use yacc’s precedence annotations. I took 5 or 6 grammar/compile courses, and none of them taught Pratt, which actually is the first thing one should reach out for when parsing expressions.

                1. 2

                  You want a Pratt parser here (which is recursive version of shunting yard).

                  What I wanted to illustrate was that you can decompose the problem of parsing languages containing arithmetic expressions into two steps — the first, where you parse the syntactic structure, and second, where you move from a parse tree to an AST.

                  There are numerous algorithms that let you do parsing. IMO, it is better for the students to learn how to implement precedence in infix separately without tying it to a specific parsing.

                  1. 1

                    What I wanted to illustrate was that you can decompose the problem of parsing languages containing arithmetic expressions into two steps — the first, where you parse the syntactic structure,

                    I haven’t seen this approach in practice. Is there some example which shows how it generalizes to complex expression grammars (with unary, binary and ternary operators, and - which is both prefix and binary)?

                    1. 2

                      Not so far; I am a new teacher, developing my course materials now. I will certainly try both, and I should be able to answer which works best for students in a couple of years.

                      1. 1

                        By the way, do you have specific examples of this kind? which are particularly difficult to parse otherwise?

                        1. 3

                          I don’t know of the top of my head what would be the most difficult for the suggested two phase parsing, but I think it’s useful to parse basically everything:

                          • + - * / as the basic case
                          • ., function composition, as an example of right associative operator
                          • prefix + - to cover both prefix and ambiguous ops
                          • ! factorial as a basic postfix op
                          • [] postfix array access with an argument.
                          • ?: ternary
                          1. 2

                            Thank you! much appreciated.

                2. 2

                  “An unambiguous CFG grammar” is not something we can easily recognize.

                  Offhand I believe I’ve read that telling whether an arbitrary CFG is ambiguous or not is equivalent to the halting problem.

                  1. 1

                    Any chance of a reference? I am curious how this proof was formulated.

                    1. 3

                      I think it’s a corollary of the Post Correspondence Problem. Roughly iff a pair of sets α, β have a solution to PCP, then the grammar

                      S → A | B
                      A → (x ∈ α)A | ε
                      B → (y ∈ β)B | ε
                      

                      is ambiguous.

                      1. 1

                        Thank you!, I will look it up.

                        1. 1

                          Well-stated! I think that you’re right. I was going to mention some folklore, but I see that Wikipedia now has a list of undecidable facts about context-free grammars, complete with bluelinks, and I’ll just link to that section.

                    2. 2

                      There’s a point in the design space I don’t think has been well explored: declarative Pratt parsing with some extensions like unary/binary and multifix. Simple implementation, linear time parsing, really good error messages. Probably not quite powerful enough to parse big existing languages but I haven’t tested enough to find out. The parse tree it produces is a bit further away from ASTs than what an CFG parser would produce. I implemented a parser generator in this space:

                      https://github.com/justinpombrio/panfix

                      Example grammar: https://github.com/justinpombrio/panfix/blob/master/tests/minilang.rs

                      1. 1

                        It is a very lax parser, and will happily generate a parse tree in the presense of a variety of errors

                        Niiiiiice! https://github.com/JetBrains/Grammar-Kit is the most production ready tool in this space that I know

                    3. 8

                      I agree with 90%-95% of the article. I would phrase it more pithily as:

                      1. Use recursive descent for recognizing existing languages, where you have a good test corpus. (And very good point about testing INVALID inputs too!)

                      2. Use grammars to design new languages

                      I follow this with https://www.oilshell.org/ – the compatible OSH is a huge recursive descent parser, and Oil is designed with a grammar.

                      Guy Steele and Russ Cox also mentioned this methodology – design the language with a grammar, and then implement it a second time by hand. Then you REALLY know it! :)

                      It is definitely possible to not know the language you’re implementing / creating!


                      As an undergrad, I never tried to design a language, or even thought about it honestly. But I agree that undergrads should know of both approaches, without necessarily getting into the details of CFGs.

                      It would be a mistake to just teach recursive descent.

                      1. 9

                        Eeeeeeeh. The point is pretty decent but to me the argument is mediocre.

                        There are various ways of implementing parsers, but for me there are generally only two which we really need to consider: LR parsing and recursive descent parsing.

                        I think what they are really trying to say is that there is “automatically generated parsers” and “hand written parsers”, and they want to argue in favor of automatically generated parsers. There’s nothing, as far as I know, stopping anyone from writing a parser generator that outputs a recursive descent parser.

                        Although few people think of it as such, LR is effectively a type checker for grammar ambiguity: if you don’t get any errors, your grammar is guaranteed to be unambiguous.

                        …however, there are unambiguous grammars that are not LR. Try feeding one of those into yacc and see how much it likes it.

                        …learners cannot, in any sensible time, understand from the recursive descent parser the language’s syntax.

                        Few people have a perfect understanding of any syntax more complicated than Scheme. It’s not as though understanding the nuances of an LR syntax is easy; if it were, writing input for a LR parser generator would be less of a gigantic pain in the ass. You talk about the bugs you’ve discovered in recursive descent parsers; reasonable, now may I talk about the bugs I’ve created trying to use LR parser generators?

                        Although there aren’t, to the best of my knowledge, modern performance numbers, it is reasonable to assume that recursive descent parsing is generally faster than LR parsing.

                        Bae, if you don’t have numbers to support it, do the honorable thing and say “I don’t know” instead of making “reasonable assumptions”. I would reasonably assume the opposite, that table-based parsers are generally faster than recursive descent parsers, since a small table and a small amount of driver code should all be able to fit in the CPU’s cache.

                        1. 3

                          @ltratt: “There are various ways of implementing parsers, but for me there are generally only two which we really need to consider: LR parsing and recursive descent parsing. “

                          Parent: I think what they are really trying to say is that there is “automatically generated parsers” and “hand written parsers”

                          I think what @ltratt is trying to say here is that for him there are only two sane choices. He is not trying to differentiate between handwritten and LR parsers.

                          He also clarifies later on:

                          @ltratt: In practice, a recursive descent parser can be thought of as meaning “a hand-written written parser”: unless you go out of your way to do otherwise, nearly all hand-written parsers are recursive descent parsers.

                          Of course, ANTLR, GLL, etc exist, but I think that is external from the point he is making, which is having a sort of (sound, but incomplete) ambiguity checker.

                          and they want to argue in favor of automatically generated parsers.

                          No, @ltratt wants to argue in favor of LR parsing, specifically because @ltratt believes that they are able to identify and reject some of the bad grammars (here bad grammars are those that contain ambiguity). — The bad grammar here is very subjective;

                          Fundamentally, recursive descent parsers have no equivalent “type checker”, and we know that we can’t build such an equivalent for the general case.

                          Personally, I prefer LL(1) grammars, and that also provide us with a type checker like what he claims for LR.

                          @ltratt: There is an additional, easily over-looked, factor that comes from relying on recursive descent parsers: learners cannot, in any sensible time, understand from the recursive descent parser the language’s syntax.

                          Here though, I think he is talking about handwritten traditional style that implements PEG.

                          1. 2

                            I would reasonably assume the opposite, that table-based parsers are generally faster than recursive descent parsers, since a small table and a small amount of driver code should all be able to fit in the CPU’s cache.

                            I think LALRPOP optimizes LR by emitting a recursive ascent parser (essentially compile the table to a bunch of functions). It feels (:)) like that should be faster: you essentially say that interpreter is faster than compiler :)

                            1. 1

                              Looks table driven to me. Digging a bit deeper, it appears it can do both.

                              I’m saying that function calls have overhead and that walking through a loop and switch statement can be faster than calling lots of functions. ie, a threaded interpreter vs tree-walking. I won’t speculate more without data though.

                              1. 2

                                Performance is discussed a bit (/s) in this post: https://smallcultfollowing.com/babysteps/blog/2015/09/14/lalrpop/

                            2. 2

                              I agree, the OP is conflating “recursive descent” with “hand-written.”

                              There’s nothing, as far as I know, stopping anyone from writing a parser generator that outputs a recursive descent parser.

                              Like ANTLR, one of the most popular parser generators, which produces RD parsers.

                              IIRC, PEG parser generators also output recursive descent parsers. I may be misremembering this, but I’ve looked at the gnarly C code emitted by one and it consisted of a whole lot of little functions, not the giant tables used by LALR parsers.

                              1. 4

                                I think you’re looking for “top-down”

                                • LL parsers and recursive descent parsers are top-down; LR parsers are bottom-up.
                                • ANTLR produces top-down parsers, not recursive descent parsers. (it has had a variety of top-down algorithms over the years, e.g. LL(*) )
                                • PEGs are grammars that can be recognized with 2 different top-down parsing techniques: memoized packrat parsing, or backtracking.

                                I would also say all recursive descent parsers are hand-written; I’ve never seen any other usage. Recursive descent is basically a technique taking a grammar and creating a bunch of recursive procedures in a language like C or Pascal.

                                There is also the case where you just write a bunch of ad hoc recursive procedures without knowing precisely what the language you’re recognizing is (i.e. viewed from the generative, grammar-based approach)

                                1. 2
                                  1. 1

                                    PEG grammars can be converted directly into a recursive descent parser, but with the worst-case exponential time complexity, so they are usually implemented as memoized packrat parsers, which have a linear time complexity, but use considerably more memory.

                                    1. 1

                                      Well, it looks like Terence Parr, the creator of ANTLR, disagrees with you, in a paper he co-wrote in 2011 (p.2.):

                                      ANTLR generates top-down, recursive-descent, mostly non-speculating parsers…

                                      1. 2

                                        Ah OK sorry, I thought only the earliest versions of ANTLR generated RD parsers, and then the more advanced top-down algorithms they developed moved away from that approach. But I guess I have not looked at their generated code in a very very long time.

                                        If they generate a bunch of procedures where each one goes with a rule in the grammar, then I’d agree that’s recursive descent (wish I could correct my original comment).

                                        But there is a fuzzy line, because it probably has to call procedures for lookahead, some of which may involve tables … i.e. I’m pretty sure the LL(*) has to do something pretty complex for lookahead, but I don’t have ANTLR handy now to look at its output.


                                        I think PEGs are basically a formalization of backtracking recursive descent parsers, but they have the optional faster algorithm of packrat parsing (which is not recursive descent, but is top-down). They precisely define what lookahead constructs are allowed.

                                        FWIW the parser I use for Oil is based on Python’s top-down parser generator, which is an LL, table driven parser (also not RD).


                                        edit: I looked it up and the newer ALL(*) used by ANTLR 4 also says it generates recursive descent with a custom lookahead function:

                                        https://www.antlr.org/papers/allstar-techreport.pdf

                                        (I last used ANTLR around the time ANTLR 4 was coming out)

                                      2. 1

                                        What would you consider recursive descent? Would something like peg_parse() here qualify as recursive descent? or does the nonterminals have to be implemented as procedures?

                                        1. 1

                                          I just looked it up, and my conception is EXACTLY what Wikipedia says as its first sentence.

                                          https://en.wikipedia.org/wiki/Recursive_descent_parser

                                          In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure implements one of the nonterminals of the grammar.

                                          The key point is that you take a grammar and write a program based on it. (I would disagree with “non-recursive equivalent” but meh)

                                          It’s one form of top-down parsing, which is hand-written

                                          I would call peg_parse() a top-down parser but not a recursive descent parser (assuming it’s doing what I think it is)

                                          1. 1

                                            Ah I see, so recursive descent is sort of grammar compiled into a custom parser. Would you call GLL such a parser? it substitutes label based jumping for subroutine call and return, but I think that is more of an optimization than anything else.