1. 31
  1.  

  2. 4

    Hand written recursive descent parsers are also faster and can handle errors better.

    But seriously, maintaining a hand written parser is much more work than a generated parser.

    I’m not sure which is best, but I think fundamentally it’s a useable specific question.

    I’ve written very high performance parsers for real languages, and while they achieve their goals you’d be delusional if you thought the code was easy to follow :)

    1. 2

      I think having a good, machine-readable grammar is important, but I agree that writing things by hand can often be beneficial, especially because of error handling.

      My basic rule of thumb is “do not fail at parsing if possible”; I try to get things across name resolution, too, ideally to fail during type checking. This helps minimize the erroneous “area”, reduces the amount of follow-up errors generated and helps emitting better error messages.

      1. 1

        But seriously, maintaining a hand written parser is much more work than a generated parser.

        In my experience, not really, if you have only one language to parse. Hand written parser is a bit more verbose than grammar (but by a constant factor only), but is significantly easier to debug and modify, because it is just code. Generators have high fixed cost, which doesn’t amortize across one language,

        1. 2

          I’m not saying “write a generator for your grammar” :D

          But take JS (I wrote the JS parser used by JSC) and the complexity of that parser dwarfs the old yacc (or bison? I can’t recall) one. It also uses less memory, is substantially faster (despite there being an academic version of yacc/bison that was 2x faster than its baseline it never went anywhere).

          With more recent changes to the language (“recent”) like strict mode changing valid parse rules, let being a valid identifier, and identifiers as property names it I doubt a generated parser would have a nice grammar anymore - except maybe ANTLR? but it generates recursive descent parsers anyway, so shrug.

          That said, if you look at https://trac.webkit.org/browser/webkit/trunk/Source/JavaScriptCore/parser/Parser.cpp it’s not trivial either :D

      2. 2

        Thanks for this. It’s very timely for me, because I am just coming to terms with having to write a multi-step parser, and this is encouraging to read.

        1. 2

          I agree, a recursive descent parser plus an operator precedence parser should be easy enough to follow (treating the operator precedence parser as a black box). Recursive descent parsing is basically the way you would end up with if you’d implement parsing naively anyway.

          1. 1

            Alternatively, just a recursive descent parser with precedence built into the grammar.

            You have a BaseExpr which are your “atomic” expressions (literals, variables, function calls), the MultiplyExpr which is 1 or more BaseExpr separated by * or /, then AddExpr which is 1 or more MultiplyExpr separated by + or -, then CommaExpr which is 1 or more MultiplyExpr separated by ,, etc.

            Both approaches are good, both approaches are perfectly reasonable to implement by hand.

            1. 1

              Yup. When your precedences are set in stone that would be feasible as well. I think a well-implemented operator precedence parser is easier to change, though.

              1. 1

                A proper operator precedence parser can even support things like programmer-definable operators, like in Haskell :)

          2. 2

            I’ve fought too much the parser generators I’ve used. Writing a simple recursive descent parser avoids this pain.

            I’ve been wondering if a code generator that outputs helpers for recursive descent parsers would be useful. Not a full mid-layer, but optional functions you can choose to call from your hand-written recursive descent parser.

            1. 2

              I absolutely love the website layout. It might be my new favorite thing. The dynamic brackets are really something special.

              I’m glad that the operator precedence parser was mentioned; recursive descent parsers are generally not scalable to large numbers of operators with varying associativity. Sure you can do it but it’s not fun.

              Recursive descent parsers do still run into issues with deeply-nested constructs in some cases, though. You run into issues with the maximum call-stack depth. In languages like C, where running out of stack space is not generally recoverable, you end up having to pass context around that stores how deep you are in the call stack and throwing an error when you get too deep. Not usually a problem except in pathological cases, though.

              (In every recursive descent parser I’ve written, I always have an error that when printed says ?FORMULA TOO COMPLEX when we’ve gone too deep; this is the same error given on Commodore BASIC when too many levels of nesting have been encountered.)

              1. 1

                The audience of this appears to be educators. I agree - for educational purposes, people should absolutely write parsers by hand. However, I will say that for performance, you really do want a parser generator. Parsing is on the hot path in compilers and in code that handles serialization formats, and it is a more significant expense than you might imagine. If you want to build an intuition for why that is: The lexer and the parser are the only phases of code analysis which necessarily traverse the entire input text, in order.

                1. 3

                  You can find a lot of old books saying that parsing is a bottleneck for compilers, but my own benchmarks of the Rust compiler don’t show this to be the case. Would be very interesting to see similar analysis for a “fast compiling” language like C, Go or Zig though.

                  1. 2

                    Parsing is a significant bottleneck in rust-analyzer, but that’s because:

                    • ra is pretty lazy, in that it doesn’t do anything with most of the code after parsing
                    • ra is too eager during parsing: we should be fast-skipping function bodies, but we aren’t doing that at the moment
                    1. 1

                      That’s pretty interesting, actually. So, would you say that ra is too lazy, and needs to put the work in to be lazier? :D

                  2. 1

                    Parsing being the bottleneck of a compiler is a problem I would love to have.

                    1. 1

                      It’s certainly not the bottleneck, but it necessarily must complete before the other phases can run, so improvements to it do matter.

                      1. 1

                        Can you give an example where a parser would take a significant amount of time compared to e. g. type checking?

                        (Not to mention that parallelizing parsing is pretty esay compared to parallelizing other phases.)

                        1. 1

                          To clarify, I’ve been told this a few times over the years by friends who directly work on compiler stuff. I do not know it of my own knowledge, and I don’t have citations that I can dig up. I apologize if it sounded like I was saying something more authoritative than that.

                          It’s also true that languages with modern type systems spend a lot more effort on type checking than languages with simple type systems.

                          Of course, the only time you really care how efficient a compiler is, as long as it’s within the general range of no more than a couple seconds per file, is when you’re directly responsible for building many large software projects on a continuous basis. That’s not a problem that most people have, so to most people compiler performance really isn’t a big deal.

                  3. 1

                    Context-free grammar formalisms are less relevant than one might think. Most realistic compilers need degrees of context-sensitivity and thus use hand-written parsers: either because the core language grammar is context-sensitive (e.g., significant whitespace as in Python) or because the general user experience profits from context-sensitivity (e.g., error handling, feedback in IDEs).

                    How hard is it to use a skeletal CFG and work on the tree for context sensitive features compared to hand-rolling the complete parser? Has anyone done both for a comparison?

                    1. 1

                      I don’t think this is a well-formed question. „Context sensitivity“ isn’t a thing and is just unfortunate terminology. I am with Okhotin on this one, let’s call this class just „grammars“ or, if we want to be specific, „ordinary grammars“.

                      Or rather, context sensitivity has intuitive meaning (things are different based on context), and cfg are context-sensitive in this way. But this is still a hand-wavy property and it doesn’t really makes sense to say „this needs context sensitivity“, because you can’t replace CS with fixed definition here.

                      In technical terms, context sensitivity means that you allow general rewrite rules, like ABC -> CdeFg, with the restriction that rhs is not shorter than lhs. That is, cfgs have one letter to the left of an arrow, csg have a string there. This is irrelevant to any real-world grammars or IDEs.