1. 9
  1.  

  2. 3

    If someone reads this and thinks - ok, but i want to achieve something today without “all of that”, I recommend looking at a PEG parsers, for example pyparsing - I think they have better developer experience: https://github.com/pyparsing/pyparsing/blob/master/examples/cpp_enum_parser.py

    1. 3

      I would second PEGs, though PEGs in general have problems with left recursion. I find it more natural to write my grammars to be tail-recursive, so I haven’t had problems with direct left recursion is a problem. Some PEG libraries handle it, but this can lead to slightly confusing behaviour.

      The other thing that can make PEGs problematic stems from the reason that they’re easy to use: they have unlimited read-ahead (alternatively unlimited backtracking). This means that error checking is difficult. Most PEG implementations record the longest successful parse because without that they will backtrack all of the way to the beginning.

      Oh, and the backtracking means that they hit scalability issues, either in terms of parsing time or memory overhead.

      That said, I like PEGs so much for prototyping that I wrote a PEG library for teaching students how to write compilers. Once I’ve got a grammar that I’m happy with, I’ll generally rewrite it as a hand-written recursive-descent parser so that I can locally control the amount of read-ahead and can get better error messages.

      1. 1

        Why recommend PEGs though? If you have a grammar at hand, isn’t it more natural to use one of the general context-free grammar parsers such as Earley, GLR, GLL and others? even if you do not want to use an off the shelf parser, or you want to hook in your own semantic features, writing an Earley parser is not difficult. These parsers can get you $O(N)$ performance on all deterministic CFGs.

        1. 1

          If you have a defined grammar in a form that something can consume, that’s great. If you are trying to define a grammar then typically you want multiple rounds of iteration and you want tooling that makes it easy to write many different versions of the grammar until you’re happy. In my experience, PEGs are great for that. Once you have something that works.

          1. 1

            Thanks!

            Do you mean the DSL for writing PEG parsers? (the combinatory parser style) or is there something specific to the way PEGs parse input that makes it easy to write such parsers? What I mean is, if you have a combinatory parser style general CFG parser that lets you construct parsers part by part, would you still prefer PEGs?

            1. 1

              There are a few things I like about PEGs:

              • The combinatory style.
              • There’s no distinction between parsing and lexing. PEGs just consume characters
              • Parsing is greedy, left-to-right, which is how I think about grammars. I have to transform them into unambiguous left-recursive form for LR1-style generators (PEGs always have a single unambiguous parse, early rules take precedence over later ones).
              • PEGs have unlimited read-ahead / backtracking. I don’t want that for a production implementation but for prototyping it’s fantastic.

              Most PEG implementations do handle left recursion in some way, it’s only the basic model that doesn’t support it.

              I’ve not found any parser generators that provide the right interfaces, performance, and error reporting ability for production use, so I only want tooling like this for prototyping. PEGs favour easy of use over everything else, which is exactly what I want for prototyping.

        2. 1

          Can you expand a bit on how you make your grammars tail recursive and hence avoid left recursion? I find writing PEG grammars pretty easy until I have to deal with removing left recursion, at which point I have to really think about it.

          1. 2

            I always found the requirement to write head-recursive grammars for tools like YACC annoying because when I write a grammar the tail-recursive form feels more natural to me. I struggle with indirect left recursion though, because it’s common to write things like:

            expr := addexpr | subexpr;
            addexpr := expr '+' expr;
            

            For a pure PEG, you need to write something like:

            number := ('0'-'9')+;
            expr := number addexpr subexpr;
            addexpr := '+' expr;
            subexpr := '-' expr;
            

            In this form, every rule starts with a terminator.

        3. 1

          Agreed; I would not want to go back to 1970s parser technology again! I like PEGs, though as David says, they have trouble with error messages. A recursive-descent tool like ANTLR is good too, especially if you want the generated code to be human-comprehensible.

        4. 3

          It was very helpful for me learning parsing and compiler development to start with a tool similar to Flex and Bison (PLY in particular), but ever since, I have made hand-written scanners/tokenizers and recursive-descent parsers for my projects. Neither are that difficult once you get the hang of them, and it’s code you fully control versus having to learn a library or tool with its own operational semantics and then maintain as a dependency.

          1. 1

            Honestly you’re better off with something like ANTLR, or just write a recursive descent parser by hand.

            the [f]lex/bison/yacc infrastructure is just not practical for any real language, their error handling is miserable, they default to using global state, they produce slow levers and parsers, etc.

            In all honesty writing recursive descent parsers is not hard, and making them maintainable is similarly trivial. RD parsers are generally faster than any PDA based algorithm.

            Writing a fast lexer can admittedly be miserable, but writing a lexer that is sufficiently fast to give you a significant net speed win isn’t difficult.

            I know people seem to love PEG parsers these days, but my experience has found them wanting. Again IME, they make handling a lot of standard PL grammar features challenging, their handling of parsing decisions requiring more than a single token of lookahead also seems to result in more complex grammars and makes for fairly brittle parsing.