1. 38
  1.  

  2. 12

    I agree with one implicit message of this post which is that parsing is an almost endless series of trade-offs. For example, for a simple binary format, I’d probably use PEGs, but when parsing textual formats, bitter experience has conditioned me to prefer LR. There are various reasons for this, such as: PEGs are easy to write but, in general, it’s impossible to work out what language they accept (most medium/large PEG parsers I’ve looked at in detail do not work as their authors expected); LR parsers are easy to reason about but more awkward to write and generally have poor error handling. I’ve tried to address that latter issue in grmtools/lrpar https://softdevteam.github.io/grmtools/master/book/errorrecovery.html – it’s not the right solution for everyone, but it might be another interesting point in the endless space of parsing trade-offs!

    1. 4

      menhir is an advanced parser generator in OCaml and features pretty good error recovery mechanisms, as far as I know. It’s used in merlin, the semantic server for OCaml, which is extremely robust. LR or LALR generators can be better than yacc!

      1. 5

        Menhir has a very cool, but very different, approach to error recovery. In essence, users can manually annotate the stategraph with detailed error messages (and Menhir has a cunning mechanism to make it semi-automatic to migrate those messages as the grammar changes) that can help users understand what’s going on. grmtools/lrpar’s approach is fully automated. As always in parsing there are pros and cons to this! There’s more details in the related work section of https://arxiv.org/abs/1804.07133 if you’re interested.

    2. 10

      I’d love to see a crate that combines the usability of LALRPOP’s format with the expressivity of PEG parsers, for when I want to quickly hack together a prototype of something involving a parser, or for cases where optimization is not as important.

      Check out my library, rust-peg. It’s a PEG-based parser generator as a Rust procedural macro, with Rust code and Rust types embedded in the grammar for you to build an AST. It’s designed to give you reasonable-looking parse errors by default, with facilities to tweak what gets reported, and the ability to extract positions / spans to include in your AST. Input can be &str or &[u8] without a lexer, or it integrates with an external lexer by handling any &[T] or your own type that you implement a few traits for.

      1. 2

        Oh, cool! I’ll check it out next time I want to parse something, that does look pretty nice :)

      2. 7

        Nice overview – FWIW all these issues are language-independent, i.e. they would arise for parser combinator, LALR, and PEG libraries in any other language.

        After writing many parsers, and using many parsing models, my conclusion is that parsing is an “art” with many tradeoffs and non-obvious choices. No single solution is right for all problems.

        The biggest distinction seems to be if you’re trying to parse an existing language or designing a new language. (Hand-written is better for the former IMO, and generated has advantages for the latter.)

        Some rough wiki pages that may be relevant:

        https://github.com/oilshell/oil/wiki/Parsing-Models-Cheatsheet

        https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate (In addition to the things listed on the page, lexing is straightforward / “solved”, while parsing isn’t.)

        1. 1

          The biggest distinction seems to be if you’re trying to parse an existing language or designing a new language. (Hand-written is better for the former IMO, and generated has advantages for the latter.)

          This is an interesting claim - I like it.

          1. 2

            I disagree with it. Their tools aren’t good enough if they have to hand-write the parser. They need to use better ones or tool builders should try to extend tools to cover those problems.

            A simple example would be the people that used GLR to handle harder grammars or who use multiple parsers to get each’s advantages. sklogic would use one for fast parsing of correct code. If that failed, another for its good error messages. Ira Baxter is probably the champion given his and Semantic Designs’ tool can probably handle anything. A less-extreme example from formal methods people is K Framework that handled (IIRC) full C on top of many others.

            We have all that capability. Then, someone can’t easily automate parsing a toy or less-complex language. That means there’s tooling gaps to work on. It might also mean someone needs to make a modern Semantic Designs with a Freemium or Open Core model.

            1. 2

              I agree that there could be better tools, but that doesn’t change the claim. If you want to write a “production quality” parser for an existing language right now, it’s better to write a parser by hand. I’m thinking of demanding problems like writing a JavaScript parser, TypeScript, C#, Clang, etc. Error messages, speed, and resource usage matter there. The latter 2 are also reusable frontends for tools and IDEs.

              Although I should qualify that to say what I said here: https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate

              That is: It’s safer and less effort to use the same model that the original authors did. The parsing technique depends on the language. So if you’re writing a Python parser, it’s probably best to use what CPython did, and use the bespoke pgen code generator – and that’s in fact what PyPy, IronPython, Jython, etc. all do. (But not Micro Python, which is written by hand for resource usage I presume).

              So I guess the former rule is a special case of the latter, where a large portion of existing parsers are hand-written (add in Lua, Go, Rust, etc.) :-)

              1. 1

                “If you want to write a “production quality” parser for an existing language right now, it’s better to write a parser by hand… Error messages, speed, and resource usage matter there. “

                I’d like to see proof of that with a comparison to one or a mix of the automated tooling. My guess is that people hand-wrote parsers because they didn’t know about the other stuff, it wasn’t invented when older language was implemented, and/or (common thing) that it’s what everyone else was doing.

                Take the two parsers: one great at efficiency, a slower one great at error messages. Both generated from a common grammar with annotations or patterns for errors if necessary. Yet, virtually nobody I ever told that had considered it. Likewise, they probably didn’t try a logical tool like K that was used to cover legacy C, produced error messages for arbitrary programs (their static analyzer), and builds on a fast, rewriting system (Maude).

                Although I asked for proof, I’d like to see proof attempted at what I’m saying, too. I’d love to see someone pick combos of the best approaches for efficient code, error handling, complex language, etc. They all get fed the same grammar or whatever with error conditions. Then, see what happens. Does it work as well as that custom stuff? Close enough? Way off? I’d love it put to the test vs the super-narrow focus I usually see in the research.

                “So if you’re writing a Python parser, it’s probably best to use what CPython did, and use the bespoke pgen code generator”

                This I agree with if you want to do something quick. Reuse what they had. Funny enough, I was thinking about doing that with Python recently to get rid of some boilerplate or extracting to a higher-performance language. I didn’t have time for it, though.

        2. 3

          Nice writeup!

          One thing I don’t see mentioned here is error recovery, which is really important for parsers of human-facing programming languages! LALPOP can do this, but afaik Nom can’t as of yet - not sure about Pest. You can also hook up a custom lexer to LALRPOP, like one implemented using Logos (or a hand-written one).

          1. 1

            Good read. Thanks for sharing :)

            1. 1

              I don’t get the value proposition of parser generators. They don’t seem that much easier than just rolling your own recursive descent parser and with the homegrown option you have full flexibility over error messages, recovery, etc.

              1. 4

                Formal grammars are typically easier to read and understand. Eg we have a recursive descent parser for SQL at work and the parser alone is ~3kloc with ~7kloc of supporting code.

                This is especially important for multiple implementations of the same standard, where you really want to be sure whether something is an intended part of the syntax or a quirk of the implementation. For a new language project I would go for:

                • a hand-written recursive descent parser that is used in production
                • a formal grammar that is included in the documentation / language spec
                • randomized tests that compare the two for differences