1. 39
  1.  

  2. 11

    I’m not surprised GCC and Clang’s parsers are handwritten; C++ is infamously difficult to parse (or even lex) so I’ll bet a grammar file for it would have to contain all sorts of crazy cruft to handle irregularities.

    I’ve written parsers both ways. Handwritten is a lot more work to implement and debug, and requires more knowledge. It does give you more flexibility and optimization potential, but it’s probably only worth it if you’ve got a really serious implementation that lots of people are using. (Kind of like building your own hash table or b-tree, instead of using std::unordered_map or LMDB.)

    BTW, I would recommend against LALR (Yacc-like) parser generators. They’re hard to learn and unintuitive. Recursive-descent (Antlr) grammars are a lot simpler. PEG is even better, in my experience, because it can handle a lot of rules that other types would reject as ambiguous. Also, PEG parser generators let you unify lexing with parsing.

    1. 6

      Handwritten is a lot more work to implement and debug, and requires more knowledge. It does give you more flexibility and optimization potential, but it’s probably only worth it if you’ve got a really serious implementation that lots of people are using

      While true, this isn’t the main reason that I’d go the hand-written route: most of the tools for generating parsers are able to produce just-about okay error messages, but if you’re producing a compiler and want to add decent tooling that produces helpful error messages then a hand-written parser is much easier to work with.

      They’re also far easier to adapt to incremental parsing. If you want to extend your parser to cache partial results so that you can write LSP integration from the same source, then a hand-written parser is also better. That may change with things like TreeSitter, though a separate TreeSitter grammar can end up out of sync with the one used in the compiler and end up with a worse developer experience.

      1. 8

        If you want to extend your parser to cache partial results so that you can write LSP integration from the same source

        Incremental parsing is not needed for IDE functionality. As a case in point, IntelliJ only does incremental parsing for stuff like XML, where it’s not uncommon to have multi-megabyte files. For reasonable programming languages and a reasonably performant implementation language, incremental parsing would be of very low priority.

        What is deal breaker for IDE functionality is error resilience: ability to build a somewhat well-formed tree from arbitrary inputs (see this article for an example). Here, most parser generators are unusable. The two exceptions I know are:

        • IntelliJ’s GrammarKit, which naturally is build around resilience
        • TreeSitter, which has some amount of resilience
        1. 1

          Incremental parsing is not needed for IDE functionality.

          Correct, though from what I’ve heard from people working on rust-analyzer, it can be pretty nice.

          1. 3

            To clarify, although rust-analyzer parser is incremental, it’s incremental capabilities are not used for real (incremental diffs from editor translate to full reparses).

            One consideration I forgot to mention is that, in a language where each syntax node is a separate heap allocated object, you might want to incrementally re-use the tree, because node allocations is what makes parsing slow. Not re-allocating a ton of JS objects was one of the motivations behind tree sitter, and IntelliJ has special code which zips existing syntax tree with a stream of events from full text-parse to surgically modify only subset of the tree.

    2. 4

      I’ve had my fair share of both parser generators (even wrote one myself), consumption of those parsers (e.g. for a Ruby based XML/HTML parser), and hand-written recursive descent parsers (e.g. for Inko).

      I’ve grown to really prefer hand-written parsers. Parser generators are great if you have a reasonable and well established grammar and language to parse. However, you can hit walls quickly, and often there are no escape hatches. They also complicate your build process, which was a big issue for me in the past due to the use of Ragel for my lexers. I also really hate debugging parser generators, as the generated code is often a total mess.

      Parsing combinators are nice when you don’t want to separate lexing and parsing, you don’t want a separate build phase, and you can’t be bothered writing a recursive descent parser from scratch. For example, I used a Ruby parsing combinator to write a parser for a templating language GitLab uses for generating changelogs (parser here). Debugging parsing combinators can be a pain though, so I generally prefer hand-written parsers. I opted not to use one here to make it easier for others to maintain the parsing code.

      When evaluating what parsing approach to use, I would recommend the order: hand-written recursive descent > parsing combinator > parsing generator. Hand-written parsers just work much better in the long term, but parsing combinators can be easier to grok/maintain (if you don’t care about webscale performance). Parsing generators I would just avoid if you can.

      1. 1

        Even when I use a parser combinator approach which is often I still separate the lexing and parsing phases. I think it’s just a lot cleaner and easier to reason about. I personally don’t think there is a large amount of distance between parser combinators and hand rolled recursive descent parsing but everyone is a little different in their opinions on that I think. Either way I generally find most people I talk to end up gravitating to recursive descent and parser combinators being in their top two and parser generators getting used really only to prove out a grammar when you are playing with the syntax or when the grammer is already known available in a format the generator can already parse.

      2. 4

        Why does no one use generic parsers such as Marpa? Are they too slow?

        1. 4

          I love how easy is to parse something in Haskell.

          • For small ad-hoc parsers, there is built-in Text.ParserCombinators.ReadP. I end up using it in places I would use a regex or strpos/splitting/slicing in other languages.

          • For anything more complicated I love to use Megaparsec, which is fast, has great error reporting and has interface very similar to ReadP above.

          1. 3

            GCC Clang

            I was amused that this completely ignored the other language front-ends, so I went and took a look: GCC’s Ada, D, Fortran and Go parsers all seem to be hand-written too! I also found interesting the fact that the D and Go parsers seem to be directly imported from “upstream” (Digital Mars/the Go project) while the Ada and Fortran ones are “native” to GCC. It seems that gccrs is choosing to build its own “gcc-native” parser too.

            Although parser generators are still used in major language implementations, maybe it’s time for universities to start teaching handwritten parsing?

            I’m not sure about that. I think students are more likely to find themselves parsing ad-hoc formats that don’t need the flexibility or performance of hand-written parsers than they are to find themselves working on a real compiler front-end. In this case, it might make more sense to learn how to use a parser generator, some of which come with really fancy features (e.g. incrementality for treesitter, formal verification for recordflux…).

            1. 1

              Slightly tangential, but I was amused to see Brian Goetz referred to as a Java contributor. I mean, it’s technically true, but his official role is Java Language Architect at Oracle. So he definitely contributes to Java.