1. 30
  1.  

    1. 7

      I’ve written a lot of parsers in my day.

      Parser combinators are especially elegant, if the language has good support. In my experience, parser combinators are married to Parsing Expression Grammars: you can do one without the other but at some point everything seems to converge between them.

      There are a few problems with using this approach, however, which are similar to the problems of writing recursive descent parsers (since PEGs generally just end up being RD under the hood):

      • Expressing lots of levels of operator precedence either gets really hairy or requires you to break the abstraction. I’ve seen good solutions to this, but most of the time it seems as though parser authors just limit themselves to fewer levels of precedence.
      • Left recursion can result in infinite loops which you have to detect and break out of.
      • Highly recursive input (e.g. ((((((((((1 + 2))))))))))) can result in stack exhaustion, so you need to pass context around to keep track of how deep you are.

      Now, with parser combinators and PEGs you can do things under the hood that end up transforming stuff into a table-driven parser at runtime (or, I suppose, at compile time with macros), but it’s a lot more hairy to get right.

      None of the above invalidates the assertion in TFA, though: this is good for 99% of parsing needs.

      I would also say that sometimes regexes are parser combinators work well together. I wrote an ECMAScript parser combinator library many (many, I’m so old) years ago and allowed regular expressions as a “terminal” combinator. It allowed things to be a lot more succinct at the bottom of the input parsing.

      I’m working on a project in Rust that is eventually going to need to do some parsing, and I think I’m going to end up using something like this. My only concern is that the target language has a ton of precedence levels and so I’m not sure if I want to deal with that…

      1. 2

        Your first point is still true today and still a pain point to me.

        For left recursion, a modern parser-combinator will use something like GLL, where each (position, parser) is memoized, so you only go down paths that are going to make progress. For recursive input, the parser uses a work queue, which should only expand when a parser is an “or” combinator. (Each ( should be a job that memoizes itself and eventually adds a new job for the next position, so the queue doesn’t expand, only the memos, linear to the input.)

        1. 1

          Expressing lots of levels of operator precedence either gets really hairy or requires you to break the abstraction. I’ve seen good solutions to this, but most of the time it seems as though parser authors just limit themselves to fewer levels of precedence.

          One idea I’ve seen is to embed an operator-precedence parser as one of the combinators, like Parsec’s buildExpressionParser. Is that what you mean by “breaking the abstraction” though?

          1. 2

            Right, though I suppose “breaking the abstraction” is a bit too strong phrasing. In “pure” parsing combinators, at least when expressing PEGs, there’s a fairly obvious correspondence between the grammar and what actually gets run in the most common implementations. With this, there’s a either a hidden operator-precedence parser being created under the hood or a bunch of synthetic nonterminals being added in to account for precedence.

            There’s nothing wrong with it per se (and a lot of compilers are, for example, recursive descent wrapping an operator-precedence parser as an “Expression” terminal). That’s a best-of-both-worlds approach, but it does hide things more than a “pure PEG” would.

            I’m completely unfamiliar with Parsec, but if I had to guess, I’d assume that’s what they’re doing here.

        2. 2

          As someone who is already familiar with parsec, I found the discussions of where you have to use explicit boxing the most interesting part.

          It would have been nice to mention the name Parsec near the top of the article somewhere. ;)