1. 26
  1.  

  2. 4

    Unrelated: I read an article recently on Haskell programming that asserted one should never write a parser, and to always use a parser combinator library. Yet on the other end of the spectrum, I see a lot of people claiming you should never use a parser generator as they universally produce awful error messages, and to always write your own. Is this actually a contradiction? Are error messages from parser combinator libraries as bad as the ones from yacc? I’ve never used a parser combinator library as I’ve never needed to do any parsing in Haskell.

    Ontopic:

    The article here says:

    if the BNF production A recognizes a language, and B recognizes a different language, then the production A | B will the union of the two languages recognized by A and B. As a result, swapping the order of an alternative from A | B to B | A will not change the language recognized.

    However, most implementations of parser combinators do not satisfy these kinds of properties! Swapping the order of the alternatives usually can change the parse tree returned from a parser built froom parser combinators.

    Is it not the case that the property that swapping ‘A | B’ and ‘B | A’ will not change the language recognised and the property that swapping ‘A | B’ and ‘B | A’ will not change the parse tree of parsing a particular string are quite different things? Grammars formally are not instructions for building parse trees, they’re string predicates.

    1. 2

      It is desirable that the combinator A | B recognises the same language as B | A as it is more declarative. Otherwise this can introduce hard to detect problems. For an implementation of parser combinators this is difficult to guarantee if efficiency is a concern. Yacc has a global view of a grammar and therefore it is easier to guarantee in a Yacc-generated parser than a combinator-based parser because each combinator has typically only a local perspective.

      1. 3

        The standard haskell parser combinator library parsec does not have commutative disjunction, only readp managed that. Second, the PEG system is bias towards A by choice - and for a reason: This reduces the ambiguity in languages and makes parsing certain aspects of programming languages easier.

        1. 1

          Second, the PEG system is bias towards A by choice - and for a reason: This reduces the ambiguity in languages and makes parsing certain aspects of programming languages easier.

          I understand that pattern matching also has a first-match policy and I don’t complain about this. I am still not convinced it is the right choice for parsing a language that is typically much larger than a runtime value deconstruction. In Parsing: a timeline, Jeffrey Kegler writes about PEG:

          But PEG is, in fact, pseudo-declarative – it uses the BNF notation, but it does not parse the BNF grammar described by the notation: PEG achieves unambiguity by finding only a subset of the parses of its BNF grammar. And, as with its predecessor GTDPL, in practice it is usually impossible to determine what the subset is. This means that the best a programmer usually can do is to create a test suite and fiddle with the PEG description until it passes. Problems not covered by the test suite will be encountered at runtime.

          Yacc has admittedly its own problems with shift-reduce conflicts.

      2. 2

        It’s definitely hard to produce good error messages from a parser generator. Especially from a parser combinator library because it’s built up dynamically and there’s no preprocessing stage.

        The system in this paper does have a first class ADT representation of grammars though. Which is translated into executable code via staging (similar to my PEG library)

        1. 1

          I believe your point about swapping A | B for B | A is exactly the problem the authors describe with parser combinators; for most parser generators, the resulting code should behave exactly the same. For any tool, the problem only arises if there’s ambiguity; if all strings are recognized by at most one of A and B, then clearly A | B and B | A will behave the same. Parser combinators have a tough time detecting ambiguity, and so will often just take the first one that matches, so in the case where there’s a set of strings recognized by both A and B, A | B and B | A will be treated differently by most parser combinator libraries. Tools like yacc and bison tend to not tolerate ambiguity, which means for each string, only one of A or B will recognize it, so A | B and B | A will behave the same way.

        2. 3

          Very interesting paper, especially the appendices on computing/recognizing by extending derivatives of regular languages, to derivatives of context-free languages. Afterwards, a comment was made regarding intersection types: context-free languages are not closed under intersection, but it is possible to add an intersection type to the type system: all their theorems remain true. This increases the expressivity of the grammar beyond that of context-free languages.