1. 19

  2. 5

    “Ideally, I could look at the parser generators for the language I’m working in and pick the one that suited my use case. “

    There’s the problem. Same in many projects in formal verification. The author wants to shoehorn all requirements into one tool or algorithm. sklogic, who writes static analyzers, always said on Hacker News: just use two, one for speed and one for good, error messages. Far as ambiguity, my research led to submissions here of tools like Elkhound that did a mix of high-speed and ambiguity-friendly algorithms. So, I think just doing a mix will let parsing be a solved problem for most use-cases in mere programming languages. Natural language totally different issue…

    The blog post is still useful for highlighting lots of other considerations. Anyone curious about this topic might find a list of tech good at each, look at combinations of them, and filter those down to the bare minimum number of primitives and combinations. That gives what a language designer needs to have it all. From there, what essential pieces are in each that could be merged into fewer tools? That gets the bloat down. Maybe leads to novel algorithms.

    1. 4

      Very strong writing. I also have problems with parsing despite all algorithms and tooling. I am no serious PLT programmer. But the other day I was playing with languages and I it was a real puzzle to tokenize multi line strings with a lexer generator (Alex). I am sure it is easy to do so; but it is not immediately obvious how to do so.

      1. 2

        Some of those requirements kinda reminded me of the somewhat recent announcement of Tree Sitter, “a parser generator tool and an incremental parsing library”. But I haven’t explored it in depth, so I’m not sure if it can be used in compilers, and if yes, then whether it can be used to display syntax errors in some nice way. Does anybody know more about it maybe?

        1. 2

          One thing that came to my mind recently is the total lack of good tooling for writing grammars.

          For instance:

          I’d love to have tool that gave me grammar construction-by-example.

          Pass in code, and incrementally construct a parser from it, allowing tool-assisted refactorings, like

          “replace all instances of identifier ":" type by a rule param in the grammar, wherever they appeared in the code example

          and then offer to

          replace instances of identifier ":" type, identifier ":" type, ... by REP(identifier, ",")

          Basically the things we take for granted when writing programming languages in an IDE.

          The best things I was able to find were things like https://ohmlang.github.io/editor/, which at least show whether your grammar parses some code examples.

          If I’m wrong and such tooling exists, please let me know!

          1. 1

            Great stuff. I share every concern described here, to a greater or lesser extent. It’s a very thoughtful piece, and I can tell it’s coming from long experience. I suspect that the speed with which it goes through these topics may be off-putting to readers without background in them, but it looks like the author has other posts that explain a lot of it. I’m really happy to learn about this blog. :)

            1. 1

              The article advocates the classic separation of lexing and parsing. This assumes that the language has one set of tokens but I found that this is often not the case and it is better addressed by approaches that don’t separate lexing and parsing. For example, try parsing XML by tokenizing first. To maintain separation, the parser would have to be able to tell the lexer in what part of the grammar it is such that it can apply the appropriate tokenization rules. This kind of context sensitivity exists in many languages and it is not appropriately addressed in tools.