1. 67

  2. 13

    The author has a pretty explicit bias toward Earley parsing, and using LL(k) over LR(k) methods (which I mostly share). There’s a wealth of information here, but he’s missing a couple other significant strands of generalized parsing research outside of Marpa’s particular Earley variant (perhaps because they don’t fit within his narrative).

    GLR (Generalized LR):

    • Tomita, LR parsers for natural languages, 1984.
    • Tomita, Efficient parsing for natural language, 1986.
    • Scott and Johnstone, Generalised bottom up parsers with reduced stack activity, 2005. This segues into their later work with Earley and GLL.

    GLL (Generalized LL):

    • Scott and Johnstone, GLL Parsing, 2010.
    • Afroozeh and Izmaylova, Faster, Practical GLL Parsing, 2015.

    Outside of Earley, GLL seems like a very practical generalized parsing approach. Instaparse is one implementation.

    Earley / Shared Packed Parse Forests:

    • Scott, SPPF-style parsing from Earley recognisers, 2008 (and several related papers by Scott and Johnstone). Note: I’ve never been able to get the approach described in the paper implemented correctly, and not for lack of trying.

    Earley Intersection Parsing (not sure if there’s a canonical name for this):

    • Bar-Hillel, Perles, and Shamir, On formal properties of simple phrase structure grammars in Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 1961. Proves some important results about intersecting automata and context free grammars.
    • Chapter 13 (“Parsing as Intersection”) of Grune and Jacobs (also cited elsewhere in his timeline), particularly 13.4 (pgs. 437-439): This describes Bar-Hillel’s Intersection Parsing approach using contemporary CS & parsing terminology, and then suggests combining it with Earley parsing. While the basic intersection parsing method produces an astonishing amount of dead-end nodes to garbage-collect later, Earley parsers limit searching to productions reachable from the start symbol. If the completer is modified to produce non-terminals in the intersection parsing format (which is easy), intersection parsing nicely handles producing the parse forest (with structural sharing, when ambiguity produces multiple derivations).

    I’ve been working on a C library for Earley Intersection parsing, and an awk-like, pattern-match-directed language based on it, but working on testing them thoroughly tends to lead me to working on theft instead.

    1. 3

      Thanks for details about alternatives.

      I have played around with Bison’s GLR option and have almost managed to create a C parser that does not require special handling of typedefs in the lexer (ambiguities still appearing at runtime for a few constructs).

      Grune and Jacobs’ Parsing Techniques - A Practical Guide is sitting in a pile waiting to be read (the first edition was manageable, the second looks daunting)

      1. 1

        I have a print copy of the second, and it’s never struct me as daunting – reading it cover to cover, perhaps, but it’s well suited to dipping into for just specific techniques, with a well-curated bibliography for further details.

      2. 2

        packrattle is another example of a GLL parser.

        The times I’ve read about Earley parsers, they seem like the same solution as GLL, but attacking the problem from a different angle. Hopefully someone in academia will eventually prove that.

        1. 2

          I’m particularly interested in Earley parsers because some use cases of mine assume ambiguity (reverse-engineering binaries, scraping damaged data), and the chart that Earley parsers build up supports several operations beyond just recognizing & building parse trees:

          • what terminals/nonterminals would advance the current parse(s) in progress? (i.e., autocomplete)
          • if we don’t assume where the starting position is, what overlapping instruction encodings are there?
          • if we inserted a fake nonterminal here, would enough of the parse have completed to match any instances of (some structure) between here and there?
          • incremental re-parsing of changing input, since earlier columns in the chart are constant once the parser has moved on.
        2. 2

          Thank you for adding this. I understand why everybody is excited about Earley parsing, though I personally don’t feel sufficiently grounded in it to share that excitement, but at the very least other lines of research deserve to be mentioned.

        3. 6

          Amazing. Well written, well researched and concise. Parsing is such an interesting problem - it arises naturally almost anywhere you find text.
          I’ve read the phrase “parsing is a solved problem” so often in discussions about it but this article puts that in its place. Thoroughly.

          1. 5

            Parsing is a solved problem in the sense that if you spend enough time working on a grammar it is often possible to produce something that Bison can handle. I once spent several months trying to get the grammar from the 1991(?) SQL standard into a form that Bison could handle without too many complaints.

            Bison’s GLR option helps a lot, but at the cost of sometimes getting runtime ambiguity errors, something that non-GLR guarantees will not happen.

            So the ‘newer’ parsing techniques don’t require so much manual intervention to get them into a form that the algorithm can handle.

            1. 3

              I think that usually means for most languages and problems people are applying it to. If so, then it probably is a solved problem given we have parsers for all of them plus generators for quite a lot of those kinds of parsers. Just look at what Semantic Designs does with GLR. There’s FOSS tools that try to similarly be thorough on parsing, term-rewriting, metaprogramming, etc. At some point, it seemed like we should be done with parsing algorithm research for most scenarios to instead polish and optimize implementations of stuff like that. Or put time into stuff that’s not parsing at all much of which is nowhere near as solved.

              I still say it’s a solved problem for most practical uses if it’s something machine-readable like a data format or programming language.

              Note: It is as awesome an article as you all say, though. I loved it.

              1. 5

                A solved problem, but for a subset of constrained grammars and languages. Still people say it without any qualification as if there’s nothing more to be said or understood.

                I take your points though and there are definitely other areas that need work.

                I think my point is just that I’m glad people are still putting effort and research into solved problems like this. No telling where breakthroughs in one area can lead and we’re still restricted constructing and representing language grammar.

                1. 3

                  A solved problem, but for a subset of constrained grammars and languages. Still people say it without any qualification as if there’s nothing more to be said or understood.

                  Yeah, they should definitely be clear on what’s solved and isn’t. It’s inaccurate if they don’t.

            2. 3

              Dumb question. How does Floyd’s operator precedence grammar (1963) fit into all this?

              There’s this brief passage.

              Why not use the algorithms that parse operator expressions for the whole language? Samuelson and Bauer 1959 had suggested exactly that. But, alas, operator expression parsing is not adequate for languages as a whole[49].

              But it doesn’t explain why they are “not adequate”.

              This is a very well researched article and I understand not everything can be included but I found a lot of the opinions included to lack justification.

              I can still take it for its historical value but not a summary of what likely works or doesn’t work.

              1. 5

                In my experience the most common source of ambiguity in programming-language grammars, other than operator precedence, is repetition. Function parameters, struct members, array literals, statements in a block of code, chained else-if cases… these are all grammatical constructs which can occur zero to n times, and they all have to deal in various ways with making sure the parser has a way to know when one item stops and the next begins, and when the overall list stops.

                Natural languages also have phenomena such as rank-shift, where for example you take an entire verb phrase and put it in a different tense and use it inside a noun or preposition phrase that’s part of a larger verb phrase. If that sentence made no sense to you, it also makes no sense to Bison. :)

                Oh - and in natural language, ambiguity is a property of the language that’s there “by design”, and clever writers very often exploit it. It can actually enhance clarity in certain situations, by eliding detail that shouldn’t be the focus. It can also be the basis of humor. So to truly “parse” natural language, you absolutely must model ambiguity.

                1. 4

                  Re: rank-shift. I finally understood rank-shift when I read the Complete Lojban Language. Lojban is a constructed language with a formal grammar which still tries to support all kinds of features of natural languages. Lojban has an explicit support for rank-shift, although naturally, in order to be machine-parseable, it needs to be explicitly marked.

                  Lojban rank-shift, which is called “raising”, is described in the Complete Lojban Language, 11.10.

              2. 6

                This is the best article I’ve read in a long time.