1. 12

Left recursion has traditionally been a weak point for top down recursive descent parsers. While there has been a number of solutions, these have typically been complex to implement. I think there might be a simple solution to this problem (detailed here that may be applicable to a large class (all?) of context free grammars).

The idea is to statically compute the minimum length string that a grammar nonterminal requires for a successful parse (or the minimal length string that the grammar nonterminal can produce if it is turned into a producer). Then, we can bound the left recursion when the remaining string to parse falls short of the minimal length needed to complete new expansions.

I have provided a working implementation in Python here. For the simplest implementation, scroll to the end, and look for the PEG (lite) parser.

An example left recursive grammar:

grammar = {
    "<start>": [ ["<E>"] ],
    "<E>": [
        ["<E>", "+", "<E>"],
        ["<E>", "-", "<E>"],
        ["<E>", "*", "<E>"],
        ["<E>", "/", "<E>"],
        ["(", "<E>", ")"],
    "<digits>": [["<digits>", "<digit>"], ["<digit>"]],
    "<digit>": [[str(i)] for i in string.digits]

Given the simplicity of the approach, and the attractiveness of recursive descent parsing, I am pretty sure that either it was noticed before (published elsewhere) or the approach has some problems I am overlooking (other than performance, which I ignore for now). Can any one tell me if this is already well known, or if there are problems with this approach?

Note: The context free parsers are only recognizers for now (doesn’t yet produce a parse tree).

  1. 2

    Interesting, so if you have a rule like

    A := A B

    Then when you try to parse an A, the program goes:

    1. I’m going to parse an A
    2. I’m going to parse an A, then a B
    3. I’m going to parse an A, then a B B
    4. I’m going to parse an A, then a B B B

    … 99. I’m going to parse an A, then a B B B … B, but any B B B … B is longer than my entire input, so fail.

    It seems like it terminates as long as that B B B suffix grows on every recursive call. What happens if you have a rule like this?

    A := A

    Or, what if B’s minimum length is zero, so the pending B B B … B’s minimum length is also zero?

    1. 1

      Great observation, thank you! You are right that it relies on progress on every recursive call. It seems I have to think more about handling the nullable (minimum length 0) nonterminals, and recursion without progress.

      (Any suggestions to handle it would be welcome too).

      Edit: I think that if I eliminate empty strings (epsilons) from the grammar, I can detect recursion such as A := A or indirect ones such as A := B, B := A and mark them as infinite parses or skip such rules.

    2. 2

      I would try posting this to the comp.compilers newsgroup. It has been awhile since I read it regularly but posters were pretty good at citing sources.

      1. 2

        The main things that I note about this solution are:

        1. At each point you run into a left-recursive parse you have to keep recurring until your recursion depth gets to the length limit. So you get an O(input-length) extra complexity factor.
        2. You need to know the size of the input. So you can’t use it for streaming parsers.

        It’s a cool and simple idea, but unless your inputs will all be short, it’s own it’s not very usable because of the extra O(l) factor.

        Shameless self-promotion regarding methods of resolving left-recursion:

        I’m actually currently writing a paper about a novel way of handling left-recursion and an expressive parsing system built with it: parsing with delimited continuations. The implementation is here if you’re curious, though it doesn’t have documentation of any kind yet aside from my in-progress academic paper draft. But the basic idea is that when you are about to enter a left-recursive infinite loop… just don’t! Instead, capture the continuation of the current parser, set it aside, and choose a different alternate to work on. Once you have a result from some other alternate, you can feed it to the old continuation you captured and make progress on it. When you get to a point where you can’t make progress that resolves a left-recursion dependency cycle, the system can just feed it a failure result that stops the recursion.

        The full system is basically an extended GLL, but instead of the primitives being literals, alternation, and concatenation, the primitives are alternation and procedures that takes the input stream as an argument. Using those you can build up your usual suspects of parser combinators, a BNF interface, etc. But since you can use arbitrary parsing procedures you can have productions that call pre-existing parsers built with other frameworks, dynamically extend your parser at run-time for ad-hoc language composition, filter away ambiguity, do data-dependent and other kinds of non-context-free parsing, etc. It’s basically the bee’s knees for expressive parsing. On the other hand, my implementation is currently, uh, a bit too slow for any practical use. I’m hoping I can optimize it a bit more – with a factor of 10 improvement it’ll be worthwhile for Racket language projects where you care enough about parser composability and extensibility that you don’t mind the parser taking as much time as the macro expander. If I could squeeze a factor of 100 I think it would be an easy sell to the Racket community as a go-to parser for writing new languages, since at that point it would only be a small fraction of the overall compilation time. I think 10 is possible, but I’m not holding my breath for 100 without a major re-design of my implementation.

        If you want the full details, you can email me for a (still fairly rough) paper draft. Or you can simply hope it gets accepted for OOPSLA 2020 ;).

        1. 2

          Thank you for the insights, and indeed, you are right about the O(l) factor, and the fact that this can not be used for a streaming parser.

          I would love to get a draft (sent a message to your account with my email).

          I had a simple GLL like implementation in Python here. Doesn’t contain any noteworthy idea other than prioritizing the alternatives based on how far along they are in parsing, and how deeply nested they are (so left recursive items have the least priority).

        2. 2

          Yes, the idea of stopping left-recursion at run-time has been done before. I came up with it myself (using a different stopping condition than yours) and found out that others had published it before, see for instance: https://github.com/PhilippeSigaud/Pegged/wiki/Left-Recursion#bounded-left-recursion That’s a clear overview and includes references to a couple of academic papers.

          1. 1

            Thank you for the link. Much appreciated. I had looked at papers by Medeiros and Laurent before, but I thought that they were fairly complex to implement. My attraction to the stopping condition here is purely its simplicity (which matches with the simplicity of the top down recursive descent approach).

          2. 1

            Left recursion and PEGs is a surprisingly hard topic. The first algorithm I know of in this area by Warth et al. can produce incorrect parses (I wrote that up here).

            I found it impossible to understand the output of your parser, so I quickly added this pretty printer so that I could see the tree structure more easily:

            def pp(t, i=0):
              for x in t:
                print("%s%s" % (" " * i, x[0]))
                for y in x[1:]:
                  pp(y, i + 1)

            With that, I can see that your algorithm does correctly parse the example that Warth et al. have a problem with (1-2-3) but I’m not sure how you encode precedence (so 2+3*4 is not parsed correctly, no matter what order I recorder the productions in E), so I’m not quite sure how fair a comparison this is.

            Anyway, I wish you luck in your quest! This stuff made my head hurt so much that I gave up on PEGs entirely and moved on to LR parsing.

            1. 1

              Thank you for the link! I haven’t really thought about precedence as encoded in the PEG order, but I see that it is a major issue.