1. 36

I’ll concede that “most innovative” is a bold claim. But here’s why I believe it:

  • Lark lets you switch between 3 parsing algorithms with the same grammar (Earley, LALR, CYK)

  • It includes unique algorithms that I invented (afaik), like a contextual-lexer for LALR, and a dynamic lexer for Earley.

  • It constructs a usable parse-tree without any additional code, only the grammar.

  • It can generate a stand-alone parser.

And there many more tiny details, which few if any other parsing libraries offer.

  1.  

  2. 4

    Very nice!

    It includes unique algorithms that I invented (afaik), like a contextual-lexer for LALR, and a dynamic lexer for Earley.

    Do you have a high level description of what the algorithm is somewhere?

    It can generate a stand-alone parser.

    What do you mean by this? Aren’t most parsers parser generators? (In fact mine is the only one I’m aware that isn’t a parser generator).

    (* PEGs cannot handle non-deterministic grammars. Also, according to Wikipedia, it remains unanswered whether PEGs can really parse all deterministic CFGs)

    I’m not sure about this statement. You could just as well say CFGs can’t handle PEGs because of the lack of a choice operator (or rather that CFGs have unpredictably outputs on non-deterministic grammars for inputs with more than one parse).

    Notice punctuation doesn’t appear in the resulting tree. It’s automatically filtered away by Lark. Build a parse-tree automagically, no construction code required

    Because of these statements, I looked at python_parser.py

    >>> from python_parser import *
    >>> python_parser2.parse('a >= b\n') == python_parser2.parse('a == b\n')
    True
    

    One statement is >= and the other is == but the information is lost.

    >>> python_parser2.parse('a >= b\n')
    Tree(expr_stmt, [Tree(comparison, [Token(NAME, 'a'), Tree(comp_op, []), Token(NAME, 'b')])])
    >>> python_parser2.parse('a == b\n')
    Tree(expr_stmt, [Tree(comparison, [Token(NAME, 'a'), Tree(comp_op, []), Token(NAME, 'b')])])
    

    Lark looks pretty good otherwise.

    1. 6

      Hi, thanks for your comments!

      Do you have a high level description of what the algorithm is somewhere?

      I kept postponing it because I wanted to write a lengthy blog post, but you’re right, and I will add an explanation. Meanwhile, here it is in a couple of sentences:

      • Earley is a chart parser, which means it goes through the input letter-by-letter, which is very slow. You can use a lexer, but then you lose a lot of the ambiguity in the text. My improvement is a “skipping” chart parser, which lets Earley match regexps, to the same effect as scanless parsing but much faster in the common case.

      • The contextual lexer communicates with the LALR(1) parser and uses the parser lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the terminals that are legal at that parser state, instead of all of the terminals (which is what yacc & ply do). It’s surprisingly effective at resolving common terminal collisions.

      Aren’t most parsers parser generators

      The most common Python parsers aren’t. Namely PLY, pyparsing and the rest. In the sense that you always need the entire library to build a parser. I agree it’s not innovative in the grand scheme of the parser world, but it’s pretty unique in the Python landscape.

      CFGs have unpredictably outputs

      Lark lets you control that with weights, and you can also get a parse tree that contains all of the ambiguities, for you to trim as you see fit.

      CFGs can’t handle PEGs because of the lack of a choice operator

      I’m not sure why you think that. The choice operator is an “or” operator that forces determinism. But when your parses can handle non-determinism, it’s not necessary. You can make your “choice” later, after the parse it complete.

      The choice operator does have advantages in terms of efficiency. If enough users ask for, I will add it as a feature (it’s entirely possible to add to the Earley parser).

      python_parser.py

      These are the comments at the header of the grammar:

      // This grammar should parse all python 2.x code successfully,
      // but the resulting parse-tree is still not well-organized.
      

      There are multiple ways to keep the operator, for example to define comp_op like this:

      !comp_op: "<" | "==" | ...
      

      I did fix that in the Python3 grammar, but there are many small details like that. I just never got around to it, it’s just an example in the examples folder.

      Thanks again for your input.

      1. 3

        “skipping” chart parser, which lets Earley match regexps

        Thanks for the description. I do hope you write that blog post at some point.

        So it can match either a regexp or a character (instead of the just a character)?

        On a side note, I always found it weird to use regexp in parsing because regexp matching is already a parsing task so somehow a subparser is used for part of the grammar. There’s no reason not to do this, of course.

        contextual lexer communicates with the LALR(1) parser

        Oh so you are doing both steps at once (or sort of alternating between them).

        The most common Python parsers aren’t. Namely PLY, pyparsing and the rest. In the sense that you always need the entire library to build a parser. I agree it’s not innovative in the grand scheme of the parser world, but it’s pretty unique in the Python landscape.

        Oh, I see. I never looked at those two in more detail. Maybe I was thinking of Parsimonious or Parsley? It definitely felt like they were essentially generating Python code although it may be encoded in some form or other.

        Lark lets you control that with weights, and you can also get a parse tree that contains all of the ambiguities, for you to trim as you see fit.

        Weights? Sounds pretty interesting!

        You can make your “choice” later, after the parse it complete.

        Yes, that’s true. I’ve never seen it stated that way.

        The choice operator does have advantages in terms of efficiency. If enough users ask for, I will add it as a feature (it’s entirely possible to add to the Earley parser).

        How does that work with Earley parsing? I probably haven’t thought about this as much as you have, but its not obvious to me how to keep track of multiple alternatives.

        I don’t think a lot of people will ask for this as a feature mind you. I’m more curious about how that can be done.

        These are the comments at the header of the grammar: // This grammar should parse all python 2.x code successfully, // but the resulting parse-tree is still not well-organized.

        Yeah, when I saw that but thought it meant the tree could be shaped differently than what the ast module would produce but still is an AST that you could, say, pretty print the parsed Python source code with.

        !comp_op: “<” | “==” | …

        Ok, that looks as pretty simple fix so it isn’t really an issue then. I didn’t find anything about the prefixes ! and ? in the documentation though. Maybe just looked too fast.

        1. 4

          I do hope you write that blog post at some point.

          I’m sure I will. And your expressed interest makes it even more likely!

          Yes, it can match regexps. And characters are just very short regexps :)

          I always found it weird to use regexp in parsing

          It actually makes a lot of sense. A stronger parser is always slower than a weaker one, so (string match > regexp > LALR > Earley). Splitting the parser to several layers lets us pick the low-hanging fruits faster, and use the heavy lifting only when necessary. In fact, I believe it’s possible to make the chart-skipping Earley to run LALR(1) on the non-ambiguous subgrammars of what it’s trying to parse, so it will have 3 layers. It’s a little more tricky than regexps, but hopefully I’ll manage to write it some day.

          So you are doing both steps at once

          Actually, it’s the same classic architecture, that both Yacc and PLY use: Read token, advance parser state, read token, advance parser state, etc.

          The only difference is that now, before the lexer reads a token, it asks the parser what state its in, and tests only the relevant subgroup of terminals. (which are organized to groups beforehand, of course)

          I don’t know how nobody thought of that before :)

          Weights? Sounds pretty interesting!

          It’s still a little primitive, you have to use integers and the higher one gets chosen. But it’s used rarely enough so it’s okay for now. And I do sum up the descendants, so weights propagate up in the grammar.

          There is, however, more than one weight selection algorithm. The other one contributed by a guy who used it for NLP.

          its not obvious to me how to keep track of multiple alternatives.

          Actually, Earley already keeps track of multiple alternatives. The hard part is to make it consider only one! So I think the way I would do it, is that when I see a choice operator, I would only let Earley see the first option, and I’ll mark the spot with a “checkpoint”. That can happen many times during the parse. Then, when the parse fails (either immediately or when it reaches the end), I would backtrack the the most recent checkpoint and yield the second option, if any, or backtrack some more. The chart parser will have to be modified to support incremental parsing, but I think there’s nothing preventing it from doing so.

          It’s only efficient if used correctly, of course, because the worst case performance is atrocious. But I think PEGs have the same problem, no?

          I didn’t find anything about the prefixes ! and ? in the documentation though.

          The ? prefix is mentioned here: https://github.com/erezsh/lark/wiki/Tree-Construction

          But I did forget about the ! prefix. I’ll add it, thanks!!

    2. 3

      thanks for sharing this!

      1. 2

        The support for standalone parsers is really tantalizing! I wasted quite some time on this problem a few years back, for a bug in the Salt devops tooling that appears to remain unfixed to this day.

        If you’re looking for potent demonstrators of your library (and 9k GitHub stars can’t be wrong ;)), it might be worth your time investigating Salt:

        I’ve long forgotten what the original bug was, however at the time, things such as prefixing the expression with “not” were totally broken. I wouldn’t be surprised if that’s still the case. That single bug is hardly representative, minion filtering is a core part of the tool, and there are probably hundreds of bugs logged or otherwise fixed by a real parser

        1. 2

          I’ve stumbled upon Lark recently and started playing with it. I wrote a grammar and a piece of text that should be parsable with it. That piece of text fails to parse though, and I’m left with the position the error occurred and what tokens are expected. If the grammar was complete/correct, that would nicely tell me where my parsed text’s syntax is wrong, however since I’m just writing the grammar and I’m sure the text is correct, I’d like to debug the grammar instead. Unfortunately, Lark doesn’t seem to help me there. I’m a beginner when it comes to parsers, so I don’t really know how to best proceed from there, though a dump of the parser state would probably help, right? Can I get that out of Lark somehow?

          1. 1

            I do want to make debugging grammars easier. But that’s a bit like asking the Python interpreter to guess what lines of code you should write.

            Perhaps you can post or PM me your attempt (code + input), and we’ll take it from there.

          2. 2

            How are the error messages it creates? One of the big weaknesses of *parsec in Haskell is how inscrutable the error messages can get.

            1. 2

              Lark provides you with the line & column in the text where the error occurred (it counts them automatically), and also which input it expected.

              You can see for yourself how Lark utilizes this information to provide useful errors when users make mistakes in the grammar: https://github.com/erezsh/lark/blob/master/lark/load_grammar.py#L593

              Of course, it’s not always clear what is the error, and what is the best way to solve it. I am open to hearing about ways that I can improve the error messages.

              1. 3

                That’s a great start. Some helpers for displaying the line and perhaps relevant textspan. One other useful thing is display which grammar rule the parse failed within. Many tools in principle support it, but the api could help better encourage it.

                1. 2

                  Good ideas!