1. 2

    I’ve stumbled upon Lark recently and started playing with it. I wrote a grammar and a piece of text that should be parsable with it. That piece of text fails to parse though, and I’m left with the position the error occurred and what tokens are expected. If the grammar was complete/correct, that would nicely tell me where my parsed text’s syntax is wrong, however since I’m just writing the grammar and I’m sure the text is correct, I’d like to debug the grammar instead. Unfortunately, Lark doesn’t seem to help me there. I’m a beginner when it comes to parsers, so I don’t really know how to best proceed from there, though a dump of the parser state would probably help, right? Can I get that out of Lark somehow?

    1. 1

      I do want to make debugging grammars easier. But that’s a bit like asking the Python interpreter to guess what lines of code you should write.

      Perhaps you can post or PM me your attempt (code + input), and we’ll take it from there.

    1. 2

      How are the error messages it creates? One of the big weaknesses of *parsec in Haskell is how inscrutable the error messages can get.

      1. 2

        Lark provides you with the line & column in the text where the error occurred (it counts them automatically), and also which input it expected.

        You can see for yourself how Lark utilizes this information to provide useful errors when users make mistakes in the grammar: https://github.com/erezsh/lark/blob/master/lark/load_grammar.py#L593

        Of course, it’s not always clear what is the error, and what is the best way to solve it. I am open to hearing about ways that I can improve the error messages.

        1. 3

          That’s a great start. Some helpers for displaying the line and perhaps relevant textspan. One other useful thing is display which grammar rule the parse failed within. Many tools in principle support it, but the api could help better encourage it.

          1. 2

            Good ideas!

      1. 4

        Very nice!

        It includes unique algorithms that I invented (afaik), like a contextual-lexer for LALR, and a dynamic lexer for Earley.

        Do you have a high level description of what the algorithm is somewhere?

        It can generate a stand-alone parser.

        What do you mean by this? Aren’t most parsers parser generators? (In fact mine is the only one I’m aware that isn’t a parser generator).

        (* PEGs cannot handle non-deterministic grammars. Also, according to Wikipedia, it remains unanswered whether PEGs can really parse all deterministic CFGs)

        I’m not sure about this statement. You could just as well say CFGs can’t handle PEGs because of the lack of a choice operator (or rather that CFGs have unpredictably outputs on non-deterministic grammars for inputs with more than one parse).

        Notice punctuation doesn’t appear in the resulting tree. It’s automatically filtered away by Lark. Build a parse-tree automagically, no construction code required

        Because of these statements, I looked at python_parser.py

        >>> from python_parser import *
        >>> python_parser2.parse('a >= b\n') == python_parser2.parse('a == b\n')
        True
        

        One statement is >= and the other is == but the information is lost.

        >>> python_parser2.parse('a >= b\n')
        Tree(expr_stmt, [Tree(comparison, [Token(NAME, 'a'), Tree(comp_op, []), Token(NAME, 'b')])])
        >>> python_parser2.parse('a == b\n')
        Tree(expr_stmt, [Tree(comparison, [Token(NAME, 'a'), Tree(comp_op, []), Token(NAME, 'b')])])
        

        Lark looks pretty good otherwise.

        1. 6

          Hi, thanks for your comments!

          Do you have a high level description of what the algorithm is somewhere?

          I kept postponing it because I wanted to write a lengthy blog post, but you’re right, and I will add an explanation. Meanwhile, here it is in a couple of sentences:

          • Earley is a chart parser, which means it goes through the input letter-by-letter, which is very slow. You can use a lexer, but then you lose a lot of the ambiguity in the text. My improvement is a “skipping” chart parser, which lets Earley match regexps, to the same effect as scanless parsing but much faster in the common case.

          • The contextual lexer communicates with the LALR(1) parser and uses the parser lookahead prediction to narrow its choice of tokens. So at each point, the lexer only matches the terminals that are legal at that parser state, instead of all of the terminals (which is what yacc & ply do). It’s surprisingly effective at resolving common terminal collisions.

          Aren’t most parsers parser generators

          The most common Python parsers aren’t. Namely PLY, pyparsing and the rest. In the sense that you always need the entire library to build a parser. I agree it’s not innovative in the grand scheme of the parser world, but it’s pretty unique in the Python landscape.

          CFGs have unpredictably outputs

          Lark lets you control that with weights, and you can also get a parse tree that contains all of the ambiguities, for you to trim as you see fit.

          CFGs can’t handle PEGs because of the lack of a choice operator

          I’m not sure why you think that. The choice operator is an “or” operator that forces determinism. But when your parses can handle non-determinism, it’s not necessary. You can make your “choice” later, after the parse it complete.

          The choice operator does have advantages in terms of efficiency. If enough users ask for, I will add it as a feature (it’s entirely possible to add to the Earley parser).

          python_parser.py

          These are the comments at the header of the grammar:

          // This grammar should parse all python 2.x code successfully,
          // but the resulting parse-tree is still not well-organized.
          

          There are multiple ways to keep the operator, for example to define comp_op like this:

          !comp_op: "<" | "==" | ...
          

          I did fix that in the Python3 grammar, but there are many small details like that. I just never got around to it, it’s just an example in the examples folder.

          Thanks again for your input.

          1. 3

            “skipping” chart parser, which lets Earley match regexps

            Thanks for the description. I do hope you write that blog post at some point.

            So it can match either a regexp or a character (instead of the just a character)?

            On a side note, I always found it weird to use regexp in parsing because regexp matching is already a parsing task so somehow a subparser is used for part of the grammar. There’s no reason not to do this, of course.

            contextual lexer communicates with the LALR(1) parser

            Oh so you are doing both steps at once (or sort of alternating between them).

            The most common Python parsers aren’t. Namely PLY, pyparsing and the rest. In the sense that you always need the entire library to build a parser. I agree it’s not innovative in the grand scheme of the parser world, but it’s pretty unique in the Python landscape.

            Oh, I see. I never looked at those two in more detail. Maybe I was thinking of Parsimonious or Parsley? It definitely felt like they were essentially generating Python code although it may be encoded in some form or other.

            Lark lets you control that with weights, and you can also get a parse tree that contains all of the ambiguities, for you to trim as you see fit.

            Weights? Sounds pretty interesting!

            You can make your “choice” later, after the parse it complete.

            Yes, that’s true. I’ve never seen it stated that way.

            The choice operator does have advantages in terms of efficiency. If enough users ask for, I will add it as a feature (it’s entirely possible to add to the Earley parser).

            How does that work with Earley parsing? I probably haven’t thought about this as much as you have, but its not obvious to me how to keep track of multiple alternatives.

            I don’t think a lot of people will ask for this as a feature mind you. I’m more curious about how that can be done.

            These are the comments at the header of the grammar: // This grammar should parse all python 2.x code successfully, // but the resulting parse-tree is still not well-organized.

            Yeah, when I saw that but thought it meant the tree could be shaped differently than what the ast module would produce but still is an AST that you could, say, pretty print the parsed Python source code with.

            !comp_op: “<” | “==” | …

            Ok, that looks as pretty simple fix so it isn’t really an issue then. I didn’t find anything about the prefixes ! and ? in the documentation though. Maybe just looked too fast.

            1. 4

              I do hope you write that blog post at some point.

              I’m sure I will. And your expressed interest makes it even more likely!

              Yes, it can match regexps. And characters are just very short regexps :)

              I always found it weird to use regexp in parsing

              It actually makes a lot of sense. A stronger parser is always slower than a weaker one, so (string match > regexp > LALR > Earley). Splitting the parser to several layers lets us pick the low-hanging fruits faster, and use the heavy lifting only when necessary. In fact, I believe it’s possible to make the chart-skipping Earley to run LALR(1) on the non-ambiguous subgrammars of what it’s trying to parse, so it will have 3 layers. It’s a little more tricky than regexps, but hopefully I’ll manage to write it some day.

              So you are doing both steps at once

              Actually, it’s the same classic architecture, that both Yacc and PLY use: Read token, advance parser state, read token, advance parser state, etc.

              The only difference is that now, before the lexer reads a token, it asks the parser what state its in, and tests only the relevant subgroup of terminals. (which are organized to groups beforehand, of course)

              I don’t know how nobody thought of that before :)

              Weights? Sounds pretty interesting!

              It’s still a little primitive, you have to use integers and the higher one gets chosen. But it’s used rarely enough so it’s okay for now. And I do sum up the descendants, so weights propagate up in the grammar.

              There is, however, more than one weight selection algorithm. The other one contributed by a guy who used it for NLP.

              its not obvious to me how to keep track of multiple alternatives.

              Actually, Earley already keeps track of multiple alternatives. The hard part is to make it consider only one! So I think the way I would do it, is that when I see a choice operator, I would only let Earley see the first option, and I’ll mark the spot with a “checkpoint”. That can happen many times during the parse. Then, when the parse fails (either immediately or when it reaches the end), I would backtrack the the most recent checkpoint and yield the second option, if any, or backtrack some more. The chart parser will have to be modified to support incremental parsing, but I think there’s nothing preventing it from doing so.

              It’s only efficient if used correctly, of course, because the worst case performance is atrocious. But I think PEGs have the same problem, no?

              I didn’t find anything about the prefixes ! and ? in the documentation though.

              The ? prefix is mentioned here: https://github.com/erezsh/lark/wiki/Tree-Construction

              But I did forget about the ! prefix. I’ll add it, thanks!!

        1. 18

          Some feedback:

          I have seen many oil shell posts, but still don’t know what the heck the actual OIL language looks like.

          1. 4

            OK thanks, maybe I should link these at the very top of the “A Glimpse of Oil”:

            http://www.oilshell.org/blog/tags.html?tag=osh-to-oil#osh-to-oil

            They are linked somewhere in the middle, which is probably easy to miss.

            It’s sort of on purpose, since Oil isn’t implemented yet, as I mention in the intro. But I think those posts give a decent idea of what it looks like (let me know if you disagree).

            1. 7

              I’ve seen your posts and hat around and never really understood what Oil was really about, but this link is really wonderful. The comparison to shell, the simplifications, the 4 different languages of shell vs the two of Oil, it all really clicked. Really cool project.

              1. 3

                I agree with the others. Until I see what’s your vision for the language, I’m not motivated to get involved.

                The only example you give contains “if test -z $[which mke2fs]”, which can’t be what you’re aiming at.

                IMHO If you really want Oil to be easy to use, you should take as much syntax from Python or Javascript as you can. And use similar semantics too.

                1. 11

                  I’m willing to be convinced that a new syntax would be better for shell programming.

                  I’m not very confident that moving towards an existing non-shell scripting language will get us there.

                  The problem I have with writing shell programs in some non-shell language is that I expect to keep using the same syntax on the command line as I do in scripts I save to disk, and non-shell languages don’t have the things that make that pleasant. For example, a non-shell language has a fixed list of “words” it knows about, and using anything not on that list is a syntax error. That’s great in Python, where such a word is almost certainly a spelling error, but in a shell, most words are program names and I don’t want my shell constantly groveling through every directory in my $PATH so it knows all my program names before I try to use them.

                  I’ve also never seen a non-shell language of any type with piping and command substitution as elegant as bash and zsh, but I’m willing to be convinced. I’m afraid, though, anyone in the “Real Language” mindset would make constructions such as diff <(./prog1 -a -b) <(./prog1 -a -c) substantially more verbose, losing one of the main reasons we have powerful shells to begin with.

                  1. 3

                    Yes it has to be a hybrid. I talk a little about “command vs expression” mode in the post. I guess you’ll have to wait and see, but I’m aware of this and it’s very much a central design issue.

                    Of course “bare words” behave in Oil just as they do in bash, e.g.

                    echo hi
                    ls /
                    

                    I will not make you type

                    run(["echo", "hi"])
                    

                    :-)

                    One of the reasons I reimplemented bash from scratch is to be aware of all the syntactic issues. Process substitution should continue to work. In fact I’ve been contemplating this “one line” rule / sublanguage – that is, essentially anything that is one line in shell will continue to work.

                    Also, OSH and Oil will likely be composed, and OSH already implements the syntax you are familiar with. This is future work so I don’t want to promise anything specific, but I think it’s possible to get the best of both worlds – familiar syntax for interactive use and clean syntax for maintainable programs.

                    1. 1

                      For example, a non-shell language has a fixed list of “words” it knows about, and using anything not on that list is a syntax error. That’s great in Python, where such a word is almost certainly a spelling error, but in a shell, most words are program names and I don’t want my shell constantly groveling through every directory in my $PATH so it knows all my program names before I try to use them.

                      tclsh is an interesting example of not having this problem.

                      I’m afraid, though, anyone in the “Real Language” mindset would make constructions such as diff <(./prog1 -a -b) <(./prog1 -a -c) substantially more verbose, losing one of the main reasons we have powerful shells to begin with.

                      You get constructs that look like this in pipey libraries for functional languages (the likes of fs2 or conduit), though they’re controversial.

                      1. 1

                        Well put. There’s also loads of muscle memory built up that is hard to leave behind. That point keeps me off of fish; I like almost everything else about it, but I don’t see why it can’t have a separate bash-like syntax.

                      2. 2

                        OK that’s fair. I’m on the fence about outside contributions – some people have contributed, but I think most projects have more users and more “bones” before getting major contributions. I’m really looking for people to test OSH on real shell scripts, not necessarily adopt it or contribute. (although if you can figure out the code, I applaud you and encourage your contributions :) )

                        As I mention in the post, the OSH language is implemented (it runs real shell scripts), but Oil isn’t.

                        There will be a different way to test if a string is empty, but for the auto-conversions, if you have [ -z foo ], it will become test -z foo. The auto-conversion is going to make your script RUN, not make it idiomatic.

                        As far as appearance, you can definitely think of Oil as a hybrid between shell and Python/JavaScript.

                        I can probably write up a cheatsheet for those curious. I haven’t really done so because it feels like promising something that’s not there. But since I’ve written so many blog posts, it might be worth showing something in the style of:

                        https://learnxinyminutes.com/docs/bash/

                    2. 0

                      Yes and I don’t think I’ll care about it until I do. It could look like APL for all we know.