1. 13
  1.  

  2. 5

    Parsing Models Cheatsheet

    I recently got the Earley parsing religion, which is O(n³) worst-case just like everything else, but can handle most sensible CFGs in linear time and space, even with infinite lookahead. No shift/reduce errors like yacc, no accidental infinite recursions like PEG parsers, just reliable operation.

    1. 1

      By the way, I think that GLL parsing can be similarly elegant and concise as that of Earley. I haven’t gotten around to translating that to Python though.

      1. 1

        The religion has its appeal… but doesn’t get the job done on its own. Can you recommend some implementations?

        1. 2

          If you are looking for a simple well-explained Python implementation that does the related optimizations such as Aycock’s fix and Leo’s right-recursion optimization, do look at ours. We are writing this as part of a larger book on fuzzing with grammars.

          1. 1

            Thanks! That book looks quite useful. I will share it around.

          2. 1

            The industrial-strength implementation is Marpa, which is a C library. There are bindings for various languages, but the original author’s Perl binding is the most complete.

            Searching for “earley” on nearly any language package manager will probably give you a bunch of implementations of varying quality.

            Naturally, I’m writing my own: as a learning exercise, and because I’ve got Strong Opinions about documentation quality I’d like to demonstrate. I haven’t yet got it to the point of public presentability, but it’s coming along nicely.

            1. 1

              Would love to see your progress. Are you going the SPPF route or are you building the trees after the Earley algorithm has completed?

              Here is our attempt at explaining Earley parsing in Python.

              1. 1

                I currently have zero documentation, but if you’d like to take a look at the source, I put it on sr.ht.

                I confess I don’t know exactly what SPPF actually is, except that Loup Vaillant couldn’t understand Scott’s paper about it, and Jeffrey Kegler independently invented something equivalent which he doesn’t really explain. Basically, I build a parse-tree node for each completed item by stitching together parse-tree nodes for non-terminal symbols that appear in the right order and cover the required span. I happen to do it while the parse is in progress, but I think the same approach would work if done afterward.

                Thank you very much for that link! I just discovered the hard way that the Leo optimization breaks naïve parse-tree construction, and yours is the first document I’ve seen that seems to acknowledge that’s an issue. Unfortunately, it only mentions it in an exercise, so there’s no worked solution, but at least there’s a hint which is more than I’ve found anywhere else.

                1. 1

                  Unfortunately, it only mentions it in an exercise,

                  The solution is there! Look at the LeoParser using Tagged class. (Most exercises do come with a worked solution.)

                  Thank you for your link.

                  1. 1

                    Ah, right! It was late at night when I first looked, now that I’ve had some rest I can see the solution.

                    Although I’m sure I’d fare better if I’d used your tutorial from the beginning, I’ve had some trouble following the code examples:

                    • Exercise 5 has a big block of text describing the concept of a deterministic reduction path, and then it has a block of code that adds a tag field to the State class. Nothing in the text motivates this addition, and it’s not actually used in the following LeoParser class.
                    • The tag field is eventually used in the solution to that first exercise, but very subtly: inside the new leo_complete() method, a single call is changed to have an extra parameter. That’s the only change to the previous leo_complete() function, and it doesn’t have a comment pointing out the change or even a keyword argument tag=state.name to highlight that something is being done with tags
                    • It turns out that tags are not even necessary to make a right-recursion-optimised recogniser, only a parser, so this is all distraction from the idea (“deterministic right reduction”) that the text is trying to convey and it would be very easy for a reader to ignore this scaffolding and assume they’ve understood the material.
                    • By the time we get to the hint that says “any time you see a tagged state, look at its end point”, there hasn’t been any discussion of how, when or why a state gets tagged, or what a state gets tagged with. It turns out that the code does provide answers to most of those things, but (as I said) it’s pretty subtle.
                      • EDIT: Now that I look more closely, the state is tagged with the LHS of the production that was originally produced, but the code never examines the tag content, so effectively the tag is a boolean “produced by Leo reduction” flag. Is that right?
                    • Another thing that confused me: inside the parse() method of the final LeoParser, it starts by calling parse_prefix(), checking the result, and then calling chart_parse(). If I understand correctly, that code will recognise the whole input string, then throw away the intermediate and final results, then start parsing again from scratch?
                    1. 1

                      Thank you for the detailed review. While the rest of the text has gone through multiple revisions, the exercises hasn’t received as much care as the main body other than checking that the idea is correct and the implementation works. I will go through the exercise again, and clarify the points you have noted; and update you.

                      1. 1

                        As I mentioned, I have updated my tutorial with your feedback in mind. The updated tutorial is here Exercise 5. Have I addressed all your concerns? Any place I can make it better?

                        Note that I took the easy way out in expanding the Leo items. I put back everything that was trimmed. Not just the ones required for derivation trees.

                        Thanks again for your feedback.

                        1. 2

                          Thank you for being patient! I’ve been studying your tutorial and working on my code, and I have something I think I’m happy with. Here are the notes I took as I was figuring things out:

                          • It’s a bit difficult for me to follow the text because I’m diving in at the end rather than reading from the beginning, and because I started with Loup Vaillant’s tutorial which uses different terms (“Earley set” instead of “column”, etc.) but that’s my problem, not a problem with the text.
                          • Thank you very much for the exhaustive tests; I found a problem with my existing recogniser due to RR_GRAMMAR4.
                          • It’s nice to have a diagram of the right-recursive-reduction chain, but I think the “topmost” reduction is depicted on the bottom, which is a little confusing.
                          • The “deterministic reduction” operation is described as resulting in a new completed item, so I was surprised to see the completion operation done in get_top() rather than in uniq_postdot()… why not have one method for the “one level” reduction and another method that calls it recursively for the topmost reduction?
                          • When it comes to actual parsing, the code adds a TState class and updates the Column class to use it… but TState seems identical to State in every way, so I’m not sure why this change is made. Is it just to distinguish between ordinary Earley items and Leo-reduced items? Wouldn’t adding an is_reduced boolean field be a clearer way to distinguish them?
                          • I assume the parameters to traverse_constraints() are named for how they’ll be called in the first iteration of the example structure, but I generally expect parameters to be named by what they mean in general, not in a specific instance. Is this a general convention for the textbook?
                          • If I understand the example code correctly, you’re adding a transitives field to each column, which stores a copy of each item in the reduction chain (get_top() iterates the chain recursively, and calls add_transitive() at each step). Isn’t that the O(n²) behaviour Leo is supposed to avoid?
                          • I’ve come up with my own approach:
                            • for each topmost item, we store the set of items in the reduction chain, and the numerically largest end offset for any item in the set
                            • inside get_top(), once we’ve found the topmost item for state_A, we add the one-level-reduced state st_B to the chain for that topmost item, and increase the end offset to state_A’s end offset, if it was smaller
                            • because our extra cache stores reduced items once per topmost, instead of once per column per topmost, I believe this preserves the O(n) behaviour of Earley parsing
                            • when we’re done recognising, we can updating the end-offset of each state in the chain to match the largest observed end-offset, and insert them into the regular state storage for the parsing phase to find.
                          • When implementing this, I had a bug where I was storing state_A in the chain instead of the one-level-reduced st_B. I discovered this bug because one of my testcases was the grammar <A> ::= 'a' | 'a' <A> and I wound up stretching all the non-recursive 'a' items to cover the right-recursion span instead of the recursive 'a' <A> items. You might want to add that test-case to your examples.

                          I’m not sure how helpful that all is to you, but I found it very helpful to have a working example to follow, especially for finer points of the reduction memoisation. Thanks again for your help!

                          1. 2

                            Thank you so much for the exhaustive review!

                            It’s nice to have a diagram of the right-recursive-reduction chain, but I think the “topmost” reduction is depicted on the bottom, which is a little confusing.

                            I agree! It is better represented with the topmost at the top.

                            The “deterministic reduction” operation is described as resulting in a new completed item, so I was surprised to see the completion operation done in get_top() rather than in uniq_postdot()… why not have one method for the “one level” reduction and another method that calls it recursively for the topmost reduction?

                            The reason I went with this organization is that the transitive has to be stored in the end column of st_B_inc. Unless I pass the end column of st_B_inc separately, I can not use uniq_postdot the way you suggest. (I think this also has to do with our different understanding of where transitives are stored. See the later point.)

                            When it comes to actual parsing, the code adds a TState class and updates the Column class to use it… but TState seems identical to State in every way, so I’m not sure why this change is made. Is it just to distinguish between ordinary Earley items and Leo-reduced items? Wouldn’t adding an is_reduced boolean field be a clearer way to distinguish them?

                            Indeed, the sole function of TState is to mark it as a transitive (the copy could have been done separately). You are right that in an independent parser that implements Leo’s reduction, this would have been better represented as a boolean field. However, we have an Earley parser that is separate in the main text, which I did not want to modify, and which I want to be importable from the main chapter. With this organization, a person who imports the State will not need to see the TState update. (I also like the fact that the transitive is a different kind of a state as it is a shorthand for a sequence of states, and hence worthy of getting its own class to indicate it. But that is subjective)

                            I assume the parameters to traverse_constraints() are named for how they’ll be called in the first iteration of the example structure, but I generally expect parameters to be named by what they mean in general, not in a specific instance. Is this a general convention for the textbook?

                            I actually switched back and forth multiple times between naming them in a general sense and naming them using the names in the example. Having found a rather large number of bugs, and having to re-understand what exactly was happening each time, I found it better to rely on the names in the example to clarify what exactly I was trying to do. (This is not the general convention followed in the book, and I will probably have to rename them at a later point. Until that time, this makes my life simpler :) )

                            If I understand the example code correctly, you’re adding a transitives field to each column, which stores a copy of each item in the reduction chain (get_top() iterates the chain recursively, and calls add_transitive() at each step). Isn’t that the O(n²) behaviour Leo is supposed to avoid?

                            Only the first time for any particular pattern. After the first time the transitive items are cached. See the quote from Loup Vaillant “Now, when we find the topmost item in the reduction path, we add the corresponding transitive item to the current Earley set. And all the Earley sets we just examined, while we’re at it.”. This is also my reading of Def 2.1 from Leo’s paper.

                            “To determine the topmost item on a deterministic reduction path we do not always have to construct the complete reduction path. Suppose $ [C -> \delta., m] $ is the topmost item on the deterministic reduction path above $[A-> \gamma., i]$, and $[B -> \beta., k]$ is some other item on this path. Then we add to set $I_k$ a so-called transitive item $[C -> \delta., B, m]$. Subsequently, if an item of the form $[B -> \beta’., k]$ is added to some set I_j, we can directly add $[C -> \delta., m]$ to $I_j$. “ (formatting is screwed up. But please see the link I gave above.)

                            That is, if $[B -> \beta., k]$ is some item in the path (at end column k), we add the transitive item to I_k.

                            On the other hand, my implementation of the parser is a naive one. The leo_expand essentially gives back all the gains made during the recognizer. It would be possible to have a clever implementation that only lazily expands when a reduced item is required. However, at this point, I have opted for a naive but simple and correct.

                            I’ve come up with my own approach: for each topmost item, we store the set of items in the reduction chain, and the numerically largest end offset for any item in the set inside get_top(), once we’ve found the topmost item for state_A, we add the one-level-reduced state st_B to the chain for that topmost item, and increase the end offset to state_A’s end offset, if it was smaller because our extra cache stores reduced items once per topmost, instead of once per column per topmost, I believe this preserves the O(n) behaviour of Earley parsing when we’re done recognising, we can updating the end-offset of each state in the chain to match the largest observed end-offset, and insert them into the regular state storage for the parsing phase to find.

                            This might work, however, I caution that Leo is rather hard to get right. (the deterministic reduction as I implement it does preserve O(n) behavior as I mentioned above.)

                            One of my testcases was the grammar ::= ‘a’ | ‘a’ and I wound up stretching all the non-recursive ‘a’ items to cover the right-recursion span instead of the recursive ‘a’ items. You might want to add that test-case to your examples.

                            Thank you so much! This actually exposed a bug I had. My implementation was not able to parse with the RR_GRAMMAR7 which is your test case. I am not entirely sure if my fix is the right one, but I don’t see a different explanation for it other than this.

                            I’m not sure how helpful that all is to you, but I found it very helpful to have a working example to follow, especially for finer points of the reduction memoisation. Thanks again for your help!

                            Your review has been very helpful, and thank you for this wonderful new year gift!

                            1. 1

                              On the other hand, my implementation of the parser is a naive one. The leo_expand essentially gives back all the gains made during the recognizer.

                              I found a way to avoid expanding all states (updated in the code) which is simple enough.

                          2. 1

                            Thank you very much! I haven’t had time to look at your changes yet, but I will try to get around to it very soon.

            2. 4

              I’ve given this some thought in the context of my own projects but I consider the ‘completion script’ angle as a workaround for a more fundamental issue: there is no option to query the client itself for completion or validation, and you would really want both tied to the actual binary that will receive the constructed arg/env to avoid any impedance mismatch, and to provide better feedback.

              The legacy- compatible way for that would be another entrypoint to main(), one where the contract is to provide validation and compltion for argc/argv without execution and, for binaries which miss this fictive “eval” entry point, provide a wrapper that exports the completion. Alas this is not something that ELF was really designed for, neither is the case of dlopen()ing non- .so ELF binary (do I hear the chimes of rundll32.exe?). A much less appealing option would be yet-another-section and a static format.

              1. 4

                git, npm, and Clang actually already handle completion in the binary. It’s a common architecture – it’s just that there is a bunch of bash glued on, and they all do it in a slightly different way.

                So the Shellac protocol should be designed so that they can migrate to something that’s not bash-specific, without rewriting all their logic.

                I could have mentioned this directly in the post, but it’s on the wiki page:

                https://github.com/oilshell/oil/wiki/Shell-Autocompletion

                http://blog.llvm.org/2017/09/clang-bash-better-auto-completion-is.html

                I think the easiest way is to simply use a flag or environment variable for the completion mode. A flag is better in that it’s not inherited, but environment variables feel more decoupled from the main program.

                1. 2

                  I think if you got npm, git and clang to adopt something like this, your adoption would jump up pretty fast and each is representative of some major groups; package managers, common/important non-shell tools, and language based things. I’d love to see adoption by by languages like Go (which as an example, already has a flags package that this should tie into) that way it’s easy to generate binaries that adhere to the protocol. I don’t use Rust (yet), but this seems like the sort of thing that would appeal to that community as well. At a different level it seems like supporting this protocol would appeal to linux distros/*BSDs for their package management as well.

                  Autocomplete comes to mind every time I use Acme (which has only filename autocompletion built-in). I’ve always wanted something that was equally general to help out with command line autocompletion. This could potentially make other tools move some of their logic to a more reasonable place (IDE plugins would either evaporate or become much smaller and simpler).

                  @andyc I’m actually a little overwhelmed at the possibilities here and how it can make so many tools so much better. Is there anything I can jump in and help with?

                  1. 2

                    I think you could get significant (albeit implicit) adoption of a new completion protocol just by adding support for it to libraries like Python’s argparse or docopt or Haskell’s optparse-applicative: these already know about all of a program’s valid command-line options. (The latter package can already do bash completion via the hidden --bash-completion-* options that get injected into programs that use it.)

                    1. 1

                      That’s similar to, but more informatitve than, the point I was making about Go’s flags. Especially since it looks like docopt (at least) has versions in multiple languages.

                      Autocomplete all the things!

                      1. 1

                        Thanks for the Haskell link, I added it here:

                        https://github.com/oilshell/oil/wiki/Shell-Autocompletion

                        There’s an argcomplete package for argparse which someone else mentioned. That’s definitely the idea – instead of generating bash completion scripts, those libraries could implement a shell-independent protocol.

                        However, for backward compatibility, we could first turn OSH itself into a Shellac server! It could simply interpret the scripts generated by clap-rs and serve completions to another shell.

                      2. 1

                        Glad you like the idea! It’s still quite early and there’s a lot to do.

                        If you’re interested in completion for Acme, then that’s absolutely what I’m going for… I want to be able to complete my shell scripts in Vim!!! So Vim, Acme, Oil, Elvish, zsh, etc. are perfect clients for this. However it’s not obvious that they can all share the same logic… that’s what we have to work out.

                        The main issue I see is how to separate completion of the shell language itself, e.g. echo $V<TAB> to complete var names, vs. completion of the argv array. Those are two separate things that are intertwined in the way most shells including bash do things.

                        Autocompletion is sort of a “polyglot” problem, so if you have expertise in Go or the JS ecosystem that would help.

                        Does Go have any existing packages for completion of its flags package? I know that Python does for argparse and Rust does as well. There are some links off the threads in my wiki page.

                        We’re discussing this on the #shell-completion channel on https://oilshell.zulipchat.com/ , which is easy to log into with Github. The conversation is dormant now but I’d like to revive it for awhile and get everyone on the same page. There was a debate about grammar-based approaches vs. this server-like approach. The server approach is closer to what npm, git, and Clang already do.

                  2. 3

                    @andyc That could be potentially very interesting to me for the Ultimate Plumber tool! That said, not knowing LSP well yet, I don’t really have any idea what’s the difference between the two protocols. Wouldn’t you mind if I asked you what are the main differences, or especially limitations in LSP which made you want to create a new protocol? Skimming the articles, I couldn’t find any elaboration on that; I’d be grateful for a pointer if I missed such a text. I think it could be interesting to others too! Thanks.

                    (And btw, the “shellac” name is just awesome :D)

                    1. 3

                      Ha, glad you got the name :) I think the tagline should be Shellac finishes your commands (groan).

                      So I actually haven’t worked much with the language server protocol. I know it uses JSON-RPC, and pretty much everyone was in unanimous agreement that we don’t want to impose that dependency on Shellac clients (shells and editors) and Shellac servers (command line tools).

                      There is also the issue that the argv array you want to complete is different than the actual line of text in the shell or the editor. I don’t know that the LSP can NOT accomodate this … but I suppose the problem just seems somewhat different on the surface.

                      As far as I know, the LSP is basically for when you type . in a Java-like language. I think it has notions of methods and arguments, which shell doesn’t have.

                      But I could be wrong and I would like it if someone tells me more about the language server protocol :)

                      It’s still very early, and the point of this post was to get people thinking about the problem!

                      As I mentioned in another comment, we’re discussing this on #shell-completion at https://oilshell.zulipchat.com , which is easy to log into with Github.

                      1. 2

                        In fountain pens world, shellac is used mainly as a general purpose glue, which also fits well here! :)

                        As to the protocol, is it expected that I could eventually call for completion on:

                        SHELLAC_ARGV=["oil", "-c", "echo $"]
                        SHELLAC_ARGC=2
                        SHELLAC_OFFSET=6
                        

                        and get a list of env variable names? or recursively for ["oil", "-c", "ls --"] have ls completion queried via oil?

                        (sorry for not going to zulipchat, but I have enough spam already in my inbox, and don’t expect to get involved unfortunately, having more than enough hobby projects on my head…)

                        1. 2

                          For the first question, Oil will be a shellac server as ewll as a client, so the answer is yes. It will know about its own syntax.

                          For the second question, I can imagine that will work, but that’s not the “right” way to do it. There would instead be a request you make to a shellac server regarding ls specifically. You wouldn’t have to go through Oil necessarily.

                          1. 1

                            Thanks! As for the second question, please try to look at it as a “degenerate case” of the first question. I.e. this would be more like ["oil", "-c", "if ls --"] or something similar in reality — that is, I’d need oil to parse its own syntax and then delegate further to ls. (Heheh, imagine an Inception-like scenario of oil -c "oil -c ..."; cue “We can go deeeeeper…” ;P)