1. 13

  2. 5

    Parsing Models Cheatsheet

    I recently got the Earley parsing religion, which is O(n³) worst-case just like everything else, but can handle most sensible CFGs in linear time and space, even with infinite lookahead. No shift/reduce errors like yacc, no accidental infinite recursions like PEG parsers, just reliable operation.

    1. 1

      By the way, I think that GLL parsing can be similarly elegant and concise as that of Earley. I haven’t gotten around to translating that to Python though.

      1. 1

        The religion has its appeal… but doesn’t get the job done on its own. Can you recommend some implementations?

        1. 2

          If you are looking for a simple well-explained Python implementation that does the related optimizations such as Aycock’s fix and Leo’s right-recursion optimization, do look at ours. We are writing this as part of a larger book on fuzzing with grammars.

          1. 1

            Thanks! That book looks quite useful. I will share it around.

          2. 1

            The industrial-strength implementation is Marpa, which is a C library. There are bindings for various languages, but the original author’s Perl binding is the most complete.

            Searching for “earley” on nearly any language package manager will probably give you a bunch of implementations of varying quality.

            Naturally, I’m writing my own: as a learning exercise, and because I’ve got Strong Opinions about documentation quality I’d like to demonstrate. I haven’t yet got it to the point of public presentability, but it’s coming along nicely.

            1. 1

              Would love to see your progress. Are you going the SPPF route or are you building the trees after the Earley algorithm has completed?

              Here is our attempt at explaining Earley parsing in Python.

              1. 1

                I currently have zero documentation, but if you’d like to take a look at the source, I put it on sr.ht.

                I confess I don’t know exactly what SPPF actually is, except that Loup Vaillant couldn’t understand Scott’s paper about it, and Jeffrey Kegler independently invented something equivalent which he doesn’t really explain. Basically, I build a parse-tree node for each completed item by stitching together parse-tree nodes for non-terminal symbols that appear in the right order and cover the required span. I happen to do it while the parse is in progress, but I think the same approach would work if done afterward.

                Thank you very much for that link! I just discovered the hard way that the Leo optimization breaks naïve parse-tree construction, and yours is the first document I’ve seen that seems to acknowledge that’s an issue. Unfortunately, it only mentions it in an exercise, so there’s no worked solution, but at least there’s a hint which is more than I’ve found anywhere else.

                1. 1

                  Unfortunately, it only mentions it in an exercise,

                  The solution is there! Look at the LeoParser using Tagged class. (Most exercises do come with a worked solution.)

                  Thank you for your link.

                  1. 1

                    Ah, right! It was late at night when I first looked, now that I’ve had some rest I can see the solution.

                    Although I’m sure I’d fare better if I’d used your tutorial from the beginning, I’ve had some trouble following the code examples:

                    • Exercise 5 has a big block of text describing the concept of a deterministic reduction path, and then it has a block of code that adds a tag field to the State class. Nothing in the text motivates this addition, and it’s not actually used in the following LeoParser class.
                    • The tag field is eventually used in the solution to that first exercise, but very subtly: inside the new leo_complete() method, a single call is changed to have an extra parameter. That’s the only change to the previous leo_complete() function, and it doesn’t have a comment pointing out the change or even a keyword argument tag=state.name to highlight that something is being done with tags
                    • It turns out that tags are not even necessary to make a right-recursion-optimised recogniser, only a parser, so this is all distraction from the idea (“deterministic right reduction”) that the text is trying to convey and it would be very easy for a reader to ignore this scaffolding and assume they’ve understood the material.
                    • By the time we get to the hint that says “any time you see a tagged state, look at its end point”, there hasn’t been any discussion of how, when or why a state gets tagged, or what a state gets tagged with. It turns out that the code does provide answers to most of those things, but (as I said) it’s pretty subtle.
                      • EDIT: Now that I look more closely, the state is tagged with the LHS of the production that was originally produced, but the code never examines the tag content, so effectively the tag is a boolean “produced by Leo reduction” flag. Is that right?
                    • Another thing that confused me: inside the parse() method of the final LeoParser, it starts by calling parse_prefix(), checking the result, and then calling chart_parse(). If I understand correctly, that code will recognise the whole input string, then throw away the intermediate and final results, then start parsing again from scratch?
                    1. 1

                      Thank you for the detailed review. While the rest of the text has gone through multiple revisions, the exercises hasn’t received as much care as the main body other than checking that the idea is correct and the implementation works. I will go through the exercise again, and clarify the points you have noted; and update you.

        2. 4

          I’ve given this some thought in the context of my own projects but I consider the ‘completion script’ angle as a workaround for a more fundamental issue: there is no option to query the client itself for completion or validation, and you would really want both tied to the actual binary that will receive the constructed arg/env to avoid any impedance mismatch, and to provide better feedback.

          The legacy- compatible way for that would be another entrypoint to main(), one where the contract is to provide validation and compltion for argc/argv without execution and, for binaries which miss this fictive “eval” entry point, provide a wrapper that exports the completion. Alas this is not something that ELF was really designed for, neither is the case of dlopen()ing non- .so ELF binary (do I hear the chimes of rundll32.exe?). A much less appealing option would be yet-another-section and a static format.

          1. 4

            git, npm, and Clang actually already handle completion in the binary. It’s a common architecture – it’s just that there is a bunch of bash glued on, and they all do it in a slightly different way.

            So the Shellac protocol should be designed so that they can migrate to something that’s not bash-specific, without rewriting all their logic.

            I could have mentioned this directly in the post, but it’s on the wiki page:



            I think the easiest way is to simply use a flag or environment variable for the completion mode. A flag is better in that it’s not inherited, but environment variables feel more decoupled from the main program.

            1. 2

              I think if you got npm, git and clang to adopt something like this, your adoption would jump up pretty fast and each is representative of some major groups; package managers, common/important non-shell tools, and language based things. I’d love to see adoption by by languages like Go (which as an example, already has a flags package that this should tie into) that way it’s easy to generate binaries that adhere to the protocol. I don’t use Rust (yet), but this seems like the sort of thing that would appeal to that community as well. At a different level it seems like supporting this protocol would appeal to linux distros/*BSDs for their package management as well.

              Autocomplete comes to mind every time I use Acme (which has only filename autocompletion built-in). I’ve always wanted something that was equally general to help out with command line autocompletion. This could potentially make other tools move some of their logic to a more reasonable place (IDE plugins would either evaporate or become much smaller and simpler).

              @andyc I’m actually a little overwhelmed at the possibilities here and how it can make so many tools so much better. Is there anything I can jump in and help with?

              1. 2

                I think you could get significant (albeit implicit) adoption of a new completion protocol just by adding support for it to libraries like Python’s argparse or docopt or Haskell’s optparse-applicative: these already know about all of a program’s valid command-line options. (The latter package can already do bash completion via the hidden --bash-completion-* options that get injected into programs that use it.)

                1. 1

                  That’s similar to, but more informatitve than, the point I was making about Go’s flags. Especially since it looks like docopt (at least) has versions in multiple languages.

                  Autocomplete all the things!

                  1. 1

                    Thanks for the Haskell link, I added it here:


                    There’s an argcomplete package for argparse which someone else mentioned. That’s definitely the idea – instead of generating bash completion scripts, those libraries could implement a shell-independent protocol.

                    However, for backward compatibility, we could first turn OSH itself into a Shellac server! It could simply interpret the scripts generated by clap-rs and serve completions to another shell.

                  2. 1

                    Glad you like the idea! It’s still quite early and there’s a lot to do.

                    If you’re interested in completion for Acme, then that’s absolutely what I’m going for… I want to be able to complete my shell scripts in Vim!!! So Vim, Acme, Oil, Elvish, zsh, etc. are perfect clients for this. However it’s not obvious that they can all share the same logic… that’s what we have to work out.

                    The main issue I see is how to separate completion of the shell language itself, e.g. echo $V<TAB> to complete var names, vs. completion of the argv array. Those are two separate things that are intertwined in the way most shells including bash do things.

                    Autocompletion is sort of a “polyglot” problem, so if you have expertise in Go or the JS ecosystem that would help.

                    Does Go have any existing packages for completion of its flags package? I know that Python does for argparse and Rust does as well. There are some links off the threads in my wiki page.

                    We’re discussing this on the #shell-completion channel on https://oilshell.zulipchat.com/ , which is easy to log into with Github. The conversation is dormant now but I’d like to revive it for awhile and get everyone on the same page. There was a debate about grammar-based approaches vs. this server-like approach. The server approach is closer to what npm, git, and Clang already do.

              2. 3

                @andyc That could be potentially very interesting to me for the Ultimate Plumber tool! That said, not knowing LSP well yet, I don’t really have any idea what’s the difference between the two protocols. Wouldn’t you mind if I asked you what are the main differences, or especially limitations in LSP which made you want to create a new protocol? Skimming the articles, I couldn’t find any elaboration on that; I’d be grateful for a pointer if I missed such a text. I think it could be interesting to others too! Thanks.

                (And btw, the “shellac” name is just awesome :D)

                1. 3

                  Ha, glad you got the name :) I think the tagline should be Shellac finishes your commands (groan).

                  So I actually haven’t worked much with the language server protocol. I know it uses JSON-RPC, and pretty much everyone was in unanimous agreement that we don’t want to impose that dependency on Shellac clients (shells and editors) and Shellac servers (command line tools).

                  There is also the issue that the argv array you want to complete is different than the actual line of text in the shell or the editor. I don’t know that the LSP can NOT accomodate this … but I suppose the problem just seems somewhat different on the surface.

                  As far as I know, the LSP is basically for when you type . in a Java-like language. I think it has notions of methods and arguments, which shell doesn’t have.

                  But I could be wrong and I would like it if someone tells me more about the language server protocol :)

                  It’s still very early, and the point of this post was to get people thinking about the problem!

                  As I mentioned in another comment, we’re discussing this on #shell-completion at https://oilshell.zulipchat.com , which is easy to log into with Github.

                  1. 2

                    In fountain pens world, shellac is used mainly as a general purpose glue, which also fits well here! :)

                    As to the protocol, is it expected that I could eventually call for completion on:

                    SHELLAC_ARGV=["oil", "-c", "echo $"]

                    and get a list of env variable names? or recursively for ["oil", "-c", "ls --"] have ls completion queried via oil?

                    (sorry for not going to zulipchat, but I have enough spam already in my inbox, and don’t expect to get involved unfortunately, having more than enough hobby projects on my head…)

                    1. 2

                      For the first question, Oil will be a shellac server as ewll as a client, so the answer is yes. It will know about its own syntax.

                      For the second question, I can imagine that will work, but that’s not the “right” way to do it. There would instead be a request you make to a shellac server regarding ls specifically. You wouldn’t have to go through Oil necessarily.

                      1. 1

                        Thanks! As for the second question, please try to look at it as a “degenerate case” of the first question. I.e. this would be more like ["oil", "-c", "if ls --"] or something similar in reality — that is, I’d need oil to parse its own syntax and then delegate further to ls. (Heheh, imagine an Inception-like scenario of oil -c "oil -c ..."; cue “We can go deeeeeper…” ;P)