1. 2

    This looks cool but I don’t understand what a ring is in this case! The readme’s pretty sparse… do you have some more information?

    1. 6

      Working on implementing an Earley parser using this amazing tutorial. I’ve read that site about a dozen times, and last night it finally clicked. Money quote from the author:

      Most compiler texts are obsessed with slaying the compiler dragons of the 60s with 70s technology.

      If you have any sort of appreciation for algorithms or parsing, you should study it.

      1. 4

        The coverage in Dick Grune’s Parsing Techniques: A Practical Guide is great, and his page for the first edition includes PDFs. If you’re interested in parsers in general, it’s definitely worth picking up.

        I’ve also been working on implementing Earley parsers, and the Earley/intersection parsing approach seems particularly compelling. It’s described in chapter 13 in the second edition, doesn’t seem to appear in the first.

        1. 2

          So Earley will recognize all context-free grammars, including ambiguous ones? I have two problems with that:

          1. Doesn’t it have to return a “forest” of all possible derivations of ambiguous grammars? What do you do with that? That doesn’t seem to be solving the problem; it’s more like pushing it to a later stage.
          2. Context free grammars aren’t powerful enough to represent most languages we use (C, C++, shell, Ruby, Perl 5 and 6, etc.). So even if Earley magically solved problem #1, it still wouldn’t be an ideal algorithm.

          I’ve written about this length in various comments, but I think CFGs are the wrong abstraction for human programming languages. They’re simultaneously not powerful enough and too inefficient.

          My argument is analogous to the one here regarding CFGs and network protocols. This is admittedly is a fringe claim, where as “CFGs should be used to parse programming languages” is the claim you will read in nearly every textbook:


          One of the langsec people responded on HN: https://news.ycombinator.com/item?id=16126234

          I read that paper about “calc-regular languages” but it was immature – which the authors admitted – it still can’t handle some common constructs.

          Anyway, my overall point is that we should use formalism, but formalism should match practice, not the other way around. Practice shouldn’t be put in the straitjacket of a theory (especially when that theory doesn’t provide you with good efficiency or correctness guarantees, as CFGs don’t).

          (FWIW I have a very messy wiki page about the difficulties with parsing here: https://github.com/oilshell/oil/wiki/Parsing-is-Difficult

          I don’t believe any of them are addressed by Earley parsing. The main bugaboo I have now is that every language has at least TWO parsers. One for the compiler/interpreter, and a totally separate one for linting/formatting. The former ignores whitespace; the latter must recognize whitespace. Even a brand new language like Julia has two parsers!!!

          Any system based on CFGs don’t solve that problem. I do NOT want parse trees (which include the full derivation of the string from the grammar); I want ASTs or “lossless syntax trees”.

          1. 2

            Well, I’m not happy about how Earley detects ambiguity TBH. But I am looking for something that is simple enough to take the question off the table so I can continue moving forward for now.

            You ended up using Pratt parsing for Oil, correct? The choice of a parser for my project is still very much up in the air; the most important consideration is whether I can build the grammar immediately before parsing a module, which isn’t a great use case for a lot of existing tools. I don’t need the ability to mutate the parser while I’m parsing, as I don’t mind cheating by pre-processing to locate all the grammars in use.

            From my research, Pratt parsing looks really underutilized, but I wasn’t able to discern whether it was painful to build the rules programmatically after parsing patterns and such, and whether it’d be substantially different from PEGs. For example, if I’m building a for x in iterable statement, would matching the in keyword be painful, especially if the metagrammar specification was declarative in nature? I won’t be writing the led() and nud() methods by hand here, they’ll be generated from specifications provided by the user. Also, I was initially put off by all the precedences on everything, then later realized those are implicit in most grammar specifications anyway, so that is pretty unfounded.

            1. 2

              Also remember you can always use more than one parser. sklogic used to point that out on HN when people were choosing between one that succeeds/fails fast and one easy fod error messages. Something like that.

              His tool on “combinatorylogic” Github is a LISP that embeds a bunch of DSL’s he mainly uses for program analysis. So, his double-parser strategy might be well-informed. ;)

              1. 1

                Oil is best described as 3 interleaved recursive descent parsers + one Pratt parser for expressions, with 15 or so lexer modes. [1]

                Pratt parsing is really best suited for expression languages, not a whole programming language, as Crockford does in his widely read article (which I reviewed here [2]).

                Although Pratt parsing isn’t a grammar-based approach, there is something called an “operator precedence grammar” which is a subset of CFGs, and what it handles roughly corresponds to that set of languages:


                Bob Nystrom has a good way of putting it:


                If recursive descent is peanut butter, Pratt parsing is jelly. When you mix the two together, you get a simple, terse, readable parser that can handle any grammar you throw at it.

                I would add the caveat that this technique is really good for implementing existing languages. If you are designing a language from scratch, it might be helpful to use some kind of parser generator to guide the design.

                In summary, I would say the parsing method you use depends very heavily on the language. And many languages are compositions of multiple languages. I don’t have experience with Earley but I’m sure the same thing is true: it’s good for some things and not good for other things.

                [1] http://www.oilshell.org/blog/2017/12/17.html

                [2] http://www.oilshell.org/blog/2016/11/02.html

              2. 2

                Have you heard about parsing with derivatives?

                I recently explored the possibility of doing this with Clojure’s spec. Here’s an example:

                (s/conform ::s-expression "(abc  def)")
                [[:sub {:left  \(
                        :nest  [[:atom [\a \b \c]]
                                [:whitespace [\space \space]]
                                [:atom [\d \e \f]]]
                        :right \)}]]

                Notice that in the third line, the whitespace has been captured. A linter could examine whitespace and a compiler/interpreter could throw it out. Specs can be bidirectional too, so a formatting tool could take a data structure like the above, adjust white space, and unform it, producing formatted source code.

                Regarding returning a parse forest, the author of parsing with derivatives goes into that here. On the next slide, the author alludes to the possibility of modifying the grammar during parsing (!) which may be of particular interest to you.

            1. 2

              I had a hard time understanding the Matt Might paper, but his lecture on it really helped me. He also motivates it really well.

              1. 1

                Somewhere around the 40 minute mark, those definitions of D and δ using define/memoize and define/fix really made my day. :-)

              1. 1

                Great stuff! I love the format of iterating on implementations and criticisms.

                Clojure also has de-nesting shortcuts via -> and related macros, but while the first argument to nest is the outermost expression, the first argument to -> and similar is the innermost expression. This makes it convenient for expressing function composition. However, it seems backward to use it for binding. These two expressions are equivalent:

                (let [x 1 y 2]
                   (when (even? y)
                (->> x
                     (when (even? y))
                     (let [x 1 y 2]))

                Which reminds me of Haskell’s where.

                Similarly, the author illustrates the utility of nest with binding forms, but using it for function composition may feel backwards. If I understand nest correctly, then these should be equivalent:

                (reduce + (filter odd? (map inc [1 2 3 4])))
                (nest (reduce +)
                      (filter odd?)
                      (map inc)
                      [1 2 3 4])

                But, this outermost-first order is the usual order of function composition in math, Haskell, and others.

                It’s interesting that you can express either binding or composition in either order.

                PS: nest is implemented in a library for Clojure as <<-