This is cool! I wrote a tiny backtracking PEG interpreter too in Python, but I never used it to parse Python.
But you’re interpreting Python’s CFG as a PEG. Have you thought about where that will break, if anywhere? Or tested it?
I have been meaning to look into Python’s bespoke parser generator pgen.c. I don’t quite understand the algorithm they are using. It seems similar to the top-down algorithms that ANTLR uses, but I haven’t grokked all the details.
Thanks! Do you have link to your PEG parser?
I did encounter some places where I reordered the alternatives being tested. It parses all the scripts in this project (and a few others I’ve tried with no glaringly obvious problem to my eyes). At the moment, I am not aware of a Python script that gets the wrong AST (except for the whitespace issue I wrote about where some valid scripts are rejected). I don’t know if my rearrangement of the grammar was needed or if it was bugs in my parser at the time.
I think the fact that Python is LL(1) helps because if a string matches, it only matches in one way (if two alternatives match the next character/token of the input, you may have to backtrack if you start with the wrong one of the two). I’m not entirely sure of what I just said. Maybe someone else can confirm or deny. Also since I don’t tokenize first and pgen might, this might not be enough in my case even if what I said about LL(1) grammars is true.
I haven’t looked inside pgen. The fact that compiling C was needed just to build the AST for a slightly different language using the ast module did bother me.
Somewhat embarrasingly it is only available on the defunct Google Code. I have it as an hg repo on my machine but haven’t touched it in a long time:
I plan to revive this code, or at least the idea, as part of my http://www.oilshell.org/ project. If you have any questions let me know.
It’s not pure PEG actually – it’s a regex-based lexing system, with a PEG-like parsing system on top (using ordered choice). It’s a backtracking interpreter – no code generation.
Thanks! The regex part makes it sound a bit like parsimonious.
Clicking through some links, I just found your recent article on Pratt parsing.
Can I ask you some questions about that? I guess most of this should probably be in a PM or email.
Could you expand the shorthands Pratt uses? From your post, I have:
but what’s nud, led, etc?
And in what way does it differ from Floyd’s algorithm? The first step of Floyd’s algorithm computes a table and two arrays f and g of precedences. For Pratt, it seems that lbp and rbp need to be first built by hand but then how does the two differ after that?
Sure, nud and led are abbreviations for null denotation and left denotation, which IMO are not very meaningful terms. I explain it like this:
In Pratt parsing, tokens can be used in the null and/or left position, based on whether they take an expression on
the left or not (nud or led in Pratt’s terminology). Examples of left operators are infix +, postfix ++, or the
pseudo-infix a (with operator [). Examples of null operators are unary minus -, and grouping parentheses (.
I actually don’t know Floyd’s algorithm. I just remember it was mentioned in one of the papers, and I thought that it used a partial order on precedences rather than a total order. Do you know if that’s true and do you have a pointer to a good explanation of Floyd’s algorithm?
What are the arrays f and g in Floyd’s algorithm?
f and g are the left and right precedence functions (for each token, returns an integer and so can be though of as two arrays indexed by tokens).
The Floyd article first gives an algorithms that uses a table of pairwise comparisons (that could potentially only form partial order). Later in the article, the mention that to save memory, you can compute f and g first and then just compare f(currenttoken) to g(nexttoken).
I was hoping there was a good explanation of Floyd too but haven’t found it. I think that Wikipedia page is rather helpful, if incomplete in some places. I ended up reading Floyd’s article. In the article, near the end, there’s a huge (# tokens)x(# tokens) table for a sample ALGO-like language and at the bottom and right of the table are the values of f and g.
Edit to add that I didn’t read Pratt’s article (even though a reformatted version is on Github).
Hm yeah I have read that Wikipedia page and been a little confused.
I sort of understand what they are doing with turning the matrix into two functions. To me that seems to be turning a matrix that happens to be a total order into a more compact representation of a total order.
I don’t really understand where they use f and g in the algorithm.
Also the operator precedence algorithm on the page isn’t given a name. It is referenced to the dragon book “Aho, Sethi & Ullman 1988, p. 206.”, but I don’t think my later edition has it. I looked in the index for “precedence” and it was about LR parsers and yacc. Maybe they removed it?
In any case, it’s definitely possible that Floyd’s algorithm and Pratt parsing are the same in the special case that you have f and g functions, rather than a full matrix? I really can’t tell though because I think the page is incomplete.
Although my question would be: is anyone using Floyd’s algorithm? Have you ever seen it in code?
I’ve seen Pratt’s algorithm in many places. Lua uses a simple variant of it (more like the “precedence climbing” structure).
One thing I did mention in my article is that Guy Steele said the Fortress programming language disallowed certain combinations of operators without explicit parens. That is, they used a partial order rather than a total order for usability reasons. So I think that could be a good use case for Floyd’s? Fortress is open source but I haven’t looked at the code!
If you find any further connections feel free to PM me or e-mail me at firstname.lastname@example.org!
Ok, I’ll sent you email for the rest. I’ll answer some parts that might be interesting to others here.
Yeah, that page is more of a helper for when Floyd’s article wasn’t so clear. It does have every piece. It just doesn’t tell you how they all fit together.
Since its optional, Floyd (and Wikipedia’s) first description of the algorithm does not make use of f and g. To use them, instead of checking if t1 <. t2 (for every time we attempt this check in the first description), check if f[t1] < g[t2]. Instead of t1 .> t2, check if f[t1] > g[t2].
In any case, it’s definitely possible that Floyd’s algorithm and Pratt parsing are the same in the special case that you have f and g functions, rather than a full matrix?
But doesn’t a total order means you have f and g? And doesn’t Pratt only work when there are total orders (since they are entered by hand)? If yes, then the special case would actually be all cases. I’m hoping the answer is no and someone can tell me what makes it a no.
[Comment removed by author]
Thank you! Indeed, what’s next? I don’t want to overhype anything but the first options makes it easier to experiment with changes to the language.