Threads for ltratt

  1. 1

    You might look at Pager’s algorithm or ELR(1) if you’re interested in reducing the table size.

    1. 2

      I implemented Pager’s algorithm – or, more accurately, a hybrid of Pager’s algorithm, Chen’s thesis, and some guesswork – in grmtools. As that suggests, Pager’s algorithm is a bit harder to implement than one would like!

      The IELR paper (I wondered if ELR is a typo for IELR? But maybe there’s another state reduction algorithm I don’t know about!) makes a convincing case that Pager’s algorithm has flaws, and that IELR corrects those flaws. I haven’t implemented IELR, though if I was to revisit that part of grmtools/lrtable, I would start with IELR and not Pager.

    1. 3

      As others have stated, the problem isn’t that POP/IMAP are being phased out by Office, but “basic authentication” (i.e. traditional username/password) is being phased out. My employer pushed the necessary button to do this a few weeks ago, causing me some confusion as to why my email setup stopped working!

      There are several tools that can help make existing email setups work with OAuth e.g. Email OAuth 2.0 Proxy, mailctl, mutt_oauth2.py.

      None of those quite worked for me, so I’ve written my own called pizauth (https://github.com/ltratt/pizauth) which I now consider to be at the alpha release stage, and ready for somewhat wider testing. I and others have so far only tested it on Office’s OAuth servers, so I can’t yet guarantee it works for other sites. Bug reports are most welcome!

      1. 4

        It’s a fascinating read, and I thank @ltratt for their work here, but the main takeaway I’m getting is “don’t install OpenBSD on a laptop unless you’ve got a spare week for debugging” TBH.

        1. 2

          That may or may not be a fair takeaway, but it wasn’t the one I intended :) Rather, my experience is that if one gets very new hardware from a manufacturer that doesn’t also provide your OS, one should expect some fun and games. Unlike, say, a Windows user, my fun and games is “look at the source code” rather than “try installing and uninstalling random drivers found on the internet”. Is that better? Well, that’s up to you ;)

          1. 2

            I reinstalled windows on my new laptop and had the fresh fortune of discovering my WiFi card was too new to be supported in the latest images.

            Also the fortune to have my panic cut short by remembering that a family member had a USB-C dock. Took me too late to realize I could just stick them on a flash drive instead of using windows update

        1. 5

          You missed a method! I’ve been meaning to write about a version of this forever but the basic idea is that the total size of the executing binary is stored within the ELF data, and ELF files do not give a shit about what comes after them.

          Thus, you can simply do:

          $ cat foo.txt >> program.bin
          

          this shifts the difficulty to:

          • getting the ELF header of the currently executing program
          • reading from that location
          1. 3

            I will cheerfully admit that I hadn’t thought of that :) Though it would make it hard (without adding some sort of header) to add more than one binary blob. There’s also a secondary problem, which is that the newly added blob will be outside the ALLOC segment, so one would probably have to mmap that part of the executable in manually. It’s definitely doable, albeit somewhat laborious. Still, a neat trick!

            1. 5

              Note that, depending on the sandboxing technology that you use, the application may not be able to access its own binary file. This may also interfere with code signing flows.

              However, perhaps surprisingly, xxd is part of Vim

              I was very surprised by this. I went to refute this claim but it looks as if that’s where my copy comes from. Checking the history, it looks as if the original author of vim made some small tweaks to xxd and bundled it. I’m fairly sure that it was a separate package when I first encountered it, but that was more than 20 years ago now.

              Mind you, I’ve never used a *NIX system that hasn’t had it installed - xxd is one of my go-to tools for a bunch of tasks and I’ve never needed to know which package it came from before.

              The performance issue that you encounter comes from the fact that clang generates an AST node for every single one of the initialiser elements. This is a lot faster if you insert it as a C string ("\x23\x42" instead of 0x23, 0x42), though you need to be careful of the extra null byte that the compiler will insert at the end of a C-string literal. I wrote a very small (about 4 lines of code) C++ tool for a project to generate C/C++ headers with string literals from arbitrary binary files a few years ago. The compiler happily passes them through and you can use things like section attributes to put them in specific locations in the output.

              1. 3

                you need to be careful of the extra null byte that the compiler will insert at the end of a C-string literal

                Include the size indicator, cf char c[5] = "abcde", and it goes away. As I recall, this does not work in c++ for some reason.

                1. 1

                  This is a lot faster if you insert it as a C string (”\x23\x42” instead of 0x23, 0x42), though you need to be careful of the extra null byte that the compiler will insert at the end of a C-string literal.

                  Ah, interesting! I must admit that I hadn’t thought to try this.

                  1. 3

                    I wrote a C tool, that converts binary data into C strings https://github.com/mulle-nat/mulle-c-string-escape, which I use for just this purpose. The created .inc file can then be easily include like so:

                    static char  data[] =
                    #include "data.inc"
                    ;
                    #define s_data  (sizeof( data) - 1)
                    
                    1. 1

                      This is what #embed is defined to be as-if, by the way.

                      1. 1

                        We used this technique in the nineties. It was fine except that the compiler would use much memory per byte, so sometimes the compile would run out of RAM on even large boxes (of that time). Of course computers have more memory now, I’ve 64GB on my desk now and had 64MB then, but perhaps the included blobs are bigger too.

                        1. 1

                          With clang 14.0.6 and mulle-c-string-escape I can get a turnaround of 3s on my system for a file of 100 MB (like the author used). The compiler takes 2.1s and 912 MB RAM and mulle-c-string-escape uses 0.9s and 1.4MB RAM. (measured with /usr/bin/time -v). For comparison: the author sees 15s for hexdump and 90s for clang. My machine takes 12s for the 100 MB hexdump. So I would estimate it would be 4s vs 105s.

                    2. 1

                      I was very surprised by this. I went to refute this claim but it looks as if that’s where my copy comes from. Checking the history, it looks as if the original author of vim made some small tweaks to xxd and bundled it. I’m fairly sure that it was a separate package when I first encountered it, but that was more than 20 years ago now.

                      The repository I found when writing the “Related work” section of my Cedro pre-processor (you know about it), states the following:

                      “This version is based on V1.10 27oct98 by Juergen Weigert. Between 2000 and 2011 it accumulated patches by the Vim team and others. In 2013 it was pulled out of the Vim tree, had a couple of features added and was pronounced version 1.11.” – https://github.com/ConorOG/xxd

                      I too didn’t know which package it came from.

                      1. 1

                        you need to be careful of the extra null byte that the compiler will insert at the end of a C-string literal

                        See Rui Ueyama’s suggestion for avoiding this. It’s quite clever and, I think, portable (but I might be wrong!).

                        1. 1

                          Lack of access is a real problem a case I’ve come across.

                          Some postscript drivers generate code that reads the, uhm, the part of the postscript file that’s after the postscript program. The postscript parser parses the first part of the foo.ps file until the syntactical EOF, and the parsed postscript code reads and uses the rest.

                          This collides unprettily with boxes that accept input from many client systems, count pages, assign costs to the right team/department, and send postscript onwards to the right printer.

                        2. 2

                          Though it would make it hard (without adding some sort of header) to add more than one binary blob.

                          the header format already exists. try concatenating a zipfile to the end of an executable. unzip will still manage to extract, and the file will execute fine. self-extracting archives work this way.

                        3. 1

                          I’ve always liked this method and the fact that it works on windows executables is great too. I was wondering if there’s a similar trick that will work with shared objects – is there a straightforward substitute for argv[0]?

                          1. 1

                            This works on Windows too! Taking advantage of it is part of the reason why the ZIP file format is designed with the dictionary of filenames and offsets at the end of the file instead of the beginning. :)

                            (You can skip all the work of parsing the ELF header if you do something like putting the length of the file as the last 4 bytes of the file.)

                          1. 2

                            That’s probably the main downside of AOT-compiled languages: you have to decide to inline or not during compilation, not during run time, and you can’t inline after dynamic linking. And function calls are relatively expensive.

                            I’m curious: are there examples of JIT compilers that can also monomorphize at run time? Probably JITs for dynamic languages do almost that, but are there JITs for static languages that do monomorphization?

                            1. 3

                              JVM probably does it.

                              Dart and TypeScript pretend to have types at “compile” time, and then run in a dynamically-typed VM with hidden classes optimization.

                              1. 3

                                Note that inlining itself is, in some sense, an AOT specific concept. In a JIT, you don’t need to care about function boundaries at all, you can do a tracing JIT.

                                The TL;DR is that you observe a program at runtime and identify a runtime loop: a sequence of instructions that is repeatedly executed. You than compile this loop as a whole. The loop can span multiple source functions/runtime function calls. In each function, the loop includes only the hot path parts.

                                So, a powerful JIT can transparently tear through arbitrary many layers of dynamic dispatch, the code is fully monomorphized in terms of instruction sequence.

                                What a JIT can’t do transparently, without the help from source level semantics, is optimize the data layout. If a Point is heap allocated, and a bunch of Points is stored in a HashMap, the JIT can’t magically specialize the map to store the points inline. Layout of data in memory has to be fixed, as it must be compatible with non-optimized code. The exception here is that, when an object doesn’t escape, JIT might first stack-allocate it, and then apply scalar replacement of aggregates.

                                1. 1

                                  The exception here is that, when an object doesn’t escape, JIT might first stack-allocate it, and then apply scalar replacement of aggregates.

                                  JITs can be a bit more clever with escape analysis: they don’t have to prove that an object never escapes in order to deconstruct its parts, they just have to make sure that any deconstruction of an object is never visible to the outside world. In other words, one can deconstruct an object temporarily provided it’s reconstructed at exit points from the JIT compiled code.

                                  1. 1

                                    For dynamic language JIT compilers—e.g. LuaJIT—the compiler has to insert type check guards into the compiled code, right? And other invariant checks. How much do these guards typically cost?

                                    I can imagine how a large runtime loop (100s of instructions) could place guards only at the entry point, leaving the bulk of the compiled section guard-free. I can also imagine eliding guards if the compiler can somehow prove a variable can only ever be one type. But for dynamic languages like Lua it could be too hard to perform meaningful global analysis in that way. If you have any insight I’d appreciate it, I’m just speculating.

                                    1. 2

                                      I am not really an expert, but here’s my understanding.

                                      Language dynamism is orthogonal to AOT/JIT and inlining. JITs for dynamic languages do need more deoptimization guards. The guards themselves are pretty cheap: they are trivial predicated branches.

                                      As usual, what kills you is not the code, it’s data layout in memory. In a static language, an object is a bunch of fields packed tightly in memory, in a dynamic language, a general object is some kind of hash-map. Optimizing those HashMap lookups to direct accesses via offset is where major performance gains/losses are.

                                  2. 2

                                    I believe C#’s CLR does this, it acts like Java at compile time but then monomorphizes generics at run time.

                                    1. 1

                                      .NET generics use monomorphization for value types and a shared instantiation for reference types. .NET Generics Under the Hood shows some of the implementation details.

                                    2. 2

                                      To alleviate this, I think Clang has a feature to profile a program and to use that information to guide the optimisation when you compile it again. This is used by Apple when building Clang itself.

                                    1. 5

                                      Excellent article! I agree with all points made, and wrote a response at https://rust-analyzer.github.io//blog/2020/09/16/challeging-LR-parsing.html.

                                      There’s one specific criticism I have though, which I’d rather submit as a comment. Note that I’ll only criticize this one thing, simply because I am silently agreeing with everything else :-)

                                      I think the article strawmans recursive descent parsing a bit. Or, rather, it is 100% precise and technically correct about pure recursive descent, but in practice people don’t use pure recursive descent. A specific claim I disagree with is this one:

                                      lack of left recursion makes expressing many standard programming language constructs as awkward as with recursive descent parsers.

                                      I claim that handing left-recursion with a hand-written parser is easy. Specifically, the code you write to parse xs := xs x | x is this:

                                      def parse_xs():
                                          while not eof:
                                              parse_x()
                                      

                                      Loops are the crucial difference between pure recursive descent as described in the literature, and recursive descent as implemented in practice.

                                      I think the article would benefit from mentioning, if even in passing, that by adding loops and Pratt parsing, hand-written recursive descent can be made to deal with natural grammars nicely. Of course this has no bearing on the issue that you’ll have no idea about what actual language will be parsed! I fully agree with this criticism of hand-written parsers.

                                      I am somewhat salty about this issue because I took five different courses in formal grammars at the University, and all of them equated hand-written parser with pure recursive descent, and none mentioned Pratt parsers (which I have discovered by accident from Crockford article). It’s a shame, given that Pratt parser is what I have to use in production :-)

                                      1. 1

                                        My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them – the fact that they’re generally more verbose than CFGs is a secondary issue IMHO. It may amuse you to know that the same Michael Van De Vanter I referenced in my post is the same Michael Van De Vanter whose masters thesis expanded and popularised the not-yet-then-called-Pratt parser concept and the same Michael Van De Vanter who recommended Tim Wagner’s thesis to me as the way to go with incremental parsing :) Michael’s influence on the field is much underrated, possibly because he’s too much of a gentleman to toot his own horn.

                                        As that suggests, I think Tim’s work (which underlies Tree Sitter) can probably address some of your challenges. I don’t know if Tree Sitter implements Tim’s full approach to error recovery, which is much closer to what you call “error resilience” than it is to other error recovery approaches. When Lukas looked at this part of Tim’s work, he found some a bug or two and some missing bits: you might find Lukas’s thesis worth reading (particularly Chapter 3). However, to fully address challenge 1 specifically you would need to be able to inform the parsing system of more of the language’s semantics to get the sort of output you would like. That’s eminently doable – lrpar does a very limited form of this with its %avoid_insert directive – but current systems don’t do it very much.

                                        I’d argue that Yacc more or less solves your second challenge already – you can use the stategraph -> statetable conversion to write ambiguous expression grammars that are resolved to unambiguous ones using declarations such as %left and so on. I’m not convinced that people always understand the consequences of these declarations, so I tend to discourage them, but when they’re used judiciously, they work well.

                                        The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

                                        BTW, in case I sound grumpy, I don’t mean to be! Thanks for rust-analyzer: I use it daily!

                                        1. 2

                                          My major problem with recursive descent parsers is that they rarely accept the languages that people thought they did when they wrote them.

                                          Yeah, I am in full agreement with you on this point.

                                          The third challenge… well isn’t that Tree Sitter, or something like it? I wasn’t quite sure exactly what you were asking for there to be honest!

                                          So, I mean I need IDE support for the grammar meta language itself. AFAIK, tree sitter doesn’t have this. Live preview of the grammar, like in this short video, would make writing a grammar much more approachable imo. This is purely an implementation/usability concern, but a pretty important one I would say. Parser generators have a pretty high barrier to entry, lowering it would be valuable.

                                          1. 1

                                            So, I mean I need IDE support for the grammar meta language itself.

                                            Ah, yes, I agree, this is nice to have: Eco supports its own grammar meta-language, for example.

                                            1. 1

                                              About the first point, do you have tests that the parser indeed corresponds to the grammar it’s supposed to parse? I mean more than unit tests. I imagine one could describe the grammar as a generator (in quickcheck or proptest) and then try the parser on that… But is it done in practice?

                                        1. 5

                                          Nice article! FWIW I went through a similar evolution of avoiding LR and coming back to it (to a degree).

                                          I posted this thread last year, which resulted in many interesting links and more approaches to parsing than I would have thought (which is further evidence that parsing is still a problem!)

                                          Survey: What parser generator are you using, if any?

                                          Although I didn’t “come back” as far as you did. I appreciate the work on error recovery and will have to look into it more deeply.

                                          Here are the “unresolved” tradeoffs I see:

                                          • As you mention, I would still go for recursive descent for a production quality implementation of an existing language. You can test against existing code, and you can produce better error messages. It takes a little practice but I think it’s worth it for most programmers.
                                          • Interfacing with the lexer is surprisingly nontrivial and people do awkward things to avoid it: https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate
                                          • I like grammars for new languages. For the time being, I’m sticking with Python’s pgen LL(1) parser generator for Oil, not LR parsing. Python switched to PEG last year, so it no longer uses it! But those problems don’t apply to Oil because it has a slightly more explicit syntax. Still, I did see very clearly the limitations of this LL style of parsing after using it.

                                          Why did I stick with it? Mainly because I could borrow a good implementation of it (one that’s been used for decades by millions of people), and because I like the EBNF syntax in the meta language. I find the EBNF with + and * to be a lot more readabable than “recursion”. I asked Guido and he said this is one of the main reasons he developed his own parsing system for Python.

                                          Example: https://docs.python.org/3/reference/grammar.html (you can see a lot of the tortured productions that caused them to switch metalanguages; Python 2 had fewer of these, while making elegant use of EBNF)

                                          There doesn’t appear to be an LR parser generator that uses + and * in the metalanguage rather than recursion. And I believe when you do an automatic conversion, you end up with a bunch of “extra” rules that makes the parser harder to debug and less efficient. I didn’t go deep into that.

                                          I will be reading this article more carefully, but all-in-all I agree there are a lot of problems to be solved with parsing. And coming back to LR and fixing its problems is one thing we should be doing!

                                          1. 3

                                            I wonder if you know of anyone working on these lexing issues?

                                            The design space is complicated by the fact that some people want more out of parsers/lexers than others. For a traditional “single language batch parser”, something simple like lex has always worked for me. But if you want to do things like embed languages with different syntaxes, you then have a lot of possible choices to muddy the waters. Eric Van Wyk and his gang did some interesting work in this area that’s worth looking at, in the sense of fairly traditional parsers interacting with changing lexers. I don’t know if anyone’s looking at it at the moment or not though.

                                            1. 2

                                              OK but I would say the lexing / parsing problem has gotten more demanding in the last 20 years, and since most LR parsing tools were written. We aren’t parsing Pascal or Fortran anymore!

                                              Hence all the ad hoc hacks you see in the various lexing and parsing metalanguages.

                                              I think the strongest argument that the problem is more difficult now is that most languages have string interpolation of arbitrary expressions: in the past 10 years, Python added it with “f-strings”, JS added it with backticks, Swift has it with \(), etc. It’s just a problem that most people have to deal with.

                                              Although there are others, like regexes in JS and Perl: When Are Lexer Modes Useful?

                                              I think the issue is conflating LR parsing and parser generators. All parser generators seem to have problems with these constructs, regardless of whether they are LR or not!

                                              You see it all of the /r/ProgrammingLanguages threads I linked. Most implementers/language designers still complain about all the hacks involved in grammars, and they choose to write parsers by hand for that reason. Here’s another similar thread:

                                              https://old.reddit.com/r/ProgrammingLanguages/comments/dwxwyq/what_lexerparser_generators_do_people_use_these/


                                              Now that I’ve read the article a bit more carefully, I didn’t find some of the arguments in favor of LR that convincing. They seemed to be along the lines of “well the problems aren’t that bad, especially once you get used to it”.

                                              I would say the same thing about recursive descent: it’s not that bad once you get used to it. (But again, I prefer grammars for new languages.)

                                              But I did go through a similar evolution – I went back and looked at why I was avoiding LR, although again I didn’t get as far as you.

                                              I now firmly believe that LR parsing is the best approach for the vast majority of purposes: it has the strongest practical safety guarantees, allows good grammar readability, and has decent performance.

                                              So I would have to quibble with this conclusion now. I believe “the vast majority of purposes” involves some problem that parser generators (LR or not) are bad at handling – at least in terms of implementing programming languages, vs. simpler DSLs. It might be only one or two parts of the language, but those can be a “dealbreaker”.

                                              I think composing parsing algorithms to fit the language is the most flexible and powerful approach. Oil uses hand-written recursive descent, Pratt parsing, and a LL(1) generated parser, all with the same “modal” lexer. Of course shell is one of the most syntactically complex languages out there, but I think overall languages are trending in that direction!

                                              People want more elaborate syntax not less elaborate! Rust is another good example of this. It has much more syntax than C, and arguably C++.

                                              1. 1

                                                Until recently Java (not a small language, syntactically or semantically) had an LR grammar (maybe it still could, but I’m not sure), so grammar size is not an argument against LR. Of course, some languages are an impossible (e.g. because they’re ambiguous) or poor (e.g. see Ruby’s Yacc grammar) fit for LR, and it would be silly to use LR for those. But as Guy Steele suggests, new languages that don’t have an LR grammar are making user’s lives harder than necessary.

                                                1. 1

                                                  Well Steele says that he uses LR grammar generators to check his language design, but he doesn’t actually use them to implement the language!

                                                  I totally agree with that, and that’s what I’m doing for Oil! The new language is prototyped with an LL(1) parser generator. Eventually it will probably be replaced by a hand-written parser, but, as I mentioned, I prefer grammars for new languages.

                                                  The Lua authors expressed the same sentiment in a paper – they used yacc to check their language design, and then switched to a hand-written parser.

                                                  I also think there’s a subtlety: it’s great to use a grammar generator to check the design of a large subset of a language. That’s still useful. But yacc falls down for a couple cases in a big language.

                                                  So I would still make the following claim:

                                                  Humans are better at parsing than common algorithms.

                                                  Humans like “subtle syntax”. They don’t have any problems reading string interpolation – they like it.

                                                  Humans like things to be “huffman coded”, which requires more syntax. And for syntax to match semantics. The opposite is Lisp, which has sort of “lost” syntactically.

                                                  That said, I appreciate the work going back to look at LR parsing and improving upon the state of the art. I think LR is super useful but I personally think of it as part of the story – i.e. composing LR with other algorithms where appropriate.

                                          1. 7

                                            A very good overview though I’ve gone through a similar process and come to the opposite conclusion: A LL grammar and recursive descent parser are the simplest and most practical and if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

                                            The thing is, the classic Mathematical Expression problem extends into dealing with prefix and postfix operators in general, which are really pretty common in most programming languages. Once you allow a function call to be Expr "(" [Expr] {"," Expr} ")" or whatever you have the left-recursion problem again. Similar with a[1][2] array dereferences and foo.bar.baz structure fields. There’s nothing wrong with a Lisp syntax IMO, but people don’t seem to like it so it’s worth putting a bit of effort into making these shortcuts, and they all have to deal with left recursion.

                                            So you do the traditional solution of expressing the precedence rules involved in these with a chain different rules, and rapidly you end up with a giant chain of 10-12 different rules. Some people seem to just Get This and absorb it fine, but even in those moments when I can actually hold what’s going on in my head, it falls out the instant I move on to something else. I’ve been much, much happier writing parsers since I discovered Pratt parsing and equivalent algorithms, which are extensions to a simple recursive-descent parser that solves these problems in a way I find much more comprehensible.

                                            1. 3

                                              […] if your language can’t be parsed by it, it’s best for all involved to change your grammar until it can be.

                                              Is there a tool that takes a grammar and tells whether it’s LL?

                                              1. 4

                                                An LL parser (e.g. JavaCC) should be able to reject non-LL grammars (if it doesn’t… it’s probably not a tool you want to use!). [Caveat: there are different classes of LL (e.g. LL(1), LL(k)), but mostly people tend to mean LL(1).]

                                                1. 2

                                                  Thanks!

                                                  (Ab)using a parser generator crossed my mind, but the ones I’m familiar with (Yacc etc.) are LR.

                                            1. 1

                                              One question. You could use LR grammars without using LR parser right? i.e if your target was to avoid ambiguity, you could still have used LR grammars with Earley.

                                              PS: This link is 404ing: https://web.archive.org/web/20150919164029/https://www.cs.berkeley.edu/Research/Projects/harmonia/papers/twagner-thesis.pdf

                                              1. 1

                                                One question. You could use LR grammars without using LR parser right?

                                                Yes… but unless you check with a non-Earley parser you won’t know if changes introduce ambiguity (plus generalised parsers are likely to be slower than non-generalised parsers, as they have to deal with more cases). I’m not sure that buys you very much. Thanks for the link report – oddly it does work if you wait for achive.org to reload itself for several seconds, but I’ve changed it to a link that I hope works more reliably.

                                                1. 1

                                                  I can use an external parser to check it; for example, I can use your parser in rust to check my grammar, but I can stick with my Earley parser in Python for my python application because it is much more simpler than an LR parser. About speed, do we have any benchmarks comparing parsers with LR grammars? even Earley has linear time parse for non ambiguous grammars.

                                                  1. 1

                                                    even Earley has linear time parse for non ambiguous grammars.

                                                    That’s only true if you apply the relevant optimisations, such as Leo’s. And if speed matters at all to you, you need to apply Practical Earley Parsing from Aycock and Horspool instead of building each Earley item manually.

                                                    Fast Earley parsing is a ton of work.

                                                    1. 1

                                                      Indeed, no arguments there. I was wondering how slow Earley (and other fast general context free parsers) is in practice (with or without Leo’s optimization) when they are pitted against LR parsers for the same grammar.

                                                      Personally, I find table driven parsers more difficult to debug, but I agree with Laurence that ambiguity in grammars is a pain.

                                              1. 2

                                                Hi Laurence, thanks for the great write up. Looking at your blog, I found that you have published previously about error correction with LR parsers, and the code examples in that page were in Python. Is there any chance you might have the implementation of CPCT+ algorithm in Python lying around?

                                                1. 2

                                                  I’m afraid there’s not a Python implementation of CPCT+ as such, because it needs to be plugged into an LR parser. It’s probably not rocket science to add it to an existing Python LR parser (especially as the algorithm examples in the paper are Python-ish), but I’m very out of date on what the best Python LR parsers might be unfortunately. Sorry!

                                                1. 14

                                                  I’d need to be convinced that another interpreter solves any major problems of cross-language interoperability. As far as I can tell, the challenge is squaring the circle of different semantics.

                                                  What happens when you take a Rust affine value, pass it to Python, which assigns a few copies of it around, and passes all of those copies back? What happens when you pass something with a destructor to a language that doesn’t support destructors? What happens with weak references that are implicitly nullable when you pass them to a language without NULL?

                                                  Getting all of that to work is where the effort in cross-language interop lies.

                                                  You might be able to get it to work in languages that feel close – the .NET CLR does this, for example. But getting Haskell, C++, Rust, and Python to easily interoperate without tons of glue code seems… unlikely.

                                                  And if you’re ok with glue code, the C ABI and the various language FFIs more or less lets you do the job.

                                                  1. 6

                                                    C ABI does the job, but I think I am not revealing anything shocking when I say C ABI could be greatly improved. Have you seen Fine-grained Language Composition by Laurence Tratt? It is a clear proof of concept of what is possible, and we are very far from it.

                                                    1. 3

                                                      Of course there’s room for improvement – the x86_64 calling convention is way too complicated. But that’s beside the point. The hard part isn’t the calling convention or the VM, it’s figuring out how to handle all of the language semantics.

                                                      That proof of concept is interesting – but I didn’t see any challenging features shared. What does PHP do with a StopIteration exception from python? Can it consume a python iterator at all? From that description, it seems like you can’t easily pass a list from PHP to python – you need to manually adapt it. And you can get exceptions if the adapted PHP variable stops looking like a list.

                                                      And, these are two languages that most people would consider close.

                                                      1. 3

                                                        And, these are two languages that most people would consider close.

                                                        I made that mistake when we started that work, but was quickly disabused of it. PHP and Python look close only if you don’t look closely, mostly because PHP is a very unusual language (e.g. PHP’s references, for example, are rather unique). You might find the full paper (https://soft-dev.org/pubs/html/barrett_bolz_diekmann_tratt__fine_grained_language_composition/) has more detail about the challenges and how they were resolved. For example, Section 6 tackles cross-language variable scoping, which stretched my (admittedly meagre) language design abilities to, and possibly beyond, their maximum!

                                                        1. 2

                                                          the x86_64 calling convention is way too complicated

                                                          It’s quite simple. First few integral parameters in registers. First few floating parameters in floating registers. Further arguments on the stack. Large parameters (structs) passed either as pointers or on the stack.

                                                          1. 2

                                                            Large parameters (structs) passed either as pointers or on the stack.

                                                            The complexity comes from how small structs are passed and returned. It’s not unmanagable, but it’s enough to be painful compared to the Windows ABI. The rules around types interactions in small structs are more complicated than they need to be. This makes determining which registers to use to pass a struct overly complex.

                                                      2. 4

                                                        Rather than cross-language interoperability, I think that the real value would be in providing a single platform that could be ported to different systems and brings tons of existing code to them. Basically, taking the JVM portability advantage and democratizing it. It would really benefit niche systems. (Commoditize your compliments!)

                                                        1. 1

                                                          But getting Haskell, C++, Rust, and Python to easily interoperate without tons of glue code seems… unlikely.

                                                          But that doesn’t mean there wouldn’t be wins in other areas. Programming in Haxe was like programming in 1 + (targets * .5) languages because you had to be mindful of how everything would be translated. However, things like unit testing were O(1) because the test suites would be automagically generated and run for each target.

                                                          And if you’re ok with glue code, the C ABI and the various language FFIs more or less lets you do the job.

                                                          I agree with you insofar that I’m skeptical language interoperability can go beyond functional interfaces passing serializable data structures. However, I think we can significantly improve on the C FFI baseline and piping strings around.

                                                          TruffleRuby’s implementation of the Ruby C FFI requires interpreting C code, yet it still manages to beat the pants off of mainline because the common AST representation of the two languages allows the JVM to perform optimizations on a shared in-memory representation of their data. Or take the WASM type spec, which would allow for similar optimizations on similar data structures while providing sane defaults when converting between primitive values.

                                                        1. 12

                                                          I agree with one implicit message of this post which is that parsing is an almost endless series of trade-offs. For example, for a simple binary format, I’d probably use PEGs, but when parsing textual formats, bitter experience has conditioned me to prefer LR. There are various reasons for this, such as: PEGs are easy to write but, in general, it’s impossible to work out what language they accept (most medium/large PEG parsers I’ve looked at in detail do not work as their authors expected); LR parsers are easy to reason about but more awkward to write and generally have poor error handling. I’ve tried to address that latter issue in grmtools/lrpar https://softdevteam.github.io/grmtools/master/book/errorrecovery.html – it’s not the right solution for everyone, but it might be another interesting point in the endless space of parsing trade-offs!

                                                          1. 4

                                                            menhir is an advanced parser generator in OCaml and features pretty good error recovery mechanisms, as far as I know. It’s used in merlin, the semantic server for OCaml, which is extremely robust. LR or LALR generators can be better than yacc!

                                                            1. 5

                                                              Menhir has a very cool, but very different, approach to error recovery. In essence, users can manually annotate the stategraph with detailed error messages (and Menhir has a cunning mechanism to make it semi-automatic to migrate those messages as the grammar changes) that can help users understand what’s going on. grmtools/lrpar’s approach is fully automated. As always in parsing there are pros and cons to this! There’s more details in the related work section of https://arxiv.org/abs/1804.07133 if you’re interested.

                                                          1. 1

                                                            Left recursion and PEGs is a surprisingly hard topic. The first algorithm I know of in this area by Warth et al. can produce incorrect parses (I wrote that up here).

                                                            I found it impossible to understand the output of your parser, so I quickly added this pretty printer so that I could see the tree structure more easily:

                                                            def pp(t, i=0):
                                                              for x in t:
                                                                print("%s%s" % (" " * i, x[0]))
                                                                for y in x[1:]:
                                                                  pp(y, i + 1)
                                                            

                                                            With that, I can see that your algorithm does correctly parse the example that Warth et al. have a problem with (1-2-3) but I’m not sure how you encode precedence (so 2+3*4 is not parsed correctly, no matter what order I recorder the productions in E), so I’m not quite sure how fair a comparison this is.

                                                            Anyway, I wish you luck in your quest! This stuff made my head hurt so much that I gave up on PEGs entirely and moved on to LR parsing.

                                                            1. 1

                                                              Thank you for the link! I haven’t really thought about precedence as encoded in the PEG order, but I see that it is a major issue.