1. 2

    Nicely explained. You may want to examine relational parsing, a different approach which also has the same characteristic performance properties as Marpa-style Early parsing.

    1. 2

      While on the subject of novel techniques, a non-Early-like (sorry for being off-topic) that I really like is the Pika Parser. It has superb error recovery, which is how I found it in the first place.

      1. 1

        Indeed, there is also the Glush parser that is very recent. There is also the derivatives approach for parsing. I hope that at some point, we are able to bring all these together (implementations in the same language, accepting same grammar format), and do a systematic study of pros and cons.

        1. 1

          The linked-to paper is an excellent overview of parsing techniques.

        1. 4

          Indeed, I link to that and also Marpa in the beginning paragraphs. Hopefully, being interactive, this tutorial can share the space?

          1. 2

            Well, if you implement Leo’s optimisation, you already have a significant advantage, compared to my tutorial. I’m personally very interested in closing that gap, so I’m definitely going to read your tutorial.

            1. 1

              Thanks! your tutorial was one of the main references I used while implementing most parts.

        1. 1

          I would add some examples for each defined term (at the end of each definition). I always get tripped up by this vocab when reading about parsing, so every little hint helps.

          1. 1

            Thanks!, I will do that.

          1. 4

            I’d like to work on Prolog for some stuff. For me, it’s a language that is very concise and I feel very confident about the code I make that it is correct even if I write fewer tests. It is not perfect, there are sometimes performance issues, lack of tooling, so for anything serious, I will probably not choose it. My last public work with Prolog is an HTTP Server for Scryer Prolog. It still needs more work but it is already there if you want to try it. Some months ago I also shipped an HTML5 game with some parts written in Tau Prolog.

            1. 1

              What is your experience with Prolog? Do you use any of the advanced CLP features in some of the Prologs?

            1. 1

              If you write loop or equivalently,


              This tends to be read as an infinite loop, where the termination condition is exceptional. That does not seem to be the use you intend. While it could help you to prototype, I think it would be a disservice to your readers to leave it in the final code.

              1. 2

                Thank you for the rankings! Just at the right time too :)

                1. 1

                  I have recently become interested in accessibility of programming languages, and I think that the scope indication would be a good idea for visually impaired programmers.

                  1. 15

                    I think it’s because we don’t have better tooling. Wireshark makes observing binary protocols much easier and richer than just text, and if we had richer structural editing (think Interlisp-like editing s-exprs in memory) and structures (again, closer to TLV over C structures), we could just make binary encodings for everything reasonable.

                    Alas, Unix.

                    1. 25

                      But text as a common denominator is the reason that there are good tools. Bespoke binary formats introduce a fundamental O(M * N) problem. You have M formats and N operations, and writing M*N tools is infeasible, even for the entire population of programmers in the world.

                      For example, I can grep a Python, JavaScript, Markdown, HTML, a log file, etc. How do I grep PDF, Word, or an RTF file? You need a tool for each format.

                      I can git pull and git diff any of those files too. How do I diff two Word files? I have to wait for somebody to build it into Word.

                      How do I version control a bunch of video clips cut from the same source? You need a custom tool if you want to do anything but copy the whole thing. (Again video editing tools have this knowledge built in, e.g. with an app-specific notion of a “project”)

                      How do I diff two Smalltalk images?

                      I’m not saying those are necessarily bad designs, but there’s an obvious tradeoff, and it’s why Unix is popular. And it’s especially popular in open source where you can’t just throw highly paid programmers at the problem.

                      It is not an absolute; in reality people do try to try to fill out every cell in the O(M * N) grid, and they get partway there, and there are some advantages to that for sure.

                      Editing with a human-friendly UI is one of the most expensive “rows” in the grid to fill. Many structural editors exist, but they don’t catch on. They have a fundamental limitation. The problem is: which structure? The lowest common denominator of structural formats doesn’t exist. In other words, the lowest common denominator is an unstructured byte stream.

                      IntelliJ is partway there – they have a generic text editor, but structure is laid on top. It’s nontrivial to get this layering right, and you still have a bunch of work to do for each language. Try doing Zig in IntelliJ, etc. I think you will mostly have a text editor (and that’s good).

                      Don’t forget that parsing is at least two rows in the grid: parsing for validity/interpretation/compilation, and parsing incomplete code for a rich UI. IntelliJ has its own parsing framework to make these operations easier in a polyglot manner. But it’s expensive to produce and Eclipse wasn’t able to duplicate it in the open source world.

                      WireShark supports pretty printing network protocols. What about sed, grep, wc (counting records), diff, an awk of all network protocols? I’m sure it has support for many of those things by now, although it’s not easy to find the abstraction that lets you do it. There’s an inherent tradeoff.

                      Some projects addressing the tradeoff:

                      (I should probably write a blog post about this)

                      Other resources: Always Bet On Text by Graydon Hoare

                      Text is the most efficient communication technology

                      Text is the most socially useful communication technology

                      It can be compared, diffed, clustered, corrected, summarized and filtered algorithmically. It permits multiparty editing. It permits branching conversations, lurking, annotation, quoting, reviewing, summarizing, structured responses, exegesis, even fan fic. The breadth, scale and depth of ways people use text is unmatched by anything. There is no equivalent in any other communication technology for the social, communicative, cognitive and reflective complexity of a library full of books or an internet full of postings. Nothing else comes close.

                      A related rant about protobufs from a couple days ago: https://news.ycombinator.com/item?id=25582962

                      Protobufs give you structured data – now what’s the problem? Isn’t that what we want?

                      No, it’s because simple operations like copy are now not generic. It doesn’t work like Unix cp or memcpy(). You need machine code to implement this one operation for every single type. Tons of complexity results from this (and this is explained by multiple primary authors of protobufs):

                      The protobuf core runtime is split into two parts, “lite” and “full”. If you don’t need reflection for your protos, it’s better to use “lite” by using “option optimize_for = LITE_RUNTIME” in your .proto file (https://developers.google.com/protocol-buffers/docs/proto#op…). That will cut out a huge amount of code size from your binary. On the downside, you won’t get functionality that requires reflection, such as text format, JSON, and DebugString().

                      Even the lite runtime can get “lighter” if you compile your binary to statically link the runtime and strip unused symbols with -ffunction-sections/-fdata-sections/–gc-sections flags. Some parts of the lite runtime are only needed in unusual situations, like ExtensionSet which is only used if your protos use proto2 extensions (https://developers.google.com/protocol-buffers/docs/proto#ex…). If you avoid these cases, the lite runtime is quite light.

                      However, there is also the issue of the generated code size ..

                      1. 5

                        Another concept to look into is the “thin waist” of networking, of operating systems, and compilers. They are all least common denominators that solve O(M*N) problems:

                        • TCP/IP is a thin waist. One on side you have low level networks like Ethernet, Wireless, etc. On the other side you have applications formats like HTTP, e-mail, IRC, etc.
                          • It doesn’t make sense to have e-mail work over wired, then have to port it to wireless, etc. In the early days, it DID work like that. You lacked the abstraction of the thin waist. This was the whole point of the Internet – to bridge incompatible networks.
                        • Unix is a thin waist. On one side you have heterogeneous hardware (supercomputers, PCs, embedded evices). On the other side you have applications.
                        • LLVM is a thin waist. On the one side you have languages, and on the other you have ISAs.
                        1. 2

                          We should be careful to note that the cost is not O(1), as we might hope, but O(|M| + |N|). We have M parsers, N backends, and a single format that they all use for interchange. This cost shows up in tools like Pandoc and in the design of protocols like NDN which aim to explicitly have a thin-waist model.

                        2. 4

                          or an RTF file

                          bad example, RTF IS a text-based format. https://en.wikipedia.org/wiki/Rich_Text_Format

                          And the text parts of pdf and Word documents frequently are actually stored as text too. The problem is grep doesn’t actually work on “text”… try like strings foo.doc | grep ..., decent chance it will actually work. Of course “text” is in scare quotes because you might have to convert from UTF-16 or something too.

                          And grep only works if there’s reasonable line breaks which is frequently untrue, grep HTML just returns one gigantic line a lot, ditto on diff. You might have to put some format-specific formatter in the middle of the pipeline to make it actually usable by those other things too.

                          The general principle of just sticking another pipe in the middle to convert applies to lots of formats. In my other comment I used disassemblers as an example and they work pretty well.

                          Text and binary isn’t really that different.

                          1. 1

                            Adding to your comment; PDF is also a text-based format. New word documents too (OOXML).

                            1. 1

                              That’s all true, but it doesn’t invalidate the main point. Text is a compromise that works with a lot of different tools. You get more reuse.

                              strings foo.doc | grep is a good example.

                              There are tools that take an HTML or JSON tree structure and make line-based so you grep it, e.g.


                              These are features, not bugs.

                            2. 2

                              I realize this isn’t the main thrust of your argument, but Word can definitely diff two Word documents, and act as a merge tool for them. It’s not as straightforward as regular diff, but it’s definitely capable there.

                              1. 2

                                I think a good argument against your O(M * N) problem is that the onus should be the binary format provider’s onus to supply some kind of binary -> text tooling. FlatBuffers, for instance, provides tooling to go to and from json.

                                Now, an argument against that is that it’s herding cats to get everyone to actually provide said tooling. Which I guess is why we can’t have nice things.

                                1. 1

                                  You can also write your own text -> binary conversion, and lots of people have done that for specific use cases.

                                  That is an argument FOR text, not against it (but also not against binary). It’s not either/or or absolute like I said in the original comment. Binary formats and text formats are both valid tools, but unstructured text is a useful “thin waist” that reduces engineering effort.

                                  Related: https://lobste.rs/s/vl9o4z/case_against_text_protocols#c_jjkocy

                                  I recommend reading “The Art of Unix Programming”, there is a section about textual formats for images.

                                  Related: Use echo/printf to write images in 5 LoC with zero libraries or headers

                                  How to create minimal music with code in any programming language

                                  (i.e. use the Unix style. If you’re writing code in a niche language, you don’t need to wait for somebody to write a library.)

                                2. 2

                                  I have two recurring problems with text diffs:

                                  One is when a colleague has changed something in a UI definition and it has been serialised into XML or JSON, and values in a dictionary have Switches places, or a numeric value has been re-rounded, leading to a bigger diff than necessary.

                                  The other is when I’ve refactored an if clause into a guard, re-indenting much of a function. The structure has changed, but the diff covers an entire function instead of the structure itself, which is just a few lines.

                                  Text is just an approximation of structure.

                                  1. 1

                                    Well, yes… Most programming languages and text formats are liberal when it comes to whitespace and newlines. This is a design choice with trade offs like they one you mention. But it is not always the case. For example, http headers do not let you squeeze in ad hoc whitespace or new lines.

                                    Notice that the problem arises from using a specialized UI tool to output text from structure… Then it doesn’t play well with simple text based tools. Using a simple text editor gives you full control of every single char and will not have the problem you describe.

                                    This is exactly the same problem with binary formats, because you pretty much have to use a specialized tool.

                                    1. 1

                                      Yes, that’s valid and true. But it doesn’t contradict anything I said – byte streams are a compromise, a thin waist, just like TCP and LLVM are. TCP is very inefficient for certain applications; likewise LLVM is an awkward fit for many problems. But it’s uneconomical to rewrite the entire ecosystem for a specific application.

                                      Though note you can also write your own GIT_EXTERNAL_DIFF for XML and JSON, and I’m sure someone has done so already. Unix is flexible and customizable.

                                    2. 1

                                      If you have used uchex, how different is that approach from simply using a parser with automatic error correction (something like what ANTLR does)? (I find it strange that they do not mention any standard error recovery except for a passing mention in 2.1).

                                      1. 2

                                        I haven’t used it, but I think I should have called it “micro-grammars” rather than “uchex” (they use that name in the paper). It’s an approach for partial parsing and semantic analysis that works on multiple languages.

                                        Looks like some people have implemented this:


                                        I think error recovery is related, but philosophically different? There is no error recovery in microgrammars. They purposely don’t specify some information. They don’t specify ALL of the language and then “recover” from errors. And as I understand they also do semantic analysis. I’d be interested in details from anyone who has more experience.

                                        1. 1

                                          Thanks for the link! much appreciated.

                                      2. 1

                                        Thanks a bunch for this comment, I’ve personally learned a lot!

                                        An interesting thin waist solution for binary formats would be having canonical text representation. WASM does this right I think.

                                        1. 1

                                          Right, WASM needs to be compact on the wire and quickly parsed. It is mostly generated by compilers, but occasionally you want to write it by hand, and read it. The economical solution is to project it onto text.

                                          Sure, you can write a structured editor for WASM as the OP was suggesting, but does anyone really think that is a good solution ??? Do you also want to write a WASM version control system that knows how to diff and merge WASM functions perfectly?

                                          It’s definitely possible, but I want to see someone pursue those projects with a straight face.

                                          When you convert to text, you can a bunch of tools for “free”. You can use sed, grep, git diff, pull, merge, etc. on it. You are moving onto the thin waist; WASM binary is off of the thin waist.

                                          Glad you got something out of this comment!

                                          I often find myself in these “systems vs. code” arguments. Most people are trying to optimize the code (for local convenience) whereas I’m trying to optimize the system (for global properties). The whole point of Oil is to optimize systems, not code.

                                          Ken Thompson unrdestands this, he wrote about the lack of records in his “sermonette” on Unix design in the very first paper on Unix shell: https://lobste.rs/s/asr9ud/unix_command_language_ken_thompson_1976#c_1phbzz

                                          (UTF-8 is also a Thompson design that lacks metadata unlike UTF-16, and that’s a feature.)

                                          Systems vs. code is a pretty big mindset difference, and it often leads to opposite conclusions. A lot of people here claim they don’t understand Rich Hickey

                                          https://news.ycombinator.com/item?id=23905198 (my reply as chubot)

                                          which linked:


                                          The point here is that dynamic typing has advantages in systems (similar to unstructured text). (And I just got a raised eyebrow from a CS professor when I made this claim, so I know it’s not obvious.)

                                          Field presence is checked at runtime, not compile time, because it enables graceful upgrade. That is a systems thing and not a code thing.

                                          Static typing is also valid, but should be layered on top. Just like binary is valid, but should gracefully co-exist with tools for text. Haskell, OCaml and Rust are great, but they run on top of untyped Unix, which is working as intended. I consider them all domain-specific type systems. (This is a real argument: Unikernels are optimizing locally, not globally, which is a fundamentally limited design.)

                                          I have some blog posts queued up on this, referencing Ken Thompson vs. Kubernetes, and system design, although it’s not clear when they will appear …

                                    1. 1

                                      What is a text protocol? Does the poster mean line based protocols (like the example in the article), or are they referring to any protocol that is not TLV encoded (but not necessarily line based, like PDF or JSON)?

                                      1. 1

                                        Yeah, I also got confused by the wording. I think the article is talking about a ‘serialization format’, not a protocol per se.

                                        Protocol is usually a set of commands + data serialization format per each command + a state machine with well defined transition events

                                        I think the article is talking about just data serialization part of the protocol.

                                        – Another confusion for me, may be this is more from the comments – is document storage formats (eg PDF, RTF, etc) and, if they are related to ‘protocols’.

                                        Usually a document storage format consists of: schema (or metaschema) of object types, serialization format for each of the allowed object types, reference object types (that represent a form of foreign keys or references within a document). But document storage formats usually (at least in my experience), do not have ‘protocols’

                                      1. 3

                                        How do you debug a binary blob when one of your fields is corrupted so that your tag length is off for a single object?

                                        1. 2

                                          Am I correct in understanding that current RL techniques is equivalent to learning an automata? (i.e Markov decision processes require behavior to depend only on the current state, and there are only finitely many states).

                                          Is there research in learning a push down automata?

                                          1. 3

                                            For those who are unfamiliar, concatenative programming uses combinatory calculus as the base rather than lambda calculus. The combinatory calculus works very well as a base for stack based languages like Postscript, Forth, Factor and Joy.

                                            1. 5

                                              There is a fundamental sense in which combinatory calculus is not the base for concatenative programming.

                                              In Schönfinkel’s calculus, a combinator is a function that takes a function as an argument and returns a function as a result. As per the original post, the revolutionary idea here is to promote functions to first class values. Prior to Schönfinkel, mathematical functions existed at a different level than data. A function takes data as arguments, and returns a datum as a result, but a function is not data.

                                              Concatenative languages are stuck in a pre-Schönfinkel world view where functions are not data. And frankly, that makes these languages less interesting to me. A ‘function’ (what Forth calls a ‘word’ and Joy calls an ‘operator’) takes a stack of data as an argument, and returns another stack of data as a result. Words are not data. You cannot directly pass a word as an argument to another word. In Joy, unlike Forth, there is a quoting mechanism for converting a function into data: you enclose it in square brackets, creating a list. Then you can use the i function to convert a list back into a function and execute it.

                                            1. 3

                                              This is great news! I wish that more programming languages (those that have not yet adopted it) adopt it with the same operator.

                                              Any chance this will be consistently adopted? It is still a bit irksome when I have to use + in ggplot but not anywhere else in tidyverse.

                                              1. 3

                                                In the linked history, the timeline starts with SCCS; However, I believe that PATCHY from CERN (that still exists!) is the first (1960s AFAIK).

                                                1. 3

                                                  In my twenties, I used to do a lot of programming in Tcl. A language that most would agree is about as far removed from a modern statically typed programming language as you can get. (Often pointed out with something close to utter disgust). And yet the idea of making illegal states unrepresentable was entirely natural to myself and other Tcl programmers. I would frequently start a programming project by creating a blank-slate interpreter, stripping out all the built-in language constructs (loops, procedures, even arithmetic), and then adding back in a carefully selected set of high-level domain-specific primitives. This DSL would then become the application’s configuration file, safe in the knowledge that illegal configurations cannot be expressed.

                                                  TCL had/has such depth often lurking around in corners. One of my favorite languages of all time!

                                                  1. 11

                                                    I have done both masters as well as a PhD both in CS after my bachelors in Civil Engg (I joined SE industry after my bachelors and went back to school after 10 years in the industry for my masters and later Ph.D.).

                                                    You do your masters for an extra leg up in the industry; Very few jobs actually require it though. On the other hand, depending what subjects are in your masters, it can be really worth it.

                                                    Don’t go for a Ph.D. unless you want it. The jobs associated with a Ph.D. in CS typically have lesser salary than the typical jobs you will get with M.S. It will also have more work hours. So if you do not absolutely love independent research, do not go for it.

                                                    Seeing that every one seems to be offering the same advice, let me tell you the other side too. When starting my Ph.D. I assumed that Ph.D. was like a better masters. I had never done independent research before, and if someone had told me that Ph.D. was exclusively about being able to do independent research, I wouldn’t have done it. However, having gone through it, and having done independent research, I find it immensely fulfilling. I don’t think I will ever be happy going back to the industry setting.

                                                    1. 5

                                                      Don’t go for a Ph.D. unless you want it.

                                                      I can’t agree with this more. A mentor of mine while I was interning during undergrad told me to not get a Ph.D. if I just wanted more letters after my name, but instead, to get it if I wanted to do Ph.D. work. I went for the Ph.D. and left early with a Masters. While I enjoyed research a lot, academia was never for me…but I wanted those extra letters after my name.

                                                      With that said, there are also plenty of opportunities to do research in industry with or without a Ph.D.

                                                    1. 12

                                                      One downside of using the normal Unix utilities for csv files is that handling quoted fields is complicated. If you have a comma inside a field, the field needs to be quoted.

                                                      I’ve used xsv table, and xsv includes other useful commands for csv files that you can pipe together: https://github.com/BurntSushi/xsv

                                                      1. 2

                                                        One downside of using the normal Unix utilities for csv files is that handling quoted fields is complicated. If you have a comma inside a field, the field needs to be quoted.

                                                        Yes, this is a classic problem. Fortunately CSV is one of formats supported by Relational pipes, so you can do e.g.

                                                        cat data.csv | relpipe-in-csv | relpipe-out-tabular | less -RSi

                                                        and see the well-parsed CSV content. Or do some filtering or transformation using relpipe-tr-sql, relpipe-tr-awk, relpipe-tr-cut, relpipe-tr-grep, relpipe-tr-sed, relpipe-tr-scheme etc. You can also convert data from different formats to valid CSV using relpipe-out-csv.

                                                        1. 1

                                                          FWIW I’ve had relational pipes on my wiki for awhile:


                                                          Although I don’t think the use of binary formats plays well with shell-like tools. Also it seems there is not a spec, e.g. how do you escape the record separator, etc.

                                                        2. 2

                                                          Yeah, quoting is one reason I started using TSV over CSV.

                                                          There’s an RFC that actually defines TSV not to contain tabs! So you can’t express them at all. But at least it’s consistent, unlike CSV. I have seen some really horrible things people call “CSV”, e.g. emitted by hardware!

                                                          So I have an extension planned which I’m now calling QTSV: https://github.com/oilshell/oil/wiki/TSV2-Proposal

                                                          It’s very simple in that you’re allowed to put QSN in cells: http://www.oilshell.org/release/latest/doc/qsn.html

                                                          So a QTSV file is syntactically a TSV file, and it will work fine with column, and be able to express tabs in fields.

                                                          1. 3

                                                            I wonder why tooling doesn’t emit the record separator and friends, which are specified by ASCII…

                                                            1. 2

                                                              I can think of a couple reasons:

                                                              • text editors don’t support them well. If tools don’t support them, then they might as well not exist. Imagine if text editors didn’t support the \n character to mean the end of line – would people still use it?
                                                              • ditto for terminals. It’s nice to be able to view a TSV or CSV file in a termial.
                                                              • grep doesn’t work anymore, since it’s inherently line-based.
                                                              • You still have an escaping problem. Just like TSV fields can’t contain tabs, the RS delimited files can’t have fields with RS. So it just shifts the problem around.

                                                              (QTSV doesn’t have these problems. Being able to just grep is a feature, but you could also make your own qtsv-grep, etc.)

                                                          2. 1

                                                            Any idea why we do not use escaping for commas that are not part of the field spec?

                                                            sentence, here is an escaped\, comma, and more

                                                            1. 1

                                                              My guess is that someone had the idea to separate numeric and string fields by enclosing strings in double quotes. This is supported by a SuperCalc manual from 1983 (linked from the Wikipedia page for CSV).

                                                              1. 1

                                                                Quotes are also used to embed newlines into a field, so you’d need some mechanism for escaping those as well.

                                                                1. 1

                                                                  Indeed. But I wonder if the same mechanism can be used

                                                                   sentence, here is an escaped\
                                                                   newline, and more
                                                                  1. 1

                                                                    Yeah, I think it could. There might be some trickiness involved in handling CRLF though.

                                                            1. 2

                                                              TCL is one language that even after all these years, I still love (I learned it in 2000). Combined with its two superpowers – TK and Expect, it is really hard to beat when you want a glue language that is cleaner than shell.

                                                              1. 2

                                                                I learned Tcl through Expect, back in like 1998 when I had to manage a large number of serial-connected network devices. I didn’t even know there was a Tcl outside of Expect for the first year or so.

                                                              1. 1

                                                                Context-free grammar formalisms are less relevant than one might think. Most realistic compilers need degrees of context-sensitivity and thus use hand-written parsers: either because the core language grammar is context-sensitive (e.g., significant whitespace as in Python) or because the general user experience profits from context-sensitivity (e.g., error handling, feedback in IDEs).

                                                                How hard is it to use a skeletal CFG and work on the tree for context sensitive features compared to hand-rolling the complete parser? Has anyone done both for a comparison?

                                                                1. 1

                                                                  I don’t think this is a well-formed question. „Context sensitivity“ isn’t a thing and is just unfortunate terminology. I am with Okhotin on this one, let’s call this class just „grammars“ or, if we want to be specific, „ordinary grammars“.

                                                                  Or rather, context sensitivity has intuitive meaning (things are different based on context), and cfg are context-sensitive in this way. But this is still a hand-wavy property and it doesn’t really makes sense to say „this needs context sensitivity“, because you can’t replace CS with fixed definition here.

                                                                  In technical terms, context sensitivity means that you allow general rewrite rules, like ABC -> CdeFg, with the restriction that rhs is not shorter than lhs. That is, cfgs have one letter to the left of an arrow, csg have a string there. This is irrelevant to any real-world grammars or IDEs.

                                                                1. 1

                                                                  The paper does not come with any data to download :-(

                                                                  “However, beyond the observation that the three variables are positively associated, the strength of the associations and their precise relationships are a matter of open debate and controversy in the research community.”

                                                                  One solution is for the test community to learn how to do statistical analysis that is more powerful that the t-test. Don’t mean to pick on you, this statement can be said of virtually all researchers in software engineering.

                                                                  1. 1

                                                                    We will have the data available soon. Will keep you updated.