1. 41
    On Parsers plt wiki.alopex.li
  1.  

  2. 10

    Great article. There are a few things that I’d phrase slightly differently:

    The article claims that parsing is not slow anymore. This is true, for languages and implementations that meet the all of the following conditions:

    • Parsed units are self-contained and import others via a structured module system.
    • The implementation is an AoT compiler or a JIT / interpreter that is not sensitive to start-up time.
    • The majority of the parsed text generates non-trivial executable code.

    This is not true for C, C++, or JavaScript (until very recently), where the module system is basically cat. V8, for example, does a trick where the parse top-level declarations and then skip to matching braces. They only parse function bodies the first time that they’re called because a web page may include a megabyte of JavaScript and needs to be responsive in a few tens of milliseconds.

    Clang puts a huge amount of effort into parse speed because a typical [Objective-]C[++] compilation unit may be a few hundred lines of code and a few megabytes of headers. Parsing the declarations in the headers can easily become a bottleneck. Most of the header code (in [Objective-]C) does not generate any code, so the fact that parsing is faster than optimisation / codegen doesn’t help.

    None of these apply to modern well-designed languages, so it’s fair to ignore them.

    On separating lexing and parsing, there’s one big reason to want to combine them: context-dependent keyword. For example, in Objective-C, the tokens atomic or retain are either identifiers or a keywords, depending on where they appears in the program. If you combine the lexer and parser then this kind of thing is trivial to do because the rule that parses the token will be different depending on the current parsing state. You don’t need any special logic for it. If you have a separate lexer and parser, then you either need to either have some special handling in the parse rules to match a identifier-or-keyword-from-this-set in most of the places where you want an identifier, or you need a notion of subtyping in your token stream (so that atomic is a kind of keyword that is a subtype of identifier) and then you can handle it differently depending on your context (treat it as an identifier in most places, expect the keyword in the places where it’s a keyword). This is much easier to do if you have a ‘pull’ model for your lexer rather than a ‘push’ model.

    The difference between a ‘pull’ and a ‘push’ lexer is quite important and is somewhat glossed over. Most textbooks assume a push model: the lexer reads some characters, generates a token, and tells the parser ‘I have this token now!’. In my experience, the ‘pull’ model is now more common and is definitely more common. In this model, the parser calls the lexer to say ‘do you have a token of type X? If so, please consume it and advance the input stream’. This makes handling subtypes trivial: from the above example, if the front of the character stream is atomic then you can call the lexer and say ‘Is the next token an identifier?’ and it will return a keyword with the value atomic but you can also say is the next token the atomic` keyword?’ and it will also return true.

    The point about PEGs and error reporting is spot on. I love PEGs for rapid prototyping languages (to the point of writing a library that provides a C++ eDSL for wrting PEG parsers. PEGs, depending on how you think about them, have unlimited read-ahead or unlimited backtracking. This is fantastic for quickly writing something that parses a grammar, but it means that on parse failure they backtrack to the start of the program and say ‘starting at index 0 in the input stream, there is not a valid parse matching the rule “complete program”’. Modifying PEGs to support good error messages destroys a lot of their value.

    1. 2

      Thank you! Point taken about C-family headers and such, I didn’t think about that.

      I confess that I’d forgotten about the difference between a “pull” and “push” lexer; I automatically think of a lexer as producing a lazy stream of tokens, which automatically gives you something that uses the “pull” model. I’m not sure I’ve seen textbooks use the “push” model, I’ll have to check. I’ve never actually run into languages with context dependent keywords in practice, at least that I have noticed.

      1. 2

        The traditional lex / yacc combination is very much a push model. The parser is driven by being fed a stream of tokens. Most hand-written recursive-descent parsers end up implementing a pull model where they can query the parser for token types with some subtyping relationship (e.g. a valid function name may have more restrictions than a valid identifier and in some contexts you need a function name, in others you just need some kind of identifier).

        Most non-trivial languages have at least a small number of context-dependent token types. Older, successful, languages are more likely to have context-dependent keywords because they forgot to reserve names for keywords early on and so end up needing to steal some values from the identifier space in contexts that explicitly disallow identifiers.

      2. 1

        I too came to the comments section to quibble with the notion that parsing is not slow. Well, good parsers are not slow, but have you seen webpack/babel/metro? Turns out that some lines of business are struggling using parsers that are about 2 orders of magnitude (or more) slower than they need to be, and it’s enough to cause productivity headaches.

        That said, their statement certainly wasn’t meant to apply to everything everywhere.

        1. 2

          One can always write slow programs if one tries hard enough. ;-) I suppose it might be more accurate to say that a good parser for a decently-designed language should not be the main bottleneck of a compiler/interpreter, or at least not the first one that a beginner language-writer hits.

          1. 2

            I think it’s important to differentiate between parsers for languages (which your article was talking about) and parsers for data structures (which @0x2ba22e11 is alluding to). Often the amount of processing that you need to do on a stream of, say, JSON data is very small and so parsing speed can quickly become a bottleneck. In contrast, a compiler typically does a lot of work per expression and so parsing of things that end up generating code is not usually a bottleneck. The exceptions to this are when you have fast-start requirements and need to parse a lot of code before you can start executing (e.g. JavaScript) or where you have a lot of declarations in the source code that do not generate anything for the middle-end of the compiler (e.g. C).

            1. 1

              Yeah, I’m just kvetching. :)

        2. 3

          Regular languages (type 3) > Context-free languages (type 2) > Context-sensitive languages (type 1) > Unrestricted languages (type 0)

          I have always felt that the distinction drawn between context-free and context-sensitive languages was a bit arbitrary. There is a clean separation between regular languages and context-free languages: The former corresponds to automata and the later corresponds to an automata with a stack (a push down automata). To get to context-sensitive, you need a linear bounded automaton which is fairly different from stacks. Add one more stack, and you get unrestricted languages. Similarly, regular grammar is defined usually as

          A → a
            | aB
            | ε
          B → b
            | aB
          

          where A and B are nonterminals and a and b are terminal symbols, and (single) nonterminals are only allowed at the end of the expansions. However, it can also be expressed with arbitrary LHS with any number of terminal and nonterminal symbols, where recursion is disallowed but Kleene star is allowed. This gets you the following grammar

          A → a
            | aB
            | ε
          B → b
            | (ab)*
          

          You can prove that this grammar is equivalent to a regular expression. Now, if you allow recursion to this grammar, you have context-free. Allow arbitrary symbols on the LHS and you have unrestricted. To have context-sensitive grammars, you have to add an ugly rule saying that the grammar is non-contracting. That is, ε is only allowed as an alternative expansion to the start symbol.

          That is, every regular grammar is a context-free grammar. However, not every context-free grammar is a context-sensitive grammar. Finally, the special casing of ε makes context-sensitive grammars unusable for practical purposes (I have not seen an application that would be adequately served by context-sensitive grammars but not by context-free or unrestricted grammars).

          Finally, regular tree grammars are quite interesting structures that sadly seems to be unused in most software engineering tasks. The only difference from context-free grammars is that the RHS is represented as a tree. They have the power of context-free languages, while allowing interesting properties of regular languages such as intersection and complement. Subset and equivalence of regular tree grammars is also decidable unlike context-free grammars.

          1. 4

            However, not every context-free grammar is a context-sensitive grammar.

            I don’t think that is true (see Chomsky’s hierarchy), and I think you can show that by the pumping lemma.

            Can you provide an example of a context-free language (or grammar) which is not a strict subset of the context-sensitive language (or grammar)?

            I think it is possible to see by the class of recognisers (nondeterministic pushdown automata, for context-free languages, versus linear-bounded nondeterministic Turing machine, for context sensitive languages) that context-free languages (and grammars) are a strict subset of context-sensitive languages (or grammars).

            Finally, the special casing of ε makes context-sensitive grammars unusable for practical purposes (I have not seen an application that would be adequately served by context-sensitive grammars but not by context-free or unrestricted grammars).

            It seems that your perspective is reversed: I don’t think one wants to model a problem as context-sensitive to solve it, but there are a lot of daily problems (such as interpretation of nearly all natural languages) that is in that class, and we have to find a way to solve it. (Usually, one simplifies the problem. But that is not enough, and automatic translation is a good example of that.)

            1. 2

              The difference is not in the languages but in the grammar definition; by definition, a context-sensitive grammar has to be monotonic (non contracting). So, the following grammar is not context-sensitive because of ε in the definition of B even though the language it represents is a context-free language (it is finite) and hence a context-sensitive language. Note: I am not saying that the language represented by this grammar can’t be represented by a context sensitive grammar, but that this grammar is not context-sensitive.

              A → a
                | aB
                | ε
              B → b
                | ε
              

              It seems that your perspective is reversed

              I am not an expert on natural language interpretation. However, see the wikipedia entry you linked. As it says, CSGs are far more powerful than what is required for natural languages, and I am not clear what we gain by modeling them as CSGs. Why not interpret them as context-free grammars with extra semantics as we do with programming languages? That is, what property of context-sensitive languages that is not available in unrestricted languages are you using?

              1. 1

                by definition, a context-sensitive grammar has to be monotonic (non contracting).

                That is the flaw: every context-sensitive language can be generated by a noncontracting grammar or a context-sensitive grammar, but they are not the same thing.

                From the Wikipedia’s page:

                A formal language that can be described by a context-sensitive grammar, or, equivalently, by a noncontracting grammar or a linear bounded automaton, is called a context-sensitive language. Some textbooks actually define CSGs as non-contracting, although this is not how Noam Chomsky defined them in 1959. This choice of definition makes no difference in terms of the languages generated (i.e. the two definitions are weakly equivalent), but it does make a difference in terms of what grammars are structurally considered context-sensitive; the latter issue was analyzed by Chomsky in 1963.

                […]

                Kuroda normal form

                Every context-sensitive grammar which does not generate the empty string can be transformed into a weakly equivalent one in Kuroda normal form. “Weakly equivalent” here means that the two grammars generate the same language. The normal form will not in general be context-sensitive, but will be a noncontracting grammar.

                The Kuroda normal form is an actual normal form for non-contracting grammars.

                So it seems that

                context-free grammar ⊂ context-sensitive grammar
                ∧ noncontracting grammar ⊂ context-sensitive grammar
                ∧ context-free grammar \ (context-free grammar ∩ noncontracting gramma)r ≠ ∅
                

                And, thus, the grammar you show is context-sensitive, and that can be seen by the formal definition of a context-sensitive grammar, in which ɣ can be ɛ.

                As it says, CSGs are far more powerful than what is required for natural languages, and I am not clear what we gain by modeling them as CSGs. Why not interpret them as context-free grammars with extra semantics as we do with programming languages?

                Well, context-free is not enough to represent all natural languages, and that is the original work of Chomsky.

                It might be that all the power of context-sensitive grammars are not necessary, but a subset of it is.

                And about your suggestion of “context-free with additional rules”, that is not how it works. First, because enumerating all the “semantic rules” would be a ridiculously laborious work (and this kind of approximation is what current works on automatic translation do), but it doesn’t represent all the classes of constructs of a language: one cannot increase the power of a grammar with “workarounds”.

                In case of programming languages, the grammars (of the languages) are context-sensitive, and the semantic is artificially limited by the designer.

                1. 2

                  Perhaps I am mistaken here; but the definition in wiki is that all the rules are of the form

                  αAβ → αγβ
                  

                  with A ∈ N α,β ∈ (N∪Σ)* and γ ∈ (N∪Σ)+ where G = (N, Σ, P, S), with N a set of nonterminal symbols, Σ a set of terminal symbols, P a set of production rules, and S the start symbol.

                  My reading of this definition is that γ can’t be empty because of +. This seems to be implied by the next sentence “The only difference between this definition of Chomsky and that of unrestricted grammars is that γ can be empty in the unrestricted case.”

                  What is the definition of context-sensitive grammar you are using?

                  1. 2

                    You are right, and I have overlooked the fact that γ ∈ (N∪Σ)+ .

                    However, I decided to give a fresh start on my arguments after reading a little more, and I hope my other (new) comment makes matter clearer.

              2. 2

                …Oh thank Eris, I’m so glad that someone with better theoretical chops than me replied to that post. XD

              3. 1

                I have always felt that the distinction drawn between context-free and context-sensitive languages was a bit arbitrary.

                It is not arbitrary:

                • CFG can generate languages with central auto-recursive item, or { aⁿc*bⁿ | n ≥ 0 };

                • CSG can generate of the type { aⁿcⁿbⁿ | n ≥ 0 }.

                That comes from the pumping lemma.

                The former corresponds to automata and the later corresponds to an automata with a stack (a push down automata). To get to context-sensitive, you need a linear bounded automaton which is fairly different from stacks.

                A pushdown automata is also quite different from a DFA: it has the concept of memory (the stack), while the DFA just needs to know the current state and the current input to find the next state.

                So, going from stack to a limited free memory (i.e., restricted Turing machine) is such a powerful jump as was from DFA to pushdown automata.

                To have context-sensitive grammars, you have to add an ugly rule saying that the grammar is non-contracting. That is, ε is only allowed as an alternative expansion to the start symbol.

                Yes, you are right.

                That is, every regular grammar is a context-free grammar. However, not every context-free grammar is a context-sensitive grammar.

                And here lies the flaw, which took me a while (consult some dead-tree books) to get to the core of the issue.

                Every CFG is contained in the CSG because grammars are measured by the languages they produce, and it is possible to generate any CFL from a CSG. They are not measure by the “shape” of their grammar. And it is here that lies the confusion, I believe.

                The fact that CFG grammars allow rules with ɛ, while CSG doesn’t (except for the initial rule), it doesn’t mean that CSG are not capable to generate any language generated by CFG.

                What you are telling is that the language L₂ representing all the CSG grammars is not a strict superset of the language L₁ representing all the CFG grammars.

                The problem you are adressing is representation of the grammars, which are languages, and not the grammars itself.

                Finally, the special casing of ε makes context-sensitive grammars unusable for practical purposes (I have not seen an application that would be adequately served by context-sensitive grammars but not by context-free or unrestricted grammars).

                It is still not clear what you meant with it, but in another comment I did earlier, I said about natural languages, to which you wrote

                Why not interpret them as context-free grammars with extra semantics as we do with programming languages?

                If you are parsing a context-free language, then you have to consult “an oracle” (i.e., a rule in memory outside the stack of your pushdown automata), then you get a restricted Turing machine, which is the recogniser of a CSL.

                That is, what property of context-sensitive languages that is not available in unrestricted languages are you using?

                Finitude: a Turing machine (the acceptor of unrestricted languages) implies an infinite tape.

                And decidability of the halting problem.

                1. 2

                  It is not arbitrary: … That comes from the pumping lemma.

                  Again, I am not disputing the language hierarchy, rather I am making a subjective point. For me the importance attributed to context-sensitive languages seems arbitrary (seems to be based entirely on being invented by Chomsky as part of the original hierarchy.). There are a number of containments in between context-free and unrestricted languages; for example, you can have indexed languages which are recognized by nested stack automata or by indexed grammars, a generalization of CFGs. They are a proper subset of CSLs.

                  A pushdown automata is also quite different from a DFA

                  My point is that a PDA is DFA + (XXX feature), while it is hard to describe LBA as DFA + (XXX feature).

                  Why is this ability to represent it as PDA + XXX or PDA - YYY important for me? The reason is as follows:

                  There is a pleasing pattern in how we define DFAs. If you look at a graph, where each node represents a state and each edge represents consumption of a character, a directed acyclic graph is equivalent to table lookup, and represents the combinational logic. Next, if you allow cycles in the graph, then you get a finite automata representing regular languages. Then, if allow graphs to be named and reused, you get recursive transition networks representing context-free languages. Add registers to RTNs and you get augmented transition networks which are Turing equivalent. This way of representing languages as a series of enhancement over a graph structure holds intuitive appeal for me. It allows me to reason about the language hierarchy by looking at the machinery required.

                  That is, every regular grammar is a context-free grammar. However, not every context-free grammar is a context-sensitive grammar.

                  And here lies the flaw, which took me a while (consult some dead-tree books) to get to the core of the issue.

                  No, I am very specific here. I am not disputing CSLs are superset of CFLs. I am simply saying that not all CFGs are CSGs based on the structure.

                  The fact that CFG grammars allow rules with ɛ, while CSG doesn’t (except for the initial rule), it doesn’t mean that CSG are not capable to generate any language generated by CFG.

                  Again, this is not what I am saying. I have no dispute with the language hierarchy and the hierarchy of languages representable by the different grammar formalisms. Indeed, any language that is representable by a CFG is also representable by some CSG. However, the particular CFG need not be a CSG. If this is the core of disagreement, perhaps this should clear it up.

                  The problem you are addressing is representation of the grammars, which are languages, and not the grammars itself.

                  No, I am addressing the grammars, not the languages; That is, my point is about the machinery used. Not about what it accomplishes.


                  I have only worked in writing parsers and fuzzers for programming languages. In parsing, the standard technique to parse is to use a lexer (a regular expression, that is an automata) to recognize tokens and use a context-free grammar to specify the parser. Finally, use separate semantic rules to specify what string is valid and what is invalid. The reverse for a fuzzer. I have not yet seen any one use context-sensitive grammars for this. Perhaps it is different in your field and CSGs are used. If so, I would be glad to hear about it.

                  1. 1

                    For posterity: I have found that context-sensitive languages are languages of a one way stack automation. These are PDAs with the additional read-only ability to inspect the stack.

                    “Sets Accepted by One-Way Stack Automata Are Context Sensitive” Hopcroft and Ullman 1968.

              4. 2

                a language/grammar is “context free” this means that in any state you can look at a fixed-sized chunk of the input without backtracking and always figure out what state to switch to next.

                I don’t think this is entirely correct: CFG are parsed by non-deterministic push down automata, they fundamentally need backtracking. If you could always figure out the next state, linear time will be enough.

                My guess (sorry if it’s wrong!) is that here you try to rationalize the “context free” terminology. I think the ”context free” terminology is just bad and confusing (got this idea from okhotin).

                My understanding is that “context-free” comes from “restricted context-sensitive”. In CSG, we have a sequence on the left hand side, aBc -> DE, and it’s the context of the left hand side we are talking about. Then, the rule like B -> DE can be called context-free. However, this doesn’t make any sense if you don’t give a definition of CSG first (which you shouldn’t, as that’s a pretty useless class)!

                The problem is that intuitively, CFG are exactly about capturing “context”. When parsing expression, + is addition operator, but when parsing trait with bounds, it is a separator. I wish we called CFGs just grammars.

                On the topic of time, parsing CFG is better than O(N^3). There’s adorably named “four Russians method” for O(N^3 / log(N)), and I think the best we can is “as fast as we can multiply two matrices together”, which is like O(N^2.37287). The core idea is to stare hard at the O(N^3) algorithm and realize that it is matrix multiplication.

                You can note which rule each function implements in a comment to make it convenient to look back at later, and all in all I’ve found it very easy to look at a rule and a function and verify that they do the Right Thing.

                I’ve found it even more useful to just write tests in comments in front of functions and various conditions inside: https://github.com/rust-analyzer/rust-analyzer/blob/ee4505f396249234bba05bfdbd14ae2a9d813004/crates/parser/src/grammar/type_params.rs#L151

                1. 1

                  Thanks! I’ll work these corrections into the post, probably once it drops off the front page.

                2. 1

                  ghc – Uses the Happy parser generator (LALR/GLR?)

                  GHC uses Happy with an LALR grammar. The Happy parser generator itself has some additional support/options for GLR, although I’m not sure how much maintenance and use that codepath has gotten over the years.

                  1. 1

                    Common values of k [lookahead in LL(k)] are 0 or 1, but you can make just about anything work fine as long as there’s a fixed upper bound to it. Nobody seems to outright say it, but if your value for k is more than 2 you’re probably making life complicated for yourself somehow.

                    ISTR Python (at least in the 2.2/2.4 era) intentionally restricting itself to an LL(1) grammar specifically to keep syntactic complexity down. I haven’t followed the language for years, so maybe it’s relaxed since then.

                    1. 2

                      At some point (2.7/early 3?), some features were added that weren’t really doable in LL(1) rules, and they relied on some hacks to get them to work. Since 3.9, they’ve switched to using a PEG parser.

                      Edit: The relevant PEP mentions examples of these constructs: https://www.python.org/dev/peps/pep-0617/#some-rules-are-not-actually-ll-1

                      1. 1

                        Thanks for the update.

                    2. 1

                      I only write recursive-descent Unicode-token parsers. They’re basically the most trivial parser possible.

                      You want to get an if statement out of an input text? Write a function that takes an old parser state and returns either a new parser state and an if statement, or an error type.

                      That is it. That is all. Add a bit of helper functions for number parsing, string parsing, etc, and you can write a parser very easily even in poor languages like C.

                      1. 1

                        I hang out in a couple programming language development chat channels

                        Can you share these channels if they are public and accept new people?

                        1. 2

                          Probably they mean #lang-dev on the Rust discord (https://discord.gg/rust-lang-community) and the Programming Languages discord, which I unfortunately no longer have a link to.

                          1. 2

                            Correct on both counts. The latter is https://discord.gg/yqWzmkV ; it’s actually an offshoot of Reddit r/programminglanguages , which I’d forgotten. The Rust discord is more fun to actually talk to people in, but the Programming Languages discord occasionally has outright deluges of interesting and/or useful information.