1. 25
    1. 14

      See also: Why Lexing and Parsing should be separate (from the oil shell wiki)

      notably,

      separating lexing and parsing reduces the n in your O(n^3). In any real program, there are many more characters than tokens

      1. 4

        That wiki page is a great summary.

        My main intuition is: lexing can be done in a fixed amount of memory with a finite-state machine (or equivalent regular language); parsing requires an arbitrary amount of memory (i.e. it needs the stack) to recognize recursive/nested structures.

        1. 4

          (or equivalent regular language)

          That’s not true for many practical languages: things like nested comments or Rust-style raw strings are not regular. Conversely, lex-style lexer generators allow for custom state manipulation.

          So, while to the first approximation it is true that lexing is regular and parsing is context free, the reality is quite a bit more murky (eg, Rust lexing isn’t even context free).

          1. 5

            This is why I say

            • lexing is non-recursive
            • parsing is recursive

            The regular / context-free thing that’s taught is actually the wrong distinction. Oil’s lexer maintains a stack of “lexer hints”, so I guess it’s a pushdown automata (which is non-recursive).

            The lexer USES regular languages, but the lexer itself isn’t entirely a regular language. It has both lexer modes and lexer hints.

            (I don’t think I wrote about the hints that much, but it’s for figuring out what right paren ) means, which is a tricky problem in shell, e.g. due to weird unbalanced ALGOL-style case syntax, nested with command sub, etc. )


            Likewise with Rust, you can match raw strings non-recursively, but it’s not a regular language.

            1. 1

              A pushdown automaton is enough to parse a context free language. Lexer modes are equivalent to selecting between different lexers at run time.

              What is really fun is string interpolation, especially when the complete expression grammar is available inside an interpolation, including interpolated strings. This looks like arbitrary recursion, but i think the way it is usually implemented is "string${ is treated as a combination of the first part of the string literal and an opening bracket, and }string" is the corresponding close bracket and the remainder of the string. (mutatis mutandis for multiple interpolations)

              1. 2

                A pushdown automaton is enough to parse a context free language.

                I’ve written an entire series on Earley parsing, and I highly doubt that. I mean if that were true why would we need GLR and Earley in the first place?

                I’m not entirely sure but as far as I know a pushdown automaton (finite state machine + stack) can only parse LL grammars. I’m pretty sure it cannot parse all LR grammars, let alone all non-ambiguous context free grammars, let alone all context-free grammars.

                1. 1

                  Oops, yes, I missed out the magic word “deterministic” before “context-free”, which in grammar terms roughly means lacking ambiguity.

                  1. 1

                    Err, not quite, actually:

                    According to Wikipedia, “Every context-free grammar can be transformed into an equivalent nondeterministic pushdown automaton”, so in a sense you were exactly write: push down automata can parse all CFGs, even the ambiguous ones.

                    However if we restrict ourselves to the deterministic ones, still according to Wikipedia, “there are non-deterministic unambiguous CFLs”, so some unambiguous grammars still cannot be parsed by a deterministic pushdown automaton.

                    Still, I stand corrected: deterministic pushdown automatons don’t seem to be limited to LL grammars.

                    1. 1

                      Indeed, my weasel word “roughly” meant to say DCFGs are a subset of unambiguous CFGs.

                      I wonder if the kind of tricks that regex engines use to get the speed of DFAs with the compactness of NFAs might be useful for implementing NPDAs - otherwise I doubt they would be an attractive implementation technique in the way deterministic pushdown automata are. Tho my very vague handwavy understanding of GLR suggests it is a bottom-up analogue of an NPDA.

                      1. 2

                        An LR(k) parser can be implemented using a DPDA. When there are ambiguities, the algorithm naturally extends to an NPDA.

                        A GLR implementation is essentially an efficient interpretation of this NPDA, somewhat analogous to Thompson’s trick to recognize a regexp in linear time on an NFA (using a set of integers to represent all the active states, and advance them in lock step).

                        A naive NPDA traversal would require maintaining a list of stacks (one per “thread”). Instead you maintain a DAG that represents all the stacks by sharing common chunks, so that as much work as possible can be shared between the different analyses.

      2. 2

        Also, most syntax highlighters are based on lexing. So you’ll want a distinct lexing grammar, anyways.

      3. 1

        Though if your parser isn’t at least in the ballpark of O(n), you have way bigger problems than the size of the constant.

    2. 10

      The only reason to separate lexing and parsing into two distinct phases, it because it’s easier to write. Otherwise, doing the lexing “on-demand” is always better, as it allows the parser to provide context, or even switch entirely between lexers as necessary, which also solves the problem of composition that the article mentions.

      (still, it’s usually a good idea to use different DSLs to specify terminals vs rules, and different mechanism to extract them from the text)

      1. 2

        The only real reason to split it into two phases is because it performs better that way. Parsing is slow, lexing is fast. Do as much work in your lexer and only involve the parser when you need it.

        1. 2

          Parsing is not slow. Parsing (at least with all the parsers I’ve worked with) is nearly instantaneous O(1) stuff, and is stunningly fast even for large code bases.

          I’m not claiming that it’s impossible to build a slow parser, but that would seem to require some effort. (Edit: Like using regex or something.)

          1. 3

            Relative to lexing, it’s slow

            It’s not a bottleneck in many compilers, but certain workloads definitely expose it. For example:

            • C/C++ – I remember an early Lattner talk on Clang – a lot of it was about parsing speed relative to GCC
            • JavaScript and v8 – parsing is extremely optimized, because it adds latency to web pages.
            1. 3

              I just find it funny when people talk about lexing and parsing. They are like 1% of writing any compiler of any value, yet they get 99% of the research papers and blog posts and articles written about them. They represent less than 1% of the wall clock run time of a typical compiler. It would be like if all the press covered the original launch of Doom by examining only the icon that was installed on the Windows desktop, and never actually launching the game. It’s skin deep, at most.

              I understand that in 1970, lexing and parsing may have been a significant part of the cost of compilation. But back then, adding two numbers was a good excuse to run to lunch while the calculation was in progress. And compiles were done overnight.

              Nonetheless, these are fascinating subjects, and I still love working on lexers and parsers. And optimizing them. And reading papers about them. So obviously I’m a flaming hypocrite.

              (You can safely ignore this message. I’m tired.)

              Also, this is slightly out of date (the bnf has been updated a few times since this was last updated), but this is the Ecstasy lexer, written in Ecstasy: https://github.com/xtclang/xvm/blob/master/lib_ecstasy/src/main/x/ecstasy/lang/src/Lexer.x

              1. 5

                they get 99% of the research papers

                We must be reading different papers.

                1. 3

                  99% of the teaching material people actually read, maybe?

                2. 2

                  I’ve been known to exaggerate 473% of the time.

              2. 4

                Compilers aren’t the only thing that process source code! The bar has been raised in the last 10 years, which is probably why people talk about parsers more.

                clang-format is huge, and now every language needs something like it. Tools like ASAN raised the bar and rely on the same location info. Debuggers use location info, etc.

                JS processors like WebPack are absolutely bottlenecked on parsing (hence esbuild, etc.).

                Also, languages have grown more elaborate syntax over the years.


                Does Ecstasy have an auto-formatter, linter, and debugger support with accurate location info? What about a language server?

                If not, people probably won’t use it. That functionality depends on a well-architected parser. It’s also a lot of work.

                Most real languages also have at least 2 parsers. It would be nice if you could just write 1, but that seems to be an open problem.


                (On the other hand, I kind of agree that it’s “obvious” why lexers and parsers should be separate, for essentially any language, and maybe weird to have a huge discussion about it. There are a bazillion production codebases out there that show you that. But there does seem to be some genuine confusion – there were production codebases on my wiki page that did otherwise, and then they split it up, and it was faster. I wrote the page because it wasn’t theoretical.)

                1. 1

                  Does Ecstasy have an auto-formatter, linter, and debugger support with accurate location info? What about a language server?

                  Debugger is just a prototype (not supporting existing debuggers via dwarf or something, just our own debugger). But yes, accurate location info, type and variable info, break points, frames, evals, whatever, it’s in there.

                  We had a prototype of an IntelliJ IDEA plug-in, but gave up on that route; instead, we have a language server project now under way (not yet ready). But I’ve built custom language editors in the past with syntax highlighting etc. (previous life, Windows 4gl stuff; don’t ask, won’t tell).

                  On the other hand, I kind of agree that it’s “obvious” why lexers and parsers should be separate, for essentially any language, and maybe weird to have a huge discussion about it.

                  Ironically, this morning I learned a bunch about parser combinators from Jamie Willis. So now I have an entire new view on the possibility of not splitting a lexer out separately 🤣

        2. 1

          Actually, that’s not the only reason. Most parsers are constrained by the requirement to be context-free, mainly for performance reasons. So the fact that most lexers use regular expressions is very convenient, because they are capable of several things that parsers can’t do, such as back-reference (\1 etc.), and arbitrary lookahead/lookbehind.

          1. 3

            Backreferences make regexes non-regular, so the usual algorithms for reducing a lexer to a DFA do not work. I can’t remember off the top of my head if the same is true for lookbehind and lookahead. (I guess lookbehind is analogous to a backref; lookahead seems to be benign enough.) For example, flex and re2c have lookahead, but not lookbehind nor backrefs.

      2. 1

        The only reason to separate lexing and parsing into two distinct phases, it because it’s easier to write. Otherwise, doing the lexing “on-demand” is always better

        Wait a minute, the two aren’t mutually exclusive. Especially if what matters to you is the ease of writing: typically lexer.next_token() would be defined separately from your parser. Heck, it’s this very separation that allows your parser from switching lexers without causing your source code to be a complete mess: one file per lexer, another file for the parser, all neatly separated.

        It’s not a runtime separation of phases (lexing and parsing would actually be interleaved), but that hardly hurts source code clarity.

        1. 3

          No argument here. When I said two distinct phases I meant doing all lexing first, and then doing parsing.

          I agree that they can and should live in different components.

          Sorry if I was unclear in my phrasing.

    3. 7

      This intro sentence is odd:

      Since some parsing approaches such as recursive descent parsing unify these phases, why do lex and yacc split them apart?

      I disagree that recursive descent parsing unifies lexing and parsing. At best it’s agnostic – your tokens could be bytes, but every time I’ve seen it explained, the tokens are output from a separate lexer.

      e.g. wikipedia: https://en.wikipedia.org/wiki/Recursive_descent_parser (typedef enum ident, etc.)

      or any CS class I’ve seen: https://web.stanford.edu/class/cs143/lectures/lecture06.pdf

      I think it’s very reasonable to define lexing as non-recursive and parsing as recursive (since the line between them isn’t always obvious), so using recursive-descent for lexing doesn’t make that much sense.


      At this point the question should be: Why unify lexing and parsing? IMO if anyone wants to do that, the “burden of proof” is on them to explain the design decision (to the extent that you’re obligated to explain any code you write :) ).

      Someone else linked this wiki page, but just to repeat - https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate

      I think this article is in agreement with everything there, but the intro/motivation seems odd to me. (I also think it can be explained a lot more clearly WITHOUT lex and yacc, as is done in the article.)

    4. 3

      I have tried using lpeg (the Lua library for parsing expression grammars) for a complicated parsing task. The main thing that caused me trouble was how to handle whitespace in a way that is systematic without polluting the higher-level grammar, eg. space term space addop space term. I made much better progress with a traditional lexer/parser split. (The complicated task was a revamp of unifdef which has to deal with C’s early translation phases, which are lexically very tangled.)

      Also, wrt the discussion of scannerless parsers, PEGs have a way to express reject rules, and their ordered choice operator is a bit more systematic than lex’s disambiguation rule. On the other hand I don’t know if there are good linters for PEGs like there are for classic context free grammars.

      1. 2

        Yes, I get tripped up on PEGs for the same reason: what are distinct phases in my mind get complected together, and I have to remember that. It is really handy to have all the parsing rules in one place, but it more complicated. Really depends on the task at hand.

        I think the traditional way to handle what you’re talking about is to settle on a convention of consuming excess whitespace either immediately before or after reading in a lexical entity.

      2. 1

        PEG libraries that I’ve used have an explicit token wrapper that means ‘match this with no whitepace’ and then they permit whitepace (or white space and things that match the comment rule, if you’re discarding comments) in any other matches. Does lpeg not have some equivalent?

        1. 1

          It isn’t that hard to write your own :-) TBH a large part of my problem was to do with the complexity of whitespace in the C preprocessor, and the requirement to keep it intact as much as possible. The fenceposts start to matter a lot, such as which tokens own leading or trailing space (and comments) and how space attaches to brackets.

    5. 3

      I’m writing a lexer at the moment for TLA+. Unfortunately the lexing grammar is non-regular; the lexer behaves differently depending on whether you are inside a module, or inside a block comment section, or whatever. It has to keep track of nestable start/end tokens of various types, meaning it’s context-free. So it kind of looks like I’m writing very simple parsing code mixed in with the lexer. Currently the design has the actual parser coming in later and creating a full parse tree within each context section, but it did get me thinking what it would look like to fully combine the lexer & parser. Tree-sitter does this; it’s a LR(1) parser and the lexer is informed of the set of tokens that the parser expects to see next as it scans text, and the lexer can change its behavior based on that info. It’s interesting, it leads to a design where all the very difficult context-sensitive or unrestricted parsing actually happens on the lexer level.

      1. 3

        Unfortunately the lexing grammar is non-regular; the lexer behaves differently depending on whether you are inside a module, or inside a block comment section, or whatever.

        You need several lexers, one per context. Those will be regular. Then your parser can just call the relevant lexer for their terminal symbols, depending on what you are parsing. Hopefully your parsing tool can do that. Otherwise, hand written recursive descent will be able to.

        Note that you cannot just lex a whole stream of tokens, and then parse them. You need to lex on demand, with something like a next_token() function. And you need your lexers to play nice with each other (when one reads a token the other must know where to resume exactly, which might be a problem depending how you buffer data).

    6. 3

      How do people deal with comments when combining lexing and parsing? An upfront pass to strip comments?

      1. 4

        While the other replies cover the standard approach (treating it as whitespace), that’s not always what you want, especially if you’re building a “blessed” parser for a language since a lot of use cases want to understand your comments (for example formatters which need to preserve comments and LSPs which might use documentation comments).

        You can check what rust-analyzer and black do for references.

      2. 2

        In my parser, every lexeme stores its position in the input text. (This can be done as a separate array.) Then a comment can just be a single COMMENT token; to get the content of the comment (if you need it for some reason, like doc generation), just grab the position and look up the original text.

        (Though most parse rules treat it as equivalent to whitespace, and ignore it.)

      3. 2

        Recognize them as a kind of whitespace, and let them get handled by the rule (which might be explicit or implicit) that ignores whitespace between significant tokens.

      4. 1

        You treat it as whitespace; for example given functions like this:

        static void eatspaces() { // consume whitespace and comments from input stream
            for (;; nextchr()) {
                int c = peekchr();
                if (c == '#') { // comment
                    while ((c = peekchr()) != '\n' && c != EOF) nextchr() ;
                    continue;
                } // you could match multiline comments, even nested ones, too
                if (!aisspace(peekchr())) {
                    break;
                }
           }
        }
        
        static bool match_nospc(int c) {
            eatspaces();
            return match(c);
        }
        
        static void expect_nospc(int c) {
            eatspaces();
            expect(c);
        }
        

        and the parser:

        static Expr primary() {
            eatspaces(cm);
            int c = nextchr();
            if (aisdigit(c)) { // number literal
                ...
            } else if (c == '(')) { // '(' EXPR ')'
                Expr ex = expression();
                expect_nospc(')');
                return ex;
            } else ...
        }
        
        static Expr unary() {
            if (match_nospc('-')) { // '-' UNARY
                return makeunaryexpr('-', unary());
            } else if ... {
                ...
            } else {
                return primary();
            }
        }
        

        I wouldn’t recommend doing things this way (unified lexer and parser), it can get pretty messy.

    7. 2

      I also read somewhere, it’s fine to add as many IRs in your own compiler as you want, unless they are helping you. The point was mainly made as an answer to: Does using more IRs in compiler slows it down?

      1. 4

        It doesn’t have to slow it down, but it can slow it down. Each transformation from one IR to the next causes overhead. But that new IR may make the analysis/transformation you want to do on it easier or more efficient. It’s a tradeoff. Some compiler tools, such as nanopass, encourage you to do lots of IR transformations, but I haven’t used them in practice.

    8. 1

      benefits of doing this:

      logical clarity of defining tokenization independently from high level parsing. just like we define semantics in terms of parse trees and not text.

      parsing can operate on tokens rather than characters, means you need 1 token of lookahead for certain types of parser instead of N bytes of lookahead where N is the longest token in your vocabulary.