1. 38
  1.  

  2. 13

    In practice almost all languages end up forcing the lexer peanut butter to get mixed into the parser’s chocolate: just off the top of my head, C, C++, Ruby, every shell language I’ve seen, Ada, Perl, Rust, Lisp, Haskell, …

    The idea that that boundary is “right” always struck me as something of a siren’s song in compiler implementation: it feels like there should be a clear, hard boundary between the lexer and parser, but in practice what people want out of languages is more contextual than that allows, so it’s just a set of rocks waiting to ruin your day.

    1. 6

      One of the worst decisions that I think lex and yacc made was to make the lexer push the parser, rather than the parser pull from the lexer. This means that you can’t have any kind of context-specific parsing, which turns out to be something that’s incredibly valuable for a language (particularly when you want to extend the language later: being able to claim things from the identifier space, but only in a new context, avoids backwards compatibility problems).

      In the pull model, the parser says ‘I am in this rule, is the next token one of these?’ and the lexer replies ‘yes, here it is’, or ‘no’. The parser can then be structured to say ‘is the next token this context-dependent keyword?’ and if the lexer says ‘no’, then ask ‘is it an identifier?’. This means that you don’t need to have stupid parser rules that say ‘I accept an identifier or any one of these keywords that happens to look like an identifier’ or move lexing logic into the parser (to say ‘I accept any identifier, but if it isn’t one of the ones that is really a keyword then we’ll abort during parsing’).

      If one token class is a subset of another (e.g. a keyword is also a valid identifier) then you can even cache the result of the first call to the lexer so that the second one is fast. I’ve never actually needed to do this because the number of ambiguous rules tends to be fairly small and the cost of re-lexing a few tokens is low.

      1. 7

        every shell language I’ve seen

        Not Oil’s front end, which parses both OSH/bash and the Oil language!

        I agree the the line between the lexer and parser is non-trivial. And if you’re parsing C you can’t avoid “the lexer hack”, which goes by many names.

        But I draw a pretty bright line in Oil’s front end – lexing is non-recursive and parsing is recursive – and it has held up for years (which wasn’t obvious at the outset).

        https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate

        I wrote about the architecture and “lexer modes” technique here:

        Without lexer modes, it probably would have been impossible to maintain the line. Because shell is composed of mutually recursive sublanguages. (But so is almost every complex language if you look closely enough, e.g. C and C++, and I think Rust)

        Rough summary is:

        • The parser tells the lexer what mode to lex in. Besides the input lines, that is the ONLY input to the lexer.
        • The lexer returns a stream of tokens.
        • Sometimes the whole parser must be reinvoked dynamically, but these invocations maintain the same strict separation between lexing and parsing.

        Another technique I use which I didn’t write about is what I call “pre-structuring” in the lexer. It might lead to a different solution for the Rust issue described:

        This is problematic for nested tuple struct. They need syntax like foo.1.2, but to the lexer this string looks like three tokens: foo, ., 1.2. That is, 1.2 is a floating point number, 6/5. So, historically one had to write this expression as foo.1 .2, with a meaningful whitespace.

        Today, this is hacked in the parser, which takes the 1.2 token from the lexer, inspects its text and further breaks it up into 1, . and 2 tokens.

        Basically if you have this kind of ambiguity, you create a mode, and design the tokens so it doesn’t prefer either one. (And shell has a lot of this.) You maintain the property that lexing is non-recursive and reads over each input byte once, and then the parser can do more logic with those tokens (which might not be recursive).

        It’s a slight difference, but I think one advantage is that it’s easier to maintain location info. I like maintaining a hard correspondence between lexing <-> error locations for error messages. When you start re-lexing and re-parsing tokens then you also have the issue of keeping track of error locations. If you think the tokens are short this may not matter, but in shell it matters in several places.

        1. 4

          So to a certain extent what you’ve written about with “lexer modes” is still in the realm of what I meant by “lexer peanut butter to get mixed into the parser’s chocolate” (the modes essentially are an opaque communication of parser state into the lexer that let the lexer vary operation in response to said state) — not that I think that’s a bad thing.

          The “siren’s song” in compiler design is really the idea that your lexer can just be this ignorant, firewall’d module with zero knowledge of what’s going on in the parser (the Token Read() approach) — design things like that, and inevitably you end up hacking around the context sensitive parts in all kinds of messy ways.

          But the “lexer mode” approach you mentioned is really a more clear-headed approach — acknowledge upfront that you can never hope to get away from needing some kind of knowledge about where the parser’s state is at in the lexer, and you can build a flexible API abstraction (modes, in your case) around that expectation which lands you in a cleaner place than if you’d started off following the siren’s song and ended up having to hack around the “clean” separation.

          1. 4

            Agreed, the “textbook compiler” divide isn’t sufficient for “real languages”, but I think that’s just a reminder to not take what you read in books too seriously!

            It’s more important to understand why they chose to do things that way, and then come up with a solution for your problem that respects the same principles.

            As opposed to copying the solution and then hacking around it when it doesn’t match the problem, which leads to a mess. (This seems to be an instance of the “bad abstraction handed down from [authority]” problem, e.g. which I see in some applications of Kubernetes)

            In the lexing/parsing case, having SOME separation really does make it easier to recognize languages. Some people go too far in the other direction, e.g. I wrote a bit about scannerless parsing and why I think it’s bad idea for most problems.

          2. 1

            A central example may be the string literal: Because string literals and command substitutions can contain each other in infinite recursion …

            "$("$("$(…)")")"
            

            … a string literal is itself a program of arbitrary complexity. So: How does one lex a string literal, and is there even a point?

            Most languages have different syntax inside and outside string literals (such as the meaning of whitespace) – same with shellscript. So the lexer needs to be aware of whether its current position is inside a string literal or not. That’s two different modes, or states, if you will.

            My idea for Shellharden was a recursive state machine. You could say I took lexer modes to the extreme:

            1. If the lexer is state aware, it can output state transitions.
            2. For a recursive language, that ought to be push and pop state transitions.
            3. If you then make the lexer aware of every state, the lexer output becomes a complete trace of the AST.

            I think the result is beautifully simple: The parser degenerates to a state machine. Although I can’t say it has a separation between lexing and parsing – it doesn’t have that kind of token stream – the separation still exists, between the lexer and the state machine. I’m able to do this with no backtracking, albeit some forward looking (as I think any lexer would). Incidentally, since Shellharden is just a filter program, it doesn’t even need to scoop up the AST, but that’s what it would do if it was e.g. an optimizing compiler.

            1. 2

              Ah OK interesting. That sounds similar but I’m not sure how it would interact with the rest of the language. I don’t use an explicit state machines… I just change the mode and reinvoke the parser!

              The key point is that the lexer always marches forward, and multiple parsers read from it “concurrently”. You don’t re-lex text you’ve already seen.


              I addressed this issue in two posts:

              When Are Lexer Modes Useful?

              String Literals With Recursive Structure

              This is the most compelling use case for lexer modes because the language inside strings is not only recursive, but it’s mutually recursive with the “main” language. The same expressions can appear both inside and outside the string literal.

              I guess I didn’t explain the whole algorithm, but more than one person told me that they changed their string literal parsing algorithm based on this post. If it’s all recursive descent (which I imagine ShellHarden should be) then it should be pretty easy to add the modes. In the simplest case you have one mode for string literals and one mode for expressions. But shell is a fair bit harder than that unfortunately!

              And this one: https://www.oilshell.org/blog/2016/10/19.html

              You can’t do this in shell, because a double-quoted string can contain an entire subprogram:

              echo “BEGIN $(if grep pat input.txt; then echo ‘FOUND’; fi) END”

              That is, the “token” would have a recursive tree structure, which means it’s not really a token anymore. The modal lexer pattern lets us easily handle these more complicated string literals.

            2. 1

              But I draw a pretty bright line in Oil’s front end – lexing is non-recursive and parsing is recursive – and it has held up for years (which wasn’t obvious at the outset).

              From an academic view point, isn’t this how it is taught? i.e., What the lexer encodes is a regular language (zero or more tokens where each token matches a regular expression) and the parer encodes a context-free grammar. The intersection of both still remains context-free. If the lexer produces context-free tokens, the intersection between lexer and parser can be much more complex (no longer limited to context-free).

              1. 6

                Well most languages don’t have a lexer that is regular, nor a parser that is context-free.

                But they can still be divided into a lexer that’s non-recursive and a parser that’s recursive.

                I think the confusion is generally conflating those pairs of things. There are many ways of lexing that are not recursive but also not regular, and many ways of parsing that are recursive but not recognizing CFGs.

                The intersection of both still remains context-free.

                Yeah this isn’t not useful, realistic, and it doesn’t occur very much in practice. A language being context-free doesn’t have any useful engineering properties. (A language being regular does)

                I’ve explained why in many comments but never wrote a blog post about that. Some notes here: https://github.com/oilshell/oil/wiki/Language-Design-and-Theory-of-Computation#context-free-grammars-are-not-powerful-enough-in-practice

                Related misunderstandings: https://lobste.rs/s/h9ypxi/is_golang_s_grammar_context_free

                If the lexer produces context-free tokens, the intersection between lexer and parser can be much more complex (no longer limited to context-free).

                Again, being context-free isn’t something to aim for. That’s a “meme from textbooks” that people copy without understanding, and then when you inevitably need to break the rules you have a mess. Actually a perfect example is the way bash uses yacc. (Although using yacc doesn’t mean your grammar is context-free, for more than one reason)

                1. 2

                  Yes it is how it is taught, or at least it is how I was taught and learned. On the other hand, it is ambiguous in academia too. Haskell is supposed to be academic, but Haskell lexical syntax includes nested comments, and we know nested comments can’t be a regular language.

                  1. 2

                    Yeah I remember a paper than said that OCaml has something like 50 ambiguities in its LALR(1) grammar … so it’s just not a powerful enough model for languages that people want to use. Mathematicians like syntax too!

                    Brains are better at recognizing syntax than computers :) We have trouble when there’s lots of redundancy. (Though honestly I don’t understand Rust’s foo.1.2, seems like foo[1][2] is just as good and more familiar.)

                    1. 1

                      I think the main concern there was that foo[1] encourages people to write foo[1+1] but foo.1 doesn’t. It is to induce people to think literals, not expressions, are expected.

              2. 3

                There are some very specific technical reasons for wanting a clear boundary here:

                • lexers are much better amendable to generation from grammar
                • lexers are very perf sensitive, as the size of input is larger (every byte, rather than every token)
                • lexers are perf sensitive, as they are useful for quick approximate analysis (first phase of syntax highlighting mainly)
                • lexers are much easier to make incremental
              3. 9

                Good article. I was going to skim it but it ended up holding my attention the whole way through.

                1. 6

                  I’m pretty sure Go only has two namespaces: goto labels and everything else. Can someone correct me if I’m wrong?

                  1. 5

                    That’s why, I think, larger companies can benefit from microservices architecture: in theory, if we just solve human coordination problem, a monolith can be architectured just as cleanly, while offering much better performance. In practice, at sufficient scale, maintaining good architecture across teams is hard, and becomes much easier if the intended internal boundaries are reified as processes.

                    Oh man, I’ve been struggling to express this for a while. There’s no technical reason monoliths need to decay into a ball of madness, but my experience is that they often do. It’s too easy to import a library that we’re not supposed to in the name of expedience (and an unscrupulous manager will likely command his team to do so to meet a deadline professing, “we will fix this right away after the release” and “we definitely won’t do this all the time, this will be an exception”).

                    It also reminds me of how programmers brought up on dynamic languages experience Go, TypeScript, etc–these languages make things so hard! Of course, those “hard things” are very often antipatterns, and while there’s no technical reason someone can’t make a well-written application in a dynamic language (and indeed, there are probably quite a few such well-written applications), for some reason Python/JS/etc projects on average are far messier than TypeScript or Go projects.

                    It seems like boundaries are a really good thing, perhaps counterintuitively.

                    1. 3

                      Swift elegantly disambiguates the two at the syntax level, by requiring a leading . for enum variants. Rust just hacks this at the name-resolution layer, by defaulting to a new binding unless there’s a matching constant in the scope.

                      I really like Swift’s solution. Can’t Rust just adopt it in a future edition?

                      1. 3

                        Would be a massively breaking change. I think rust is committed to not doing those

                        1. 3

                          My understanding is that adding a new feature is not a breaking change. I think .id is currently invalid syntax, so making it valid wouldn’t be breaking.

                          1. 1

                            Albeit one that could probably be automatically applied to legacy code with a linter-like tool.

                            1. 3

                              That’s not enough. There’s still churn and outdated books/tutorials/QA.

                              1. 3

                                RFC 2005 (match ergonomics) had similarly large churn problem to my proposed enum variant pattern syntax, but Rust still did it.

                                1. 1

                                  Match ergonomics was subtle. It left the old syntax valid, so there were no source code changes required, and it didn’t add any new tokens/sigils/keywords, so anyone who already knew the hard way could figure out the easy way, or just ignore it and keep writing “unergonomic” matches.

                                  1. 1

                                    To clarify, .id syntax addition also would leave the old syntax valid.

                          2. 1

                            I’ve definitely always been jealous of Swift’s enum syntax.

                          3. 2

                            This reminds me of another context dependence I’ve wondered about:

                            As in C, | is normally the binary or, and || the logical or operator. But look what happens in a match:

                            match optionalValue {
                                Some(1 | 2) => {}
                                _ => {}
                            }
                            

                            The inside of a match is another language where Some(1 | 2) does not mean Some(3)!

                            (I don’t know if the lexer is as agnostic as to really emit binary-or tokens here. It seems like it might as well be another lexer mode.)

                            1. 2

                              As the post explains, Rust pattern is disjoint with expression, so there is nothing shady going on. Some(1 + 2) also does not mean Some(3), it is a syntax error.

                              1. 4

                                Fair enough, rust patterns are not expressions. But that merely names the beast. I realize this context dependence in itself is more or less inevitable (for the same reason that I can’t expect things to mean the same inside and outside string literals), but when patterns and expressions are not syntactically distinguishable, it makes the reader also context dependent.

                                If I had instead used the matches macro, the reader just has to know that the second argument is a pattern in order to resolve the ambiguity:

                                matches!(optionalValue, Some(1 | 2))
                                
                            2. 2

                              Nice article with concrete examples!

                              It seems that we are generally overly-optimistic about internal boundaries, and they seem to crumble under the pressure of feature requests, unless the boundary in question is physically reified

                              Yes! People often wonder why I started Oil in a dynamic language, and then used gradually typing to make it strictly typed. That’s a good one-sentence summary!

                              In this case the “feature requests” are really “language design over many years” as well as “unexpected quirks of bash figured out after years”. Bash basically has a dynamically typed implementation in C, i.e. the “AST” of all shells is homogeneous.

                              I have been ripping up the codebase with freedom for years. This requires nearly 100% test coverage, mostly on the external boundaries, and I find that more useful than elaborate types. (Although now that it’s more mature and strictly typed, I’ve certainly gotten mileage out of reliable refactoring.)

                              There are a bunch of invariants I maintained through “metaprogramming with small code” (regular languages, ASDL, etc.) rather than “types”, like the ones described in this sibling comment: https://lobste.rs/s/ckntjg/almost_rules#c_j7bzgo

                              1. 1

                                Perhaps I’m missing something here, but in the “Patterns and Expressions” example, surely there is syntactic difference between none and None? Purely from the fact that None is capitalized, we can tell that it is a constructor while none is not. And no, the LHS of = (or match rules for that matter) is not an expression, it is a pattern.

                                1. 6

                                  You would think so, but no, case is completely nonsyntactic in Rust and you can have lower case constructors and upper case variables. In Rust code, constructors are usually capitalized and variables are not, but that’s just a convention.

                                  1. 1

                                    You’re right! Thanks.

                                  2. 5

                                    Case of the identifier doesn’t affect its syntactic category. You can have const nope: Option<i32> = None and a binding Nope.

                                    LHS of = is a (place) expression, otherwise stuff like foo.bar = baz won’t work.

                                    fn main() {
                                      let mut x = 1;
                                      let mut y = 2;
                                      *if true { &mut x } else { &mut y } = 92;
                                    }
                                    

                                    https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=a0a0bc560a63c331d74c3eac66fedae0

                                    1. 2

                                      Thank you for clearing that up. I guess I was expecting a greater similarity to ML.

                                      Edit: With regards to the LHS.

                                  3. 1

                                    Internal boundaries can be very useful for security and other aspects of a dependable system. Reminds me of djb’s “Some thoughts on security after ten years of qmail 1.0” paper.