1. 7

  2. 9

    I keep grumbling about this in different contexts, but I wish people would stop writing push-model lexers. This one is another instance. The lexer is responsible for deciding the type of the token. This pushes a lot of complexity into the parser for situations where there is some context sensitivity. You have to check against identifier and all of the context-dependent token types.

    A pull-model lexer takes the expected token type as an argument and returns either no token or a valid token as an argument. This makes it trivial to handle ambiguous token types, because the parser (which has parse state) is able to request the most specialised token type that makes sense in the current context.

    As a minor nit, it looks as if the token returns a null-terminated C string, which requires a copy, so it is easy to leak memory. If this is not a copy, then it lacks a length and so the parser has to re-lex to find the end. I prefer the Token type to contain a source location pair. I typically use something inspired by clang here, where the source location is a 32-bit integer that uses 1 bit as a discriminator to differentiate between values that encode a source, column, and file tulle in 31 bits, or a 31-bit index into a table of source locations that don’t fit in this encoding. You can then expose APIs for getting the length of a token and copying it as a string (into a caller-provided buffer). These can be static inline functions. This looks like it requires a single C string as an input, so you’d need something extra for building an include stack and your source locations can just be offsets into the stream.

    In general, it’s a good idea to allow the input to be pulled in in chunks, rather then the whole thing provided at once. In C, I’d write this as a structure with a pointer, a start location, a length, an internal buffer, and a callback to update the pointer to point to the requested location (and another void* for stream context).

    1. 1

      I haven’t found many pull-model lexers, are you aware of any that would be great to learn from?

      The only one I’ve really looked at is pulldown-cmark.

      1. 1

        Most real compilers end up writing an ad-hoc one. I’ve not found a general-purpose tool for writing them, so I’ve tended to write ad-hoc ones in various places.

    2. 4

      Nitpick: I like the simplicity of your approach, but maybe call it a lexer generator in C. I initially read it as something that could parse C, and then was befuddled by the example.

      1. 1

        I’m playing with it. The code is straightforward and the API is simple.

        I’ve done a pull request to avoid getting stuck in an endless loop when calling NFAFromRe("[") which is of course a syntax error, but this way the caller can tell the user which regex caused the problem.

        There are some things I miss, for instance negated character classes such as [^"\\] for “anything except double quotes and escapes”, and also repetition counts such as [a-fA-F0-9]{4} which admittedly is not hard to repeat by copy-and-paste.

        Very happy to see a program that delivers what it promises.

        1. 1

          Looks interesting; but I actually found that learning to handle a lexer generator and writing the syntax doesn’t actuall use less time than just directly writing the lexer myself; meanwhile I even doubt whether a parser generator actually saves time; most time is used refactoring the grammar so it isn’t e.g. left recursive; I mostly used ANTLR in the past and then switched to Coco/R, but now I only use it to check the syntax and calculate first and follow sets, but write everything myself.

          1. 1

            The benefit from a lexer generator depends a lot on two things:

            • How efficient your language’s regular expression engine is.
            • How complex your tokens are.

            If your tokens are all strings or simple expressions (e.g. letter-or-underscore followed by arbitrary non-whitespace characters, or string-of-digits) then writing a simple matcher for them is trivial and you don’t need to go via a regular-expression formalism as an intermediate. If your tokens are more complex and you do need to express them as regular expressions, a parser generator is generally a win for AoT-compiled languages because it can expand them into efficient matchers at compile time, whereas a generic regex engine will run interpreted bytecode at best. In a JIT’d language implementation, things are a bit more complex because a trace-based JIT will generate similar code to from a good interpreter and specialised code.

            1. 1

              are more complex and you do need to express them as regular expressions

              Here is one of my projects as an example: https://github.com/rochus-keller/Oberon/. It includes both generated and hand-crafted lexers and parsers. I never met a language where not at least number literals are candidates for regular expressions, but recognizing keywords is more expensive in my experience. I would never use a generic regex engine in a lexer.