1. 25
  1. 3

    One alternative that meets the io.Reader support requirement and minimizes allocations is to have the lexer maintain a buffer and return byte slices. That is, each time you read from the source, you also write to the buffer. When emitting tokens, the Val string field of Token becomes Val []byte and you slice from your internal buffer. Mutability isn’t a problem since the buffer is created by the lexer and ownership of the slices are transferred to the consumer of the tokens. When the buffer fills up, you simply replace it with a new empty buffer, leaving the GC to cleanup references to the old buffers.

    If you want to eek out even more performance by avoiding the GC’d pointer inherit in each Val []byte, you can replace that with Pos, End int and make the lexer’s client responsible for buffer management and lazily reconstructing the slices from the buffer plus the slice indexes.

    1. 3

      Kind of related, I’ve been meaning to do some basic experiments/benchmarks to understand when handwritten lexing is faster than regexps (for example, comparing some typical [not necessary complete] email regexp matcher with a handwritten matcher that matches the same strings).

      At the highest level I assume that kind of quickly a lexer will be faster than most regexps since my lexer would be a precise function and regexp is a general purpose engine.

      But I’m sure there will be surprises.

      1. 2

        Yes, that’s almost certainly the case, especially given that Go’s regexp package is quite slow (guaranteed linear-time, but slow): https://github.com/golang/go/issues/26623

        1. 2

          FWIW you can also generate code from regexes – i.e. it can be a regex compiler rather than an interpreter.

          For example Oil’s lexer uses re2c to generate C code.

          intermediate source: https://www.oilshell.org/release/0.10.0/source-code.wwz/_devbuild/tmp/osh-lex.re2c.h

          generated code: https://www.oilshell.org/release/0.10.0/source-code.wwz/_devbuild/gen/osh-lex.h

          The generated code is a big state machine with switch and goto, and it doesn’t allocate memory or anything like that. It’s quite fast in my experience.


          I meant to write a blog post on re2c, but haven’t gotten around to it. Some pictures here:

          https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

          Another benefit of regular languages is that you get (simulated) nondeterminism FOR FREE. It’s still a linear pass, but it can be in multiple states at once, and then make the decision much later.

          So it can be easier to write certain kinds of lexers, and even move work from the parser to the lexer. I wrote about related topics here but didn’t quite emphasize the nondeterminism:

          https://www.oilshell.org/blog/2020/07/eggex-theory.html