1. 30

  2. 26

    These articles are a lovely read. As the articles explicitly discuss, this documents the theory and some of the implementation details of RE2. RE2 and these articles also directly inspired Go’s regexp package (authored by Russ himself) and Rust’s regex crate (authored by me).

    One of the main motivations of RE2 itself was that most popular regex engines were implemented via backtracking. Depending on the regex, executing a backtracking implementation may take exponential time in the size of the input, which can lead to things like REDoS. RE2 on the other hand, guarantees that it will always execute a regex search in O(mn) time, where m ~ len(regex) and n ~ len(haystack). For folks familiar with RE2 and these articles, that’s not news. So with that in mind, I figured I’d document some of the downsides of implementing regexes using the approach described by RE2. (And of course, I do not mean for the list to be exhaustive.)

    For context, and for folks not familiar with RE2 internals, RE2 is actually composed of a number of regex engines. In general, each regex engine occupies a particular space in performance vs features. That is, the faster the engine, the less versatile it is.

    • The Pike VM is the most versatile. It most closely resembles Thompson’s construction followed by an NFA simulation with the additional power of tracking capture group match locations. It can service all requests but has very high constant factors. This makes it extremely slow in practice, but, predictably slow.
    • The “bitstate backtracker” implements the classical backtracking algorithm, but maintains a bitset of which states have been visited. This maintains the algorithmic complexity guarantees at the expense of some overhead. The size of the bitset is proportional to len(regex) * len(haystack), so it can only be used on small regexes and small haystacks. But, it is in practice a bit quicker than the Pike VM. Other than the restrictions on sizes, it can service all requests.
    • The “one pass NFA” is like the Pike VM, but can only execute on regexes that never need to consider more than one possible transition for any byte of input. This leads to a substantial reduction in constant factors when compared to the Pike VM. For the subset of regexes on which this can execute, it can service all requests.
    • Finally, the “lazy DFA” or “hybrid NFA/DFA” is very much like a DFA, except it compiles itself at search time. In a large number of cases, this provides the performance benefits of a traditional DFA without having to pay the exorbitant cost of building the entire DFA before searching. This is the fastest regex engine in RE2, but can only provide the starting and ending offsets of a match. It cannot report the positions of capturing groups. (Getting the starting location of a match actually requires a separate reverse scan of the haystack, starting at the end of the match.)

    OK, so with that in mind, here are some downsides:

    • Even though RE2 and its ilk will never take exponential time on any input, they can still take a long time to execute. While search time is usually said to be linear in the size of the haystack, this assumes that the size of the regex is held as a constant. As I said above, execution is actually O(mn), so if your regex is large, then search time can be especially slow. For example: https://github.com/golang/go/issues/7608
    • DFAs don’t handle look-around too well. And if you try to add it (even for a small fixed size), it can blow up their size pretty quickly. RE2 does add limited single-byte look-around support to the DFA to deal with ^, $ and \b. Unfortunately, this prevents things like $ from treating \r\n as a line terminator and \b from being Unicode-aware. Rust’s regex crate does make \b Unicode-aware by default, so this is a pain point.
    • In order to resolve the capture group locations, RE2 will typically first run the DFA to find the span of the entire match and then one of {Pike VM, bitstate backtracker, one-pass NFA} to find the spans of each capturing group. This results in three passes over the matching portion of the text (twice with the DFA), which adds overhead to each match that a pure backtracking implementation doesn’t typically have. Moreover, the Pike VM and bitstate backtrackers are quite slow on their own, typically much slower than the backtracking implementations found in places like PCRE. (When PCRE doesn’t exhibit catastrophic backtracking, of course.)
    • Unicode is a tricky thing to figure out. Both RE2 and Go have fairly limited support for Unicode. Things like \w and \s, for example, are not Unicode-aware by default. Neither is \b. Rust’s regex crate, on the other hand, makes all of those things Unicode-aware. Plus a lot of other stuff. The main problem here is that Unicode really explodes the sizes of things, particularly the DFA. For example, if you fully compile the DFA for Unicode-aware \w, its size in memory is about 300KB. But a non-Unicode-aware \w is a mere 1.3KB. Now, this memory usage isn’t fully realized because of the lazy DFA, but it does mean you need to spend more time building out parts of the DFA depending on what kind of text you’re searching.
    • For similar reasons, large bounded repetitions aren’t handled well because they lead to very large automatons. When the automatons get large, you wind up with huge constant factors that slow down your search. In general, the worst cases are the things that would otherwise cause a fully compiled DFA to be exponential in size with respect to the regex. For example, [01]*1[01]{N} will produce about 2^N states in a traditional DFA.

    I’ve never studied a production grade pure-backtracking regex engine in a lot of detail, but one of the main conclusions to draw here is that in order to get a regex engine that is both fast in practice and fast in theory, you need a lot of complexity. That is, you need a lot of different regex engines implemented internally to deal with all of the different cases. If you give up on needing to be fast in practice, then you can just implement the Pike VM and be done with it. At that point, the hardest parts are probably the parser and the Thompson construction.

    With all that said, Hyperscan also guarantees linear time, but it goes about the problem in a very different way.

    1. 1

      Could you have a really dumb (exponential time) backtracking implementation that counts how many times it has backtracked, and bails out and switches to the O(nm) implementation stack once it has backtracked more than say 2 times per input byte?

      I expect this would not be an improvement to predictability. :)

      1. 2

        In theory yes. But usually you use backtracking to provide additional features that can’t be provided using finite automata. So in order for your idea to work, you also need to restrict the regex features used.

        IIRC, PCRE has a backtracking limit feature. That’s one of the reasons why its searches can fail. (Searches in RE2 or the regex crate can never fail.)

      2. 1

        Thank you for the awesome writeup! Do you have a version of it on your blog? I’m pretty sure a lot of people would find it interesting!

        I’m curious if you explored PEGs too, and LPEG in particular, and if yes would it be a huge stretch to ask if you fancied a similar writeup and in particular comparison to RE2? I know it’s a huge ask, but that’s probably the only chance I’ll ever have to pose it with a non-0 probability of getting an answer, so I’m not gonna skip it :D

        1. 3

          No I don’t and no I haven’t really explored PEGs. Sorry! PEGs live in a different domain than general purpose regex engines I think, which is why I haven’t explored them much.

          As for whether I could write such a post, maybe. The problem is that blog posts take a long time to write. I have a couple in the pipeline, but they’ve been there for over a year already. :-( I started a notes repo with the intent of publishing lower effort and possibly non-technical posts. So maybe I could put it there.

          1. 2

            In the back of my head, I’m slowly baking a thought how to make a lowest effort pipeline for picking some comments I wrote here or on HN and converting them into something like the “notes” you mention. Not yet sure how to present them to make the result attractive/approachable for readers. (Especially the “question” part.) Some “Socratic” dialogs or what?