1. 56

Ergex is a regular expression engine written in Rust. It has some unusual features:

  • A push architecture (so you can look for matches in streaming or distributed data)
  • Simultaneous matching of arbitrarily many expressions
  • No allocations during matching time
  • Always linear performance
  • A push-oriented implementation of Aho-Corasick
  • An interesting ShrinkSet data structure

It is nowhere close to ready for prime time, but some stuff has come up in my personal life that means I won’t be able to dedicate much time to it for the foreseeable future (everything’s fine, thank you), and rather than let it languish, I put it out there. It’s got a fairly comprehensive test suite and works pretty well for the state it’s in.

  1.  

  2. 10

    Another example (other than hyperscan) of a regex implementation with streaming support is pire. I’m a bit surprised that this is supposedly not a very common feature of regex libraries.

    Would be nice to see benchmarks.

    1. 2

      Oh interesting, I can’t believe I’d never seen pire before. Looks really cool.

      Ergex is relatively fast in what limited benchmarking I’ve done, but because it tracks POSIX submatches, it will always be slower than engines that don’t.

      The push-oriented Aho-Corasick does help considerably with simultaneous matching - there’s only a single AC automaton, regardless of how many regular expressions are in the database (versus the simpler approach of having an AC per expression).

      I experimented with multi-Rabin-Karp and Commentz-Walter in place of AC, but AC ended up being the fastest in my testing. I’m sure there are cases where that would not be true.

    2. 2

      Hm looks cool! re2c apparently supports the push style as an option, though I haven’t used it:

      https://re2c.org/manual/manual_c.html#command-line-options

      -f –storable-state

      Generate a lexer which can store its inner state. This is useful in push-model lexers which are stopped by an outer program when there is not enough input, and then resumed when more input becomes available.


      As far as matching many patterns simultaneously, I did a little benchmark here that takes thousands of entries in /usr/share/dict/words and uses them as patterns, i.e. the “fgrep problem” (because fgrep supports multiple fixed strings at once):

      https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

      It compares a bunch of regex engines, but doesn’t use the push style. The main goal was to show that regex engines handle the special case of many fixed strings quite well, even without Aho-Corasick.

      1. 1

        Interesting.

        I did a quick benchmark between ripgrep and ergex. The sample ergex program is:

        use ergex::*;
        use std::fs::File;
        use std::io::{self, BufRead, Read};
        
        struct PrintingHandler {}
        
        impl MatchHandler for PrintingHandler {
            fn on_match(&mut self, _id: usize, matches: &[Capture]) -> ContinueMatching {
                println!("Got match at {:?}..{:?}", matches[0].start, matches[0].end);
                ContinueMatching::Yes
            }
        }
        
        fn main() {
            let mut builder = DatabaseBuilder::new();
            let file = File::open("strings.txt").unwrap();
            for (number, line) in io::BufReader::new(file).lines().enumerate() {
                if let Ok(s) = line {
                    builder = builder.with_expression(Regex::new(number, s.trim()).build().unwrap());
                }
            }
        
            let database = builder.build();
            let mut handler = PrintingHandler {};
            let mut scratch = database.make_scratch(&mut handler);
        
            let mut file = File::open("all-1.txt").unwrap();
            let mut contents: Vec<u8> = vec![];
            file.read_to_end(&mut contents).unwrap();
            scratch.push(&contents);
            scratch.finish();
        }
        

        I took an random sample of 6000 words from /usr/share/dict/words on my Mac, and took the all-1.txt from your website.

        Here’s the results:

        ; time rg -f strings.txt all-1.txt > /dev/null
        rg -f strings.txt all-1.txt > /dev/null  0.33s user 0.01s system 99% cpu 0.349 total
        
        ; time ./target/release/greptest > /dev/null  
        ./target/release/greptest > /dev/null  0.78s user 0.09s system 99% cpu 0.870 total
        

        So, ~2.5 times slower than ripgrep, though the ergex version reports one line per match (so if a given line matches more than once, it’ll get more than one result line) and reports submatch positions.

        So, not terrible.

        I did find a bug with the lifetimes in the AC scratch, though (which is why I’m reading the content all at once rather than line-at-a-time). I’ll fix that tonight.

        1. 2

          OK nice!

          Since you implemented Aho-Corasick and I suppose “the Thompson construction” (?), do you have any insight as to how (or whether) Aho-Corasick does better on the “fgrep problem”?

          My memory is fuzzy but this whole argument was basically that at both build time and runtime you may not get a big benefit from knowing you have the “fgrep problem” and using Aho-Corasick, vs. the general problem of matching many regexes. At least in theory, either way you can build a DFA quickly.

          (I did learn that RE2 sometimes doesn’t build a full DFA for reasons of memory, and rust/regex follows that decision. But re2c must do that because it compiles to C code, and it doesn’t appear slow or memory intensive.)

          I guess the obvious thing to do is to compare my benchmark with Aho-Corasick, but I didn’t do that, partly because I don’t know Rust!

          1. 3

            On the regular expression side, ergex uses a Pike-style virtual machine exclusively (this is the same underlying mechanism used in, for example, the Go regexp package). A Pike-style VM essentially simulates an NFA.

            DFAs are faster but are, as you said, memory-intensive. More importantly, though, they cannot report submatch boundaries. One of the hard requirements for ergex is that it support POSIX submatch semantics.

            (In the Mysterious Future, if I have the time, ergex will do different matching backends for situations where the caller doesn’t specify any submatches or says that they don’t care about them, but for the sake of simplicity that isn’t done here. Doing so would give some very large speedups when it could be used.)

            To match multiple expressions, ergex divides the set of expressions it’s matching into three groups:

            • expressions with a constant prefix (which could, theoretically, be the whole string)
            • expressions without a constant prefix
            • expressions with a left anchor (^)

            It then builds an AC automaton (more on that in a sec) for all of the expressions with constant prefixes, by building an AC matching those prefixes.

            Then, every time a block of bytes is pushed (remember, ergex is push-oriented and you push bytes at your leisure; ergex never asks for more), the bytes are handed to all of the prefixless and left-anchored expressions.

            As soon as a left-anchored expression doesn’t match a byte, it’s removed from the set of expressions that get future bytes (using a clever data structure called a ShrinkSet; removal and refilling are O(1) operations).

            For prefixless expressions, they get the bytes as they come and do whatever they need to do.

            For prefixed expressions, they can be in two states:

            • waiting for the start of a prefix
            • matching

            If they’re already matching, they’ll get the bytes just like any other expression. If they’re waiting for the start of their prefix, though, they’re “idle”.

            Rather than one AC per expression (since an expression might have multiple constant prefixes), we have one AC per expression database. As the bytes come in they are passed to the AC and the AC advances. The AC stores a single integer state: the node in the automaton. When a matching node is reached, all of the associated strings are reported to any regular expressions that were pending the start, and they are kicked off. Because we can’t be sure if the prefix is actually still available in memory (it may have been pushed in multiple blocks, straddling one or more block boundaries), the AC subsystem simply backs up the virtual offset and hands the entire prefix to the regular expression (this doesn’t mean we’ve buffered anything; we know the prefix because it was in the automaton).

            Once a prefixed expression is kicked off, it proceeds as any other until it enters an idle state again or becomes disabled for whatever reason.

            One obvious enhancement that should be made, and that is not currently in ergex, is the obvious “the prefix is the expression” - that is, the prefix is the whole expression, a static string. Currently, once the prefix is matched it is handed off to the regular expression engine which matches the prefix again and then continues on (in the case of “that’s the whole expression” it continues on to report matches). This could be fixed, but wasn’t considered to be worth the effort at this point. It’s more complicated than it might otherwise appear because we need to track submatches.

            The AC itself (I think I’m finally getting to your actual question) is implemented as a series of nodes arranged as would be expected by the Aho Corasick algorithm. Each node has a (potentially empty) list of strings that match at that node and a set of successor nodes reachable by a given next byte.

            Borrowing an idea from the excellent aho_corasick crate, we make the AC construction optionally as dense as you’d like. Dense nodes have fully-specified transitions to their next states via a 256-entry array. A fully dense AC would be a DFA. We specify a “prefix set” such that the first N nodes are fully dense, and subsequent ones are sparse. Sparse nodes do a simple linear lookup of their next states to see if they would match the current input byte.

            Surprisingly, a DFA (a fully dense AC) is often slower due to fewer nodes being able to fit in data caches. By default, we only have the first 1-3 nodes be dense. This is tunable however.

            So…yeah…that was a wall of text. I apologize. I hope it answered your question! :)

            1. 2

              Thanks for the explanation! I think it does give me a better idea of the relationship. I did not quite realize that you use Aho-Corasick as part of the “big matcher”. I have used that type of matcher in the past (on thousands of regexes) but I don’t remember exactly how it worked. I think RE2 has a version of this too (MatchSet or something like that.)

              I think the high level question is “Why Aho-Corasick?” and I guess the reason must be either speed or memory usage.

              However there is a separate issue of scaling with the number of fixed strings or the number of patterns.

              The important thing I forgot to say is that even with 6000 strings, re2c generates less than 213 KB of native executable code, and it is solving the more general problem, not just Aho-Corasick. And it scales better than RE2 or ripgrep. Notice that the DFA method is roughly constant with respect to #strings, while RE2 and ripgrep are linear.

              https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh#L339

              So the separate question is “Why not use a DFA?” And the answer I’ve been getting is “because it takes more memory” . However if it also is constant with respect to #strings, that’s a very important property!

              And the DFA size is linear with respect to the pattern size.

              So I guess my overall claim is that at least some implementations of Aho-Corasick don’t solve the “fgrep problem” as well as the DFA that re2c creates, though I haven’t looked into whether they are better than RE2 and rust/regex.


              Also there is a paper here about POSIX submatch extraction with DFAs (I haven’t read it):

              https://re2c.org/#papers-articles

      2. 2

        Do you have literature on push matching? I’d like to read more about it.

        1. 2

          I’m afraid not. Before @rzhikharevich mentioned pire in this thread, the only other instance I was aware of was HyperScan.

          The push-oriented Aho-Corasick implementation in ergex is, AFAIK, the only one of its kind.

          1. 2

            Do you have further reading on what “push-oriented” could mean, though? Otherwise I will take a look at the code.

            1. 2

              Ah, sorry. It’s nothing as fancy as you might think; it’s not describing a new algorithm or anything.

              Most regular expression engines assume either that all of the input to be searched over is available the entire time and thus can go backwards and forwards in the input if need be, or that it is available in-order and can ask for “the next byte” or whatever at will.

              I would call that approach pull-oriented. Ergex is push-oriented: when you get some bytes, you call scratch.push(bytes) and Ergex will tell you via callback if any matches are detected yet. The scratch space keeps a minimal amount of state (it doesn’t buffer the input).

              The easiest practical example is network traffic: I want to run 1000 regexes over arbitrarily many TCP streams that I’m passively monitoring (that is, I’m not an endpoint, so I can’t “read” from the stream). I create a scratch space for each stream, and then as packets come across, I push that packet’s data to the scratch space, which then will tell me which, if any, regexes matched. I keep doing that until the connection is closed.

              The scratch spaces are a bit bigger than I’d like for them to be at the moment (a huge part of that is due to submatch tracking, which can be reduced but not completely eliminated) and some duplication that is there just to appease the borrow checker and will eventually be removed.

              1. 2

                This is excellent, thanks. Does it require some amount of concurrency for the callbacks or can it be synchronous?

                1. 2

                  Anything that implements the Handler trait can be used. You pass a Handler in when creating your regular expression scratch space and its on_match method will get called whenever a match is detected. It’s agnostic about synchronicity. I’m not super familiar with async Rust but I see no reason why it couldn’t be used in async code as well.

                  To maybe clarify a bit: Ergex is entirely externally-driven. It only does something when you call push; as a side effect of that push, your Handler callback might get called some number of times before the push returns. The scratch space maintains enough state such that the next push is understood to be the next set of bytes in the stream. When the stream ends, you call scratch.finish, which may also result in your handler getting called back before finish returns for any matches that happen at the end-of-input.

          2. 2

            I think this is mostly a programming style issue and not an algorithmic issue, since the input is handled by a DFA, which naturally works either way. (EDIT: it’s an NFA, but I think a similar argument applies)

            That is, I think it’s easier to “invert” a DFA than arbitrary functions that use a call stack. That’s because the DFA already naturally takes one byte of input at a time. That’s how it’s defined!

            Personally I think this push vs. pull issue in programming is very annoying (and leads to threads / coroutines, etc.)

            Related comment with a bunch of links: https://lobste.rs/s/khbbac/generate_all_things#c_1i5c1n

            I have kept a log of such things to turn into a blog:

            https://oilshell.zulipchat.com/#narrow/stream/266575-blog-ideas/topic/blog.3A.20push.20vs.2E.20pull

            • node.js HTTP parser rewrite to use code generation (it’s a push-based parser written in the nginx style. Compare with the Go style)
            • re2c push vs. pull code generation
            • I believe GNU yacc supports push or pull code gen the same way (pull is much more common, and came first)
            • There is something of a debate for whether the Pratt parsing is the “pull” version and Shunting yard is the “push” version of the same algorithm. Though I guess it is clear that the first is pull, and the second is push.
            • One weird insight I had is that the two “phases” of a shell are push vs. pull:
              • GNU readline / ble.sh is a push architecture, i.e. for getting each byte of input. ble.sh actually implements a crazy state machine parser for bash! It’s quite impressive
              • The shell interpreter itself is a normal top-down “pull”, that works one line at a time. It pulls lines, lexes, parsers, and evaluates them.

            Tweet I linked:

            https://twitter.com/halvarflake/status/1454804904342654976

            (and of course DFAs are a particular kind of state machine with some nice properties)

          3. 2

            Having a streaming API is interesting. I’m using PCRE2 within a text editor, and it supports “partial matching”, but the entire matched string (including possible lookbehind) must be in RAM as one piece.

            1. 1

              You know how much I love text editors! I’d love to hear about what you’re working on.

              1. 2

                I’m writing a little emacs using the data structures from vis: https://github.com/leahneukirchen/te