1. 24
  1.  

  2. 2

    I’ve implemented Aho-Corasick a few times in my career. It’s one of my favorite algorithms. A visualization like this would have been helpful. Nice work on that! I remember having to work it out on paper each time.

    I posted one of my implementations here: https://github.com/codeplea/ahocorasickphp

    1. 1

      does anyone know how it is related to thompsons idea (https://swtch.com/~rsc/regexp/regexp1.html , the original paper seems to be behind a paywall) of how to match regular expressions?

      1. 10

        I’m not quite sure how to answer the question since it’s kind of broad. In the most simplistic sense, they are both implement matching using finite state machines that executes in time proportional to the input being searched. Aho-Corasick is one of many different ways to search for multiple strings, and IIRC, Navaro and Raffinot classify it was a prefix oriented approach that generalizes from the idea of KMP for single pattern search. (The other categories of string search being “suffix” and “factor based” approaches.)

        In the usual framing, an Aho-Corasick matcher is typically implemented as a simulation of a non-deterministic finite automaton, where the “non determinism” comes from following failure transitions. That is, for each byte of input, it is possible to follow N failure transitions before advancing to the next byte of input (where I believe N is proportional to the number of strings you’re trying to match). However, such a matcher can be converted to a deterministic finite automaton by pre-computing all of the failure transitions. Once you have that, each byte of input can be processed in a constant number of instructions, regardless of how many strings you’re searching. Of course, the downside of this approach is that it requires more memory. (In the literature, you usually see the NFA variant just called “Aho-Corasick” and the DFA variant called “Advanced Aho-Corasick.”)

        Thompson’s construction targets a strictly more general use case beyond just searching for literal strings, and is usually implemented by compiling a regex down into a small set of bytecodes for implementing alternations and repetitions. An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions. The actual time complexity here remains the same I think, but the processing of the various “split” bytecodes in Thompson’s construction in a VM of sorts is generally going to be quite slow.

        It would be interesting to try and draw a more concrete correspondence between Thompson’s “split” bytecode and Aho-Corasick’s failure transitions.

        I do also know that Paul Wankadia (current maintainer of RE2) has done some really cool work on removing the various split bytecodes from the normal Thompson construction, at the cost of doing a bit more work translating the regex to bytecodes. But I still don’t think it will approach to speed of Aho-Corasick. To a first approximation, I would somewhat expect Aho-Corasick to be about an order of magnitude faster than the typical Thompson construction with a VM. If you JIT the Thompson construction (which has been done), then all bets are off. :-)

        For completeness, note that there is also Glushkov’s construction algorithm, which also constructs an NFA for matching regexes, but with different trade offs than Thompson’s construction. That is, Glushkov’s algorithm does epsilon removal, where as Thompson’s doesn’t. But this means you get a bigger machine (more transitions) and spend more time building it. I’ve wanted to experiment with Glushkov’s construction for a while now, but I always get stuck when I try to figure out how to handle giant Unicode character classes (which also thwarts a lot of interesting bit-parallel implementations of NFAs).

        1. 2

          An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

          I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

          The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

          In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

          As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

          It doesn’t do any backtracking or “split” – it’s all just switch and goto.

          The actual time complexity here remains the same I think,

          As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

          I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

          But I am not sure this is true, based on [1]:

          In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.


          All this is complicated by another fact I just read here:

          https://research.swtch.com/yaccalive

          The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

          (my emphasis)

          I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

          The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.


          I just watched a couple talks about derivatives recently, including one about redgrep (also by Paul Wankadia), which uses derivatives – not the McNaughton-Yamada algorithm used by RE2:

          https://github.com/google/redgrep

          redgrep: from regular expression derivatives to LLVM

          This one is a lot more friendly and shows the code on one slide basically:

          Regular Expression Derivatives in Python

          https://github.com/MichaelPaddon/epsilon

          I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

          Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

          (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)


          [1] https://swtch.com/~rsc/regexp/regexp1.html:

          [2] https://swtch.com/~rsc/regexp/regexp2.html

          1. 3

            Minor update: The original re2c [1] paper does cite the Dragon Book, so at least in 1994 it used the McNaughton-Yamada algorithm to generate a DFA (not an NFA).

            From there it says that generating assembly code from the DFA is “straightforward” (section 3.2).

            The code is quite large now but I’m guessing it still follows that basic algorithm, along with a bunch of optimizations for DFA size, and then the big complication of submatch extraction (which Russ Cox’s articles also deal heavily with).

            [1] http://re2c.org/1994_bumbulis_cowan_re2c_a_more_versatile_scanner_generator.pdf


            Also if anyone is still reading who wants to manipulate regexes for a practical purpose (i.e. portably implementing extended glob as in bash/ksh), check out my issues :) https://github.com/oilshell/oil/issues?q=is%3Aissue+is%3Aopen+label%3Acomputer-science

            1. 1

              Hiya! I’m a huge fan of the work you’re doing with Oil. I am really excited to see how it turns out. :-)

              I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

              The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

              In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

              As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

              It doesn’t do any backtracking or “split” – it’s all just switch and goto.

              I’m actually not terribly familiar with re2c, but I think there might be some confusion here. Thompson’s construction is really about building an NFA and then simulating that. At least, that’s what his original paper describes. (scihub this: https://dl.acm.org/citation.cfm?id=363387) The original paper actually describes something closer to a JIT, since the output of his compiler is object code. But the object code is still executing a simulation of an NFA. This is very clear via the discussion around CLIST and NLIST handling at match time.

              So to be clear, I wouldn’t say the output of Thompson’s construction is itself a DFA, but rather, an NFA that can either 1) be interpreted as-is or 2) further transformed into an actual DFA. (2) can itself be split into a couple different strategies. One of them is the “lazy DFA,” which I believe was pioneered by Thompson, but in his implementation of grep, and definitely not in his original IBM 7094 paper. A lazy DFA is something that essentially caches the result of interpreting the NFA during match time. This tends to work well in practice and also permits setting a bound of the amount of memory you use, at the cost of essentially falling back to the performance characteristics of NFA interpretation if you reach that bound. The other strategy for (2) is your normal DFA building: transform the NFA into a DFA via powerset construction, and then possibly, DFA minimization. I’m guessing that’s what re2c is doing. They are still likely using Thompson’s construction, but it’s only one piece of the puzzle.

              But yeah, the expressiveness of everything here is all equivalent. It’s a weird framing, but at this point, we tend to couple theoretical terms (“NFA”, “DFA”, and so on) with specific performance characteristics. e.g., an NFA uses a small amount of memory to execute, but has very high overhead while a DFA can use lots of memory but has very small constant factors.

              As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

              I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

              But I am not sure this is true, based on [1]:

              In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.

              I think it’s pretty uncontroversial to state that the interpretation of an NFA is going to be, in practice, slower than an implementation that reflects the key characteristic of a DFA: a small constant number of instructions per processed byte, typically in the form of a transition lookup, advancing to the next state and checking for a match condition. An interpretation of the NFA is usually going to result in a number of instructions proportional to the size of your regex.

              There are some interesting exceptions, but I don’t think it really changes the above much. One exception is bit parallel NFAs, which can work quite well—perhaps even better than DFAs—but only for smaller machines. Another exception is JIT. If you JIT Thompson’s construction, then the performance characteristics change from the normal “interpretation” of an NFA, which usually resembles a VM of sorts.

              The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

              (my emphasis)

              I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

              The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.

              Thompson’s original paper isn’t long, so I think I’d actually recommend that you just read it. :-) It should be clear that Thompson isn’t talking about a DFA, but an NFA.

              I’m not sure what Cox means by his statement. Certainly, Thompson’s original paper does not talk about DFA state caching, but the construction and execution is undoubtedly an NFA simulation. The only thing I can think of here is that Thompson never actually mentions “NFA” or “DFA” in his paper, and therefore, framing Thompson’s contribution in those terms is something that others have added. But it kind of seems like a distinction without much of a difference, so I might be missing Cox’s point here.

              I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

              Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

              (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)

              Yeah, Paul’s talk on redgrep is pretty freaking sweet.

              I wish I could remember why I haven’t done much with regex derivatives. I’ve researched them before, but it’s been a while now at this point. If I had to guess, I’d say that I probably get stuck with figuring out how to handle large Unicode character classes (which is also where I get stuck in Glushkov’s construction and in trying bit parallel NFAs).

              1. 2

                BTW Darius Bacon had the same question about Thompson’s paper and derivatives and addressed it here:

                https://news.ycombinator.com/item?id=18434225

                I guess there is a related idea called Antimirov Derivatives / Partial Derivatives, and that leads to the NFA? I have to understand it better. It looks like Antimirov derivatives didn’t have a name until 1995, but Thompson must have used that idea (?).

                The algorithm to generate a DFA in redgrep / epsilon based on derivatives seem seems to be completely different. It does NOT generate an NFA; it generates a DFA directly (and it has fewer states than some implementations of Thompson construction -> DFA minimization).

                The hard part of that algorithm seems to be canonicalizing the DFA states so that you can compare them with equality. Everybody seems to mention that issue. But it seems like it’s a short algorithm.

                Paul W. mentioned partial derivatives in his redgrep talk with respect to capturing groups. I guess that was an open problem in redgrep.

                1. 2

                  Thanks for the kind words about Oil. I’m happy to have help from people familiar with regex implementation :)

                  Two somewhat separate points:

                  1. With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                  If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                  The authors of re2c published a new algorithm in 2017 for doing submatch extraction with compiled DFAs, but I haven’t gone over it:

                  http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

                  Anyway, the point is that submatch extraction with DFAs is hard. That is borne out in both RE2 and re2c. IIRC, the whole point of the “multi-engine” backend in RE2 by Cox is to handle this. It uses the slower method when it needs to extract submatches, and the faster DFA method when it doesn’t.

                  If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                  I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                  This all boils down to terminology, but the way I think of it, there are two worlds. (And this is pretty much exactly the point that the Cox articles make.)

                  a) Perl-style regexes, PCRE, Python – this is backtracking world. They’re not using NFAs or DFAs.

                  b) awk, egrep, libc, RE2, re2c – This is NFA/DFA world or “Thompson’s construction” / Dragon Book / McNaughton-Yamada.

                  In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                  Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|...|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                  And if you optimize the DFA, maybe you will end up with the exact same one (just guessing, I don’t know that.)


                  1. I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs. As you say, it doesn’t mention those words. And it talks about Brzozowski derivatives in the first paragraph.

                  http://www.oilshell.org/archive/Thompson-1968.pdf

                  I don’t know 7094 assembly or ALGOL-60, so it’s hard for me to interpret. I tend to trust what Russ Cox says.

                  I think it’s plausible that it is using NFAs, because the Dragon Book explains it that way, and Russ Cox explained it that way in the first paragraphs of his first article.

                  But I think he went back and read it later and realized it was about derivatives! It’s also plausible to me that the technique is based on derivatives. In fact it is the most plausible to me because the paper mentions derivatives and does not mention NFAs or DFAs!!!

                  I guess to really answer this question I’d have to go implement both algorithms and see. (Or maybe just e-mail Russ Cox.)

                  (I actually implemented what were known as 2 separate parsing algorithms here and realized they are the same: http://www.oilshell.org/blog/2016/11/01.html . The author of precedence climbing acknowledged this to be true – links at end. Derivatives and McNaughton-Yamada seem very different to me, but maybe there is a deep connection.)

                  1. 1

                    With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                    Hmm I don’t think I said that. :-/ Sorry if that’s how it came across. The typical Aho-Corasick formulation uses failure transitions at match time, so it isn’t a DFA. It’s an NFA. “Advanced” Aho-Corasick is the classical DFA variant.

                    It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization. The interesting comparison is at the implementation level, and specifically, the construction algorithms themselves. Aho-Corasick is a fundamentally more constrained algorithm, and doesn’t need to handle the full generality of regular expressions.

                    If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                    Your second sentence is true, but the first sentence doesn’t really make sense to me honestly. If we’re talking about an implementation that reflects the structure of an NFA, then you’re almost always going to be looking at some interpretation of the NFA at match time (or, perhaps, a bit-parallel implementation if the machine is small enough or a JIT). This looks very very different from what a DFA implementation looks like.

                    Thank you for the link about re2c. I hadn’t seen that. Currently, Rust’s regex library (like RE2) needs to use an NFA to do submatch capture, which stinks, because it’s much slower than the DFA.

                    If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                    This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                    I wonder if your Oil use case is making this a bit more difficult to grasp. In that context, where you have a tokenizer that rarely changes and a tool to convert your DFA into code, then an NFA seems rather silly. But in a more general context, where you’re building regexes at runtime and matching them, the specific trade offs between space and time become much more relevant. In that context, eagerly building a DFA is a pretty rare thing to do (as supported by numerous implementations such as RE2, Go’s regex library, GNU grep, Thompson’s original grep and of course my own regex library too).

                    I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                    Yes… But like I said in my comment above, we are well beyond theory here. The vernacular in this domain couples theoretical terms with implementation details. There are very real differentiating properties between “I implemented an NFA matcher” and “I implemented a DFA matcher.”

                    Take a look at section 6 of the original paper that introduced Aho-Corasick. It explicitly introduces a DFA variant, and explicitly calls out its downside as increased memory usage but faster matching. This is in contrast to the initial and classical formulation with failure transitions that uses less memory but is slower at matching.

                    In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                    No, I don’t think so. I mean, yes, of course it’s a critical difference, but that’s really not what we’re talking about here. We’re talking about Aho-Corasick vs the Thompson construction (or, RE2 if you like). Bringing up backtracking engines it well outside the scope, and moreover, it is not the only distinction that matters. You yourself noted the variety of implementation strategies used by RE2. These are precisely a reflection of the various properties of typical NFA vs DFA implementations. Waving all of this away honestly makes this entire discussion rather moot.

                    Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|…|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                    That sounds right, yes, but it’s a somewhat trivial theoretical result. e.g., Given two regular expressions, you can determine whether the languages they recognize are equivalent by converting both to a minimal DFA. If the resulting DFAs are equivalent up to state renaming, then you know they recognize the same languages. Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily. That lazy construction is really useful because it gets you very close to the full speed of a DFA but allows more precise control over the space used. Aho-Corasick, on the other hand, is going to use far less memory. In practice, for small regexes of literal alternations, you might choose to use an Aho-Corasick DFA (“Advanced Aho-Corasick”) instead of using the full blown regex engine, because the Aho-Corasick DFA is likely faster than the fully general lazy DFA.

                    I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs.

                    It is. No DFA implementation is going to be explicitly manipulating states at match time. Derivatives are certainly used to motivate construction; but it’s still an NFA.

                    I’m not quite sure how to fix your confusion on this topic. :-/ I think if I could say one thing, it would be that NFA vs DFA have very real distinctions in their typical implementations, even if theoretically they give the same power. This is why we can speak separately about “Thompson’s construction” and “DFA”, because they are not intricately coupled.

                    I feel like a read of Navaro and Raffinot’s “Flexible Pattern Matching in Strings” would help clear up a lot of this for you, because they make the NFA vs DFA distinction very clear at the implementation level with lots of examples. Unfortunately, the book is a little pricey. If we were friends in real life, I’d lend you my copy. :-)

                    1. 1

                      I did some work on this that clarifies things somewhat – at least I think so, let me know what you think.

                      https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                      I don’t think we are disagreeing that much about technical points. I was somewhat nitpicking the original answer because I didn’t think it illuminated the core issue. And because I want to understand the difference myself, for Oil! I think this will matter eventually and I’m not just arguing for its own sake :)

                      It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization.

                      I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                      I don’t believe it’s common knowledge and it’s one of the first things I would say to someone who says “compare Thompson’s construction and Aho-Corasick”. If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                      People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                      For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                      This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                      I don’t understand this… Look at the generated code I linked for re2c:

                      https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                      https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                      This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                      Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily.

                      For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                      Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a --dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                      But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)


                      I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                      If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                      (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)


                      About Thompson’s paper: I now agree it’s about NFAs. After looking at Figure 5, it’s obvious it is an NFA for a(b|c)d.

                      I’m still confused by the derivatives comment in the first paragraph. The 2009 paper that “reintroduced” regex derivatives actually makes a comment on this exact line at the end in the “Related Work” section:

                      Regular-expression derivatives re-examined

                      I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                      There is a big Unicode data file in the repo: https://github.com/MichaelPaddon/epsilon/blob/master/epsilon/ucd.py

                      Also take a look at the table in section 5.2 on DFA size. It compares the derivatives method vs. a Thompson construction and says it’s always better.

                      I guess Aho-Corasick is probably always better than Thompson too for the subset of the problem it solves, as far as DFA size, but I would love to see if that translates to execution time. I think it will be hard to construct a case where re2c’s algorithms isn’t good enough and you have to use Aho-Corasick.


                      Also I see the clist and nlist thing here. It wasn’t clear from your comment what those were supposed to be doing in the Thompson paper, but it’s a lot clearer in this explanation:

                      https://swtch.com/~rsc/regexp/regexp1.html

                      The simulation uses two lists: clist is the current set of states that the NFA is in, and nlist is the next set of states that the NFA will be in, after processing the current character.


                      Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                      1. 1

                        I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                        Not sure. I jumped right over the theory part because in the theory, everything folds into equivalence with the same expressive power. The only real practical distinction there that I can think to make is that I remember NFAs being easier to construct than DFAs in my computability theory class (many moons ago), which in turn made it easier to use NFAs instead of DFAs to prove a particular language to be regular.

                        If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                        Right… Good point. You probably can’t say for certain that they’ll boil down to the same DFA without being a bit more careful. Overlapping matches are probably part of it, but even in the non-overlapping case, the typical Aho-Corasick implementation will still probably have subtly different match semantics than the typical regex DFA implementation. That is, regexes typically implement “leftmost first” match semantics (the natural consequence of a backtracking implementation) or “leftmost longest” match semantics (found primarily in POSIX compatible regex implementations). That is, given the strings Sam and Samwise matching the string Samwise:

                        • Sam|Samwise matches Samwise in leftmost longest.
                        • Samwise|Sam matches Samwise in leftmost longest.
                        • Sam|Samwise matches Sam in leftmost first.
                        • Samwise|Sam matches Samwise in leftmost first.

                        So with leftmost longest semantics, the order of your alternation doesn’t matter. But with leftmost first semantics, the order does matter. RE2 (along with Go’s and Rust’s regex engines) both provide leftmost first semantics even though they are finite automata implementations. The reason for doing so is to match the semantics of popular regex implementations such as PCRE. Notably, RE2 (and Go) both permit the caller to switch to leftmost longest semantics if they want. Generally speaking, I’ve found leftmost first semantics to be more useful.

                        Back to Aho-Corasick… In most typical implementations that I know of (including my own), it will report all matches. So for Sam|Samwise, it would match at both Sam and Samwise. However, it’s also easy to tweak the matching routine (without changing the automaton) to not report overlapping matches by simply moving back to the start state after a match is found. In this case, Aho-Corasick will report Sam as a match but not Samwise regardless of the order of the patterns. In this sense, Aho-Corasick has neither leftmost first nor leftmost longest semantics, but leftmost shortest. (Generally, I see this as a problem and it’s one of the things on my list to fix. In particular, I’d like for Aho-Corasick to have the same match semantics as the regex engine so that Aho-Corasick can be used to completely replace the use of the regex engine internally when you have an alternation of literals.)

                        People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                        For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                        Sure, but this is mostly a domain specific thing. I’d agree that if you’re working on a problem in which re2c is worth using in the first place, then you probably always want to spend the extra time in compilation to make matching faster. In that sense, choosing between NFA and DFA is an easy choice: pick the DFA. But this is because you’re paying for compilation when you compile your program itself. In a typical regex engine, the cost of compilation is paid during the execution of your program. While users have generally grown to expect that regex compilation is not exactly super cheap, you generally don’t have all day either. For re2c, it’s conceivable that you might be willing to spend minutes building the DFA. But for a normal regex engine, you probably don’t want to spend more than a millisecond building your machine for reasonably sized regexes.

                        The other thing to look at here is that, in terms of implementation choices, an implementation modeled after an NFA tends to have more power in that it can resolve capture groups. Now, you did point out a paper that purports to do this in a DFA, but this generally isn’t how it has been done in practice (RE2, Go, Rust). I haven’t read the paper yet, so I’m also not sure what the trade offs are. My point here is that, depending on implementation, choosing between NFA and DFA might depend on what question you’re asking.

                        I don’t understand this… Look at the generated code I linked for re2c:

                        https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                        https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                        This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                        I don’t think looking at generated code out of context can really do much. e.g., I don’t know what set of strings went into the production of that machine. I’d guess it was small though. So in that sense, it’s no surprise that the corresponding DFA is small. When it comes to Aho-Corasick, I suspect that the size of the DFA is indeed proportional to the total size of all the patterns you’ve compiled into it. (I’m not 100% on that.) But in the case of regexes, with its repetition operators, the size of the DFA can be exponential in the size of the original regex.

                        Try building an automaton that matches any word in /usr/share/dict/words.

                        For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                        It really just depend on your tolerance for memory-vs-speed-vs-compile-time. If you’re using re2c, then you almost certainly fall into the “I don’t care about compile time, and I probably have a very high tolerance of memory usage since that’s merely reflected in code size.” There are some secondary effects here of course, e.g., a large code size might result in thrashing your instruction cache.

                        Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a –dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                        But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)

                        Yes, while the general case is exponential, I believe that at least building an Aho-Corasick DFA is linear with respect to the number of literals you’re trying to match. I don’t know whether minimization is also minimal. But remember, memory usage is another key component here. The DFA is going to use a lot more of it. Usually, if you’re building a DFA, you’re either doing it at source compile time (so it’s embedded in the code as a state machine) or you’re doing it at runtime, which means it’s probably a table based DFA. Naively, for a byte based automaton, that’s 256 transitions for each state, where each transition needs to be big enough to point to any arbitrary state. If you choose 32 bits, then you’re already at 1KB per state. Now, obviously, there are different ways to represent DFAs. You could pick a sparse representation of the transitions, but this almost always sacrifices execution time because a sparse lookup will probably be slower. Beyond that, there are other tricks you can do like condensing the number of transitions per state to the minimal number of transitions needed by separating the space of all 256 bytes into equivalence classes. This incurs an extra lookup per byte at match time, but it’s generally worth it for the space savings.

                        I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                        If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                        (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)

                        Right, yeah, hah. I’ve been working on string search for a lot of years now. In terms of credentials, I also wrote Rust’s regex library, ripgrep (which bests grep in many cases) and one of the fastest CSV parsers in existence (which is particularly relevant to our conversation here, because I did it by compiling the CSV parser into a DFA on the stack, where as most CSV parsers use the standard NFA simulation embedded in the code).

                        Most or all of my work is focused on building finite state machines at runtime. This biases toward a very different set of trade offs than the offline compilation found in the domain of re2c. So I don’t think I would necessarily claim anything about performance. In general, I’d probably almost always give the edge to re2c over any equivalent machine compiled at runtime.

                        That grep runs faster is probably what I’d expect. A finite state machine has its limits. Its key advantage is the ability to express complex match semantics in a general way. But there are all sorts of special cases that can be optimized to go quite a bit faster than a normal finite state machine. You hinted at it with the “skipping input” aspect of grep (you’ve probably read this). But in modern times, it’s less to do with skipping input and more to do that the skip loop itself is written to use optimized vector instructions via memchr (glibc’s implementation of that is in x86 Assembly, my own implementation uses normal Rust code with vendor intrinsics). To that end, it turns out that you don’t even need to skip input to go very fast, but it’s instead better to pick the byte you skip on more wisely. That’s what led to my own FreqyPacked algorithm that uses a static notion of how common certain bytes are, and chooses the skip loop based on that. This is distinct from Boyer-Moore, which is usually written with a skip loop on the very last byte. That’s what GNU grep does. This specific algorithmic difference is a key reason why ripgrep edges out GNU grep in a lot of cases.

                        I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                        Interesting! Thanks for the links. One of the tasks that’s nextish on my list (after building out a bit more infrastructure) is to rethink the optimization and compilation passes in Rust’s regex crate. No small part of that will be figuring out how to address big Unicode classes, because right now, they are a primary thing that causes Thompson’s NFA construction to go into the weeds.

                        Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                        Good luck!

                        1. 1

                          I will respond more later, but I just tested ripgrep and it beats grep (and fgrep)! Impressive :)

                          https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                          Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                          I’m curious about RE2 now. Also curious about your aho-corasick library. I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                          The fact that fgrep is slower than grep makes me doubt that the constant string alternation problem is really “easier” than the regex problem.

                          1. 1

                            Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                            Hah… You’ll want to read the blog post I linked you to. :-) There is quite a bit more to it, but most of the interesting algorithmy stuff is in the regex engine. ripgrep itself is mostly engineering. Note that ripgrep’s regex engine is actually pluggable. By default it uses Rust’s regex engine, but the -P flag enables PCRE2, assuming you build ripgrep with PCRE2 support. (See README.)

                            Your RE2 and re2c benchmarks aren’t quite measuring the same thing as your grep/fgrep/ripgrep benchmarks. The latter are counting lines, but the former are counting all matches. Consider this comparison, where the former corresponds to counting all matching lines while the latter corresponds to counting every individual match:

                            $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c
                            2969020
                            real    0.673
                            user    0.655
                            sys     0.017
                            maxmem  343 MB
                            faults  0
                            
                            $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c -o
                            3830430
                            real    1.497
                            user    1.458
                            sys     0.037
                            maxmem  343 MB
                            faults  0
                            

                            Since your query has a lot of matches, this actually winds up being quite significant. I don’t think GNU grep actually has a way of counting every individual match, short of using -o and piping the output to wc -l. But then you wind up measuring the time it takes to not just count every match, but print every match too. You can see the difference with ripgrep, which admittedly isn’t too much, but it’s not nothing:

                            $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o -c
                            3830430
                            real    1.511
                            user    1.492
                            sys     0.017
                            maxmem  343 MB
                            faults  0
                            
                            $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o | wc -l
                            3830430
                            real    1.684
                            user    1.648
                            sys     0.033
                            maxmem  343 MB
                            faults  0
                            

                            I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                            I welcome all competitors! Beating ripgrep in all or most cases will be tricky. If you limit yourself to a specific subset of regexes and allow yourself to use offline compilation, then I’m sure there are wins to be had. There are lots of interesting optimizations at play here though, and generally finite state machines are your last line of defense because they are slower than optimized vector routines. But, vector optimizations don’t really scale to large inputs. They only really work for small regexes.

              2. 2

                wow, thanks for this long answer. seems like i have quite a bit of reading to do! :) so only an limited answer for some points:

                An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                I haven’t really thought about it this way (regular expressions as matched unit instead of words) until now. Maybe because most regexp examples / uses tend to match against a limited set of words, not infinite ones.

                For completeness, note that there is also Glushkov’s construction algorithm

                I didn’t know about Glushkov’s algorithm, interesting!

              3. 3

                BTW I put the original paper here, based on the long discussion below:

                http://www.oilshell.org/archive/Thompson-1968.pdf

                This got me interested and I pubilshed some benchmarks. Supposedly fgrep originally used Aho-Corasick. I’m not sure if GNU fgrep actually does now though! But regular grep is faster than fgrep on this problem, perhaps because it has better Boyer-Moore like “input skipping”. That’s my conjecture anyway.

                https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks


                EDIT: Yeah I see this in grep.c in the GNU source. It seems like it doesn’t try to treat the fgrep problem specially? For example, by using Aho-Corasick. Not sure why fgrep is slower than grep then. It should be the same speed if the same engine is used.

                  /* In a unibyte locale, switch from fgrep to grep if                                                                                                                                                                                        
                     the pattern matches words (where grep is typically faster).                                                                                                                                                                              
                     In a multibyte locale, switch from fgrep to grep if either                                                                                                                                                                               
                     (1) case is ignored (where grep is typically faster), or                                                                                                                                                                                 
                     (2) the pattern has an encoding error (where fgrep might not work).  */                                                                                                                                                                  
                  if (compile == Fcompile                                                                                                                                                                                                                     
                      && (MB_CUR_MAX <= 1                                                                                                                                                                                                                     
                          ? match_words                                                                                                                                                                                                                       
                          : match_icase || contains_encoding_error (keys, keycc)))                                                                                                                                                                            
                    {                                                                                                                                                                                                                                         
                      size_t new_keycc;                                                                                                                                                                                                                       
                      char *new_keys;                                                                                                                                                                                                                         
                      fgrep_to_grep_pattern (keycc, keys, &new_keycc, &new_keys); 
                
                1. 1

                  Yeah, GNU fgrep and GNU grep should be the same speed on the same input, assuming both inputs are just normal literals. That’s because grep should be able to detect the case of a simple literal and run as if it were fgrep. For your benchmark, what input are you searching? I can try and re-run the benchmark for you and explain the discrepancy if one indeed exists.

                  Also, if you’re interested in more details on how grep is optimized, you might want to read my ripgrep article (or perhaps skim it, since it’s long, but don’t miss the parts about Teddy and memchr :-)).

                  1. 1

                    Thanks, that would be useful. You should be able to clone the repo and run it with the instructions at the top of fixed-strings.sh. Instead of all-benchmarks you can run grep-fixed-benchmark.

                    Basically this will just download a file, make it 10x bigger, and run grep on it.

                    The input is an alternation of literal strings, which I deem the “fgrep problem”. For me, grep is reliably faster, but I don’t understand why since I think they should be using the same engine either way. There is some attempt to convert fgrep to grep, but I guess I don’t understand under what conditions that happens.

                    But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                    Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this). Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                    There are 13 keywords, described in the README:

                    Matching this file:
                    -rw-rw-r-- 1 andy andy 338M Nov 10 11:09 _tmp/all-10.txt
                    
                    Against 13 keywords:
                       for while continue break if fi then else elif case esac do done
                    
                    1. 1

                      All right… So I played with your benchmark script. I hacked at it until it ran without error, and here’s my output:https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                      There are definitely some issues with how you’re benchmarking this here. Namely, you can’t give regexes to fgrep, and the BRE syntax used by standard grep doesn’t support alternation. So in order to test with fgrep, you need to pass the alternation via a file using the -f flag (or, as you’ll see, use the -e flag). For testing standard grep, you need to use egrep. For GNU grep specifically, grep, fgrep and egrep are all just different front ends. They share the same backend essentially. What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                      I’m going to diverge from your benchmark script and just type the actual commands we want to run. It will be easier this way to verify results. So I started with just a regex alternation. The first two are ripgrep, where the first searches with a memory map and the latter uses standard read calls (the latter is what GNU grep does):

                      $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                      2969020
                      real    0.838
                      user    0.803
                      sys     0.033
                      maxmem  343 MB
                      faults  0
                      
                      $ time rg --no-mmap 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                      2969020
                      
                      real    0.891
                      user    0.823
                      sys     0.067
                      maxmem  6 MB
                      faults  0
                      
                      $ time egrep 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                      2969001
                      real    1.223
                      user    1.111
                      sys     0.110
                      maxmem  5 MB
                      faults  0
                      

                      For fgrep, we can’t use an alternation. Instead, we have to pass each pattern individually since fgrep doesn’t take regexes of any kind as input:

                      $ time fgrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                      2969001
                      real    1.515
                      user    1.427
                      sys     0.087
                      maxmem  5 MB
                      faults  0
                      

                      Now this is interesting. There is a repeatable performance difference here between egrep and fgrep. However, this doesn’t appear to have anything to do with fgrep per se, but rather, the method of input. Specifically, if we give each pattern individually to egrep, then it has the same slow down (and similarly for grep):

                      $ time egrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                      2969001
                      real    1.537
                      user    1.411
                      sys     0.123
                      maxmem  5 MB
                      faults  0
                      

                      One of my hypotheses for the difference is match overhead. In particular, you’ve chosen a query that has a lot of matches. This tends to be a less common case for tools like grep which can used interactively. Large results sets aren’t as useful because we humans can’t look through everything, so we tend to run searches that produce small result sets. As a result, cases with a large number of results might not receive as much optimization love (see, for example, the silver searcher). I tested my hypothesis by tweaking your query such that it doesn’t match anything:

                      $ time egrep 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                      real    1.099
                      user    1.035
                      sys     0.063
                      maxmem  5 MB
                      faults  0
                      
                      $ time egrep -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                      real    1.482
                      user    1.433
                      sys     0.047
                      maxmem  5 MB
                      faults  0
                      

                      So my hypothesis is wrong. The -e variant is still slower, even though I believe they are semantically the same search. As a fun side note, ripgrep does really well here, and doesn’t suffer from the same (seeming) performance bug:

                      $ time rg 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                      real    0.098
                      user    0.068
                      sys     0.029
                      maxmem  343 MB
                      faults  0
                      
                      $ time rg -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                      real    0.102
                      user    0.078
                      sys     0.024
                      maxmem  342 MB
                      faults  0
                      

                      The reason why ripgrep is fast, is that for both searches, it isn’t actually using Aho-Corasick at all! Instead, it’s using a SIMD algorithm called Teddy. Teddy is part of the regex engine and is described in detail here: https://github.com/rust-lang/regex/blob/master/src/literal/teddy_ssse3/imp.rs — Teddy doesn’t do too well when there are a lot of matches, but when there are very few matches, like in the latter search above, it absolutely screams.

                      Looking at grep’s source code (current master), this comment in GEAcompile sheds some light:

                        /* For GNU regex, pass the patterns separately to detect errors like
                           "[\nallo\n]\n", where the patterns are "[", "allo" and "]", and
                           this should be a syntax error.  The same for backref, where the
                           backref should be local to each pattern.  */
                      

                      ripgrep’s default regex engine doesn’t support backreferences, so it doesn’t have to deal with this. In particular, ripgrep will take -e pat1 -e pat2 ... -e patN and just translate that to (?:pat1)|(?:pat2)|...|(?:patN) such that it’s one giant regex. Since it doesn’t have to worry about backreferences, it can do this. At this point, I’m out of gas w.r.t. to investigation, and it’s not a stretch at this point to say that these two invocations go through different code paths. Since fgrep cannot invoke the pat1|pat2|...|patN code path, you can’t quite do a fair comparison unless you make egrep use the same code path as fgrep for multiple literals. And when you do that, they come out to be even.

                      But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                      I don’t know the origins of fgrep, but in today’s world, it’s primarily a UX thing and not a performance thing. Namely, it’s nice to be able to search for literal strings like Foo(. But if that’s interpreted as a regex, it’s probably a syntax error (unless you’re using BREs). So fgrep is really nice for that.

                      In a suboptimal implementation, it’s possible that fgrep will be faster than the equivalent egrep command. If that is indeed the case, then that just means that the egrep implementation isn’t falling back to normal substring search when the pattern it has been given is just a literal. If egrep is just “parse and feed to the regex engine,” then there will be cases in which fgrep is faster. You need an explicit analysis phase that ducks out to normal substring search that bypasses the regex engine.

                      Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this).

                      The problem with this statement is that “Thompson construction” isn’t really enough to go on. Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                      Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                      I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                      1. 1

                        What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                        That’s not true, as I just wrote, grep, fgrep, and ripgrep all return 2969020 results. BRE supports alternation with \|. You don’t need a flag to get alternation with fgrep either.

                        $ { echo foo; echo bar; }  | grep 'foo\|bar' 
                        foo
                        bar
                        
                        $ { echo foo; echo bar; }  | fgrep $'foo\nbar' 
                        foo
                        bar
                        
                        $ grep --version
                        grep (GNU grep) 2.25
                        Copyright (C) 2016 Free Software Foundation, Inc.
                        
                        1. 1

                          Cute. I was misled by this page then I guess. This page also indicates that BREs don’t have alternations: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

                          I forgot that \n could be used to delimit patterns in fgrep. But it’s still taking the same code path AFAIK inside GNU grep when compared with -e foo -e bar. The code path is different from foo|bar.

                        2. 1

                          After reading this, I’m still confused by why fgrep is slower than egrep.

                          It should be either

                          • The exact same speed, because it’s just a front end for the same engine.
                          • Faster, because it’s solving a “simpler” problem and can use a faster algorithm. The whole thing that started this discussion was this:

                          An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                          I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                          If you run your aho-corasick algorithm on this data set and it’s faster than the RE2 benchmark, that might convince me! (because RE2 is also counting all matches and not all lines, and it’s interpreted not compiled).

                          I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                          Maybe? But your post didn’t show that?

                          The point that the re2c and RE2 benchmarks aren’t measuring the same thing is well taken, but that’s separate from the grep/fgrep issue.

                          Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                          The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                          I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case. The NFA is very simple, and the DFA is very simple. I guess the test I should do is to run the simple powerset/subset construction on that case. I would be surprised if it “blows up” in this particular case (i.e. there are no repetitions, so the graph is very simple).

                          Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                          So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                          That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                          Can you think of any examples of such problems?

                          1. 1

                            After reading this, I’m still confused by why fgrep is slower than egrep.

                            Because that’s not the most precise statement we can make given the data. I carved the problem down to egrep 'pat1|pat2|...|patN' vs egrep -e pat1 -e pat2 ... -e patN in my previous comment, where the former is faster than the latter, and, crucially, the latter matches the speed of fgrep. So it’s not about “fgrep vs egrep,” but rather, a distinction in the code path. In other words, it looks like an unnecessary implementation quirk. But this is not a complete explanation. It looks like you’ll need to either ask someone who’s familiar with GNU grep automata construction, or dig into the code yourself. :-)

                            An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                            I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                            If you run your aho-corasick algorithm on this data set and it’s faster than the re2c benchmark, that might convince me! (because re2c is also counting all matches and not all lines).

                            I think we are unfortunately approach diminishing returns in this discussion. We’re getting hung up on terminology. In my comment above, I was using “Thompson’s construction” as short hand for “the NFA simulation of Thompson’s construction,” which roughly matches the algorithm in Thompson’s original paper. That is going to be very obviously and very trivially slower than even the NFA fomulation of Aho-Corasick, and certainly much slower than the DFA formulation of Aho-Corasick.

                            The above paragraph is 100% about implementation details. It has nothing to do with the theory.

                            I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                            Maybe? But your post didn’t show that?

                            I’m not in a debate with you here, and I’m not trying to convince you of anything. I’m just trying to offer up my knowledge and experience. Do with it as you will.

                            I think if you read my blog post about ripgrep, it should pretty solidly convince you that both algorithms and engineering are required. ripgrep, to a first approximation, uses the same algorithm as GNU grep. That is, both of them fundamentally use a lazy DFA as their work horse. But to a second approximation, the algorithms surrounding the lazy DFA, and in particular, the analysis performed on the regex, are very different. e.g., GNU grep does not use frequency based substring search and it does not have a vectorized multi-pattern matcher. Those are algorithmic differences. They aren’t huge, and someone could add them to GNU grep if they wanted to. But they are still differences. One other algorithmic difference has to do with how large Unicode character classes are handled. ripgrep is much faster than GNU grep in that space, and I suspect it largely has to do with how GNU grep implements POSIX locale support. But I’m not certain.

                            The engineering aspect of these tools is also quite large, but not for any particularly interesting reasons. You just need to make sure you’re doing as little work as possible per line (or per byte) of input. Naive implementations (e.g., “read a file line by line and apply a regex to each line”) will invariably do much more work than is necessary. Rising above naive implementations requires engineering and smarter choices about how you find your matches.

                            Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                            The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                            OK, but you’ll wind up needing to pick a cutoff point where you don’t eagerly compile the DFA because it will use up too much memory. You don’t need to pick this cutoff point, probably, in the context of re2c.

                            I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case.

                            I think I agree. At least, I can’t see any specific reason for a performance difference.

                            What algorithm does rust/regex use for DFA minimization?

                            It doesn’t do DFA minimization. Neither does RE2. It’s too expensive. Neither RE2 nor rust/regex ever eagerly computes a DFA for general regexes. It’s always done lazily. rust/regex has an optimization pass that RE2 doesn’t have, where, if it detects an alternation of literals, it will use Aho-Corasick specifically because Aho-Corasick is eagerly built and is therefore faster than the lazy DFA. The heuristics used for this currently are not optimal, but the code is there and works in a lot of cases.

                            The NFA is very simple, and the DFA is very simple. I have never implemented that algorithm but I would be surprised if it “blows up” in this particular case.

                            I would be too, in the sense that “blow up” means “exponentially blow up.” Other than that, I think I already characterized the memory usage of an Aho-Corasick DFA. In a typical table based DFA, every state uses 1KB of memory. That’s a lot.

                            Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                            So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                            That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                            I wouldn’t suspect so either. But if my regex engine never bothers with the eager DFA construction or minimization after applying the Thompson construction, then it makes sense to just use Aho-Corasick when possible. In order for me to be convinced to do otherwise, I think I’d need to be convinced that building and minimizing a DFA by starting with Thompson’s construction won’t be noticeably slower than using the normal Aho-Corasick construction algorithm, and then converting that into a table based DFA directly. Intuitively speaking, I would be surprised if both construction algorithms ran at similar speeds. In your use cases, this distinction is probably wildly uninteresting, since this is all paid for when you compile your program, and therefore you have a very high tolerance. In that case, you probably don’t care about Aho-Corasick, and for implementation simplicity, you’d just stick with Thompson’s construction -> DFA -> minimization. That’s not true in my case.

                            1. 1

                              FWIW I constructed a benchmark to match 1000 - 6000 strings “in parallel” with re2c, RE2, and ripgrep:

                              https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh#L328

                              ripgrep (and grep) are faster than re2c for n = 100 just like they are for n = 13 (which was surprising to me since they’re not compilers).

                              But in the range of 1000 - 6000 strings, re2c match time is essentially constant (< 3 seconds), while ripgrep increases linearly to 13+ seconds. RE2 also increases linearly but not as much.

                              The re2c compile time is linear as well, and the generated code size is linear, with a small constant factor. For 6000 strings, it only uses 213 KiB of code (representing the DFA).

                              I’m not pointing this out because I think it’s important for ripgrep to handle. GNU grep blows up somewhere around 400 strings and I gave up running it.

                              But it is interesting in the sense that this is precisely the Aho-Corasick problem – an alternation of constant strings. (modulo the overlapping matches issue)

                              My point was that “Thompson’s construction” can handle the Aho-Corasick problem, and do very well at it. It’s possible that I’m underestimating what re2c is doing after that, but considering that the compile time is low, I don’t think anything very advanced is going on.

                              If I have time in the future I’ll explore that further, although actually I’m more interested in derivatives, so I may never get to it. If I do experiment with derivatives, this would be a great test case.


                              In theory, I would expect DFAs to be able to match in constant time with respect to the number of strings, not linear time. But re2c is the only implementation that gets close to that!

                              I would have thought RE2 would too, so I suppose it’s harder than I thought. RE2 also doesn’t want to use more than 8 MB of memory for the DFA by default, although it’s easy to bump up.

                              I could say more about this, but I’ll probably blog about this in the future instead… I’ve been meaning to write a post/demo about re2c for a long time. If you have any comments on this feel free to mail me at andy@oilshell.org instead of this deep lobste.rs thread!

                              1. 1

                                Aye. Just some small quick notes:

                                • In RE2’s case, it will never use Aho-Corasick and it will never use an eagerly compiled DFA.
                                • In ripgrep’s case, it will use Aho-Corasick for very small alternations, but falls back to the same lazy DFA approach that RE2 uses beyond that. This is a failing of the current implementation. You can tweak the DFA size limit with the --dfa-size-limit flag.
                                • In both of the aforementioned cases, once the DFA exceeds the internal limit, it starts thrashing. If it thrashes enough, it falls back to the NFA simulation of Thompson’s construction.
                                • “Thompson’s construction” can only handle the Aho-Corasick problem well if you eagerly compile the DFA. 6,000 strings might not be enough to really test the differences between the traditional Aho-Corasick compilation algorithm and Thompson’s construction -> DFA -> minimization. /usr/share/dict/words for instance is over 100,000 entries on my system.

                                Just as a quick example, here is the time difference in compilation between NFA Aho-Corasick (traditional) and DFA Aho-Corasick (“advanced”) using an example program from my Aho-Corasick library:

                                $ cargo build --release --example dict-search
                                
                                $ time ./target/release/examples/dict-search -d /usr/share/dict/words /dev/null
                                pattern,start,end
                                
                                real    0.109
                                user    0.089
                                sys     0.020
                                maxmem  38 MB
                                faults  0
                                
                                $ time ./target/release/examples/dict-search --full -d /usr/share/dict/words /dev/null
                                pattern,start,end
                                
                                real    1.018
                                user    0.903
                                sys     0.113
                                maxmem  321 MB
                                faults  0
                                

                                Note also the difference in peak memory usage. :-)

                          2. 1

                            I think we might be getting hung up on “Thompson construction” terminology [1], so let me put it another way:

                            I don’t think Aho-Corasick is currently used to speed up string search problems.

                            GNU grep doesn’t appear to use it, and ripgrep doesn’t use it either.

                            The automata-based regex techniques seemed to have “subsumed it” (at least for any problems I can think of).

                            In your article I see this, which seems to make sense, since you have to go beyond automata to get further speedup!

                            Handling the case of multiple literals (e.g., foo|bar) is just as important. GNU grep uses a little known algorithm similar to Commentz-Walter for searching multiple patterns. In short, Commentz-Walter is what you get when you merge Aho-Corasick with Boyer-Moore: a skip table with a reverse automaton.

                            (However I’m not seeing the results of this in practice either! But that is probably because I haven’t looked hard enough.)

                            [1] Which I think of from a user’s perspective as “automata-based regex engine”, but you’re thinking of from an implementer’s perspective.

                            1. 1

                              GNU grep doesn’t appear to use [Aho-Corasick], and ripgrep doesn’t use it either.

                              Both do, absolutely. For GNU grep, see: http://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c

                              For ripgrep, given that I wrote every line of string searching code in it, I can tell you that it will use Aho-Corasick. Teddy is an optimization that, when applicable, will best Aho-Corasick. But it is not always applicable. For example, it doesn’t work well when you have a large number of literals. In those cases, ripgrep falls back to Aho-Corasick.

                              With respect to Commentz-Walter, I do not believe GNU grep uses it any more. I think it did at one time.

                              If GNU grep and/or ripgrep compiled regexes eagerly to a DFA and then minimized them, then there might not be any difference between using that machine to search and Aho-Corasick to search, other than, conjecturally, the time it takes to build the machine. But since both use lazy DFAs built directly from Thompson’s NFA construction at match time, there is no infrastructure for eagerly building out and minimizing the DFA. So we both just use Aho-Corasick instead.

                          3. 1

                            In my previous comment, one other interesting thing stood out to me is that GNU grep’s match count is slightly lower than ripgrep. It looks like this is because GNU grep is, for some reason, thinking that some of the matched lines contain invalid UTF-8. This is easier to see on a smaller example:

                            $ time egrep 's a very, very long f' oil.txt | wc -l
                            51
                            
                            $ time rg 's a very, very long f' oil.txt | wc -l
                            60
                            

                            Specifically, if you look at the output of egrep, it actually stops printing results and says Binary file oil.txt matches. In my case, it looks like GNU grep thinks its a binary file because it contains invalid UTF-8 in places and my locale is set to en_US.UTF-8. I can work around this by either setting my locale to C (via LC_ALL=C), or by enabling --text.

                            This doesn’t seem to impact performance though.

                            1. 2

                              Yes, LC_ALL=C is essential, otherwise on my machine GNU grep stops after 10% of the file!!!

                              I spent awhile figuring that out! Honestly I think it is a bug I should file, but I haven’t looked into it further. The test file is a concatenation of the same file 10 times, and a specific character sequence in the last part of the file makes grep stop!

                              See ESSENTIAL BUG FIX comment here.

                              https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh

                              I just re-ran the verification and grep, fgrep, and ripgrep are all returning 2969020 results, with the same timings as reported in the README.

                              1. 1

                                I don’t think it’s a bug. It’s intended behavior.

                                My output from the benchmark script is very different: https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                                1. 1

                                  I dunno, why would you ever want grep to stop in the middle of a file, no matter what other text is there? Suppose I have:

                                  grep foo

                                  on a file that looks like this:

                                  foo
                                  <unintelligible poorly encoded garbage>
                                  foo
                                  

                                  Shouldn’t it find both foo’s ? Or I guess it’s saying it doesn’t know that the second foo is even a foo after the encoding errors. That is plausible (for non-UTF8 encodings, since UTF-8 is designed to be able to resynchronize). But at least it should print an error rather than stopping silently!

                                  1. 1

                                    It does print an error. Or at least, it should. It says “Binary file foo matches”:

                                      if (!out_quiet && (encoding_error_output
                                                         || (0 <= nlines_first_null && nlines_first_null < nlines)))
                                        {
                                          printf_errno (_("Binary file %s matches\n"), input_filename ());
                                          if (line_buffered)
                                            fflush_errno ();
                                        }
                                    

                                    I did some digging and found this: http://lists.gnu.org/archive/html/bug-grep/2016-02/msg00037.html

                      2. 3

                        Complementary. Here’s the “original paper” you’re looking for: String Matching: An Aid to Bibliographic Search. It learns from Thompson’s method and improves its performance and utility in exchange for some features (no backtracking).