1. 13
  1. 10

    I’ve used Aho-Corasick a lot. Not just on its own, but also as a tool to optimize special cases in a regex engine. Here are some various tidbits:

    • Aho-Corasick can work fairly nicely in my experience up to ~low millions of entries. Somewhere around this point, constructing the automaton can get quite slow.
    • Aho-Corasick is very nice when you want solid predictable performance.
    • Aho-Corasick won’t beat specialized algorithms for searching a smaller number of patterns, especially some of the new packed substring algorithms that use vector instructions.
    • If you can prove that every match of an AC automaton starts with at most a few different distinct bytes (e.g., Inspector Lestrade|Sherlock Holmes|Irene Adler has two distinct bytes), then you can use optimized routines in a skip loop to search for one of those bytes when your automaton is in a start state. For example, if there is only one beginning byte, you can use memchr. (You can extrapolate the implementation of memchr to memchr2 and memchr3 variants.) One must be judicious with such optimizations; feeding a very common byte to a skip loop can destroy performance. This negates the “predictable” performance aspect that you typically get with Aho-Corasick.
    • Aho-Corasick can be used to find overlapping or non-overlapping matches.
    • Aho-Corasick is fairly easy to adapt to report matches over streams such that your haystack needn’t be resident in memory at once. (“easy” in this context means “you don’t need to hand roll a special buffer that deals with edge cases.”)
    • Commentz-Walter is an interesting concoction that combines the ideas of Aho-Corasick and Boyer-Moore. Namely, it constructs a backwards automaton that matches patterns in reverse while using the same type of skipping logic that Boyer-Moore does. Unfortunately, Commentz-Walter doesn’t scale as well as Aho-Corasick as you add patterns, because the more patterns you have, the less likely the skipping logic is going to buy you anything. However, for a smallish number of patterns, Commentz-Walter can beat Aho-Corasick. This may be why GNU grep used to use Commentz-Walter.
    • In the literature, there is a variant of Aho-Corasick called “Advanced” Aho-Corasick. This variant compiles away all of the “failure” transitions found in the original algorithm down to a table based DFA. This makes the number of operations performed during search per byte constant instead of proportional to patterns themselves. This can result in a nice boost in performance, but of course, will cost quite a bit more memory. (There are tricks to reduce memory size—typically generic to any table based DFA—but they also come with costs.)
    1. 6

      I found it a little hard to understand the aims and tradeoffs of this algorithm from wikipedia. The original paper is here: https://cr.yp.to/bib/1975/aho.pdf

      This algorithm lets you search for all occurrences of many short patterns in a long text.

      One application might be syntax hilighting keywords in a text editor.

      It works by building a finite state machine out of a list of patterns. This is similar to the common lexing technique we use of making a vector of regular expressions and running them all “in parallel”.

      1. 6

        The historic articles on string matching and dynamic programming are a real treasure. They covered ground-breaking algorithms in just a few pages, didn’t hold anything back, and the work could be replicated without needing to refer to earlier articles. The pseudo-code is so clean and efficient that when re-implementing these algorithms from the original paper, it is possible (IMO) to get back into the exact thought patterns of the authors. Almost like they preserved a part of themselves in pseudo-code.

      2. 1

        (Flagged) Posting wikipedia articles doesn’t seem appropriate for lobsters.

        1. 1

          My gut instinct as someone that has lurked here for a while but am only new to commenting recently is that one of the worst parts of proggit is people posting plain links to wikipedia articles or software libraries with no context.

          But I’m glad this was posted because burntsushi’s comment is highly valuable.