What, other than concurrency, makes ripgrep so fast?
The short answer is that much of Mike Haertel’s advice in this post is still good. The bits about literal scanning, avoiding searching line-by-line, and paying attention to your input handling are on the money.
But the stuff about Boyer-Moore is outdated. The fact that it skips some of the input sounds great until you realize that most searches contain fairly short literals, and thus, the opportunity for skipping is not great. In those cases, the reason why it is still fast is because of BM’s “skip” loop, which involves feeding the final byte of your needle to memchr, and only looking for a match whenever memchr matches.
Memchr is very fast because it will use AVX2 vector instructions on a modern machine. Glibc’s unrolled inner loop will process 128 bytes at a time (I think, ripgrep’s memchr implementation definitely will).
So the key observation here is that the more time you spend in memchr, the faster your search will be. So in ripgrep’s case, we don’t even bother with BM most of the time and instead try to make more efficient use of memchr. Instead of limiting ourselves to the last byte, we instead guess which byte in the needle will match least frequently and use that. It’s a heuristic, but a smarter one than the one used by BM. ripgrep will edge out grep on searches for short literals in part because of this optimization. (On Linux, searching a single file using a memory map is also faster than searching via traditional buffered reading.)
But there’s other stuff. Ripgrep has a much faster vectorised multi pattern matcher for a small number of literals. Works great for case insensitive searches. Its line counting is also vectorised, where as GNU grep’s is not, which can actually slow it down quite a bit.
And there are other things such as building automata that are more friendly to Unicode. This is in part because ripgrep doesn’t need to deal with POSIX locales, but GNU grep could in theory specialize UTF-8 handling since it’s so common.
But yeah, the full answer is in my blog post.
Thank you for your detailed explanation and for a tool that makes my life that much easier. I appreciate it!
The author (burntsushi) has written a blog post about just that :)
Very interesting read, thank you for posting! I especially liked the insight that a fast grep needs to bundle its own regex engine because POSIX regex functions operate on NUL-terminated strings. I suppose if you have a regex library that can handle streamed input that should be possible to re-use for grep.
In this message, see that line that contains a word that starts with =21? What is that?
Quoted-printable. In this particular case, it decodes to !@#%$!@#%
Thank you! I’m unsure why I didn’t know this; I’ve seen this encoding in emails forever. I wasn’t aware that it was any more complicated than “equal sign plus two hex digits”; I wasn’t aware of the end-of-line rules.