This is pretty clever! But is it really a Bloom filter?
To me, the special sauce of a Bloom filter is the use of multiple hash functions, and finding a match only when the bits corresponding to each hash are true.
The string matcher is the degenerate case with a single hash, and a degenerate hash that’s simply c & 0x3F. But again, not to disparage it, it’s a cool trick!
c & 0x3F
There is only one hash. That they use just the lower bits does allow for false positives and indexes the bits in a Bloomy way, but I would call it more “Bloom-inspired” or maybe “degenerate order-1 Bloom” or something to highlight the 1-hash aspect.
That same CPython source itself uses scare quotes around “Bloom filters” and TFA mentions them being different, too. The usual performance problem of Bloom filters relative to competitors is all these random probes to memory (higher level caches or DIMMs or whatnot) which this idea does not have.
Also, I am unsure why they would both have ascii_line_break and put the same ASCII characters into this “Bloomy” filter for splitlines except that maybe they re-use _PyUnicode_IsLinebreak somewhere (which I do not see). Maybe they thought _PyUnicode_IsNonAscBreak was too long a name or something.
Finally, on a similar note, for just the line break part where there are literally only 3 things, I am not sure 3 failed cmp are even slower than their bit-vector comparison prefix thing on a modern CPU when the failures are all very well predicted. gcc will often do sequential comparisons for so few switch targets, but its perfect hash alternative is also more expensive than just a binary and.
You could see it as a bloom filter where the hash functions are “if the last 5 bits equal this, the output is this, 0 otherwise” and are implemented very efficiently. These hashes aren’t very good, the number used is fairly low, but if you don’t need to match that many different characters and your main objective is to be very quick, it’s an ok bloom filter.
Thank you for reading.
I agree, if they didn’t call it a bloom filter in the names and comments I would not have thought of it as a bloom filter. But it makes things relatable and makes the logic more obvious.