1. 4

Does anyone have good rules of thumb/numbers you should know for bloom filters? I’ve never actually had to pick the values and I don’t know what common sizes people use or shoot for or what tradeoffs people pick (esp around things like latency for number of hashes, disk io if it’s not in memory, etc.). Numbers like https://hur.st/bloomfilter/

  1.  

  2. 6

    Here’s a rule of thumb I use all the time: to get false positive probability (FPP) of about 2%, your bloom filter needs 1 byte per set element. (1% FPP requires about 10 bits/element, but I find the 1-byte rule more useful.)

    Other parameters like # hash functions are already determined (for optimal configurations) by your choice of FPP. Most bloom filters in practice are “blocked” (i.e. values are first hashed to a block, within which all the hash functions are evaluated) for cache or I/O efficiency. In that case you need to pick a block size as well, which is usually determined by hardware requirements, e.g. cache line or page or disk sector size. See also “partitioning” (which unlike blocking dedicates a separate partition to each hash function): https://arxiv.org/pdf/2009.11789.pdf.

    1. 2

      Back when I was in academia I used to refer to this really good, short explanation of bloom filters: https://llimllib.github.io/bloomfilter-tutorial/

      The article does have an “Advanced Topics” section covering the number of hash functions and the size to use for a bloom filter. I hope you find it to be somewhat useful.