An important notice, since it is buried at the end: xor filter is immutable. You can’t add or delete items once it is constructed. (You can’t delete in Bloom filter either, but you can add. Counting bloom filter supports both, but with worse overhead.)
In theory you create Bloom filter that would allow for removal of the entries (if you are sure that the entry was previously added). Just instead bits as flags you store count of elements that has that flag. Unfortunately that enormously increase the size of the filter.
If you want a filter which supports both adding and substraction with log(n) space, take a look at this new datastructure: Utreexo https://eprint.iacr.org/2019/611.pdf
This is what you want: https://arxiv.org/pdf/1912.08258.pdf
Even if “what you wanted” was a closer link to published paper, I think it’s nicer and more likely closer to that aim to link to the main Arxiv page for it, rather than to the PDF directly, dogg.
Also, the post in the OP is by one of the paper’s authors, so it’s not like it wasn’t a primary source and worthy of checking out.
I apologize without reservation, dogg – I should have checked the Arxiv posting guidelines in the Lobsters FAQ beforehand.
Thanks for this!
I was wondering if Xor Filters got included in Chapter 3 (membership) of Probabilistic Data Structures and Algorithms for Big Data Applications but I don’t see it listed in the TOC.
Pretty excited work here; the immutable nature of this is not really a problem and the author covers this pretty well and for the intended use cases.
What is a problem is having to hold your unique set in RAM for the construction of the filter. The problem space I am thinking of is datasets are awkward to fit into RAM, especially so when it is single use and immediately discarded; it pushes out data that I do want to use.
Fortunately I think a cheat is available as your construction success rate is >95% you can just stream the construction into three or more separate bit arrays each using different hash seeds, and then use one of any that succeed…if it fails, rerun and it uses a different sets of random seeds.
This neatly sidesteps all that awkwardness of building and retaining a set of unique keys.
This works (for me) as it is easy to arrange for my data source to re-spam my constructor if it fails as I would treat this just as if there was a network/server glitch forcing me to retry anyway.