Surprised they don’t mention the singular value decomposition at all for dimensionality reduction. A friend also pointed me to the Johnson-Lindenstrauss Theorem, which “shows that a set of n points in high dimensional Euclidean space can be mapped into an O(log n/e^2)-dimensional Euclidean space such that the distance between any two points changes by only a factor of (1 +- e)”.

Maybe it’s because their solution needs to run online? But they don’t mention that in their blog post.

EDIT: Sorry, didn’t want to come across as totally negative. I definitely think the solution they ended up going with is clever. I just think it would be more interesting if they’d talked about more prior art.

My understanding of SVD is that for it to make sense the dimensions need to be corelated in some way. This is not the case for our workload - for all intends and purposes the dimension are pretty random.

Johnson-Lindenstrauss Theorem - amazing. I’m not sure what implications are, would need to think about it. Thanks for the link!

Thanks for the very interesting post! I am not very familiar with image search, but I wonder if (Optimized) Product Quantization would help here. In word embeddings, we have found that OPQ can compress vectors an order of magnitude with barely any l2 loss. Another benefit is that distances between vectors can be computed by using a precomputed table of pairwise distances of subquantizer centroids.

This post is less about image search and more about just “nearest neighbor” in 144 dimensions. Or to be more precise - proving that nearest neighbor within given threshold doesn’t exist :)

We tried various centroids, but the data is quite random so no obvious clusters emerge.

This post is less about image search and more about just “nearest neighbor” in 144 dimensions.

Which is what PQ-based NN search will give you, except that you can do the computations in the lower-dimensional quantized space.

We tried various centroids, but the data is quite random so no obvious clusters emerge.

I am not sure what you mean here. The vectors must have some underlying structure, otherwise it wouldn’t be useful to compute distances between vectors. As you mention in the blog post, the problem is more that with such large dimensionalities, you run into the curse of dimensionality and all data points and points tend to equidistant, which is problematic for k-means.

However, PQ uses k-means clustering in low-dimensional spaces (for each subquantizer).

I found this problem interesting and took a shot at simplifying the math. It turns out you can view this problem as an argmax after a matrix multiplication. The result is pretty easy to implement in floating point math with a library. A V100 could do 1500 queries in 0.070 seconds

I’d imagine (after fiddling with the precision and overflows) it’d be a nice way of rephrasing the problem for optimization on a CPU.

I don’t know anything about the problem domain so maybe this is a silly question, but is the Euclidean metric especially meaningful here? Since the space is finite-dimensional all norms on it are equivalent and at a glance the l_1 or l_\infty norms look like they’re easier to compute.

I wonder if the constant needed to “convert” to an L1/Linf norm would cause overflow in the uint16 implementation. I don’t think the engineer in this post had access to the underlying hash algorithm (which might be able to absorb that change).

The problem here is finding the Nearest Neighbor in the sense of the Euclidean Metric.
Once you calculated the distance from each one to any of the others you already lost.

You can do much cheaper calculations to reduce the number of elements to check.

For instance you can set a lower bound to the distance between 2 vectors by their mean and variance.
Using that you may create only a sub set of the data which requires the exact distance.

I’m surprised that similar inputs produce similar hashes. Seems like that would make it easier to go in the reverse direction, which I thought defeats the purpose of a hash.

Not all hashes are cryptographic hashes or checksums, which aim to vary greatly with every bit of input. These hashes are fuzzy image matching hashes, designed to be as similar as possible for similar images. These hashes can then be used for detecting illicit material with a certain degree of certainty (the Euclidean distance threshold chosen), even if they’ve been edited to attempt circumventing detection.

Imagine using a cryptographic hash to fingerprint illicit images. An adversary need only change 1 pixel and they’ve circumvented your detection.

Since this is all about fuzzy matching, why wouldn’t a simple subtraction cut it? Each of dimensions has 256 possible values, just computing the sum of differences should give a good approximation of “close”. (This is a genuine question, as I think I might have misunderstood something)

Does anyone know what they are (might be) using for the hash function? I’ve worked on similar locality-sensitive-hash problems before and find the properties of such hashes to be pretty interesting

How does it work? While there has been lots of work on fuzzy hashing published, the innards of the process are intentionally a bit of a mystery. The New York Times recently wrote a story that was probably the most public discussion of how such technology works. The challenge was if criminal producers and distributors of CSAM knew exactly how such tools worked then they might be able to craft how they alter their images in order to beat it. To be clear, Cloudflare will be running the CSAM Screening Tool on behalf of the website operator from within our secure points of presence. We will not be distributing the software directly to users. We will remain vigilant for potential attempted abuse of the platform, and will take prompt action as necessary.

Which is unfortunate because that is much more interesting than the contents of this article.

Thanks for the link! Seems like a reasonable call on their part.

Thinking on this a bit more, image fingerprinting seems like really interesting problem; the problems I’ve thought about before all deal with byte streams, but with an image the hash has to be like, perceptual? Not sure what the right word is, but seems really interesting. I’ve probably found a weekend tangent haha

The problem here is finding the Nearest Neighbor in the sense of the Euclidean Metric.
Once you calculated the distance from each one to any of the others you already lost.

You can do much cheaper calculations to reduce the number of elements to check.

Surprised they don’t mention the singular value decomposition at all for dimensionality reduction. A friend also pointed me to the Johnson-Lindenstrauss Theorem, which “shows that a set of n points in high dimensional Euclidean space can be mapped into an O(log n/e^2)-dimensional Euclidean space such that the distance between any two points changes by only a factor of (1 +- e)”.

Maybe it’s because their solution needs to run online? But they don’t mention that in their blog post.

EDIT: Sorry, didn’t want to come across as totally negative. I definitely think the solution they ended up going with is clever. I just think it would be more interesting if they’d talked about more prior art.

My understanding of SVD is that for it to make sense the dimensions need to be corelated in some way. This is not the case for our workload - for all intends and purposes the dimension are pretty random.

Johnson-Lindenstrauss Theorem - amazing. I’m not sure what implications are, would need to think about it. Thanks for the link!

Thanks for the very interesting post! I am not very familiar with image search, but I wonder if (Optimized) Product Quantization would help here. In word embeddings, we have found that OPQ can compress vectors an order of magnitude with barely any l2 loss. Another benefit is that distances between vectors can be computed by using a precomputed table of pairwise distances of subquantizer centroids.

This post is less about image search and more about just “nearest neighbor” in 144 dimensions. Or to be more precise - proving that nearest neighbor within given threshold doesn’t exist :)

We tried various centroids, but the data is quite random so no obvious clusters emerge.

Which is what PQ-based NN search will give you, except that you can do the computations in the lower-dimensional quantized space.

I am not sure what you mean here. The vectors must have some underlying structure, otherwise it wouldn’t be useful to compute distances between vectors. As you mention in the blog post, the problem is more that with such large dimensionalities, you run into the curse of dimensionality and all data points and points tend to equidistant, which is problematic for k-means.

However, PQ uses k-means clustering in low-dimensional spaces (for each subquantizer).

I found this problem interesting and took a shot at simplifying the math. It turns out you can view this problem as an argmax after a matrix multiplication. The result is pretty easy to implement in floating point math with a library. A V100 could do 1500 queries in 0.070 seconds

I’d imagine (after fiddling with the precision and overflows) it’d be a nice way of rephrasing the problem for optimization on a CPU.

writeup linked on this submission https://lobste.rs/s/pti6im/optimizing_euclidean_distance_queries

here’s a direct link: https://jott.live/markdown/euclidean_distance

I spent

waaaaytoo long wondering how the maximum distance between points could be zero.I don’t know anything about the problem domain so maybe this is a silly question, but is the Euclidean metric especially meaningful here? Since the space is finite-dimensional all norms on it are equivalent and at a glance the l_1 or l_\infty norms look like they’re easier to compute.

I wonder if the constant needed to “convert” to an L1/Linf norm would cause overflow in the uint16 implementation. I don’t think the engineer in this post had access to the underlying hash algorithm (which might be able to absorb that change).

Regardless this is a good idea

I wonder if performance could be improved significantly by moving to a higher dimensional binary space. Then you just need to calculate the Hamming distance using XOR and POPCOUNT. It sounds like AVX2 doesn’t have a POPCOUNT, but you can implement one using a lookup table: https://stackoverflow.com/questions/50081465/counting-1-bits-population-count-on-large-data-using-avx-512-or-avx-2

Or use the AVX-512 POPCOUNT.

The problem here is finding the Nearest Neighbor in the sense of the Euclidean Metric. Once you calculated the distance from each one to any of the others you already lost.

You can do much cheaper calculations to reduce the number of elements to check.

Like what? They tested and explained several strategies that didn’t work.

For instance you can set a lower bound to the distance between 2 vectors by their mean and variance. Using that you may create only a sub set of the data which requires the exact distance.

I’m surprised that similar inputs produce similar hashes. Seems like that would make it easier to go in the reverse direction, which I thought defeats the purpose of a hash.

Not all hashes are cryptographic hashes or checksums, which aim to vary greatly with every bit of input. These hashes are fuzzy image matching hashes, designed to be as similar as possible for similar images. These hashes can then be used for detecting illicit material with a certain degree of certainty (the Euclidean distance threshold chosen), even if they’ve been edited to attempt circumventing detection.

Imagine using a cryptographic hash to fingerprint illicit images. An adversary need only change 1 pixel and they’ve circumvented your detection.

Cloudflare compares these fuzzy hashes against hash databases of child pornography.

Since this is all about fuzzy matching, why wouldn’t a simple subtraction cut it? Each of dimensions has 256 possible values, just computing the sum of differences should give a good approximation of “close”. (This is a genuine question, as I think I might have misunderstood something)

Does anyone know what they are (might be) using for the hash function? I’ve worked on similar locality-sensitive-hash problems before and find the properties of such hashes to be pretty interesting

According to their blog, they’re not telling

Which is unfortunate because that is much more interesting than the contents of this article.

Thanks for the link! Seems like a reasonable call on their part.

Thinking on this a bit more, image fingerprinting seems like really interesting problem; the problems I’ve thought about before all deal with byte streams, but with an image the hash has to be like, perceptual? Not sure what the right word is, but seems really interesting. I’ve probably found a weekend tangent haha

This comparison of perceptual hashing is a good intro to the topic: https://tech.okcupid.com/evaluating-perceptual-image-hashes-okcupid/

The problem here is finding the Nearest Neighbor in the sense of the Euclidean Metric. Once you calculated the distance from each one to any of the others you already lost.

You can do much cheaper calculations to reduce the number of elements to check.