I would have added another addendum. Usually, the probability is small (if it’s too big, it doesn’t matter how big it is – collisions are too likely). In that case, the log-odds of the probability is approximated by log(p), so:

log(k^2 / 2N) = 2log(k) - log(N) - 1

This is the “strength of belief” (or, since it is negative, “strength of disbelief” or “confidence that it won’t happen”, measured in bits).

For example, if you use SHA-256 (log(N)=256) and you have a few billion docs (log(k) = 40), then the confidence that there won’t be a collision is 257 - 80 = 177 bits. On the other hand, if you’re worried about colliding UUID v4 (that’s not a hash, but the math doesn’t care), you have 122 bits, 123 - 80 = 43. Still pretty good!

(Credit for thinking about confidence as log odds goes to ET Jaynes “Probability Theory: The Logic of Science” where he suggested decibels – but since most people here are software engineers rather than sound engineers, bits are more intuitive units than decibels)

It varies hugely. For cryptographic hash functions, it either holds, or you’ve found a weakness in the hash function. For other hash functions, it generally doesn’t in general but often does on non-malicious input. There have been quite a few denial-of-service vulnerabilities as a result of this. If I can craft data that you will put into a hash table such that it will always give a hash collision then I can make you keep growing your hash table or can make you fall back to secondary chaining with a single chain (turning your expected O(1) data structure into an O(n) one). This causes you to either exhaust memory and crash, or consume vast amounts of CPU for simple tasks. Neither is good.

Just to add: there are a bunch of places where mediocre hash functions behave really well because the people preparing the inputs are helping out by putting lots of entropy in. For instance, usernames on most web forms, or pointers returned from a malloc() which is deliberately designed to give out random looking addresses from all over the address space.

I would have added another addendum. Usually, the probability is small (if it’s too big, it doesn’t matter how big it is – collisions are too likely). In that case, the log-odds of the probability is approximated by log(p), so:

log(k^2 / 2N) = 2log(k) - log(N) - 1

This is the “strength of belief” (or, since it is negative, “strength of disbelief” or “confidence that it won’t happen”, measured in bits).

For example, if you use SHA-256 (log(N)=256) and you have a few billion docs (log(k) = 40), then the confidence that there won’t be a collision is 257 - 80 = 177 bits. On the other hand, if you’re worried about colliding UUID v4 (that’s not a hash, but the math doesn’t care), you have 122 bits, 123 - 80 = 43. Still pretty good!

(Credit for thinking about confidence as log odds goes to ET Jaynes “Probability Theory: The Logic of Science” where he suggested decibels – but since most people here are software engineers rather than sound engineers, bits are more intuitive units than decibels)

Oooh. 2log(k)-log(N)-1 perfectly explains the birthday problem bound too. Thanks

How well does the uniform distribution assumption hold in real hashing functions?

It varies hugely. For cryptographic hash functions, it either holds, or you’ve found a weakness in the hash function. For other hash functions, it generally doesn’t in general but often does on non-malicious input. There have been quite a few denial-of-service vulnerabilities as a result of this. If I can craft data that you will put into a hash table such that it will always give a hash collision then I can make you keep growing your hash table or can make you fall back to secondary chaining with a single chain (turning your expected O(1) data structure into an O(n) one). This causes you to either exhaust memory and crash, or consume vast amounts of CPU for simple tasks. Neither is good.

Just to add: there are a bunch of places where mediocre hash functions behave really well because the people preparing the inputs are helping out by putting lots of entropy in. For instance, usernames on most web forms, or pointers returned from a malloc() which is deliberately designed to give out random looking addresses from all over the address space.

I thought this was going to be about calculating the randomness of the hash function itself too; I guess this proof assumes uniform randomness?

Yes, it does.