2 is a prime. The real problem is that the hashing algorithm is easily reversible and has no randomization. E.g. in Python, on every startup a random seed is generated to insert into each hash used in dictionary, precisely to prevent hash collision attacks, and by using a hash function that is hard to generate collisions for.

It’s true that the hash function itself should be improved; that will solve most of the problems right away. The problem that I was referring to is that when there is a clump of values in a hashed data structure, and you double the size of the data structure, the clump doesn’t go away. At the very best, it chops the clump in half.

But if you change the modulo from one prime number to another prime number, the clump will often disappear entirely – unless all of the items in the clump have the same exact hash code, which that article actually accomplished 😢.

Most of the people, prefer to use power of two for so they can &(pow2-1) instead of %. In the old days performing the & was one order of magnitude faster.

Indeed. And we used single cycle bit twiddling instead of multiplication for string hashes, because multiply took over 1 cycle per bit (39 cycles for a 32 bit multiply on a 386, IIRC). Yet memory access was considered pretty much free.

Now I can pipeline 2x 32-bit multiplies inside a single clock cycle on x86, but a memory access is 300+ clock cycles.

Integer division is still way slower than bitwise ops. According to the Agner instruction tables, on Zen 3, 64-bit DIV instruction has a latency of 9-17 cycles, while while an AND instruction has a latency of 1 cycle. Exactly how much slower the modulo ends up being depends on the surrounding code and how intelligently the core can find other work to do while waiting for the DIV to complete, but an order of magnitude is realistic.

Probably in large part because compilers are awesome these days; a division or modules where either operandi is known at compile time won’t be compiled to an actual div instruction these days. And of course, CPUs are so fast that the extra 9 cycles probably doesn’t matter in most cases.

This is the problem with using power of two bucket growth in a hashed data structure.

Modulos should be prime numbers. That’s what God created prime numbers for.

2 is a prime. The real problem is that the hashing algorithm is easily reversible and has no randomization. E.g. in Python, on every startup a random seed is generated to insert into each hash used in dictionary, precisely to prevent hash collision attacks, and by using a hash function that is hard to generate collisions for.

It’s true that the hash function itself should be improved; that will solve most of the problems right away. The problem that I was referring to is that when there is a clump of values in a hashed data structure, and you double the size of the data structure, the clump doesn’t go away. At the very best, it chops the clump in half.

But if you change the modulo from one prime number to another prime number, the clump will often disappear entirely – unless all of the items in the clump have the same exact hash code, which that article actually accomplished 😢.

TLDR - you’re correct

Most of the people, prefer to use power of two for so they can &(pow2-1) instead of %. In the old days performing the & was one order of magnitude faster.

Indeed. And we used single cycle bit twiddling instead of multiplication for string hashes, because multiply took over 1 cycle per bit (39 cycles for a 32 bit multiply on a 386, IIRC). Yet memory access was considered pretty much free.

Now I can pipeline 2x 32-bit multiplies inside a single clock cycle on x86, but a memory access is 300+ clock cycles.

Very funny how things have changed.

Integer division is still way slower than bitwise ops. According to the Agner instruction tables, on Zen 3, 64-bit DIV instruction has a latency of 9-17 cycles, while while an AND instruction has a latency of 1 cycle. Exactly how much slower the modulo ends up being depends on the surrounding code and how intelligently the core can find other work to do while waiting for the DIV to complete, but an order of magnitude is realistic.

I do favour POT-sized tables, but you absolutely do not need a div for lookups in prime-sized tables. See eg https://libdivide.com/, https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/

I was thinking the performance gap started to close in recent times. I’ve seen less bitwise magic in today’s code, compared to what it was.

Probably in large part because compilers are awesome these days; a division or modules where either operandi is known at compile time won’t be compiled to an actual div instruction these days. And of course, CPUs are so fast that the extra 9 cycles probably doesn’t matter in most cases.

9 cycles can be easily hidden by a L2 cache hit (L1 miss).

Modulos won’t help if the hashes are exatly the same.

Yes, ignaloidas already pointed that out :)

Reminds me of this talk about using hash collisions to DoS webservices