You don’t have to use the golden ratio; multiplying by any constant with ones in the top and bottom bits and about half those in between will mix a lot of input bits into the top output bits. One gotcha is that it only mixes less-significant bits towards more-significant ones, so the 2nd bit from the top is never affected by the top bit, 3rd bit from the top isn’t affected by top two, etc. You can do other steps to add the missing dependencies if it matters, like a rotate and another multiply for instance. (The post touches on a lot of this.)

FNV hashing, mentioned in the article, is an old multiplicative hash used in DNS, and the rolling Rabin-Karp hash is multiplicative. Today Yann Collet’s xxHash and LZ4 use multiplication in hashing. There have got to be a bajillion other uses of multiplication for non-cryptographic hashing that I can’t name, since it’s such a cheap way to mix bits.

It is, as author says, kind of interesting that something like a multiplicative hash isn’t the default cheap function everyone’s taught. Integer division to calculate a modulus is maybe the most expensive arithmetic operation we commonly do when the modulus isn’t a power of two.

Nice!
About the leftward bit propagation: can you do multiplication modulo a compile time constant fast? If you compute (((x * constant1) % constant2) % (1<<32)) where constant1 is the aforementioned constant with lots of ones, and constant2 is a prime number quite close to 1<<32 then that would get information from the upper bits to propagate into the lower bits too, right? Assuming you’re okay with having just slightly fewer than 1<<32 hash outputs.

(Replace 1<<32 with 1<<64 above if appropriate of course.)

You still have to do the divide for the modulus at runtime and you’ll wait 26 cycles for a 32-bit divide on Intel Skylake. You’ll only wait 3 cycles for a 32-bit multiply, and you can start one every cycle. That’s if I’m reading the tables right. Non-cryptographic hashes often do multiply-rotate-multiply to get bits influencing each other faster than a multiply and a modulus would. xxHash arranges them so your CPU can be working on more than one at once.

(But worrying about all bits influencing each other is just one possible tradeoff, and, e.g. the cheap functions in hashtable-based LZ compressors or Rabin-Karp string search don’t really bother.)

you’ll wait 26 cycles for a 32-bit divide on Intel Skylake

And looking at that table, 35-88 cycles to divide by a 64 bit divide. Wow. That’s so many cycles, I didn’t realize. But I should have: on a 2.4 GHz processor 26 cycles is 10.83 ns per op, which is roughly consistent with the author’s measurement of ~9 ns per op.

That’s not what I asked. I asked a specific question.

can you do multiplication modulo a compile time constant fast?

similarly to how you can do division by a constant fast by implementing it as multiplication by the divisor’s multiplicative inverse in the group of integers modulo 2^(word size). clang and gcc perform this optimisation out the box already for division by a constnat. What I was asking is if there’s a similar trick for modulo by a constant. You obviously can do (divide by divisor, multiply by divisor, subtract from original number), but I’m wondering if there’s something quicker with a shorter dependency chain.

So (much later) Daniel Lemire actually did this, and, yes, you can get an advantage out of using the normally-thrown-away top bits of the multiply result!

OK, I get it. Although I knew about the inverse trick for avoiding DIVs for constant divisions, I didn’t know or think of extending that to modulus even in the more obvious way. Mea culpa for replying without getting it.

I don’t know the concrete answer about the best way to do n*c1%(2^32-5) or such. At least does intuitively seem like it should be possible to get some win from using the high bits of the multiply result as the divide-by-multiplying tricks do.

So does that mean that when the author says Dinkumware’s FNV1-based strategy is too expensive, it’s only more expensive because FNV1 is byte-by-byte and fibonacci hashing multiplying by 2^64 / Φ works on 8 bytes at a time?

Does that mean you could beat all these implementations by finding a multiplier that produces an even distribution when used as a hash function working on 8 byte words at a time? That is, he says the fibonacci hash doesn’t produce a great distribution, whereas multipliers like the FNV1 prime are chosen to produce good even distributions. So if you found an even-distribution-producing number for an 8 byte word multiplicative hash, would that then work just as well whatever-hash-then-fibonacci-hash? But be faster because it’s 1 step not 2?

I think you’re right about FNV and byte- vs. word-wise multiplies.

Re: 32 vs. 64, it does look like Intel’s latest big cores can crunch through 64-bit multiplies pretty quickly. Things like Murmur and xxHash don’t use them; I don’t know if that’s because perf on current chips is for some reason not as good as it looks to me or if it’s mainly for the sake of older or smaller platforms. The folks that work on this kind of thing surely know.

Re: getting a good distribution, the limitations on the output quality you’ll get from a single multiply aren’t ones you can address through choice of constant. If you want better performance on the traditional statistical tests, rotates and multiplies like xxHash or MurmurHash are one approach. (Or go straight to SipHash, which prevents hash flooding.) Correct choice depends on what you’re trying to do.

That makes me wonder what hash algorithm ska::unordered_map uses that was faster than FNV1 in dinkumware, but doesn’t have the desirable property of evenly mixing high bits without multiplying the output by 2^64 / φ. Skimming his code it looks like std::hash.

On my MacOS system, running Apple LLVM version 9.1.0 (clang-902.0.39.2), std::hash for primitive integers is the identity function (i.e. no hash), and for strings murmur2 on 32 bit systems and cityhash64 on 64 bit systems.

// We use murmur2 when size_t is 32 bits, and cityhash64 when size_t
// is 64 bits. This is because cityhash64 uses 64bit x 64bit
// multiplication, which can be very slow on 32-bit systems.

Looking at CityHash, it also multiplies by large primes (with the first and last bits set of course).

Assuming then that multiplying by his constant does nothing for string keys—plausible since his benchmarks are only for integer keys—does that mean his benchmark just proves that dinkumware using FNV1 for integer keys is better than no hash, and that multiplying an 8 byte word by a constant is faster than multiplying each integer byte by a constant?

A fair point that came up over on HN is that people mean really different things by “hash” even in non-cryptographic contexts; I mostly just meant “that thing you use to pick hashtable buckets.”

In a trivial sense a fixed-size multiply clearly isn’t a drop-in for hashes that take arbitrary-length inputs, though you can use multiplies as a key part of variable-length hashing like xxHash etc. And if you’re judging your hash by checking that outputs look random-ish in a large statistical test suite, not just how well it works in your hashtable, a multiply also won’t pass muster. A genre of popular non-cryptographic hashes are like popular non-cryptographic PRNGs in that way–traditionally judged by running a bunch of statistical tests.

That said, these “how random-looking is your not-cryptographically-random function” games annoy me a bit in both cases. Crypto-primitive-based functions (SipHash for hashing, cipher-based PRNGs) are pretty cheap now and are immune not just to common statistical tests, but any practically relevant method for creating pathological input or detecting nonrandomness; if they weren’t, the underlying functions would be broken as crypto primitives. They’re a smart choice more often than you might think given that hashtable-flooding attacks are a thing.

If you don’t need insurance against all bad inputs, and you’re tuning hard enough that SipHash is intolerable, I’d argue it’s reasonable to look at cheap simple functions that empirically work for your use case. Failing statistical tests doesn’t make your choice wrong if the cheaper hashing saves you more time than any maldistribution in your hashtable costs. You don’t see LZ packers using MurmurHash, for example.

“map a large number to a small number.” It’s literally the slowest operation in the hash table.

This is why tries always outperform hash tables. Your operations is not O(k × hashing coefficient) + some probing cost related to density, but simply O(k).

I must admit I never noticed Knuth names this variant Fibonacci hashing in TAOCP, I’ve always just seen it referred to as Knuth’s multiplicative hashing. It appears in a number of data compression algorithms – for instance you can try searching for the constant 2654435761, which is a prime close to 2^32/phi often used.

Somehow It seems unlikely that the developers of STL implementations were not aware of this.

And the only alternative was the “power of two binary and” version which discards bits from the hash function and can lead to all kinds of problems. So your options are either slow and safe, or fast and losing bits and getting potentially many hash collisions if you’re ever not careful.

That sounds awkward. mod division throws away high bits just the same as masking.

It’s almost as fast as the power of two size, but it introduces far fewer problems because it doesn’t discard any bits.

This is similarly awkward. The big shift at the end of fibonacci hash is exactly the same as a mask, except for high bits instead of low bits. Where do you think all those shifted bits go?

Yeah I think that part is just wrong. I don’t think the author understands that the mathematical properties of prime modulo addressing make it valuable, not the bitwise properties. Obviously masking just divides by some power of 2, and don’t magically throw away more bits than dividing by any other number.

I think this only demonstrates that a trivial multiplicative hash speeds up hash tables with integer keys. He’s essentially proven that the dinkumware implementation’s strategy of hashing integers with FNV1 is valuable, but works just as well with a much simpler hash function that also saves some cycles. I wrote more in my other comments.

You don’t have to use the golden ratio; multiplying by any constant with ones in the top and bottom bits and about half those in between will mix a lot of input bits into the top output bits. One gotcha is that it only mixes less-significant bits towards more-significant ones, so the 2nd bit from the top is never affected by the top bit, 3rd bit from the top isn’t affected by top two, etc. You can do other steps to add the missing dependencies if it matters, like a rotate and another multiply for instance. (The post touches on a lot of this.)

FNV hashing, mentioned in the article, is an old multiplicative hash used in DNS, and the rolling Rabin-Karp hash is multiplicative. Today Yann Collet’s xxHash and LZ4 use multiplication in hashing. There have got to be a bajillion other uses of multiplication for non-cryptographic hashing that I can’t name, since it’s such a cheap way to mix bits.

It is, as author says, kind of interesting that something like a multiplicative hash isn’t the default cheap function everyone’s taught. Integer division to calculate a modulus is maybe the most expensive arithmetic operation we commonly do when the modulus isn’t a power of two.

Nice! About the leftward bit propagation: can you do multiplication modulo a compile time constant fast? If you compute

`(((x * constant1) % constant2) % (1<<32))`

where constant1 is the aforementioned constant with lots of ones, and constant2 is a prime number quite close to 1<<32 then that would get information from the upper bits to propagate into the lower bits too, right? Assuming you’re okay with having just slightly fewer than 1<<32 hash outputs.(Replace 1<<32 with 1<<64 above if appropriate of course.)

You still have to do the divide for the modulus at runtime and you’ll wait 26 cycles for a 32-bit divide on Intel Skylake. You’ll only wait 3 cycles for a 32-bit multiply, and you can start one every cycle. That’s if I’m reading the tables right. Non-cryptographic hashes often do multiply-rotate-multiply to get bits influencing each other faster than a multiply and a modulus would. xxHash arranges them so your CPU can be working on more than one at once.

(But worrying about all bits influencing each other is just one possible tradeoff, and, e.g. the cheap functions in hashtable-based LZ compressors or Rabin-Karp string search don’t really bother.)

And looking at that table, 35-88 cycles to divide by a 64 bit divide. Wow. That’s so many cycles, I didn’t realize. But I should have: on a 2.4 GHz processor 26 cycles is 10.83 ns per op, which is roughly consistent with the author’s measurement of ~9 ns per op.

That’s not what I asked. I asked a specific question.

similarly to how you can do division by a constant fast by implementing it as multiplication by the divisor’s multiplicative inverse in the group of integers modulo 2^(word size). clang and gcc perform this optimisation out the box already for division by a constnat. What I was asking is if there’s a similar trick for modulo by a constant. You obviously can do (divide by divisor, multiply by divisor, subtract from original number), but I’m wondering if there’s something quicker with a shorter dependency chain.

So (much later) Daniel Lemire actually did this, and, yes, you can get an advantage out of using the normally-thrown-away top bits of the multiply result!

https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/

Good to know. Seems like you could surely make better hashes with __uint128_t too, though it looks like it’s an LLVM/GCC thing and not fully standard.

Thanks!

OK, I get it. Although I knew about the inverse trick for avoiding DIVs for constant divisions, I didn’t know or think of extending that to modulus even in the more obvious way. Mea culpa for replying without getting it.

I don’t know the concrete answer about the best way to do n*c1%(2^32-5) or such. At least does intuitively seem like it should be possible to get

somewin from using the high bits of the multiply result as the divide-by-multiplying tricks do.So does that mean that when the author says Dinkumware’s FNV1-based strategy is too expensive, it’s only more expensive because FNV1 is byte-by-byte and fibonacci hashing multiplying by 2^64 / Φ works on 8 bytes at a time?

Does that mean you could beat all these implementations by finding a multiplier that produces an even distribution when used as a hash function working on 8 byte words at a time? That is, he says the fibonacci hash doesn’t produce a great distribution, whereas multipliers like the FNV1 prime are chosen to produce good even distributions. So if you found an even-distribution-producing number for an 8 byte word multiplicative hash, would that then work just as well whatever-hash-then-fibonacci-hash? But be faster because it’s 1 step not 2?

I think you’re right about FNV and byte- vs. word-wise multiplies.

Re: 32 vs. 64, it does look like Intel’s latest big cores can crunch through 64-bit multiplies pretty quickly. Things like Murmur and xxHash don’t use them; I don’t know if that’s because perf on current chips is for some reason not as good as it looks to me or if it’s mainly for the sake of older or smaller platforms. The folks that work on this kind of thing surely know.

Re: getting a good distribution, the limitations on the output quality you’ll get from a single multiply aren’t ones you can address through choice of constant. If you want better performance on the traditional statistical tests, rotates and multiplies like xxHash or MurmurHash are one approach. (Or go straight to SipHash, which prevents hash flooding.) Correct choice depends on what you’re trying to do.

That makes me wonder what hash algorithm ska::unordered_map uses that was faster than FNV1 in dinkumware, but doesn’t have the desirable property of evenly mixing high bits without multiplying the output by 2^64 / φ. Skimming his code it looks like std::hash.

On my MacOS system, running Apple LLVM version 9.1.0 (clang-902.0.39.2), std::hash for primitive integers is the identity function (i.e. no hash), and for strings murmur2 on 32 bit systems and cityhash64 on 64 bit systems.

Looking at CityHash, it also multiplies by large primes (with the first and last bits set of course).

Assuming then that multiplying by his constant does nothing for string keys—plausible since his benchmarks are only for integer keys—does that mean his benchmark just proves that dinkumware using FNV1 for integer keys is better than no hash, and that multiplying an 8 byte word by a constant is faster than multiplying each integer byte by a constant?

A fair point that came up over on HN is that people mean really different things by “hash” even in non-cryptographic contexts; I mostly just meant “that thing you use to pick hashtable buckets.”

In a trivial sense a fixed-size multiply clearly isn’t a drop-in for hashes that take arbitrary-length inputs, though you can use multiplies as a key part of variable-length hashing like xxHash etc. And if you’re judging your hash by checking that outputs look random-ish in a large statistical test suite, not just how well it works in your hashtable, a multiply also won’t pass muster. A genre of popular non-cryptographic hashes are like popular non-cryptographic PRNGs in that way–traditionally judged by running a bunch of statistical tests.

That said, these “how random-looking is your not-cryptographically-random function” games annoy me a bit in both cases. Crypto-primitive-based functions (SipHash for hashing, cipher-based PRNGs) are pretty cheap now and are immune not just to common statistical tests, but

anypractically relevant method for creating pathological input or detecting nonrandomness; if they weren’t, the underlying functions would be broken as crypto primitives. They’re a smart choice more often than you might think given that hashtable-flooding attacks are a thing.If you

don’tneed insurance against all bad inputs, and you’re tuning hard enough that SipHash is intolerable, I’d argue it’s reasonable to look at cheap simple functions that empirically work for your use case. Failing statistical tests doesn’t make your choice wrong if the cheaper hashing saves you more time than any maldistribution in your hashtable costs. You don’t see LZ packers using MurmurHash, for example.This is why tries

alwaysoutperform hash tables. Your operations is not O(k ×hashing coefficient) +some probing cost related to density, but simply O(k).I must admit I never noticed Knuth names this variant Fibonacci hashing in TAOCP, I’ve always just seen it referred to as Knuth’s multiplicative hashing. It appears in a number of data compression algorithms – for instance you can try searching for the constant 2654435761, which is a prime close to 2^32/phi often used.

Somehow It seems unlikely that the developers of STL implementations were not aware of this.

That sounds awkward. mod division throws away high bits just the same as masking.

This is similarly awkward. The big shift at the end of fibonacci hash is exactly the same as a mask, except for high bits instead of low bits. Where do you think all those shifted bits go?

Yeah I think that part is just wrong. I don’t think the author understands that the mathematical properties of prime modulo addressing make it valuable, not the bitwise properties. Obviously masking just divides by some power of 2, and don’t magically throw away more bits than dividing by any other number.

I think this only demonstrates that a trivial multiplicative hash speeds up hash tables with integer keys. He’s essentially proven that the dinkumware implementation’s strategy of hashing integers with FNV1 is valuable, but works just as well with a much simpler hash function that also saves some cycles. I wrote more in my other comments.