Multiplying integers is really surprisingly slow. On Skylake, you can start one r32 IMUL per cycle and you have to wait 4 cycles for the result. You can start four AND/OR/XORs per cycle with 1 cycle latency, so that’s 16 bitops in the same time as a single multiply.
The multiplier is 31, which is a sign the hash function is designed to be implemented by bitshifting. That is, h * 31 is the same as (h << 5) - h. I use a similar DJB hash here:
In that case the multiplier is 33, so you do (h << 5) + h.
I think Java has problems with shifts because it doesn’t have unsigned integers? But the JVM is pretty performance-minded so I imagine it should do “the right thing”.
I was going to say CPython’s hash tables with string keys. That might work, but they’re pretty complicated, and with the methodology he uses, I don’t know how you would take out the bytecode dispatch overhead. It looks like he just directly calls the hash function (hash('abc') in Python) in a loop and subtracts.
However I seem to recall that Python was actually faster than Go in some hash table benchmarks. (early Go at least) That is, if you wrote a simple loop to insert into hash tables like this one, Python’s highly tuned hash tables would outperform a compiled language! They’re both of course written in C – in some sense you’re comparing C to C, but it shows that the interpreter overhead sometimes doesn’t matter.
(And I don’t have a reference unfortunately.)
Maybe an easier one would be C++ unordered_map which I use here with a similar hash function:
I suspect the hash function he uses is not ideal.
Multiplying integers is really surprisingly slow. On Skylake, you can start one r32 IMUL per cycle and you have to wait 4 cycles for the result. You can start four AND/OR/XORs per cycle with 1 cycle latency, so that’s 16 bitops in the same time as a single multiply.
The multiplier is 31, which is a sign the hash function is designed to be implemented by bitshifting. That is,
h * 31is the same as(h << 5) - h. I use a similar DJB hash here:https://github.com/google/rappor/blob/master/analysis/cpp/find_cliques.cc#L81
In that case the multiplier is 33, so you do
(h << 5) + h.I think Java has problems with shifts because it doesn’t have unsigned integers? But the JVM is pretty performance-minded so I imagine it should do “the right thing”.
The example code is in Java – I’d like to see a version of this in C!
Is there a C hashmap implementation you’re particularly thinking of?
I was going to say CPython’s hash tables with string keys. That might work, but they’re pretty complicated, and with the methodology he uses, I don’t know how you would take out the bytecode dispatch overhead. It looks like he just directly calls the hash function (
hash('abc')in Python) in a loop and subtracts.However I seem to recall that Python was actually faster than Go in some hash table benchmarks. (early Go at least) That is, if you wrote a simple loop to insert into hash tables like this one, Python’s highly tuned hash tables would outperform a compiled language! They’re both of course written in C – in some sense you’re comparing C to C, but it shows that the interpreter overhead sometimes doesn’t matter.
(And I don’t have a reference unfortunately.)
Maybe an easier one would be C++
unordered_mapwhich I use here with a similar hash function:https://github.com/google/rappor/blob/master/analysis/cpp/find_cliques.cc#L81