1. 7
  1.  

  2. 3

    I love the note that a multiply and shift mostly fixes the initial ‘bad’ hash: if the problem is entropy badly distributed throughout the word, a multiply can quickly mix the all the bits into the upper few. I suspect some inherited wisdom about designing a non-crypto hash dates back to when it was much more expensive, relatively, to multiply by an arbitrary 64-bit number in a hot loop. (xxhash, though not made for hashtables, hashes at 19.4GB/s on Intel Coffee Lake with 64-bit multiplies and no SIMD.)

    I tend towards “just use something SipHash-y” for roughly the reason I like ciphers as PRNGs: they’re fast enough and get rid of a whole class of potential problem. Maybe the argument is weaker in hashtables, because more apps really want accesses to be fast, and there are other ways at the hash-flooding/collision problem. Still, hashes are at least an area where tuning too much for the wrong thing can end up penny-wise and pound-foolish.

    1. 3

      (This article isn’t really about C++ per se. I wish there were an algorithms or data structures tag.)

      Thus, each hash map has a way to map a hash h to a bucket index i. … The discerning reader might already see the problem: This mapping now amounts to throwing away most of the bits of h.

      I ran into this last year. I had a homemade flat hash map that was too slow. Turned out many of the keys (short strings) differed only in the last byte — stuff like “foo1”, “foo2”, “bar8”, “bar9”… This was causing tons of collisions in my map.

      I was using the common FNV1a hash algorithm. When I looked at the algorithm, it was obvious that the last byte hashed will only alter the low 8 bits of the result, since each step ends by xor’ing the accumulator with the current byte. Then the hash table would throw away those bits when computing a table index. End result: any keys that differed only in the last byte would collide.

      I switched to a different hash function (i forget which, I don’t have the source handy) and that fixed it.

      1. 1

        I like that Rust’s HashMap talks almost immediately focuses on the hashing function used: https://doc.rust-lang.org/std/collections/struct.HashMap.html

        Performance-sensitive use cases should consider the hash function an important input to a hash table, and not just given. Hence this article isn’t about the hash map’s badness, but rather about the hash function itself.

        1. 1

          I didn’t know Rust’s default hash map was so advanced. I’m not sure I agree with using SipHash as the default, since it’s relatively uncommon for a hash map to be a DoS vulnerability, and they’re trading performance for resistance to that. (SipHash would be a bad choice in a (non-networked) game, for instance.)

          This paragraph is weird:

          The behavior resulting from such a logic error is not specified, but will not result in undefined behavior. This could include panics, incorrect results, aborts, memory leaks, and non-termination.

          Is that a typo, and means to say it will not panic etc.? Or is it that those bad effects are possible but are not considered undefined behavior?

          1. 2

            “Undefined behavior” has a specific meaning in Rust literature - it always refers to cases where a program performs operations which violate Rust’s correctness requirements, which the language says will result in undefined behaviour in that program.

            This paragraph in HashMap’s documentation attempts, perhaps unsuccessfully, to convey that your program will behave incorrectly and unpredictably in those cases, but only in ways which are “safe”[1] because they don’t invoke behaviour which is undefined by the language.

            I’ve written what I now realise is somewhat of a tome below, which explains this a bit more with some technical details on why this distinction is being made.

            The idea of the partition of Rust using unsafe is that the compiler prevents “safe Rust” code from using language operations if they can invoke said undefined behaviour[2]. Safe Rust need to be able to rely that it’s protected by Rust’s guarantees for safe code, which the operations allowed in unsafe can violate if used incorrectly. This also means that Rust imposes a contract upon authors of code in unsafe blocks, which are the boundary between safe and unsafe Rust code; they too must not allow safe Rust code to invoke undefined behaviour in unsafe regions.

            This means that code in safe Rust which has unspecified behaviour still cannot invoke Rust’s language level undefined behaviour. By extension, it means that this contract between authors of unsafe Rust and the language require them to defend against wild misuse and logic errors in safe traits like Eq[3] or Hash which are expected to comply by certain basic rules, similar to those of typeclass laws in e.g. Haskell. This is explicitly called out in the Rust language reference. It’s not permissible for your unsafe Rust code to allow even the worst of errors in safe code to allow things like stack or heap corruption, dangling references, or cause it to exhibit data races or violations of Rust’s ownership model.

            [1] As this paragraph implies, some surprising things are “safe” because the program state is well defined by the language - panics cause a defined abort or stack unwind, aborts cause the program to halt (which is a defined program state), memory leaks are a performance issue rather than a memory safety issue (see the famous “ballistic missile malloc” anecdote), and obviously, going into an infinite loop or returning incorrect values for safe types is defined behaviour. Integer overflow is also safe (surprising to C and C++ users in particular), because the behaviour is implementation defined, but specified, because the language only permits implementations to choose between either panicking or specifically having two’s-complement wraparound behaviour for integer overflow. This can still result in logic errors, but as discussed later, logic errors in safe code are explicitly defined to be safe.

            [2] Such restricted operations are typically useful operations which the language cannot prove are guaranteed to be sound in all cases - use of memory with uninitialised regions, executing functions across the FFI boundary, bit reinterpret casts (known as transmute in Rust jargon), reading or writing of mutable globals, accessing fields on unions, or dereferencing unsafe pointers. Misuse of these can either directly result in undefined behaviour, or can allow undefined behaviour to occur by violating Rust’s guarantees.

            [3] A clear logic error in Eq would be just returning random results always, because Eq specifies implementors must return the same result for multiple invocations with the same values. A more surprising logic error is failing to implement the Eq trait with the reflexive ((a == a) == true)[4], transitive ((a == b AND b == c) == (a == c)) and symmetric ((a == b) == (b == a)) properties, because Eq is the trait for total equality, which includes these propreties as a requirement, unlike PartialEq. An emergent logic error is one like HashMap’s documentation describes, where safely managed interior mutability through types like Cell or RefCell allow their implementations of Eq to violate the basic rule of equality producing the same results for the same pair of input values - because they defer their Eq implementation to the values they contain, which can change between invocations.

            [4] Rust’s builtin floating point types are defined to always be IEEE 754 compliant, and therefore do not implement Eq, because IEEE 754 values are not reflexive (because a == a is false if a is NaN). This has the effect that floating points can’t be used as keys in BTreeMap, or as values in BTreeSet; not that you should really want to do so.

            1. 1

              It is the second option - none of those listed effects are considered undefined behaviour in rust.