1. 24
    1. 3

      In the “Handling Collisions” section, he writes

      It is possible that hash_b will return 0, reducing the second term to 0. This will cause the hash table to try to insert the item into the same bucket over and over. We can mitigate this by adding 1 to the result of the second hash, making sure it’s never 0.

      How does adding 1 to a hash ensure that it’s not 0?

      1. 1

        It’s possible for hash_b(x) = 0, so that (hash_a(x) + i) * hash_b(x) is always 0, no matter how many times you increase i to get another bucket.

        1. 2

          It’s possible hash_b(x) = -1

          1. 2

            Looking at the ht_hash implementation, it doesn’t look like it will ever return negative values.

            1. 2

              So the hash function can return [0, INT_MAX], meaning that (INT_MAX + 1) is possible. If hash_a() can also return 0, we thus have 0 + (undefined behavior probably being -1) % num_buckets.

              -1 % num_buckets still returns -1. Not a good idea for an index.

              The guide is interesting, pretty simple. But those issues are fundamental in C and should be tackled right now. It does not add too much complexity to limit the number of buckets, only use unsigned values (avoid undefined behavior), and check for edge-cases.

              The hash function itself can easily overflow a long integer, meaning that the precaution of up-casting to long, modulo then restrict to int is useless and incorrect. Still undefined behavior.

              There are other well-defined stream hash-functions, easy to write in a few lines that can be used instead.

              1. 5

                The hash function returns [0, m] where m is the number of buckets (at least in ht_get_hash). That means, unless the hash table has INT_MAX buckets, hash_b + 1 won’t overflow.

                The overflow potential that is there is that hash += (long)pow(a, len_s - (i+1)) * s[i]; could overflow the long hash; it probably won’t because long is probably 64 bits, but on some systems (maybe especially the kind of system where you would want a home grown hash table implementation), long could be 32 or 16 bits. I agree that using unsigned there would be better.

                Also, while it doesn’t really matter to this discussion, 32-bit INT_MAX + 1 overflows from 01111111111111111111111111111111 to 10000000000000000000000000000000, which represents -2147483648 in two’s complement, not -1. -1 would be all bits set to 1. (Of course, this is UB and the compiler could just make the program terminate or launch nethack or whatever instead)