1. 32
  1. 19

    could use a collision-resistant hash, like perhaps SHA256

    No, what, why are you going straight to the full-on expensive cryptographic hash functions — there’s things like SipHash designed specifically for this.

    1. 15

      This is a pretty common pitfall with hash tables, which has resulted in vulnerabilities in (at least) php and python in the past, just off the top of my head.

      The general pattern was only identified in 2011 though, so the hash table library involved here pre-dates this being a known common pitfall. and fixing it in Haskell is a bit thorny; the standard approach is to use SipHash instead of a non-cryptographic hash function like FNV. But SipHash is a MAC not just a hash function; it requires a secret key, normally chosen randomly. This doesn’t really work for purely functional languages Haskell, at least without changing APIs, since that would break referential transparency. You would need to provide an rng to a function for creating a map.

      This is why I tend to use Map from the containers package (which is a balanced tree) rather than HashMap, unless I know the inputs were trusted and I actually need a performance boost that HashMap can provide.

      1. 3

        You could probably fudge something with unsafePerformIO and {-# NOINLINE #-} inside unordered-containers, but it seems fraught, especially if you want to share structure between tables.

        I tend to default to Map as well, because unioning singleton HashMaps is O(n^2), but there’s an optimisation in Map that makes it fast.

        1. 2

          It might be enough to pick a single global key on process startup, using unsafe constructs to make this happen – But you then also have to make sure that nothing in the API makes the difference observable between runs. That probably means sorting the output of toList, to start with, which introduces an asymptotic performance hit.

      2. 5

        Very interesting article. While reading the ‘Random salt on startup’ section, I was reminded of this passage in the OCaml Hashtbl module documentation:

        A hash table that is created with ~random set to false uses a fixed hash function (Hashtbl.hash) to distribute keys among buckets. As a consequence, collisions between keys happen deterministically. In Web-facing applications or other security-sensitive applications, the deterministic collision patterns can be exploited by a malicious user to create a denial-of-service attack: the attacker sends input crafted to create many collisions in the table, slowing the application down.

        A hash table that is created with ~random set to true uses the seeded hash function Hashtbl.seeded_hash with a seed that is randomly chosen at hash table creation time. In effect, the hash function used is randomly selected among 2^{30} different hash functions. All these hash functions have different collision patterns, rendering ineffective the denial-of-service attack described above. However, because of randomization, enumerating all elements of the hash table using Hashtbl.fold or Hashtbl.iter is no longer deterministic: elements are enumerated in different orders at different runs of the program.

        Now, personally I don’t think JSON objects should have a guaranteed ordering of keys, but it is understandable that it’s nicer when they do. I’ve come across this in the past when logging in JSON format–you really do want the log lines to be ordered consistently. In my case I converted into an ordered format explicitly.

        Side note, OCaml libraries don’t use hashes but rather association lists to represent JSON objects, so they don’t suffer from collision issues, but I guess memory usage issues instead.

        1. 2

          This is really neat, it adds 4 bytes of memory but if the table is above a few elements (where it would be a DoS) it won’t matter.

          Thank you for sharing the OCaml docs

          1. 4

            It’s pretty interesting how most OCaml libraries for JSON/YAML/TOML/S-exp/etc. use algebraic types when the standard library provides mutable hashes. I always felt that aeson complicated things for no good reason.

            I definitely prefer it that way, and I made my OTOML library work that way too, as opposed to the old To.ml that did use mutable hashes (though it was my least significant objection to its design).

        2. 3

          Does this bug class have a common name yet? That’s usually the first step towards getting a Wikipedia article and educating language designers.

          1. 7

            The article on SipHash calls it “hash flooding,” though I don’t know if that’s widespread: https://en.wikipedia.org/wiki/SipHash

            1. 5

              I have heard it called hash flooding. Denial of Service via Algorithmic Complexity Attacks introduced these attacks.

            2. 3

              There are also some other nasty side-effects of this solution, namely that the HashMap.toList would produce a potentially different ordering on every run of the program.

              Isn’t that part of the deal with hash-based data structures? That ordering is not something you should rely on, as it’s non-deterministic between runs?

              In other news, next up on Lobsters: “Haskell considered harmful” /s

              1. 2

                Doesn’t have to be. Python dictionaries iterate on insertion order while using SipHash to prevent attacks. They do that by maintaining two arrays: one for the contents that are added in insertion order and another that maps the hashed key to the location in the first array. It takes two hops, but keeps a stable order across runs.

                This is a really nice property when you are working with, for example, JSON config. If you load and dump it, the config values don’t get shuffled around and the git diff will only show the relevant changes.

                1. 1

                  Being unspecified doesn’t necessarily mean it can vary between runs. That would make it impure, at least unless you have a type system where you can express that order independence as a quotient type. This subtle impurity would wreak havoc on an abstraction like Workflow.