1. 16
  1.  

  2. 8

    UUID keys can be really expensive for some workloads, and the chance of performance degradation really shouldn’t be treated with such a relaxed attitude.

    Most databases use some form of lexicographic range for storing items. Even high performance hash table architectures may sort their items on disk due to the extreme compression you can achieve when your keys share a significant amount of information with their neighbors. Modern b-trees will use prefix encoding to only store partial keys on each tree node, and suffix truncation is used to chop off extra bytes on split points that get percolated up to indexes, resulting in a lot of savings. sled goes so far as to completely skip storing any key material other than a node’s boundary if keys follow a fixed stride and they are constant length, as any key can be losslessly recomputed given the boundary key and the distance. This also allows for completely skipping interpolation search in favor of a branch-free lookup, avoiding a significant portion of the CPU cycles usually spent on searching through children in the read path of many indexes.

    UUIDs destroy all of these optimizations by injecting entropy into various locations of an ID. This isn’t just a size issue. Longer keys mean that you have to use more tree nodes overall to store a data set, which means more index nodes are required, which may mean that the index needs extra levels of indirection that any operation needs to pass through before reaching the relevant node. More entropy means that the RAM you’ve spent money on for keeping recently accessed nodes in cache is less cost-effective. You’ll need to spend more money on RAM if you want to keep similar hit ratios, but due to the key entropy you’re likely to need to traverse more index nodes on every access, so you may not necessarily be able to throw money at the problem in this straightforward way to resolve the performance hit.

    It’s fine if you really don’t care about performance, or if low-coordination is worth sacrificing storage price performance for, but measure carefully and don’t brush off the chance of this really messing up metrics that matter otherwise.

    1. 4

      There are a few misconceptions in the article. It tries to differentiate between integer keys and UUID keys, but a UUID is a 128-bit integer, so it’s conflating two things:

      • Identifier space (16-, 32-, 64-, and 128-bit)
      • Identifier allocation policy (sequential, random, other)

      UUIDs define several allocation policies. The simplest is completely random, but Version 1 is the MAC address concatenated with a 60-bit nanoseconds time. That works well for your purposes: on a single database machine, every UUID will have 48 bits the same, little entropy in the high bits of the time stamp, and only a single value for most low bits (and so you can design your tree on the assumption that conflicts are rare and have a less-efficient secondary structure for the few times they occur). That; however, reduces the claim in the article about unpredictable values in URLs, because now if you know the second that something was generated then you have a 30-bit space to attack to find it, which is nontrivial (and probably easy to detect) but still quite feasible, and trivial if you ever accidentally do something that allows an offline attack.

      A fully-random 128-bit UUID has all of the problems that you describe but so would a fully random 64-bit identifier.

      1. 2

        Just like we have salts for passwords, we can expose public identifiers that have used some localized secret to deterministically map from a difficult to enumerate space into a compact monotonic fixed length keyspace.

        1. 3

          I’ve had doing this in the back of my mind for a little while now, this article inspired me to do some googling just now. This looks nice as a glance and is implemented in a ton of languages: https://hashids.org/. Lots of people talking about it being insecure, but as far as I can tell it seems fine for obfuscating ids.

      2. 3

        I’ve used GUIDs for keys in SQL Server, in the days when it was purported to require ‘comb GUID’ allocation in order to save performance (avoid index fragmentation). Performance didn’t suffer anyway, with databases of only a few gigs. We tested ints vs GUIDs all over the place and didn’t find a single instance of a measurable slowdown.

        I’m sure it would have been possible to get to index sizes where this made a measurable difference in speed, or cost us money or something, but I agree with your suggestion to measure carefully, though I’d also add that don’t ‘optimise’ by avoiding GUIDs/UUIDs before doing so, as you may be leaving the benefits behind based on what you’ve heard rather than what you’ve tested for yourself.

        1. 1

          More entropy means that the RAM you’ve spent money on for keeping recently accessed nodes in cache is less cost-effective.

          Could you quantify these in a real world situation?

          We use something similar to UUIDs but since the number of keys are limited to < 10_000 we never experienced any problems. I think it also matters what you do with these UUID keys.

          1. 3

            UUIDs trade more space for less coordination. More space consumed by the unique key means less space that is available on a system for potentially valuable data. You are spending money keeping fluff in memory and persisted, and in some cases this is less problematic than the coordination required to use less fluff, but this trade-off is one that exists.

            10k keys is not a size that puts too many constraints on an architecture, unless you have some situation where you’re changing many things per second and it maps to a very dense, highly contended space. But if your workload is one where you need to decide how many machines you will be using to provide enough resources, this kind of consideration starts to have significant TCO implications.

            1. 2

              On the other hand UUIDs fill the keyspace more uniformly, so you might have different easier at the point you need to start distributing the system to multiple nodes (I suppose less worry in having key collisions also if you want to fully utilize the write throughput of the cluster) – not an expert, but I suppose it boils down to the age old “it depends”. Not to downplay the boost you might get from getting the keys fit into CPU-happy chunks for faster processing, or just the lower cost of saving thin rows. It’s turtles all the way down :-)

              1. 1

                But if your workload is one where you need to decide how many machines you will be using to provide enough resources, this kind of consideration starts to have significant TCO implications.

                Absolutely. I usually save 30% TCO for companies with optimization projects. These project very rarely revolve around what is stored in a database and how long is the text field. There is so much inefficiency that you got in an average infrastructure that using UUIDs are usually the least of the problems.

          2. 8

            You can get this with a bit less effort if you use mix phx.new --binary-id as described in the docs.