1. 18
  1.  

  2. 7

    I’ve heard lots of anecdotes and admonishments around UUID primary keys but little evidence.

    Pros:

    • you won’t run out of them any time soon
    • merging rows is easy
    • obscurity - n+1 is pretty easy to guess, after all

    Cons:

    • B+ trees and random keys go together like chalk and cheese
    • huuuuuge
    • you probably have a natural key you can use instead
    • sorting? nope (unless you use v1 UUIDs or something)

    Does anyone know of actual best practices here?

    1. 6

      With respect to sorting, I always stored a UUID with a timestamp. This has the added benefit of being much more informative (and good for auditing purposes).

      UUIDs are far superior to integers because the key no longer has to be generated by the database. This makes it much easier for external agents to generate data, send it to the database, and essentially keep a pointer to it without negotiating in some crazy way with the DB. And in many cases, a UUID is a “natural” key.

      And natural keys are great, if you can get ‘em. A lot of the time they’re messy.

      Edit: Just want to stress that in my experience, even for moderate sized databases with only one server, UUIDs have been demonstrably better to work with than integer keys.

      1. 2

        With a BRIN index on that timestamp, this method sounds brilliant.

      2. 3

        I agree with all of your pros and cons and ultimately the answer has to be with regard to the performance needs and characteristics of your database, which are more complicated than just the characteristics of the underlying engine.

        I’d nitpick that “you probably have a natural key you can use instead” applies equally to all synthetic keys. With regard to choosing natural vs synthetic, in every situation I’ve encountered, it was a very clear choice - the wrong one would have been dramatically inappropriate. I haven’t particularly specialized in databases, though, and would defer to anyone who feels they’re an expert.

        1. 2

          Your trade-offs are correct. UUID’s are bad for performance, but many people don’t actually need to care about performance impact on this level. They can let you avoid managing an ID generator across databases/autoincrement modulo number of shards or something, but will make your databases slower and you give up advantages of having a lexicographically ordered database (no more efficient scans over causally ordered data, but you CAN have a timestamp be part of your UUID [with UUID v1 as mysql uses by default] so you are basically relying on NTP + wrapping counter number + a bit of entropy, which may be enough to give you usable scan semantics depending on your workload).

          (skip this chunk if you don’t care how databases work) Databases like mysql split their storage files into chunks that contain a subset of rows (called a page). when mutations occur, the update is stored in a persistent log, which is written to sequentially and fast. but you don’t use arrays as a constant lookup time dictionary, so the crash recovery log needs to be supplemented by a persistent fast lookup structure if you want to have a dataset that is bigger than memory (b+ tree in many classic cases and still awesome for read-heavy stuff, LSM/fractal trees in newer stuff depending on workload characteristics). When you update a row, the page for that section of the tree needs to be updated. Many databases buffer writes to pages, so that fewer total writes may occur (like nagle’s algorithm in TCP/a bus doesn’t slam on the gas when the first passenger jumps on, but waits for a few more to bump up the utility:work ratio). The persistent trees are laid out in a way such that you can efficiently scan through rows based on their primary key, lexicographically.

          • the longer the primary key => the more bytes need to be compared for finding matching rows + the bigger the database + the less of it fits in memory + the more cache thrashing occurs. UUID’s are big, and can dramatically increase the number of bytes that need to be compared in a key before matching can be determined.
          • as you mentioned, random keys can be bad for persistent tree implementations. random keys => more pages being written to because locality has been thrown away => more pages end up in dirty page buffer => fewer utility:work buffering optimizations can occur => total write throughput goes down (this doesn’t matter quite as much with uuid v1 depending on your workload)

          here’s a percona blog post on the subject

        2. 3

          RE: live schema updates, I always got a kick out of how pt-online-schema-change can handle alterations for big tables without blocking.

          1. 2

            I mean, owen already explain the downside of UUIDs, but I’m surprise people don’t react to “Over normalization”, that just makes me cringe. I agree you shouldn’t over normalize (going NF7) but BCNF should be the target IMHO.

            The example given is even worse: “In the above case categories could very easily just be an array of varchar fields on a post”. What does it mean? Coma separated in a VARCHAR field? I don’t understand. Or a postgres array? I mean, in my opinion both are bad, the VARCHAR coma separated field prevent indexing, and easily query for: “all post in category X”. And the postgres array just lead to duplication. And a JOIN with the right indexes is pretty fast and cheap.

            1. 1

              I have to admit that I’ve read enough articles about databases written for an audience who self-define as not knowing better, that I find them hard to concentrate through. :) Properly justifying opinions about database practices tends to require a lot more detail than this sort of thing goes into. As such, I didn’t react to this one really.

              I totally agree with you about BCNF. Is NF7 a real thing, or is it a joke about an imagined progression past 3NF?

              1. 2

                It was a joke about an imagined progression past 6NF.

            2. 2

              That’s fun he did not thought about not indexing data which is in my opinion the number 0 of beginners mistake coming to database.

              1. 1

                All good stuff in the article. “Not storing images” cannot be stressed enough. I had to deal with a database that stored lots of binary data and it’s a total nightmare. The size gets out of hand in no time and extracting data is difficult without a programmatic interface. Furthermore, browsing and searching the content is pretty much impossible.

                Keep “files” out of databases and maintain your sanity.