1. 23

  2. 53

    UUID keys shred index cache friendliness and price performance, so take the author’s claims of “high scale” with a nice chunk of salt.

    B-tree-based databases that have put any effort into optimizing their index layout will employ prefix encoding and suffix truncation to reduce the space consumed in the index layer significantly, even with wasteful high-entropy approaches like this, but leaf nodes are only able to employ prefix encoding, which is pretty low impact on high-entropy workloads like this. And leaf nodes will make up over 99% of all nodes in the tree in most cases.

    The main property the author is getting at (unique, low-coordination identifiers) can be achieved in a much better way that will play with decent b-tree indexes by assigning a very short prefix to a local node, and having monotonic ID’s generated locally. This plays nicely with prefix encoding and avoids redundant bloat which causes prices and latency to go up. Another approach is to just batch ID allocations to servers by having them claim an ID range periodically and then allocate from that. 16 bytes is a huge amount of data to waste for uniqueness. For context, with 8 bytes you can dish out over 6 billion unique IDs per second for 100 years. Since this article is talking about “high scale”, this approach is totally inappropriate for indexing dozens of terabytes and up from a storage efficiency perspective. High scale is when TCO matters.

    You trade a lot of tools that can increase the robustness of your applications by going down this path at all though. The author totes avoiding the frontend having to wait on the database server as a nice property, and sometimes it is, but just know that you are getting into the territory of what a database does a lot better than you or your engineers probably can do: guaranteeing the consistency of contents through transactions, foreign-key constraints, etc… Presumably your business uses that data for something useful, so it can be advantageous to have it exist in a non-conflicted state from time to time.

    If you pay attention to things like index size, it’s not unreasonable to expect a round trip to a relational database in the same dc to complete in under 500-700 microseconds. That isn’t very much compared to how long your users are waiting anyway to establish a TLS connection with your web app and waiting for it to slap everything together and ship it across the internet. But users definitely notice when their shit disappears, and this general approach to data modeling is a very sharp knife that begs for dataloss unless people who effectively know how databases achieve the desired correctness properties are the ones writing the logic, and it’s pretty rare that they are.

    1. 1

      I don’t buy his speed argument either. You’d still have to wait for the server to do data validation and permission checks (jeez, I certainly hope they’re not doing client-side data validation!) including cross-record consistency checks which can’t be done on the client. So the server may still respond with either “permission denied” or validation errors.

      1. 1

        Let me summarize what I think you said so I can respond to it.

        One concern is that 16 byte UUIDs are too big and would consume extra disk space (and therefore in-memory cache for those pages.) Since UUIDs’ prefixes are random, prefix compression wont be very helpful. With sequential IDs, you get better latency because you can fit more data in the in-memory index and better disk utilization because each row takes less space.

        You’re also saying that while UUIDs save you from having to ask for a new ID, that’s only a small part of ensuring data consistency (especially between entities in your data model.) You still have to do more complex validation when manipulating the database anyway. The small latency savings of not asking another service to provide a UUID is not worth ignoring these other consistency checks. (UUIDs may lull you into a sense of not having to deal with these problems.)

        I think the points above are valid and make sense.

        UUIDs have other benefits though. Lets say you have a streaming ingestion system. You need to stamp every incoming message with an ID. Relying on a centralized service to give you an ID per message is too slow; even requesting extents of message IDs requires the presence of the ID service at start up. In this way, UUIDs can increase decoupling between systems and improve availability. You may still have dependencies on other services for quota/permissions checking, but you don’t have to rely on the ID service.

        The ULID spec mentioned elsewhere seems to give you some of the best of both worlds. It’s prefix is based on time, so it has good prefix compression (still not as good as auto-increment IDs). It can be generated anywhere so it can improve availability.

        So we sorta have a tradeoff between storage and availability here.

        1. 2

          ID generation can be batched to have amortized near-0 cost, and then you don’t need to burn all those extra bytes for ULIDs. Batch size can be chosen to allow for servers to prefetch ID batches long before they need them, giving a nice ops runway for addressing any issues with the generator before services start draining it.

          Thoughtful architecture saves money and reduces operator stress.

          1. 1

            I agree with what you said. Still, even prefetching batches introduces a dependency on the ID service. It’s possible to avoid this with UUIDs. YMMV with how much this actually helps if you have other reasons for a dependency on the ID service (perhaps it’s a database that stores configs you need.)

      2. 7

        Let’s earn money in the nearest decades, most systems anyway will die after 10-20 years because of rapid technology grow.

        And other jokes you can tell yourself.

        1. 5

          Universally unique identifier. Absolutely unique once you generate it ANYWHERE. It is possible because of using a large (128-bit) random

          Randomness doesn’t imply uniqueness.

          1. 7

            Not only that, but most v4 UUIDs use six bits to say, essentially, “I’m a v4 UUID,” meaning that only 122 of the bits consist of random data. That’s why most UUIDs you see have a third “group” that starts with a 4 and a fourth group that starts with 8, 9, A, or B.

            (Normally it would be excessively pedantic to point this out, but if you’re going to go to the trouble of writing an entire article about this, these missing bits of knowledge count against your credibility a little.)

            1. 2

              And the author is a CTO. This is frustrating.

              1. 1

                Randomness doesn’t guarantee uniqueness, but it is good enough for real world applications. (Assuming each client has access to a good RNG library and a good implementation of UUIDs. You can’t get get the MAC address in JS, for example, if you want to use the version of UUIDs that incorporates MAC addresses.)

              2. 2

                I prefer https://github.com/ulid/spec instead. It is sortable and does not wreck indices.

                1. 2

                  Having worked in a company that used UUID everywhere, I would like to claim that UUIDs are over-rated.

                  The only usable UUID version is 4.

                  UUID v1 has a bunch of issues; mac addresses are not that unique in the wild. A lot of implementations don’t use monotonic clocks so time might go backwards. It also allows enumeration attacks which is not a problem in itself but because they look like UUIDv4 there are more chances that it will happen (vs auto-increment IDs which are clearly enumerable).

                  v2 and v3 have other issues and are generally considered deprecated.

                  Like @bdesham said below, you loose a few of the random bytes to encoding the version. It also involves bit-shifting the random number into the various fields.

                  A last issue is that since all UUIDs have the same representation, it can be hard to locate the UUID in a system that uses them everywhere. I remember users being confused and using the API_KEY value instead of the API_SECRET because both were generated as UUIDs.

                  Might as well just use base64(rand(128)) with maybe an embedded unique key name in front.

                  1. 1

                    It’s more “Why you shouldn’t use artificial (auto-incrementing) primary keys” and, yeah, that’s true. But UUIDs are just another artificial primary key (albeit with fewer problems than autoinc-ints) and not necessarily the answer if you already have a primary key inherent in your data.