We switched to using UUIDs for one of our projects recently - you’ll likely run into issues if using postgres once you start using more unusual index types, such as GIN or GIST indexes. This is because the uuid type is missing the gist and gin operator classes, so postgresql doesn’t really know what to do here.
For example to index a uuid array field requires the gin operator class for the uuid type, so we ended up having to cast the uuid to a text and index that instead, however the uuid type doesn’t support casting to the text type so we had to create a custom function for this… you get the idea.
It can be quite restrictive when trying to use other features of your database when your primary key type doesn’t quite behave like a first class citizen.
I think the underselling of the performance penalties of using UUIDs for keys makes this article borderline misleading. You are taking about massive increase in index storage, and at times extraordinary cost on complex queries across multiple tables (all those jumbo UUIDs add up), you lose clustering, you lose consistent semantics external tools and systems often expect (read: some search systems, /u/ephess pointed out even gist/gin indexes!).
My main issue with it – this is a classic “works great in dev, explodes in prod” solution. As systems grow, you tend to grow in both data and query complexity, both of which often will bite you with UUIDs in non-linear ways.
I’ve been using UUIDs in production for some time now, with no problems. I would advise against them in tables with lots of rows (let’s say more than 100M) but that’s not that common.
Well – our production experience differs. I have helped organizations “walk back” UUID primary keys (at non-trivial cost) 4 times in the last 7 years. In each case, they were selected by a developer, not a DBA during the early dev process for convenience and “safety” without a real understanding of the costs when you put UUIDs in a X way join… or they get into complex stored procs. The scaling is poor because of the way the DB engine leans on those primary keys.
I am not opposed to UUIDs, I use them all over the place, but NOT in the primary key the DB engine needs to use to do work. You can get almost most of the benefits of UUIDs with none of the overhead by just having them in a separate field.
I have a similar experience to robertmeta (although, most of my databases have had a fair amount of data in them). There are two types of indices: machine indices, and human indices. It’s best to leave the machine indices alone (the ‘id’ column in your table), and to make up whatever other indices you need to feel comfortable. In the cases where you mix these two worlds you often end up with hard-to-reason-about performance issues and a lack of some rare but still useful tools (range queries on id).
I’m interested in the phrase
Relational DBs are very good at storing a relatively small amount of data, and are not well suited for immutable databases.
What is well-suited for an immutable database then?
I suspect the author is referring to something like Datomic or Event Store
I’ve used Clojure & Datomic in production, I won’t use Datomic ever again and I’m mostly Haskell now.
I’d rather Goldman-Sachs style immutable'ish data in PostgreSQL (valid_from, valid_to).
Datomic is not a production-ready product.
pkill -9 -f java
We begged them to give us a way to control the thread pool, set a timeout, let us timeout ourselves and kill the query. We offered them money via contract to be paid to fix the problem - they refused. This is the tip of the iceberg.
There is no project, company, or individual I will ever recommend Datomic to. That doesn’t mean it hasn’t worked for some people, but it does mean those people either haven’t used it for very long or have been very lucky to avoid the pitfalls.
And this isn’t like, “I was using Riak inappropriately and then I was unhappy”, we were using Datomic for a problem thoroughly within its remit.
Got a link on that Goldman-Sachs style Postgres? :)
Thanks for the opinion on Datomic. I had to spend a couple weeks dissuading a previous employer from it, but I didn’t have direct experience to go on, just a sense of unease that their promises had insufficient technical detail. I ultimately got through by pointing out that they’d be lucky if their business model succeeded well enough to have the problems for which Datomic claims an advantage over Postgres.
I’ve already made this point to clementd, but if you can’t eat the perf cost of UUIDs but would like to distribute ID generation, still get unique ids, and get provenance + timestamps to boot, I strongly recommend looking into Snowflake/Flake style unique ids.
My version in Haskell, used in production a couple places
I think k-ordered unique ids with timestamps + provenance are a pretty happy “medium” when you don’t need to share primary keys with other entities, need something that indexes better than UUIDs, but also don’t want a global singleton / synchronization for primary key generation.
If you need to share objects with third parties who aren’t exclusively talking to you about those objects, UUIDs are still useful referentially unless you want to make a natural key of (Org id x k-ordered unique id).
It is also recommended against using auto incrementing ids on S3 (and I think DynamoDB) to best distribute load across partitions. In this case UUIDs aren’t necessary: you can simply reverse the order of the digits in the auto incrementing id.
Additionally, I’ve always felt uncomfortable with treating UUIDs as unique because the chance of a collision is so small. Surely if you want something to be unique (as in the case of a primary key in the database) you’d have to deal with collision resolution in the code, no matter how small? I wouldn’t feel conformable with the code otherwise :(
By that same measure you shouldn’t trust cryptography.
UUIDs also have non-random variants, though. They just require coordination as you scale.
After posting I gave it lots more thought (and ready the wiki link above) and am actually warming up to the idea of UUIDs. Even if a collision does occur, if it’s set as the primary key on your database, the error handling is probably there already.
Yeah - I’ve used UUIDs as primary keys, actually, in Postgres. I’m definitely guilty of being a programmer rather than a DBA (… is there a word for “non-DBA programmer”? Surely, DBA is programming), and if I’d had the advice that they work badly with GIN and GIST indices, I’d have avoided them. I’m sure the docs said it somewhere…
But they worked decently well. At that company’s scale, there was no performance issue. It was probably less than 1M rows in the largest table, for what that’s worth.
Additionally, I’ve always felt uncomfortable with treating UUIDs as unique because the chance of a collision is so small. Surely if you want something to be unique (as in the case of a primary key in the database) you’d have to deal with collision resolution in the code, no matter how small?
Not to dog-pile, but another point to consider: the filesystem that stores your database is likely using UUIDs internally to represent chunks/extents.
Regarding the points in “Why serial IDs can be a problem”: Wouldn’t having proper authentication prevent users from cycling through ids to get records? What the author proposes is simply security by obscurity.
No, it’s not security by obscurity. Understanding how the app generates IDs will not allow anyone to sequence them.
And most of the time “proper authentication” looks like “knows HTTP”.
If you have a public facing API, then automated scrapes are something you have to handle, regardless of your underlying schema design. To have failures in the design and operation of the API drive the underlying schema into unperformant and tricky workarounds is, in my personal opinion, not optimal.
There are other concerns. Like competitors guessing how many users you have from your ids, bad guys using how many of certain objects exist to tune an attack or using ids associated with objects to guess when they were created. It’s pretty rich.
Or knowing that a row has been deleted, because there’s a gap in the numbering. There are very few scenarios where that’s a big deal, but…
oh for sure. On the other hand, it’s pretty easy to have two indices, and use one for one purpose, and another for another purpose. Making your primary key a UUID has subtle rich tradeoffs as well.