1. 38
  1.  

  2. 15

    I find it really odd that UUIDs didn’t come up here.

    They solve the same problems without the drawback that they can be accidentally used as temporary values when iterating or doing maths

    1. 17

      That’s a start, but using a common ID type besides integer or string only prevents some of the errors. You can still inadvertently pass a Person ID to an operation expecting a Product ID, unless they are distinct in your type system. I like to pair each table with a one-field struct type holding its ID value. Some languages make this zero cost at runtime for a nice compile time benefit.

      Types are very good when they give you guarantees about data, but even better when they give you guarantees about the problem domain.

      1. 14

        You can still inadvertently pass a Person ID to an operation expecting a Product ID, unless they are distinct in your type system

        you can, but if there actually is a collision, then you have much worse problems in your hands.

        1. 1

          Or a very (* huge number of times) unlikely, but not impossible, event happened once… either way, you’d have to be even more unlucky/have a bigger problem to collision within the same namespace.

          1. 1

            It’s not either-or. You can make your types mutually exclusive at compile time and your values collision-resistant at runtime too.

          2. 3

            That’s a good idea 💡! I’ve also seen the issue you mentioned solved by prepending the table name to the id “person:12345” or using opaque type aliases (if the language has a type system that supports this)

            1. 5

              Embedding the type along with the identifier is such a good idea that I’ve surprised myself with how much I used to scoff at this when looking at IDs on paper communications, e.g. ‘CST12345’ having a ‘CST’ prefix so that it’s unambiguously (for the organisation) a customer identifier rather than a widget or a data or … whatever.

              Within a relational database, in X normal form tables, the ‘type’ of a row is effectively identified with the table name. With other forms of data storage, there are other approaches. With CSV, for example, column name serves this purpose. In a structured document, e.g. JSON, we might have:

              "customer-id": 12345
              

              It’s still far too easy for confusion to happen, though.

              In data storage, we might do this:

              "customers": [
                  { "id": 42, name: "Alice", "ref": 99 },
                  { "id": 34, name: "Bob", "ref": 123 }
              ]
              

              ref … to what? It’s something that has an ID, and that ID is stored as an integer, but:

              1. We don’t know what it is, so it’s easy to make a wrong assumption.
              2. It’s an integer, so if we get it wrong, we might not even notice until catastrophe occurs.

              In code, too, we can do (pseudocode here):

              function update_customer(id, ref, from_date):
                  save(id = id, address_ref = ref, from_date = from_date)
              

              Again here we’ve made an assumption, and if new_ref wasn’t supposed to be an address_ref, we have a catastrophe. Instead of updating when the customer started the job with ref ‘ref’, we’ve set when they moved into the address given by ‘ref’, which was not the intention.

              Does this mean that we should lean on the type system? In some languages, this works really well. Apologies because I haven’t used C++ for a long time, but I seem to remember you can do this:

              void update_customer(Customer customer, Job job, Date from_date) {
                  save_job_details(customer.id, job.id, from_date);
              }
              

              Here we’ve fixed the issue of understanding what we’ve been passed, but there’s still a boundary here between the code and the underlying data, where we can easily get things confused, because the data is stored without strict types - customer ID is stored as a number, job ID is stored as a number - even date is (let’s assume here) stored as a number - and they could be switched around without the code or the data storage caring.

              We could have a storage system that enforced the types in the same way as the type system we’ve built in the code, but I don’t think I’ve ever used a system like this. Once I’m working with data storage or transmission, I’m mapping to/from a structure where the highest level enforced types are only as sophisticated as integer vs bigger integer vs date vs text, for example.

              We’ve gone some way to helping ourselves avoid a particular kind of catastrophe by using GUID/UUIDs as unique identifiers, so even if we get confused, at least the chance of mistaken identity is effectively zero.

              Why doesn’t an XML/JSON schema fix the problem? Because it only works as well as the mapping code. I’ve noticed there’s been a trend away from schemas as JSON’s become the lingua franca. I used to really like them in the XML world because they provided a little safety ‘for free’ and then building some checks on top was much less work.

              Why don’t ORMs solve the problem? They do help, but I think we’ve all seen plenty of positive and negative aspects of using ORMs, so they aren’t ubiquitous, especially in the case where they are hooked into the types through application code, and of course the world isn’t using relational databases for everything any more.

              Back to labelling identifiers with their types: It’s a safety belt, it adds boiler plate, and we often don’t like ‘unnecessary’ fluff, but with a little tooling support, perhaps it’s one of the better options for avoiding type confusion that we have when it comes to data leaving the confines of code using high level types for safety.

          3. 13

            They don’t only solve solve that problem. They solve the problem of generating random numbers on a distributed manner without collisions. And because of this they are a rather huge number, which makes them a very bad choice for database index.

            I don’t quite agree with this article, in the sense that the examples provided by the author are pretty well known anti patterns. The reason why you probably shouldn’t do it is that you shouldn’t do it at all. An incremental integer primary key is not meant to be used as a public reference for external systems. If you need another unique identifier, add one as needed. You can generate a large random integer for example. I don’t think anyone would write a for loop and use a loop counter for database lookup. The point of having a database is to NOT do such things.

            1. 7

              I’ve mapped “UUIDs are a bad idea for database keys” to a “needs more research” thunk in my head. I have no idea if it’s true or not. One place it was discussed: https://news.ycombinator.com/item?id=14523523.

              1. 4

                They don’t only solve solve that problem. They solve the problem of generating random numbers on a distributed manner without collisions. And because of this they are a rather huge number, which makes them a very bad choice for database index.

                Why? Indexes aren’t using the integers directly. Binary search – and b-trees – don’t care how big a value is. Neither do hash indexes. UUIDs aren’t ordered, but I doubt that user ids would have a huge locality effect – or that you’d have enough users that the index pages would fall out of cache.

                What’s the concern? Storage size?

                1. 2

                  So from personal experience I worked at a place using MSSQL and they had GUIDs as the Clustered Index (CI). Now aside from the fact that these were stored as the wrong type (MSSQL has a UUID type which should have been able to handle them a bit more space efficiently) the other problem was table bloating.

                  See, when you have an ordered index and you insert a user into the table it just gets appended to the end of the table in the last page. Then the indexes get updated and those might end up having to be inserted in the middle of a page somewhere in the middle of the index but because indexes are usually smaller than table records this isn’t a massive deal.

                  Now if your clustered index isn’t ordered you end eventually with an incredibly sparse table, because if you consider a completely defragmented and compacted table, if you need to perform an insert this insert is going to be guaranteed to be in the middle of the table somewhere in the middle of a page. This causes a page split (which means you’ve now got two pages). This keeps happening until your tables all take up twice as much space.

                  And when your key type is a 36 byte fixed string and everything refers to everything and your keys end up included as index elements for various reasons this means that your database is now growing at an enormous rate all the time.

                  It wasn’t a pretty picture especially since we were trying to use the free tier of MSSQL which has a 10GiB per/db limit.

                  When the solution of splitting the database up into multiple databases (which brought in a whole slew of problems) was eventually implemented after many of my protests we tried to change out all the CIs in the database. GUIDs would still be used for references but an integer RowID was being used as the CI.

                  Databases grew slower and things actually performed better.

                  Although the above is an anecdote.

              2. 8

                When running my code in ‘test mode’, I adjusted the autoincrement sequences in the schema so that none of them generate overlapping values (so that if I use the wrong table ID I can’t accidentally find a different record). Not a perfect system, but a very cheap one to implement that catches 90% of my mistakes.

                1. 1

                  Indeed, this is a great idea to catch more bugs in an existing system that is using integer IDs!

                2. 4

                  Or you simply use a UserId for the the user id and be done, because it completely solves the issue described in the article (hint: it’s not a collision/uniqueness problem).

                  1. 1

                    hint: it’s not a collision/uniqueness problem

                    Exactly. UUIDs aren’t a bad solution, but they aren’t targeting the actual problem being discussed, even if they happen to work well for it.

                3. 10

                  I always loved how Stripe’s REST API handles opaque IDs as a way to prevent confusion. While the Backwards-compatible changes documentation calls out “adding or removing fixed prefixes” as a backwards-compatible change, you’ll notice opaque IDs generated by Stripe usually include a short, human-readable prefix describing the ID. Some examples:

                  • Publishable API key: pk_test_TYooMQauvdEDq54NiTphI7jx
                  • Secret API key: sk_test_4eC39HqLyjWDarjtT1zdp7dc
                  • Charge ID: ch_1IVMF02eZvKYlo2CyPTlPI5a
                  • Balance Transaction ID: txn_1032HU2eZvKYlo2CEPtcnUvl
                  • Payment Method ID: card_19yUNL2eZvKYlo2CNGsN6EWH

                  You’re not meant to rely on these within your own code (I think some of the other suggestions in this post around strict type systems are far more applicable in that case), but they’re brilliant sanity-checks while running through a debugger’s stack view to make sure you’ve not accidentally referenced the wrong variable. Doubly so since Stripe’s documentation provides examples of the fixed prefixes for their API responses.

                  1. 3

                    This is nice, and probably works well with the more “dynamic” languages used on the web. I wonder if they use this representation in the database as well, or if this is somehow “decoded” somewhere, and if it is, what they use as an internal representation.

                  2. 6

                    This is why using some form class/object system to represent… “objects” works really fucking well, for the vast majority of systems the vast majority of us will ever work on.

                    UUIDs, large starting integer offsets, etc - none of those fix the actual problem, they just hide it.

                    If loading entire model instances is too memory-hungry and your language/framework doesn’t have the ability to lazy-load, or partial-load models using only a subset of the fields, using a type-specific wrapper around the identifying integer (similar to what @kevinc mentions in another comment, and what I mentioned in https://lobste.rs/s/yqvbvt/things_you_should_do_now_from_phabricator#c_5c8xi5) is another option (and not mutually exclusive with using full on Models).

                    I realise this article is a couple of years old, but it’s not like it’s from 1997.

                    1. 5
                      for message in messages:
                          ban_account(message.sender)
                      

                      Fixed!

                      1. 1

                        To explain me better, I think the problem is not using integer ids, but rather, using integer ids with languages that make you creating an index to iterate a collection.

                      2. 5

                        This is a special case of the problem that Hungarian Notation was intended to solve: representing units in languages where the type system can’t express units natively. The canonical example is in Excel, using variables like int colFoo; and int rowBar. It’s trivial to see if you’re adding a row to a column. It’s better in C++ where you can write a generic wrapper type that has an explicit constructor (and can explicitly or implicitly decay to an integer, depending on your requirements) so your function takes a Column or a UserID not an int and you will get a failure in type checking if you got it wrong.

                        1. 2

                          The scenario in the post is also a great example of where “security by obscurity isn’t security” misses the point. You’re “just” obscuring your user IDs, but you’ve managed to protect yourself against an accidental availability issue and small amounts of data exfiltration (knowing how many users there are).

                          1. 1

                            A few thoughts

                            • if your language allows you to cast ints to other types that are not numeric in the traditional sense (php), you are opening yourself up to a huge class of bugs regardless of if uids are ints or not
                            • when combining large amounts of data, some jurisdictions imply that incremented uids are also PII (IANAL). although i suppose if you’re at the stage where you need to choose between bigint and uuid, you don’t have to be concerned with that just yet!
                            • UUIDs help prevent against IDORs (but don’t rely on it!)
                            • UUIDs shouldn’t be a problem for database keys, if you’re at the scale where you’re having issues, you’d probably need to shard your database (mod n) anyways. Also, every modern database supports robust and fast UUID based indexing.
                            • If you’re building a community site, low user IDs are considered a badge of pride (or can cause problems!), so there’s a social aspect as well to this choice
                            1. 2

                              if your language allows you to cast ints to other types that are not numeric in the traditional sense (php),

                              I think the problem is less “can you type cast?” and more “can you accidentally type case?” https://3v4l.org/bmLu5