1. 39
  1. 9

    UUID keys allow the merging of old, new, and foreign data sets so they win for me.

    BUT a bigger win is using natural keys, including natural compound keys. UUIDs should only be used if the row is indeed universally unique but otherwise contains no distinguishing information.

    1. 24

      “natural” keys burned me so many times now I don’t believe in them. Government issued unique IDs? They’ll change or stop being unique or required. External sequential numbers? They’ll be reset during a system upgrade. “Globally unique identifiers”? They’ll silently become “per-source unique” one day.

      In each case you’ll end up with an exception to fix one day which requires keys fixup across many tables. It’s just not worth getting into that situation.

      1. 8

        The issue with natural keys is less what I used to think – that SQL makes it unbelievably tedious to use them – and more, that the limitations of data modelling itself mean that there will always be abstraction violations, and natural keys are by definition not resilient to out of model violations, and so, because we live in a fallen world, we might as well use surrogate keys.

      2. 8

        Reminder: UUID is just a (really big) number. You can use whatever numbers you want for keys.

        For example, if you want sequential keys, but have multiple nodes and want an ability to randomly partition the keyspace, this would probably do the job in only 64 bits – 48 bits for sequence, 5 for partition = 32 partitions, and 11 for node_id.

            create or replace function sequential_id(node_id integer)
                returns bigint
                language sql as $$
                    select
                        nextval('seq1') -- sequence, LSB
                        | ((floor(random() * (1<<5))::bigint)<<48) -- partition
                        | ((node_id::bigint & 2047) << 53) -- node identifier, MSB
                    $$;
        
        # select id, id&281474976710655 seq,
             (id&8725724278030336)>>48 part
           from k
           order by seq asc;
                 id          | seq | part
        ---------------------+-----+------
         1303510617147048054 | 118 |   23
         1318991740866134135 | 119 |   14
         1364590687093260408 | 120 |   16
         1359524137512468601 | 121 |   30
         1302947667193626746 | 122 |   21
         1344605963746803835 | 123 |    9
         1335598764492062844 | 124 |    9
         1333909914631798909 | 125 |    3
         1297318167659413630 | 126 |    1
         1318710265889423487 | 127 |   13
         1337287614352326784 | 128 |   15
         1363183312209707137 | 129 |   11
         1338694989235880066 | 130 |   20
         1328561890074296451 | 131 |   16
        (14 rows)
        
        > bitstring(8725724278030336)
        "0000000000011111000000000000000000000000000000000000000000000000"
        > bitstring(281474976710655)
        "0000000000000000111111111111111111111111111111111111111111111111"
        

        This is not necessarily better than UUID, but it might be for your use case! Put another way: the designers of a specific UUID implementation may have been optimizing for your specific use case, but more likely they weren’t.

          1. 4

            Another thing to consider is that a cryptographic hash can reliably give you 128 bits that will uniquely and deterministically identify any piece of data. Very handy if you need to make UUIDs for something that doesn’t have them.

            1. 1

              But even a cryptographic hash has collisions, however unlikely. So there is always a chance that two distinct pieces of data will end up with the same id. But probably the same can happen, for example, with random UUID (except here you rely on the quality of your randomness source rather than the quality of your hash function and the shape of your data). Somehow using a crypto hash as a long-term id always feels iffy to me.

              1. 3

                Given the size of hashes, using the hash of the content as its id is totally safe. The hashing algorithms are designed such that collisions are so unlikely that they are impossible to happen.

                If this wasn’t the case, all systems based on content adressing would be in serious trouble. Systems like git or ipfs

                1. 1

                  If this wasn’t the case, all systems based on content adressing would be in serious trouble. Systems like git or ipfs

                  Any system that assumes a hash of some content uniquely identifies that content is in fact in serious trouble! They work most of the time, but IPFS is absolutely unsound in this regard. So is ~every blockchain.

                  1. 2

                    I’ve seen several cryptographic systems rely on exactly this fact for their security. So while it’s probabilistic, you’re relying on the Birthday Paradox to ensure it’s highly unlikely.

                    From that table, for a 128-bit hash function, for a 0.1% chance of a collision, you’d need to hash 8.3e17 items. In practical terms, a machine that can hash 1,000,000 items per second would need to run for just over 26 millennia to have 0.1% chance of a collision.

                    For systems that use 256-bit digests (like IPFS), it would take many orders of magnitude longer.

                    1. 1

                      I’ve seen several cryptographic systems rely on exactly this fact for their security. So while it’s probabilistic, you’re relying on the Birthday Paradox to ensure it’s highly unlikely.

                      If collisions are an acceptable risk as long as their frequency is low enough, then sure, no problem! Engineering is all about tradeoffs like this one. You just can’t assert that something which is very improbable is impossible.

                      1. 1

                        I can still pretend and almost certainly get away with it. If the chances of getting bitten by this pretense is ten orders of magnitude lower than the chance of a cosmic ray hitting the right transistor in a server’s RAM and cause something really bad to happen, then for all practical purposes, I’m safe to live in blissful ignorance. And what a bliss it is; Assuming a SHA-256 hash uniquely identifies a given string can immensely simplify your system architecture.

                      2. 1

                        You’ve misread the table. 1.5e37 is for 256 bit hashes. For 128 bits it’s 8.3e17, which is obviously a lot smaller.

                        For context with IPFS, the Google search index is estimated to contain between 10 and 50 billion pages.

                        1. 1

                          You’ve misread the table

                          Thanks. Although it’s what happens when you start with an 256bit example, then remember everyone’s talking about UUIDs, and hastily re-calculate everything. :/

                  2. 3

                    No, this is absolutely not a consideration. Your CPU and RAM have lower reliability than a 128-bit cryptographic hash. If you ever find a collision by chance it’ll be more likely to be false positive due to hardware failure (we’re talking 100-year timespan at a constant rate of billions of uuids per second).

                    And before you mention potential cryptographic weaknesses, consider that useful attacks need a preimage attack, and the “collision” attacks known currently are useless for making such uuids collide.

                    1. 2

                      Whenever you map from a set of cardinality N (content) to a set of cardinality less than N (hashes of that content) by definition you will have collisions. A hash of something is just definitionally not equivalent to that thing, and doesn’t uniquely identify it.

                      As an example, if I operate a multi-tenant hosting service, and a customer uploads a file F, I simply can’t assert that any hash of the content of F can be used as a globally unique reference to F. Another customer can upload a different file F’ which hashes identically.

                      “Highly unlikely” isn’t equivalent to “impossible”.

                      1. 4

                        It’s absolutely impossible for all practical purposes. It’s a useless pedantry to consider otherwise.

                        Remember we’re talking about v4 UUID here which already assumes a “risk” of collisions. Cryptographic hashes are indistinguishable from random data, and are probably more robust than your prng.

                        The risk of an accidental collision is so small, you can question whether there’s enough energy available to our entire civilisation to compute enough data to ever collide in the 128-bit space in it’s lifetime.

                        1. 1

                          It’s absolutely impossible for all practical purposes. It’s a useless pedantry to consider otherwise.

                          I mean it literally isn’t, right? “Absolutely impossible” is just factually not equivalent to “highly improbable” — or am I in the wrong timeline again? 😉 Going back to the hosting example, if you want to use UUIDs to identify customer documents that’s fine, but you can’t depend on the low risk of collisions to establish tenant isolation, you still have to namespace them.

                          1. 1

                            By your definition of impossible, there is literally no possible solution since it would require infinite memory. At that point you should question why using a computer at all.

                            The fact is that people don’t quite understand uuid and use it everywhere in meme fashion. Most uuid usages as database keys i’ve seen, it is even stored as a string. Which creates much more serious problems than those in discussion here.

                            1. 1

                              You don’t need infinite memory to uniquely identify documents among customers. You just need a coördination point to assign namespaces.

                              I agree that UUID using in the wild is…. wild. I don’t think most developers even really understand that the dashed-hex form of a UUID is actually just one of many possible encodings of what is ultimately just 16 bytes in memory.

                        2. 3

                          The only system behaviour that the universe guarantees is that all systems will eventually decay.

                          For any other behaviour you have to accept some probability that it won’t happe (hardware failure, bugs, operator error, attacks, business failure, death, and, yes, hash collisions).

                          Hash collisions with a good algorithm will often be a risk much lower than other factors that you can’t control. When they are, what sense does it make to worry about them?

                          1. 1

                            There is a categorical difference between hash collisions and the kind of unavoidable risk you’re describing, e.g. solar flares flipping a bit in my main memory.

                        3. 1

                          After doing some more thinking/reading, I would agree with you for something like SHA-256. But using a 128-bit hash still seems like a bad idea. I found this paragraph from a reply on HN summarizes it quite well:

                          In cases where you can demonstrate that you only care about preimage resistance and not collision resistance, then a 128-bit hash would be sufficient. However often collision attacks crop in in unexpected places or when your protocol is used in ways you didn’t design for. Better to just double the hash size and not worry about it.

                          I think Git’s messy migration from SHA-1 is a good cautionary tale. Or do you believe it’s completely unnecessary?

                          1. 1

                            Git’s migration is due to a known weakness in SHA-1, not due to the hash size being too small. I believe git would be perfectly fine if it used a different 128-bit cryptographic hash.

                            The first sentence you’ve quoted is important. There are uses of git where this SHA-1 weakness could matter. For UUID generation it’s harder to imagine scenarios where it could be relevant. But remember you don’t need to use SHA-1 — you can use 128 bits of any not-yet-broken cryptographic hash algorithm, and you can even pepper it if you’re super paranoid about that algorithm getting cracked too.

                            1. 1

                              The first sentence you’ve quoted is important. There are uses of git where this SHA-1 weakness could matter.

                              Yes, but isn’t git’s situation exactly what the rest of that paragraph warns about: SHA-1 is succeptible to a collision attack, not a preimage attack. And now everyone is trying to figure out whether this could be exploited in some way even though on the surface git is just a simple conten-addressable system where collision attacks shouldn’t matter. And as far as I can tell there is still no consensus either way.

                              And as the rest of that reply explains, 128-bit is not enough to guarantee collision resistance.

                              1. 1

                                If the 128-bit space is not enough for you, then it means you can’t use UUID v4 at all.

                                Their whole principle of these UUIDs is based on the fact that random collisions in the 128-bit space are so massively improbable that they can be safely assumed to never ever happen. I need to reiterate that outputs of a not-broken cryptographic hash are entirely indistinguishable from random.

                                Resistance of a hash algorithm to cryptographic (analytic) attacks is only slightly related to the hash size. There are other much more important factors like the number of rounds that the hash uses, and that factor is independent of the output size, so it’s inaccurate to say that 128-bit hashes are inherently weaker than hashes with a larger output.

                                Please note that you don’t need to use SHA-1. SHA-1’s weakness is unique its specific algorithm, not to 128-bit hashes in general. You can pick any other algorithm. You can use SHA-2, SHA-3, bcrypt/scrypt, or whatever else you like, maybe even a XOR of all of them together.

                    2. 3

                      I’d thoroughly recommend reading every post on Bradur’s site. Some really excellent thoughts, solutions and insight

                      1. 2

                        They’re opaque from a user-facing perspective. With sequences, users may be able to make inferences about how much data is being generated by watching them increment over time. Malicious users might take advantage of them to try attacks based on ID iteration or collision.

                        I don’t mean to be condescending, but this is objectively wrong. The point of a primary key is to primarily refer to a row unambiguously. You can use an existent attribute if you are confident it will be unique and are aware of the implications. As such, the safest bet is usually to add a column for reference purposes. Typically an integer called id and incremented from an initial value. Now if you do this and for some reason need to expose that table to the internet and have concerns about revealing implicit information or allow for guesses based on that value… Then add a column fitting whatever requirements you have and use it. That is a totally distinct problem and has no business in the choice of primary key. Seriously, where did this silly idea that unambiguously refer to a row must be made from the primary key only, came from?

                        1. 2

                          My experience with uuids vs integers in Postgres is that for operations involving multiple rows (joins, selecting sets), bigints and ints end up being significantly faster (ie the queries take less time) even with indexing. I assume all the usual factors come into play (index size, number of processor cycles per comparison). Of course, you can have more than one unique key for different purposes.

                          1. 1

                            This definitely mirrors a problem of mine. There’s also the question which DBMS handles UUIDs and multi-uuid primaries better. I still have the chance to switch from mariadb to postgresql and I can’t seem to decide..

                            And for combined UUIDs parimary keys I could start to use a uuid -> serial conversion table, so the primaries are less costly ?! But that’d require a lookup all the time..

                            ( I don’t have a chance not to use UUIDs to the client facing side. )

                            1. 1

                              To clarify: I want clients to be able to generate datasets offline and later sync them, so I’ll have to give each entry a UUID (and their sub entries also).