1. 11
  1. 11

    The common solution to this problem is using a UUID (Universally Unique Identifier) instead. UUIDs are great because it’s nearly impossible to generate a duplicate and they obscure your internal IDs. They have one problem though. They take up a lot of space in a URL: api.planetscale.com/v1/deploy-requests/7cb776c5-8c12-4b1a-84aa-9941b815d873. Try double clicking on that ID to select and copy it. You can’t. The browser interprets it as 5 different words. It may seem minor, but to build a product that developers love to use, we need to care about details like these.

    Remove the hyphens? 7cb776c58c124b1a84aa9941b815d873.

    Encode the UUID hex as base32 and strip the equals signs?

    In [7]: base64.b32encode(uuid.uuid4().bytes).decode().strip("=")
    Out[7]: '46LIVXKUY5A57HIDNFDEUYCOLM'
    

    So there are at least two options for keeping all 128 bits of entropy and being able to double-click it.

    The longer and more complex the ID, the less likely it is to happen. Determining the complexity needed for the ID depends on the application. In our case, we used the NanoID collision tool and decided to use 12 character long IDs with the alphabet of 0123456789abcdefghijklmnopqrstuvwxyz. This gives us a 1% probability of a collision in the next ~35 years if we are generating 1,000 IDs per hour.

    But your scheme gives log(36 ^ 12) / log(2) = 62 bits of entropy. So you are not solving the same problem that UUID is. Of course the IDs are shorter.

    1. 4

      Seems like a case of picking a technology based on aesthetics for development purposes, which appears to lead to bad decisions. And according to the NanoID readme, it uses A-Za-z0-9_- by default, so does encode similar bits of entropy. Removing the hyphens really seems like the best idea, lol.

      1. 5

        Yeah, the author seems to be treating UUIDs as more special than they really are. The type of UUID everyone uses is just a securely-random 128-bit number (with two(?) of the bits set to constant values to identify it as such) encoded in hex with a few hyphens.

        A nanoid is the same thing, just with control over the size and the alphabet used to encode it.

        BTW, I recently discovered the d64 encoding, which is a modified base64 that avoids hyphens and = signs, and I’m using it in my current project.

        1. 1

          Thanks for the pointer to d64! I’ve been looking for something like this.

    2. 5

      This gives us a 1% probability of a collision in the next ~35 years if we are generating 1,000 IDs per hour.

      1k IDs an hour seems incredibly low for a SaaS database provider to be using as a benchmark, doesn’t it?

      1. 5

        I was thinking the same and I think these ids are for the databases, not the actual data in the tables. 1000 databases/hour is a good goal for a database start-up.

      2. 2

        They take up a lot of space in a URL: api.planetscale.com/v1/deploy-requests/7cb776c5-8c12-4b1a-84aa-9941b815d873. Try double clicking on that ID to select and copy it. You can’t. The browser interprets it as 5 different words. It may seem minor, but to build a product that developers love to use, we need to care about details like these.

        UUIDs are 128 bit numbers; the 7cb776c5-8c12-4b1a-84aa-9941b815d873 form is one of many possible encodings. You’re not beholden to it! I’ve become quite fond of just plain old hex encoding, for example.

        1. 1

          Similarly, my company just base64url-encodes UUIDs. We store UUIDs in the database and encode them whenever we need the string representation.

        2. 1

          Another concern: will the identifier ever be visible to (non-technical) end users? In most languages that use the latin alphabet, vulgar words and slurs tend to be brief, so the chance of a randomly generated NanoID identifier sounding inappropriate in some language seems decidedly non-zero.

          Of course, as I understand it, Chinese speakers occasionally use latin numerals to write words that sound like the equivalent number spoken in Chinese, so maybe there’s no ID generation system that’s completely safe.

          1. 2

            Could pare down the set of letters like what Multics did.

            1. 1

              This is a concern I have too. The thing @calvin refers to below says Multics:

              reduced the alphabet to sixteen characters to eliminate the possibility of obscenities: all vowels were removed, “v” because you can use it to look like an “u”, and “f”, of course, and “y” because it’s like a vowel, and 2 others.”

              But what were the two others?

              Anyway the above is one approach, alphabet limiting. The other approach, given randomly generated IDs or keys, is to just filter them against list of words you don’t want; if it’s random, you can always generate a new one.

              1. 1

                Is this really serious concern or is this just being overly cautious? Has this ever caused real troubles? (Unless of course it gets out of proportion)

              2. 1

                We did something similar for very high speed stuff for NATS.io.

                https://github.com/nats-io/nuid