UUID keys can be really expensive for some workloads, and the chance of performance degradation really shouldn’t be treated with such a relaxed attitude.
Most databases use some form of lexicographic range for storing items. Even high performance hash table architectures may sort their items on disk due to the extreme compression you can achieve when your keys share a significant amount of information with their neighbors. Modern b-trees will use prefix encoding to only store partial keys on each tree node, and suffix truncation is used to chop off extra bytes on split points that get percolated up to indexes, resulting in a lot of savings. sled goes so far as to completely skip storing any key material other than a node’s boundary if keys follow a fixed stride and they are constant length, as any key can be losslessly recomputed given the boundary key and the distance. This also allows for completely skipping interpolation search in favor of a branch-free lookup, avoiding a significant portion of the CPU cycles usually spent on searching through children in the read path of many indexes.
UUIDs destroy all of these optimizations by injecting entropy into various locations of an ID. This isn’t just a size issue. Longer keys mean that you have to use more tree nodes overall to store a data set, which means more index nodes are required, which may mean that the index needs extra levels of indirection that any operation needs to pass through before reaching the relevant node. More entropy means that the RAM you’ve spent money on for keeping recently accessed nodes in cache is less cost-effective. You’ll need to spend more money on RAM if you want to keep similar hit ratios, but due to the key entropy you’re likely to need to traverse more index nodes on every access, so you may not necessarily be able to throw money at the problem in this straightforward way to resolve the performance hit.
It’s fine if you really don’t care about performance, or if low-coordination is worth sacrificing storage price performance for, but measure carefully and don’t brush off the chance of this really messing up metrics that matter otherwise.
There are a few misconceptions in the article. It tries to differentiate between integer keys and UUID keys, but a UUID is a 128-bit integer, so it’s conflating two things:
UUIDs define several allocation policies. The simplest is completely random, but Version 1 is the MAC address concatenated with a 60-bit nanoseconds time. That works well for your purposes: on a single database machine, every UUID will have 48 bits the same, little entropy in the high bits of the time stamp, and only a single value for most low bits (and so you can design your tree on the assumption that conflicts are rare and have a less-efficient secondary structure for the few times they occur). That; however, reduces the claim in the article about unpredictable values in URLs, because now if you know the second that something was generated then you have a 30-bit space to attack to find it, which is nontrivial (and probably easy to detect) but still quite feasible, and trivial if you ever accidentally do something that allows an offline attack.
A fully-random 128-bit UUID has all of the problems that you describe but so would a fully random 64-bit identifier.
Just like we have salts for passwords, we can expose public identifiers that have used some localized secret to deterministically map from a difficult to enumerate space into a compact monotonic fixed length keyspace.
I’ve had doing this in the back of my mind for a little while now, this article inspired me to do some googling just now.
This looks nice as a glance and is implemented in a ton of languages: https://hashids.org/. Lots of people talking about it being insecure, but as far as I can tell it seems fine for obfuscating ids.
I’ve used GUIDs for keys in SQL Server, in the days when it was purported to require ‘comb GUID’ allocation in order to save performance (avoid index fragmentation). Performance didn’t suffer anyway, with databases of only a few gigs. We tested ints vs GUIDs all over the place and didn’t find a single instance of a measurable slowdown.
I’m sure it would have been possible to get to index sizes where this made a measurable difference in speed, or cost us money or something, but I agree with your suggestion to measure carefully, though I’d also add that don’t ‘optimise’ by avoiding GUIDs/UUIDs before doing so, as you may be leaving the benefits behind based on what you’ve heard rather than what you’ve tested for yourself.
More entropy means that the RAM you’ve spent money on for keeping recently accessed nodes in cache is less cost-effective.
Could you quantify these in a real world situation?
We use something similar to UUIDs but since the number of keys are limited to < 10_000 we never experienced any problems. I think it also matters what you do with these UUID keys.
UUIDs trade more space for less coordination. More space consumed by the unique key means less space that is available on a system for potentially valuable data. You are spending money keeping fluff in memory and persisted, and in some cases this is less problematic than the coordination required to use less fluff, but this trade-off is one that exists.
10k keys is not a size that puts too many constraints on an architecture, unless you have some situation where you’re changing many things per second and it maps to a very dense, highly contended space. But if your workload is one where you need to decide how many machines you will be using to provide enough resources, this kind of consideration starts to have significant TCO implications.
On the other hand UUIDs fill the keyspace more uniformly, so you might have different easier at the point you need to start distributing the system to multiple nodes (I suppose less worry in having key collisions also if you want to fully utilize the write throughput of the cluster) – not an expert, but I suppose it boils down to the age old “it depends”. Not to downplay the boost you might get from getting the keys fit into CPU-happy chunks for faster processing, or just the lower cost of saving thin rows. It’s turtles all the way down :-)
But if your workload is one where you need to decide how many machines you will be using to provide enough resources, this kind of consideration starts to have significant TCO implications.
Absolutely. I usually save 30% TCO for companies with optimization projects. These project very rarely revolve around what is stored in a database and how long is the text field. There is so much inefficiency that you got in an average infrastructure that using UUIDs are usually the least of the problems.
Interesting that https://gitlab.com/akihe/radamsa is not packaged in apt. IIRC E.g. homebrew has it. Well, best idea to install from the source anycase – usually you want the latest and the best when fuzzing.
We have enough fuzzers already, we want people to run them and find interesting stuff.
Anything slightly useful eventually becomes an academic paper mill. It’s the nature of the system.
Research is like VC investments, most die and a few take off spectacularly.
Yes, I read your blog post when it popped up on my feed.
Very interesting indeed.
It reminds me of a moment of “Too Much Truth in Advertising”….
One of the big static analysis firms used to have an page of recommendations from happy customers.
One customer, a Big household name, said something like, “We used X to find ten’s of thousands of real bugs in code that we have been shipping to our customers for more than a decade!”
Which immediately told me most bugs are never found in testing, and if they are, they’re probably aren’t triggered, and if they are, it probably doesn’t matter…..
Which also says, by far, most software is in the grade of serving up cat pictures, if it fails to serve up one to one person… who cares? ie. Most software isn’t doing stuff that really matters.
Which also says, in the fields where it really really really does matter (Avionics / Self driving cars / ….) by far most practical experience of software engineering isn’t really relevant.
And also, don’t use C. I’m not sure what The One True language is…. but I bet it is one that makes automated whole static analysis a lot easier than C does.
All this said, to me, defects really do matter, even if you’re only serving cat pictures….
Why?
Because testing and debugging a change built on top of pile of flakiness is much much much harder than testing and debugging one built on a rock solid foundation.
Because as our systems get bigger and bigger built on more and more layers, the probability of one the tens of thousands of very low probability bugs biting us tends to one.
We have enough fuzzers already, we want people to run them and find interesting stuff.
I would say, not really. In the hierarchy of fuzzers, we are struggling to go till or beyond level 3, that is we can generate syntactically valid programs if we have the grammar, but anything beyond is really hard. We are still making progress, but we are no where near fuzzing programs that take multi-level inputs (most of the interesting stuff happens beyond the first level parsers).
Sadly, fuzzing is mostly an academic paper mill rather than software mill.
Unfortunately, I agree with this mostly. Quite a lot of fuzzing papers seem to be making rather limited improvements to the domain, and original ideas are few and far between. I believe that part of the reason is faulty measurement. Given a target such as a benchmark suite of programs, and faults, it is relatively easy to over optimize for them. On the other hand, finding bugs in numerous programs often may only mean that you went looking for them, and may not say anything more about the impact or originality of your approach.
Quite a lot of fuzzing papers seem to be making rather limited improvements to the domain, and original ideas are few and far between.
Actually I’d argue most fuzzing papers are tweaks on afl (or to a lesser extent) on libfuzzer, since they are so easily available.
If the first task in reproduction is downloading and building /patching/fixing an arcane set of out of date dependencies…. no giants are going to be standing on your shoulders.
IANAL, but consider if FB hadn’t created and distributed the patent grant file. My guess is we would probably be worse off. But creating it draws attention to something normally invisible, because I don’t think any software license automatically protects you from patent litigation.
Also not a lawyer, but Apache 2.0 explicitly mentions patents. As I understand it, it says that you have an automatic grant to all relevant patents owned by all contributors, but if you claim one of your patents is infringed by people using the software, you lose your licence to the software.
Compare to the React licence, which says you lose your licence to the software if you sue Facebook over any patent at all, regardless of whether it’s related to React or not.
The Apache 2.0 licence is a well-regarded Free Software licence, but the React licence, it seems, is not.
But editing a comment to state that you retract it would seem valuable. (Elsewhere I’d propose striking it out, e.g. by enclosing the whole shebang in <s></s>; alas, no strikeouts on Lobsters.)
This is incorrect. If you sue Facebook over any patent at all, it does not terminate your software license. It does terminate the additional patent grant. So the patent grant plus the BSD license gives you strictly more legal protection than the BSD license alone does. The Apache 2.0 license also revokes the patent grant, but not the entire license, if you initiate patent litigation against the copyright holder.
Apache 2.0 covers you, but the point still stands for other libraries licensed under licenses such as BSD. I’d presume GPL(v2) also protects you against patents, but this is just an assumption. It would be nice to get confirmation for this from a source that is at least somewhat official (regarding US jurisdiction).
UUID keys can be really expensive for some workloads, and the chance of performance degradation really shouldn’t be treated with such a relaxed attitude.
Most databases use some form of lexicographic range for storing items. Even high performance hash table architectures may sort their items on disk due to the extreme compression you can achieve when your keys share a significant amount of information with their neighbors. Modern b-trees will use prefix encoding to only store partial keys on each tree node, and suffix truncation is used to chop off extra bytes on split points that get percolated up to indexes, resulting in a lot of savings. sled goes so far as to completely skip storing any key material other than a node’s boundary if keys follow a fixed stride and they are constant length, as any key can be losslessly recomputed given the boundary key and the distance. This also allows for completely skipping interpolation search in favor of a branch-free lookup, avoiding a significant portion of the CPU cycles usually spent on searching through children in the read path of many indexes.
UUIDs destroy all of these optimizations by injecting entropy into various locations of an ID. This isn’t just a size issue. Longer keys mean that you have to use more tree nodes overall to store a data set, which means more index nodes are required, which may mean that the index needs extra levels of indirection that any operation needs to pass through before reaching the relevant node. More entropy means that the RAM you’ve spent money on for keeping recently accessed nodes in cache is less cost-effective. You’ll need to spend more money on RAM if you want to keep similar hit ratios, but due to the key entropy you’re likely to need to traverse more index nodes on every access, so you may not necessarily be able to throw money at the problem in this straightforward way to resolve the performance hit.
It’s fine if you really don’t care about performance, or if low-coordination is worth sacrificing storage price performance for, but measure carefully and don’t brush off the chance of this really messing up metrics that matter otherwise.
There are a few misconceptions in the article. It tries to differentiate between integer keys and UUID keys, but a UUID is a 128-bit integer, so it’s conflating two things:
UUIDs define several allocation policies. The simplest is completely random, but Version 1 is the MAC address concatenated with a 60-bit nanoseconds time. That works well for your purposes: on a single database machine, every UUID will have 48 bits the same, little entropy in the high bits of the time stamp, and only a single value for most low bits (and so you can design your tree on the assumption that conflicts are rare and have a less-efficient secondary structure for the few times they occur). That; however, reduces the claim in the article about unpredictable values in URLs, because now if you know the second that something was generated then you have a 30-bit space to attack to find it, which is nontrivial (and probably easy to detect) but still quite feasible, and trivial if you ever accidentally do something that allows an offline attack.
A fully-random 128-bit UUID has all of the problems that you describe but so would a fully random 64-bit identifier.
Just like we have salts for passwords, we can expose public identifiers that have used some localized secret to deterministically map from a difficult to enumerate space into a compact monotonic fixed length keyspace.
I’ve had doing this in the back of my mind for a little while now, this article inspired me to do some googling just now. This looks nice as a glance and is implemented in a ton of languages: https://hashids.org/. Lots of people talking about it being insecure, but as far as I can tell it seems fine for obfuscating ids.
I’ve used GUIDs for keys in SQL Server, in the days when it was purported to require ‘comb GUID’ allocation in order to save performance (avoid index fragmentation). Performance didn’t suffer anyway, with databases of only a few gigs. We tested ints vs GUIDs all over the place and didn’t find a single instance of a measurable slowdown.
I’m sure it would have been possible to get to index sizes where this made a measurable difference in speed, or cost us money or something, but I agree with your suggestion to measure carefully, though I’d also add that don’t ‘optimise’ by avoiding GUIDs/UUIDs before doing so, as you may be leaving the benefits behind based on what you’ve heard rather than what you’ve tested for yourself.
Could you quantify these in a real world situation?
We use something similar to UUIDs but since the number of keys are limited to < 10_000 we never experienced any problems. I think it also matters what you do with these UUID keys.
UUIDs trade more space for less coordination. More space consumed by the unique key means less space that is available on a system for potentially valuable data. You are spending money keeping fluff in memory and persisted, and in some cases this is less problematic than the coordination required to use less fluff, but this trade-off is one that exists.
10k keys is not a size that puts too many constraints on an architecture, unless you have some situation where you’re changing many things per second and it maps to a very dense, highly contended space. But if your workload is one where you need to decide how many machines you will be using to provide enough resources, this kind of consideration starts to have significant TCO implications.
On the other hand UUIDs fill the keyspace more uniformly, so you might have different easier at the point you need to start distributing the system to multiple nodes (I suppose less worry in having key collisions also if you want to fully utilize the write throughput of the cluster) – not an expert, but I suppose it boils down to the age old “it depends”. Not to downplay the boost you might get from getting the keys fit into CPU-happy chunks for faster processing, or just the lower cost of saving thin rows. It’s turtles all the way down :-)
Absolutely. I usually save 30% TCO for companies with optimization projects. These project very rarely revolve around what is stored in a database and how long is the text field. There is so much inefficiency that you got in an average infrastructure that using UUIDs are usually the least of the problems.
Now if only there was a matching dpkg for each one…..
Sadly, fuzzing is mostly an academic paper mill rather than software mill.
Out of the box on ubuntu you only get
afl/bionic,now 2.52b-2 amd64 [installed] instrumentation-driven fuzzer for binary formats
afl-cov/bionic,bionic 0.6.1-2 all code coverage for afl (American Fuzzy Lop)
fusil/bionic,bionic 1.5-1 all Fuzzing program to test applications
libfuzzer-9-dev/bionic-updates,bionic-security 1:9-2~ubuntu18.04.2 amd64 [installed] Library for coverage-guided fuzz testing
wfuzz/bionic,bionic 2.2.9-1 all Web application bruteforcer
zzuf/bionic,now 0.15-1 amd64 [installed] transparent application fuzzer
Interesting that https://gitlab.com/akihe/radamsa is not packaged in apt. IIRC E.g. homebrew has it. Well, best idea to install from the source anycase – usually you want the latest and the best when fuzzing.
Fascinating.
Nice idea…..
Very light on dependencies… trivial to build and install.
Comes along with it’s own Scheme interpreter and a bunch of scheme programs by the look of it!
We have enough fuzzers already, we want people to run them and find interesting stuff.
Anything slightly useful eventually becomes an academic paper mill. It’s the nature of the system. Research is like VC investments, most die and a few take off spectacularly.
Anyway back to fuzzing.
Here is a discussion of one of the listed papers, which I thought was excellent work: http://shape-of-code.coding-guidelines.com/2020/01/27/how-useful-are-automatically-generated-compiler-tests/
Yes, I read your blog post when it popped up on my feed.
Very interesting indeed.
It reminds me of a moment of “Too Much Truth in Advertising”….
One of the big static analysis firms used to have an page of recommendations from happy customers.
One customer, a Big household name, said something like, “We used X to find ten’s of thousands of real bugs in code that we have been shipping to our customers for more than a decade!”
Which immediately told me most bugs are never found in testing, and if they are, they’re probably aren’t triggered, and if they are, it probably doesn’t matter…..
Which also says, by far, most software is in the grade of serving up cat pictures, if it fails to serve up one to one person… who cares? ie. Most software isn’t doing stuff that really matters.
Which also says, in the fields where it really really really does matter (Avionics / Self driving cars / ….) by far most practical experience of software engineering isn’t really relevant.
Except as a warning that “Here be Dragons! Do you really want to trust this stuff?
And also, don’t use C. I’m not sure what The One True language is…. but I bet it is one that makes automated whole static analysis a lot easier than C does.
All this said, to me, defects really do matter, even if you’re only serving cat pictures….
Why?
Because testing and debugging a change built on top of pile of flakiness is much much much harder than testing and debugging one built on a rock solid foundation.
Because as our systems get bigger and bigger built on more and more layers, the probability of one the tens of thousands of very low probability bugs biting us tends to one.
As usual, MonkeyUser puts it succinctly… https://www.monkeyuser.com/assets/images/2019/139-mvp.png
Which brings me back to fuzzing, I’m using fuzzing and watching the field because of one simple habit.
When I start working on an area of code…. I stress the hell out of it, and make it rock solid.
Then I start with any enhancements……
Then I stress the hell out of my work.
I would say, not really. In the hierarchy of fuzzers, we are struggling to go till or beyond level 3, that is we can generate syntactically valid programs if we have the grammar, but anything beyond is really hard. We are still making progress, but we are no where near fuzzing programs that take multi-level inputs (most of the interesting stuff happens beyond the first level parsers).
Unfortunately, I agree with this mostly. Quite a lot of fuzzing papers seem to be making rather limited improvements to the domain, and original ideas are few and far between. I believe that part of the reason is faulty measurement. Given a target such as a benchmark suite of programs, and faults, it is relatively easy to over optimize for them. On the other hand, finding bugs in numerous programs often may only mean that you went looking for them, and may not say anything more about the impact or originality of your approach.
Actually I’d argue most fuzzing papers are tweaks on afl (or to a lesser extent) on libfuzzer, since they are so easily available.
If the first task in reproduction is downloading and building /patching/fixing an arcane set of out of date dependencies…. no giants are going to be standing on your shoulders.
IANAL, but consider if FB hadn’t created and distributed the patent grant file. My guess is we would probably be worse off. But creating it draws attention to something normally invisible, because I don’t think any software license automatically protects you from patent litigation.
Also not a lawyer, but Apache 2.0 explicitly mentions patents. As I understand it, it says that you have an automatic grant to all relevant patents owned by all contributors, but if you claim one of your patents is infringed by people using the software, you lose your licence to the software.
Compare to the React licence, which says you lose your licence to the software if you sue Facebook over any patent at all, regardless of whether it’s related to React or not.
The Apache 2.0 licence is a well-regarded Free Software licence, but the React licence, it seems, is not.
I’ll see myself out. :)
Please don’t remove comments even if they’re incorrect–it makes reading threads later a lot harder.
OK
But editing a comment to state that you retract it would seem valuable. (Elsewhere I’d propose striking it out, e.g. by enclosing the whole shebang in
<s></s>
; alas, no strikeouts on Lobsters.)This is incorrect. If you sue Facebook over any patent at all, it does not terminate your software license. It does terminate the additional patent grant. So the patent grant plus the BSD license gives you strictly more legal protection than the BSD license alone does. The Apache 2.0 license also revokes the patent grant, but not the entire license, if you initiate patent litigation against the copyright holder.
See the last question at https://code.facebook.com/pages/850928938376556
Apache 2.0 covers you, but the point still stands for other libraries licensed under licenses such as BSD. I’d presume GPL(v2) also protects you against patents, but this is just an assumption. It would be nice to get confirmation for this from a source that is at least somewhat official (regarding US jurisdiction).
GPLv2 does nothing for patents that MIT or BSD doesn’t. That was one of several reasons for GPLv3.