Has anyone else experienced that whenever you mention code uses a randomized algorithm, you immediately have at least one person who insists that it would be better to use a deterministic algorithm? It seems like there’s a large part of the profession who assume that randomization is complex and determinism is simple, but in general my experience is the opposite - that randomized algorithms tend to be easier to get right and keep working right.
I’m thinking of a case where I had randomized property testing and people freaked out (and later asked me how to implement it), and now where my boss is saying “why don’t we just transfer all these days in order” and doesn’t seem to get the substantial increase in complexity and fragility or substantial decrease in throughput that represents.
I guess I’m asking for war stories or advice here.
No no no, you just gotta phrase it the right way. “For extra robustness, input is perturbed by a stochastic noise function generating Gaussian noise from an input nonce. The function was selected to ensure its statistical soundness to p < 0.01…”
For automated testing, randomness seems ideal to me. Just make sure to save the input seed. If anyone gives you guff, ask them how a hashtable works, and whether the standard hashtable implementation in their language-of-choice protects against HashDoS.
For automated testing, randomness is a huge pain in the neck. Let’s say you make a pull request, everything works fine in CI, then you merge it into master and everything still works fine in CI. Then, you want to deploy a hotfix because things are on fire, and suddenly your pipeline fails due to a random seed that triggered a bug. And then when you try to hunt down the bug, you run the test that failed with the seed that produced the failure, only to realize that it’s larger system state that results in this problem caused by the rest of the tests, so it only fails with this seed if you run the entire damn suite. Not ideal and best practices warns against it, but it’s certainly not uncommon to have some db or generator state that is shared across tests. I’ve experienced this often enough that I never ever want to feel that pain again. I would be very reluctant to work with anyone who feels this is a good idea, as it shows a lack of experience with this type of situation in the real world.
At my previous job we used to have randomization in our test suite, even up to the point of using faker to generate names and phone numbers etc, but at some point we got so fed up with it with it that we had to rip it out completely and turn into a fully deterministic test suite. At that point the test suite was quite big, so it was quite an effort to change this! Besides failures of the test suite due to unlucky random seeds at inconvenient moments, we also had some situations where the library would produce certain invalid values (or worse, combinations of values) that the code didn’t accept for good reasons. One example is that after implementing an address validation system we’d need addresses that actually existed, which meant we’d have to stop using them. And we more than once exhausted the pool of available values when filtering out disallowed ones. I realize this is more of a problem with this library or using generated names like this, but it’s also a result of wanting to randomize stuff.
Oh, and re: hashtable implementation protecting against HashDoS. This has its own problems as well: Witness this nice bug in the CHICKEN Scheme runtime and note that it had been open for 2 years until we figured out that the symbol table randomization was the culprit (and not even a bug as such - just an unexpected side effect of the hash bucket memory allocation pattern being different depending on the hash table’s randomization factor).
I see the value in randomizing tests for catching unexpected bugs, as long as it’s like a smoke test or a fuzzing job that runs on the side of the main test suite. That main one should be deterministic as fuck, and deterministic regression tests should be added as new bugs are found, either in the wild or by the randomized smoke tests.
So you ditched randomized testing because it found bugs which deterministic testing wasn’t able to?
For your CI pipeline, use a fixed seed. Then repeatedly run a separate random seeded test. This way actual genuine bugs still get to be found but when you’re developing you don’t get your test suite claiming you just introduced a bug.
The rest of your criticism is just a design failure of the test suite. Obviously your random inputs must be valid (although you should be randomly testing failure cases too).
Condescending much?
I suppose if you have enough resources to keep running the pipeline with enough different seeds, that could be a solution. Otherwise you might as well just drop the randomization.
Of course the test suite was not designed perfectly, I even mentioned this in my reply. Show me a constantly evolving real world system developed under time pressure that doesn’t have warts like that.
If you have the resources to run tests when someone makes a PR or whatever, you probably have the resources to run the tests again a couple of times a day. Obviously prioritize it below other runs but it doesn’t strike me as a massive resource drain.
And my point was that blaming bugs in the test suite on randomized testing was incorrect. Randomized testing was not at fault here.
Compared to the cost of a developer spending a few minutes, it’s pretty cheap.
For instance, the $work CI pipeline for a (14-year-old rails app with extensive browser testing) costs 40 cents per build (thanks to ec2 spot pricing) and takes 10 minutes (thanks to concurrency - it’s 50+ minutes run serially).
Our small team average fewer than 10 builds in a day - $4 or so - or about $1200 a year for something that makes the entire team substantially more productive.
Depends on your costs. On an open source project run by volunteers, who may or may not have Silicon Valley salaries, $1200 a year can be quite a lot of money. The time spent setting up, fiddling with, fixing and figuring out such a CI pipeline when said volunteers really want to be compiler engineers and not devops people is also quite expensive.
I don’t get your point. If your project can’t afford a CI pipeline (sr.ht costs from $20 per year) then it’s not running tests for every PR automatically anyway. I am assuming a developer can set his machine up to run a fixed-seed test with random tests running once in a while. Some random coverage is always going to be better than no random coverage.
I wouldn’t expect an OSS project to have a comparable test suite runtime or commit frequency. It’s a very different situation from a paid team.
This summarizes my favorite use of randomization for testing. I like using long running fuzzing (outside of CI) for finding edge cases/failures, then minimizing those failures in to unit tests that can be quickly applied during development as well as CI going forward.
I get what I see as the benefits of random testing, while keeping CI/CD deterministic. Not saying this is the perfect approach for every use case, but has struck a useful balance for me.
Pseudorandomness is great for this though. For the Verona runtime, we have a systematic testing mode that effectively runs one thread at a time so that we get a deterministic interleaving. This is driven by a pseudorandom number generator (that we’ve implemented, so we don’t get different behaviour on different platforms) and the CI tests run this with a load of different seeds. When a test fails, it tells you the seed that failed and if you rerun the tests locally with the same seed then you get the same crash.
On the opposite extreme, I spent a while tracking down a bug in libUCL a few weeks ago. UCL uses a random number in the hash table implementation to prevent targeted (untrusted) data from being able to hit the pathological cases. The deletion code did the wrong thing but my test case failed only when the random number (an integer clamped to the range of indexes in the table) was 0. The random number generator was seeded with the current time in seconds. This meant that I had a reproduceable test that failed when I reran it for an entire second, but then passed for the next four seconds, then failed again. This was an incredibly painful debugging experience.
This is great experience, thank you a lot! I suppose I should have made the disclaimer I’ve only done randomized testing like this on relatively small projects; obviously the larger things get the more possibilities there are for unkind interactions.
I do like the idea of splitting the two. Run automated tests with a fixed input seed and have a “fuzzing” pass that runs on a separate schedule (ie, doesn’t block merges) that chooses random seeds and automatically creates “investigate this when you have time” bug reports.
Applications of Randomness in Systems Performance Measurement, Trevor Blackwell’s Phd thesis, makes a strong argument that systems should use randomness to avoid deterministic failures and reduce configuration sensitivity.
Here are some excerpts from the abstract.
This is very good, thank you.
Unless you actually want the testing to be slightly different on every run, you can make it randomised and deterministic - use seed=1. That’s a pretty common approach in quite a few places.
Doesn’t this defeat the point of randomizing, though?
The point of randomisation should be to increase the state space that you explore per unit of developer effort in writing tests. If you drive your tests from a PRNG then you can write a single test and run it with a large range of random seeds to get different behaviour and you can scale this up by buying more CI infrastructure (which is a lot cheaper than buying more developers who write good tests). If you do this, the most important thing is to ensure that the PRNG is the only source of nondeterminism in the tests (everything else that introduces entropy should, in testing, pull from the testing PRNG) so that when you find a test failure in CI then you can reproduce it locally.
It depends where you want the randomness. Do you want to generate varied tests, or are you after different tests on each run. Sometimes one is enough, sometimes not.
Yes, when I worked in a university research lab I suggested using a cryptographic hash function which was demonstrably sufficient to reduce collisions to negligible probability, to implement a relational database DISTINCT operator without having to retain the full keys (just their hash codes). Some world-famous faculty reacted quite negatively and irrationally to that idea; apparently they couldn’t emotionally accept even a negligible probability of failure.
After making the argument as quantitative as possible, and pointing out that cryptographic systems and content-addressable storage (like git) rely on the same assumptions, I finally won them over, but it was a lot harder than I had expected, given such an intelligent and knowledgeable audience.
It’s good to provide some perspective for this kind of discussions. What is the probability of, say, a SHA-256 collision introducing a bug versus the probability of an bit flip in memory (or even an ECC failure) introducing a bug? Both of these are fairly easily quantifiable. I’d be willing to bet that for your application, transient hardware glitches introduce many orders of magnitude more unreliability into your system than using a cryptographically secure hash function[1]. Often the big psychological jump is accepting that the system is not 100% reliable, once you’ve got people to make that leap then you can have a rational conversation about how a proposed change will impact reliability.
[1] You need to be a bit careful here because there are some attacks to some well-known hash functions that allow an attacker to craft a collision, but if you’re salting the data before hashing with a random salt then you should be fine.
Yes, I think “relative risk” is the way to frame these discussions (much like say COVID vaccinations for kids, but I don’t really want to go there).
I’ve definitely faced this in the context of property-based testing. The resistance dissipated when we hit a bug which would have clearly been hit by a natural property-based test, FWIW.
I should start a list of fallacies of complexity theory, because this is one that I’ve encountered before. The complexity class BPP contains problems which can be solved efficiently but only randomly. If there exists an approach which derandomizes the entirety of BPP to P, then it likely involves creating specific PRNGs which are hard to predict.
What’s the goal of the nondeterminism in your case?
I don’t think it’s that nondeterminism is a goal, it’s that determinism is not always a goal. For instance if you need a way to shed 20% of your traffic once it goes over some threshold you could deterministically drop every 5th request, keeping the state and coordination necessary to do so, or you could randomly drop every request with a 20% probability. The latter is much simpler. It’s not that nondeterminism is desireable, it’s that determinism adds more complexity and we don’t really need it for our goals here.
Or a recommender that scores everything that you’re likely to buy. There’s likely to be something that the algorithm thinks you’re really likely to buy but you aren’t so it will stack it way at the top of the list, so every time you see recommendations it’s always the top item. (If we remove things after you’ve bought them, this result is virtually guaranteed.) We could recompute the scores often and keep track of what you’ve already seen and frequencies of re-showing the same product, enforcing diversity by occasionally promoting items lower in the list. That would work. Or we could multiply the score by a random float, effectively weighting the likelihood by the computed score but still allowing it to look fresh every time you look and getting the diversity that way. If you need determinism then this quick hack isn’t available to you, but if it’s not something that you actually need then it is.
Such things are perfectly fine to do, if somewhat tricky to test. But randomizing tests is a big no-no in my book (see my note above). In fact, if you drop requests based on a probability rather than deterministically, it will probably also be much harder to abuse the system in a denial of service attack.
Property testing will generate random cases. If you run a test 100 times with random data, you might generate data which breaks a test; that’s good!
Random tests are just as annoying as flaky tests - they break at inopportune moments and can take a lot of effort to track down. They only sound good in theory, but in practice they’re just a headache.
Nah, I use them all day every day. Not a headache, not a lot of effort. Finds bugs all the time and is great documentation.
Maybe it is me missing something in this whole thread (seems the most likely explanation), but why are we talking about randomness and determinism as if they are mutually exclusive, when in fact all random algorithms are deterministic. As long as you apply the randomness correctly, you should be able to save the randomiser seed and then play back the exact sequence of numbers again at any time in the future, enabling each test to be repeated exactly to debug failures, while still having novel, randomised data and properties on each test run.
According to the mathematical definitions there are no actual random algorithms in programming, only deterministic algorithms with unpredictable inputs. If you save the input used to generate the random values you can repeat the ‘random’ behaviour every time with never any deviation. Seems like based on this if people object you can simply take your current random algorithm and say ‘don’t worry, this is deterministic’.
It may be theoretically possible in modern processors to implement an algorithm that makes use of quantum tunnelling to generate random values, but processors are set up to avoid tunnelling and it would probably be horribly resource intensive and generate garbage values.
Seems like next time someone says your randomised tests need to be deterministic you can just say ‘Don’t worry, the algorithm is completely deterministic’
I have a sneaking feeling that I am in fact misunderstanding this whole conversation and that the word determinism in this context means ‘always the same’.
Random and deterministic are mutually exclusive. A random algorithm depends on a source of entropy, which is by definition not deterministic. A pseudorandom algorithm depends on an approximation of entropy that is unpredictable (in the ideal case) without knowing a seed value. Most computers now provide a hardware random number generator that uses some free-running ring oscillators and generates a sequence of bits that are either 1 or 0 depending on some quantum properties at the gate level that are either random or our understanding of physics is more wrong than we think it is. OS random number generators typically use this as one of the inputs of something like Fortuna (which is designed to be able to merge multiple entropy sources without knowing the quality of the entropy and still give cryptographically secure random numbers).
The information-theoretic definition of a random sequence is one that cannot be generated by a program that is shorter than the sequence. Most random number generators come very close to this: you cannot predict the sequence without encoding the PRNG algorithm and all of the entropy measurements, so as long as you query less often than you add entropy they meet the definition and if you don’t then the difference between the theory and practice matters only in theory.
All of that said, I agree that we need to be careful about terminology, because a random algorithm can be fed a fake entropy source to give deterministic output. This can be incredibly useful for exploring a large state space but still being able to debug test failures.
How can I access this as a programmer and how can I find out if a given random function is using this feature? The first example I looked up (https://docs.microsoft.com/en-us/dotnet/api/system.random?view=net-5.0) is not using it. What languages have support for this?
Also, I feel like this is only useful in cryptography. Using a non-deterministic true random has the advantage of being opaque and non-reproducible, I see no advantage of being opaque in unit testing. The other use-case for me for random generators is procedural content generation in games and virtual worlds, where opaqueness is also more of a downside than a desired feature.
On x86, the RDRAND instruction reads from this. On most *NIX systems it’s one of the things that seeds a cryptographically secure random number generator exposed via
/dev/random
or similar. It’s generally recommended to use this rather than using the random number generator directly because there are some theoretical supply chain attacks on the hardware random number generator that are undetectable in practice. Before being exposed to users, the entropy is fed through a ‘whitening’ process that guarantees a uniform distribution. It’s possible to build one of these that uses a secret key that, if known to an attacker, makes it possible to guess the next random number with fairly high probability (on the order of 1/2^16) but which provides a completely uniform output that is unguessable if you don’t know the secret key. Filtering through Yarrow, Fortuna, or similar avoids this.As far as telling whether a given random function uses this, you can do that only by reading the documentation. Detecting whether an entropy source is a true random source is not computable in the general case. You can do some statistical sampling that will tell you if something has a good distribution but you can’t tell if it’s the output of a purely deterministic PRNG as part of black-box testing.
C? I’d be pretty shocked if any higher-level language didn’t provide an interface to seed a PRNG with an OS-provided entropy source.
Or anything where your security (including availability) depends on a truly unguessable random number. For example, TCP sequence numbers, UUIDs, and so on.
Completely agreed. Reproducibility is the most important constraint for a test suite, random numbers should not be used. Pseudo-random number generators are a great fit for this.
Also agreed. You almost always want a PRNG for these. You might want to seed it with a random number, but you often want to record it for later. Elite was the first game I’m aware of to really take advantage of this by generating 255 galaxies procedurally on systems with 32 KiB of RAM (not enough to hold all of the galaxies in memory at the same time), one for each of the possible seeds to their 8-bit PRNG.