1. 22

Has anyone else experienced that whenever you mention code uses a randomized algorithm, you immediately have at least one person who insists that it would be better to use a deterministic algorithm? It seems like there’s a large part of the profession who assume that randomization is complex and determinism is simple, but in general my experience is the opposite - that randomized algorithms tend to be easier to get right and keep working right.

I’m thinking of a case where I had randomized property testing and people freaked out (and later asked me how to implement it), and now where my boss is saying “why don’t we just transfer all these days in order” and doesn’t seem to get the substantial increase in complexity and fragility or substantial decrease in throughput that represents.

I guess I’m asking for war stories or advice here.

  1.  

  2. 23

    No no no, you just gotta phrase it the right way. “For extra robustness, input is perturbed by a stochastic noise function generating Gaussian noise from an input nonce. The function was selected to ensure its statistical soundness to p < 0.01…”

    For automated testing, randomness seems ideal to me. Just make sure to save the input seed. If anyone gives you guff, ask them how a hashtable works, and whether the standard hashtable implementation in their language-of-choice protects against HashDoS.

    1. 4

      For automated testing, randomness is a huge pain in the neck. Let’s say you make a pull request, everything works fine in CI, then you merge it into master and everything still works fine in CI. Then, you want to deploy a hotfix because things are on fire, and suddenly your pipeline fails due to a random seed that triggered a bug. And then when you try to hunt down the bug, you run the test that failed with the seed that produced the failure, only to realize that it’s larger system state that results in this problem caused by the rest of the tests, so it only fails with this seed if you run the entire damn suite. Not ideal and best practices warns against it, but it’s certainly not uncommon to have some db or generator state that is shared across tests. I’ve experienced this often enough that I never ever want to feel that pain again. I would be very reluctant to work with anyone who feels this is a good idea, as it shows a lack of experience with this type of situation in the real world.

      At my previous job we used to have randomization in our test suite, even up to the point of using faker to generate names and phone numbers etc, but at some point we got so fed up with it with it that we had to rip it out completely and turn into a fully deterministic test suite. At that point the test suite was quite big, so it was quite an effort to change this! Besides failures of the test suite due to unlucky random seeds at inconvenient moments, we also had some situations where the library would produce certain invalid values (or worse, combinations of values) that the code didn’t accept for good reasons. One example is that after implementing an address validation system we’d need addresses that actually existed, which meant we’d have to stop using them. And we more than once exhausted the pool of available values when filtering out disallowed ones. I realize this is more of a problem with this library or using generated names like this, but it’s also a result of wanting to randomize stuff.

      Oh, and re: hashtable implementation protecting against HashDoS. This has its own problems as well: Witness this nice bug in the CHICKEN Scheme runtime and note that it had been open for 2 years until we figured out that the symbol table randomization was the culprit (and not even a bug as such - just an unexpected side effect of the hash bucket memory allocation pattern being different depending on the hash table’s randomization factor).

      I see the value in randomizing tests for catching unexpected bugs, as long as it’s like a smoke test or a fuzzing job that runs on the side of the main test suite. That main one should be deterministic as fuck, and deterministic regression tests should be added as new bugs are found, either in the wild or by the randomized smoke tests.

      1. 9

        So you ditched randomized testing because it found bugs which deterministic testing wasn’t able to?

        For your CI pipeline, use a fixed seed. Then repeatedly run a separate random seeded test. This way actual genuine bugs still get to be found but when you’re developing you don’t get your test suite claiming you just introduced a bug.

        The rest of your criticism is just a design failure of the test suite. Obviously your random inputs must be valid (although you should be randomly testing failure cases too).

        1. 2

          So you ditched randomized testing because it found bugs which deterministic testing wasn’t able to?

          Condescending much?

          For your CI pipeline, use a fixed seed. Then repeatedly run a separate random seeded test. This way actual genuine bugs still get to be found but when you’re developing you don’t get your test suite claiming you just introduced a bug.

          I suppose if you have enough resources to keep running the pipeline with enough different seeds, that could be a solution. Otherwise you might as well just drop the randomization.

          The rest of your criticism is just a design failure of the test suite. Obviously your random inputs must be valid (although you should be randomly testing failure cases too).

          Of course the test suite was not designed perfectly, I even mentioned this in my reply. Show me a constantly evolving real world system developed under time pressure that doesn’t have warts like that.

          1. 5

            I suppose if you have enough resources to keep running the pipeline with enough different seeds, that could be a solution. Otherwise you might as well just drop the randomization.

            If you have the resources to run tests when someone makes a PR or whatever, you probably have the resources to run the tests again a couple of times a day. Obviously prioritize it below other runs but it doesn’t strike me as a massive resource drain.

            Of course the test suite was not designed perfectly, I even mentioned this in my reply. Show me a constantly evolving real world system developed under time pressure that doesn’t have warts like that.

            And my point was that blaming bugs in the test suite on randomized testing was incorrect. Randomized testing was not at fault here.

            1. 2

              I suppose if you have enough resources to keep running the pipeline with enough different seeds, that could be a solution. Otherwise you might as well just drop the randomization.

              Compared to the cost of a developer spending a few minutes, it’s pretty cheap.

              For instance, the $work CI pipeline for a (14-year-old rails app with extensive browser testing) costs 40 cents per build (thanks to ec2 spot pricing) and takes 10 minutes (thanks to concurrency - it’s 50+ minutes run serially).

              Our small team average fewer than 10 builds in a day - $4 or so - or about $1200 a year for something that makes the entire team substantially more productive.

              1. 2

                Depends on your costs. On an open source project run by volunteers, who may or may not have Silicon Valley salaries, $1200 a year can be quite a lot of money. The time spent setting up, fiddling with, fixing and figuring out such a CI pipeline when said volunteers really want to be compiler engineers and not devops people is also quite expensive.

                1. 2

                  I don’t get your point. If your project can’t afford a CI pipeline (sr.ht costs from $20 per year) then it’s not running tests for every PR automatically anyway. I am assuming a developer can set his machine up to run a fixed-seed test with random tests running once in a while. Some random coverage is always going to be better than no random coverage.

                  1. 1

                    I wouldn’t expect an OSS project to have a comparable test suite runtime or commit frequency. It’s a very different situation from a paid team.

                2. 1

                  I suppose if you have enough resources to keep running the pipeline with enough different seeds, that could be a solution. Otherwise you might as well just drop the randomization.

                  This summarizes my favorite use of randomization for testing. I like using long running fuzzing (outside of CI) for finding edge cases/failures, then minimizing those failures in to unit tests that can be quickly applied during development as well as CI going forward.

                  I get what I see as the benefits of random testing, while keeping CI/CD deterministic. Not saying this is the perfect approach for every use case, but has struck a useful balance for me.

              2. 3

                Pseudorandomness is great for this though. For the Verona runtime, we have a systematic testing mode that effectively runs one thread at a time so that we get a deterministic interleaving. This is driven by a pseudorandom number generator (that we’ve implemented, so we don’t get different behaviour on different platforms) and the CI tests run this with a load of different seeds. When a test fails, it tells you the seed that failed and if you rerun the tests locally with the same seed then you get the same crash.

                On the opposite extreme, I spent a while tracking down a bug in libUCL a few weeks ago. UCL uses a random number in the hash table implementation to prevent targeted (untrusted) data from being able to hit the pathological cases. The deletion code did the wrong thing but my test case failed only when the random number (an integer clamped to the range of indexes in the table) was 0. The random number generator was seeded with the current time in seconds. This meant that I had a reproduceable test that failed when I reran it for an entire second, but then passed for the next four seconds, then failed again. This was an incredibly painful debugging experience.

                1. 2

                  This is great experience, thank you a lot! I suppose I should have made the disclaimer I’ve only done randomized testing like this on relatively small projects; obviously the larger things get the more possibilities there are for unkind interactions.

                  I do like the idea of splitting the two. Run automated tests with a fixed input seed and have a “fuzzing” pass that runs on a separate schedule (ie, doesn’t block merges) that chooses random seeds and automatically creates “investigate this when you have time” bug reports.

              3. 12

                Applications of Randomness in Systems Performance Measurement, Trevor Blackwell’s Phd thesis, makes a strong argument that systems should use randomness to avoid deterministic failures and reduce configuration sensitivity.

                Here are some excerpts from the abstract.

                This thesis presents and analyzes a simple principle for building systems: that there should be a random component in all arbitrary decisions. If no randomness is used, system performance can vary widely and unpredictably due to small changes in the system workload or configuration. This makes measurements hard to reproduce and less meaningful as predictors of performance that could be expected in similar situations.

                For TCP/IP, changes of a few percent in link propagation delays and other parameters caused order of magnitude shifts in bandwidth allocation between competing connections. For memory systems, changes in the essentially arbitrary order in which functions were arranged in memory caused changes in runtime of tens of percent for single benchmarks, and of a few percent when averaged across a suite of benchmarks. In both applications the measured variability is larger than performance increases often reported for new improved designs, suggesting that many published measurements of the benefits of new schemes may be erroneous or at least irreproducible.

                To make TCP/IP and memory systems measurable enough to make benchmark results meaningful and convincing, randomness must be added. Methods for adding randomness to conventional program linkers, to linkers which try to optimize memory system performance by avoiding cache conflicts, and to TCP/IP are presented and analyzed. In all of the systems, various amounts of randomness can be added in many different places. We show how to choose reasonable amounts of randomness based on measuring configuration sensitivity, and propose specific recipes for randomizing TCP/IP and memory systems. Substantial reductions in the configuration sensitivity are demonstrated, making measurements much more robust and meaningful. The accuracy of the results increases with the number of runs and thus is limited only by the available computing resources.

                When the overall performance of a system is strongly influenced by the worst case behavior, reducing the sensitivity of the system can also make it perform better. Using average waiting time as a metric, TCP/IP performance is shown to improve significantly when randomization is added to the sending host’s congestion window calculations. Although the improvements are less than those achieved by previously proposed schemes using randomized packet discard algorithms inside the network, the proposed modifications can be implemented entirely in the sending host and so can be deployed more easily.

                1. 2

                  This is very good, thank you.

                2. 5

                  Unless you actually want the testing to be slightly different on every run, you can make it randomised and deterministic - use seed=1. That’s a pretty common approach in quite a few places.

                  1. 2

                    Doesn’t this defeat the point of randomizing, though?

                    1. 3

                      The point of randomisation should be to increase the state space that you explore per unit of developer effort in writing tests. If you drive your tests from a PRNG then you can write a single test and run it with a large range of random seeds to get different behaviour and you can scale this up by buying more CI infrastructure (which is a lot cheaper than buying more developers who write good tests). If you do this, the most important thing is to ensure that the PRNG is the only source of nondeterminism in the tests (everything else that introduces entropy should, in testing, pull from the testing PRNG) so that when you find a test failure in CI then you can reproduce it locally.

                      1. 1

                        It depends where you want the randomness. Do you want to generate varied tests, or are you after different tests on each run. Sometimes one is enough, sometimes not.

                    2. 5

                      Yes, when I worked in a university research lab I suggested using a cryptographic hash function which was demonstrably sufficient to reduce collisions to negligible probability, to implement a relational database DISTINCT operator without having to retain the full keys (just their hash codes). Some world-famous faculty reacted quite negatively and irrationally to that idea; apparently they couldn’t emotionally accept even a negligible probability of failure.

                      After making the argument as quantitative as possible, and pointing out that cryptographic systems and content-addressable storage (like git) rely on the same assumptions, I finally won them over, but it was a lot harder than I had expected, given such an intelligent and knowledgeable audience.

                      1. 3

                        It’s good to provide some perspective for this kind of discussions. What is the probability of, say, a SHA-256 collision introducing a bug versus the probability of an bit flip in memory (or even an ECC failure) introducing a bug? Both of these are fairly easily quantifiable. I’d be willing to bet that for your application, transient hardware glitches introduce many orders of magnitude more unreliability into your system than using a cryptographically secure hash function[1]. Often the big psychological jump is accepting that the system is not 100% reliable, once you’ve got people to make that leap then you can have a rational conversation about how a proposed change will impact reliability.

                        [1] You need to be a bit careful here because there are some attacks to some well-known hash functions that allow an attacker to craft a collision, but if you’re salting the data before hashing with a random salt then you should be fine.

                        1. 1

                          Yes, I think “relative risk” is the way to frame these discussions (much like say COVID vaccinations for kids, but I don’t really want to go there).

                      2. 4

                        I’ve definitely faced this in the context of property-based testing. The resistance dissipated when we hit a bug which would have clearly been hit by a natural property-based test, FWIW.

                        1. 3

                          I should start a list of fallacies of complexity theory, because this is one that I’ve encountered before. The complexity class BPP contains problems which can be solved efficiently but only randomly. If there exists an approach which derandomizes the entirety of BPP to P, then it likely involves creating specific PRNGs which are hard to predict.

                          1. 2

                            What’s the goal of the nondeterminism in your case?

                            1. 13

                              I don’t think it’s that nondeterminism is a goal, it’s that determinism is not always a goal. For instance if you need a way to shed 20% of your traffic once it goes over some threshold you could deterministically drop every 5th request, keeping the state and coordination necessary to do so, or you could randomly drop every request with a 20% probability. The latter is much simpler. It’s not that nondeterminism is desireable, it’s that determinism adds more complexity and we don’t really need it for our goals here.

                              Or a recommender that scores everything that you’re likely to buy. There’s likely to be something that the algorithm thinks you’re really likely to buy but you aren’t so it will stack it way at the top of the list, so every time you see recommendations it’s always the top item. (If we remove things after you’ve bought them, this result is virtually guaranteed.) We could recompute the scores often and keep track of what you’ve already seen and frequencies of re-showing the same product, enforcing diversity by occasionally promoting items lower in the list. That would work. Or we could multiply the score by a random float, effectively weighting the likelihood by the computed score but still allowing it to look fresh every time you look and getting the diversity that way. If you need determinism then this quick hack isn’t available to you, but if it’s not something that you actually need then it is.

                              1. 1

                                Such things are perfectly fine to do, if somewhat tricky to test. But randomizing tests is a big no-no in my book (see my note above). In fact, if you drop requests based on a probability rather than deterministically, it will probably also be much harder to abuse the system in a denial of service attack.

                              2. 4

                                Property testing will generate random cases. If you run a test 100 times with random data, you might generate data which breaks a test; that’s good!

                                1. 1

                                  Random tests are just as annoying as flaky tests - they break at inopportune moments and can take a lot of effort to track down. They only sound good in theory, but in practice they’re just a headache.

                                  1. 3

                                    Nah, I use them all day every day. Not a headache, not a lot of effort. Finds bugs all the time and is great documentation.

                              3. 2

                                Maybe it is me missing something in this whole thread (seems the most likely explanation), but why are we talking about randomness and determinism as if they are mutually exclusive, when in fact all random algorithms are deterministic. As long as you apply the randomness correctly, you should be able to save the randomiser seed and then play back the exact sequence of numbers again at any time in the future, enabling each test to be repeated exactly to debug failures, while still having novel, randomised data and properties on each test run.

                                According to the mathematical definitions there are no actual random algorithms in programming, only deterministic algorithms with unpredictable inputs. If you save the input used to generate the random values you can repeat the ‘random’ behaviour every time with never any deviation. Seems like based on this if people object you can simply take your current random algorithm and say ‘don’t worry, this is deterministic’.

                                It may be theoretically possible in modern processors to implement an algorithm that makes use of quantum tunnelling to generate random values, but processors are set up to avoid tunnelling and it would probably be horribly resource intensive and generate garbage values.

                                Seems like next time someone says your randomised tests need to be deterministic you can just say ‘Don’t worry, the algorithm is completely deterministic’

                                I have a sneaking feeling that I am in fact misunderstanding this whole conversation and that the word determinism in this context means ‘always the same’.

                                1. 4

                                  why are we talking about randomness and determinism as if they are mutually exclusive, when in fact all random algorithms are deterministic

                                  Random and deterministic are mutually exclusive. A random algorithm depends on a source of entropy, which is by definition not deterministic. A pseudorandom algorithm depends on an approximation of entropy that is unpredictable (in the ideal case) without knowing a seed value. Most computers now provide a hardware random number generator that uses some free-running ring oscillators and generates a sequence of bits that are either 1 or 0 depending on some quantum properties at the gate level that are either random or our understanding of physics is more wrong than we think it is. OS random number generators typically use this as one of the inputs of something like Fortuna (which is designed to be able to merge multiple entropy sources without knowing the quality of the entropy and still give cryptographically secure random numbers).

                                  The information-theoretic definition of a random sequence is one that cannot be generated by a program that is shorter than the sequence. Most random number generators come very close to this: you cannot predict the sequence without encoding the PRNG algorithm and all of the entropy measurements, so as long as you query less often than you add entropy they meet the definition and if you don’t then the difference between the theory and practice matters only in theory.

                                  All of that said, I agree that we need to be careful about terminology, because a random algorithm can be fed a fake entropy source to give deterministic output. This can be incredibly useful for exploring a large state space but still being able to debug test failures.

                                  1. 2

                                    Most computers now provide a hardware random number generator that uses some free-running ring oscillators and generates a sequence of bits that are either 1 or 0 depending on some quantum properties at the gate level

                                    How can I access this as a programmer and how can I find out if a given random function is using this feature? The first example I looked up (https://docs.microsoft.com/en-us/dotnet/api/system.random?view=net-5.0) is not using it. What languages have support for this?

                                    Also, I feel like this is only useful in cryptography. Using a non-deterministic true random has the advantage of being opaque and non-reproducible, I see no advantage of being opaque in unit testing. The other use-case for me for random generators is procedural content generation in games and virtual worlds, where opaqueness is also more of a downside than a desired feature.

                                    1. 3

                                      How can I access this as a programmer and how can I find out if a given random function is using this feature?

                                      On x86, the RDRAND instruction reads from this. On most *NIX systems it’s one of the things that seeds a cryptographically secure random number generator exposed via /dev/random or similar. It’s generally recommended to use this rather than using the random number generator directly because there are some theoretical supply chain attacks on the hardware random number generator that are undetectable in practice. Before being exposed to users, the entropy is fed through a ‘whitening’ process that guarantees a uniform distribution. It’s possible to build one of these that uses a secret key that, if known to an attacker, makes it possible to guess the next random number with fairly high probability (on the order of 1/2^16) but which provides a completely uniform output that is unguessable if you don’t know the secret key. Filtering through Yarrow, Fortuna, or similar avoids this.

                                      As far as telling whether a given random function uses this, you can do that only by reading the documentation. Detecting whether an entropy source is a true random source is not computable in the general case. You can do some statistical sampling that will tell you if something has a good distribution but you can’t tell if it’s the output of a purely deterministic PRNG as part of black-box testing.

                                      What languages have support for this?

                                      C? I’d be pretty shocked if any higher-level language didn’t provide an interface to seed a PRNG with an OS-provided entropy source.

                                      Also, I feel like this is only useful in cryptography

                                      Or anything where your security (including availability) depends on a truly unguessable random number. For example, TCP sequence numbers, UUIDs, and so on.

                                      Using a non-deterministic true random has the advantage of being opaque and non-reproducible, I see no advantage of being opaque in unit testing

                                      Completely agreed. Reproducibility is the most important constraint for a test suite, random numbers should not be used. Pseudo-random number generators are a great fit for this.

                                      The other use-case for me for random generators is procedural content generation in games and virtual worlds, where opaqueness is also more of a downside than a desired feature.

                                      Also agreed. You almost always want a PRNG for these. You might want to seed it with a random number, but you often want to record it for later. Elite was the first game I’m aware of to really take advantage of this by generating 255 galaxies procedurally on systems with 32 KiB of RAM (not enough to hold all of the galaxies in memory at the same time), one for each of the possible seeds to their 8-bit PRNG.