A talk by a FoundationDB developer about how they got it so reliable that the Jepsen team didn’t see a need to test it.
re Single-threaded, pseudo concurrency
He said you need it single threaded so it’s deterministic. There are methods available for deterministic multi-threading. DTHREADS submission comes to mind. The paper claims its performance often matched pthreads, sometimes exceeding it. That might have improved simulation speed.
re flow language. This part is a good argument for macros combined with languages that are easy to transform. ZL language comes out ahead of C++ here with end result being C. Then, a la my Brute-Force Assurance concept, you can hit it with every verification tool that works on C before using an optimizing or certifying compiler.
re simulated networking. I’m pretty sure someone has built something open-source that can help with this if people don’t want to build it. I bookmarked ns-3 in case it was useful for that. However, the FoundationDB approach to simulation is pretty simple: each node is a class instance passing memory to each other that can fail on a call, chosen by random number. Doesn’t get easier to simulate failure sticking to how someone normally uses the language.
re determinism. Mentions examples like conditional checks for random numbers, system time, and disk usage. Claims they still screw up despite actively watching out for such things. Perhaps we need static analyzers that check for known sources of non-determinism. Also, like in information-flow control, we might label specific operations as non-deterministic when be put them in the problem. We similarly label things that have to be deterministic. A language-based analysis catches when they mix.
Also mentions a cheat they have where they re-run a tiny percentage of their simulation runs twice on same inputs in same order looking at the output for differences. I’d have not guessed that would work given how complicated traces get on non-determinism. I’d have guessed a pile of runs mixing guided and random inputs would be necessary like with most testing. Maybe I’m missing something about what he’s saying or nature of the setup. This should get more replication on various types of non-determinism in case there’s quick, easy wins to be had like their 2 runs.
re testing. Interesting form of test scripts where each has (a) a goal the database wants to achieve such as performance and (b) an attack on the systems to see if it fails.
Swizzling is interesting. Clogging is just stopping nodes from sending and receiving packets. Swizzling, maybe a reference to this, takes a subset of network connections, “stops them on a rolling basis,” and then brings them “back up in reverse order.” He adds, “And for reasons that we totally don’t understand, this is better at finding bugs than normal clogging.” This is worth looking into more deeply. It could even be a known phenomenon with existing solutions that are just silo’d away from mainstream. Maybe just wrong order invariants or something simple.
Next is a nice torture test:
“Sometime in those 30 seconds, we’re going to change the configuration of the database. This particular configuration change is designed to provoke a coordination change… which means the cluster is going to have to execute Multi-Paxos. And it’s going to be doing that while network connections are randomly clogged… while machines are going down and rebooting left and right… and while it’s trying to get a thousand transactions done per second.”
Proving that same scenario is safe in TLA+ or SPIN without a simulator is left as an exercise for the reader. And not even mentioned in the video. ;)
re other simulated disasters. “Broken machine” makes all future system calls have 1% chance of failure. Simulate dumb sysadmins with tests like swapping IP addresses of two machines to see what happens and swap the data files (eg HD’s) of two computers. This concept needs more experimentation.
re finding bugs before customers. One benefit of simulation is you can do this by causing failures yourself and increasing their frequency by speeding up time. They make the “sim-seconds” way higher than seconds of customer runtime with limitation that it must run on single core. Already mentioned potential improvements on that part.
re buggification. @hwayne, he said “didn’t understand the contract.” They really are enlightened. ;) Back to talk, he says buggificaiton is making your code randomly do something other than caller thinks it will. Examples include sending things in wrong order vs what’s in interface, not sending everything that’s ready to send, or just never sending a timeout.
re Hurst exponent. The concept sounded nice. Then he lost me on the details. His talk was great on bringing in a general audience up to this point where I think he failed. Maybe he wanted a tangent focused only on people who get this stuff. Anyway, I get the point when he gets back to hardware failures. Gives example of hard, drive failure mentioning it could be part of a bad batch, gotten bad maintenance, or a humidity problem in datacenter. Says multiple, cascading failures are among hardest things to test in real-world, physical testing. Since it’s easy in simulation, they simulate as many cascading failures as they can. Could be tie-in here to Combinatorial Testing. I was already considering it for modeling with ASM’s for interleaving or cascading failures.
re test runs. They estimate they’ve run trillions of hours of CPU tests. They run 5-10 million simulation runs per night on their cluster (simulated or their physical cluster?). “Each one of those simulates an entire cluster behaving anywhere between… five minutes and an hour or two depending on the test. And plus all that buggification and speeding up time and so on for years. We’ve done a lot of testing.”
re bad news is broken debugging. “Stepping through code that’s laced with callbacks is horrible.” Only a few people are skilled enough to get through “2 or 3 levels of the stack. So, you’re left with… printf.” Of course haha. The fact that the simulation is deterministic with same output supposed to come out lets them put in “non-spammy printfs” with conditionals that gradually “work deeper into something.” Still room for improvement here in underlying tooling.
re nightmare case of simulation is wrong. One is simulation misses something that happens in real world. Two is they might “misunderstand the contracts of the operating systems.” Maybe a specific call behaving unexpectedly, cross-platform code behaving differently on another platform whose calls have same name, or just an OS or hardware bug. Solution is backup cluster called sinkhole that’s a bunch of server mobos with ECC RAM connected to switches. Both servers and switches are connected to a programmable, network, power supply that turns them on and off all night while database runs. In just two days, they managed to physically destroy… the network power supplies. They went through hundreds of them. SSD’s, too, which doesn’t surprise me given failures reported on flash-based devices. Conclusion: that they never found a bug in their database doing this makes them think “simulation is doing a good job.”
re software dependencies. They have stuff in their stack they didn’t write, can’t simulate in Flow models, and Sinkhole found bugs in them. One flaw in ZooKeeper. More interesting was flaw in APT since it didn’t call fsync. There’s was a prior submission on Lobsters dedicated to properly doing that, esp across filesystems. Had to call it more than once to be sure.
re simulation time. He said running it breaks flow. Damn, these folks are smart: talking about simulations and maintaining flow. Yeah, that is a drawback. Notes these days they can just add lots of hardware to increase productivity. Maybe another team just adds model-checking and analysis, too, to do it without lots of hardware. Not in the talk but give them time.
re mental conditioning. He writes a bug the simulation catches. That’s negative feedback. Next bug isn’t caught. No feedback. So, he is “slowly being trained to write the kinds of bugs that slip through” their simulation framework. He says that keeps CTO up at night. Says they’re training their programmers to “defeat the simulation framework.” (I say hire the CTO, too). One solution is highly-different frameworks (diversity, yep!) with one they use day-to-day but other just before releases. The other is stuff like Sinkhole. Reiterates they blow through lots of power supplies. Btw, @stevelord, are there affordable, power supplies that can take that which software engineers like them might have not been familiar with?
Another point he mentions is how complex systems interact with components’ limitations causing problems during those interactions. That’s pretty close to the core of Leveson’s STAMP Mode that @ToKi and @hwayne told me about. I suggest everyone interested in safety modeling read that paper. She’s fucking brilliant. I especially love how holistic her method is: works on specific details keeping big picture in mind whole time. I’m thinking about attempting a security version in the future that stays close to the style and examples of the paper as a tribute. Anyone wanting to beat me to it feel free. :)
EDIT: Saw that @ac originally submitted this. I missed it. I think I got it off blog article by @davidk01. Thanks to both for bringing it up since I really wanted to know how FoundationDB verification worked. It’s possibly the only high-assurance (or nearly so) database done by small players that got into production… at high profit, too. Without formal verification. New angles to explore. :)