The entire book “Is Parallel Programming Hard, And, If So, What Can You Do About It?” is a gem edited by Paul McKenne, the guy behind the RCU synchronization primitive in the Linux Kernel.
But I find Chapter 11 on Validation the most fascinating….
Here is a TL;DR; for you…
From the guy who probably is one of best on the planet at this game…
I have had a few parallel programs work the first time, but that is only because I have written a large number parallel programs over the past two decades. And I have had far more parallel programs that fooled me into thinking that they were working correctly the first time than actually were working the first time.
His core idea…
The basic trick behind parallel validation, as with other software validation, is to realize that the computer knows what is wrong. It is therefore your job to force it to tell you. This chapter can therefore be thought of as a short course in machine interrogation
But never forget that the two best debugging tools are a solid design and a good night’s sleep!
Why do we have bugs? We’re insanely optimistic…
These insane levels of optimism extend to self-assessments of programming ability, as evidenced by the effectiveness of (and the controversy over) interviewing techniques involving coding trivial programs [Bra07]. In fact, the clinical term for a human being with less-than-insane levels of optimism is “clinically depressed.” Such people usually have extreme difficulty functioning in their daily lives, underscoring the perhaps counter-intuitive importance of insane levels of optimism to a normal, healthy life.
If you have a sporadic bug that fails on 10% of time, how many times do you have to retest before you’re 99% certain you’ve fixed it? Answer 44 times.
And if it’s a 1% probability of failure? You have to rerun it 458 times to be 99% certain you have fixed it.
See his figure 11.4 for a scary graph.
- Add delay to race-prone regions.
If you suspect something is getting corrupted due to a race… slap a fat delay before and/or after each access to widen the probability of a race.
- Increase workload intensity.
ie. Murphy’s Law of Thermodynamics. “Things get worse under pressure”.
- Test suspicious subsystems in isolation
Decrease the amount of the system that is under test.
- Simulate unusual events
Deliberately induce random failures in anything that can fail under normal conditions (but you code is expected to handle it).
- Count near misses.
This is a good one I have never thought of. If a locking primitive takes N cycles… any access less than N cycles apart is a near miss and probably wasn’t locked correctly.