A detailed analysis of finding and fixing database driver bugs at scale in production.
When I read the fix for the first problem (read-to-see-if-error-before-write) it occurred to me that there was still a race (since the timeout could occur after the read but before the write).
i.e. this is not a “fix” in the sense that “the problem cannot occur”, but it turns the problem into a “so unlikely to occur that it is fixed for all practical purposes while we have other bugs in the system”.
Which I find interesting since it is on the interface between correctness and practicality.