[My reply to author in Disqus. Done in a hurry since I gotta go to work now. ]
Good write-up on the difficulties. The first I saw was Guttman’s paper that set back the hype quite a bit:
His post and yours both oversell the problems while underselling the successes. It’s best to keep big picture in mind. Yes, it’s extremely hard. Yes, we shouldn’t rely exclusively on it or have too much expectations. On Guttman’s end, the Orange Book of TCSEC (later, EAL7 of Common Criteria) required many activities like I included in my own security framework:
Each was supposed to check the others. Each time, you want experienced people doing it since it’s the people that make these processes work. What’s mostly missing in this post and esp in Guttman’s is all the evidence of defect reduction. Most of the work he cites were tracking problems they found throughout their lifecycle. They all said the formal specifications caught bugs due to ambiguity. Some found formal proofs beneficial with often mentioning “deep” errors they likely wouldn’t find with testing or code review. TLA+ use at Amazon was great example where errors showed up 30+ steps into protocols. Most of old verifications found they caught errors just simplifying their specs for the prover even if they wouldn’t run proofs. And for high-assurance security, the A1-class systems would go through 2-5 years of analysis and pentesting where evalutators said they did way better than systems using mainstream practices. Most of those can’t even tell you how many covert channels they have.
If you’re interested in methods that helped in practice, I dropped a start on a survey paper on Lobste.rs trying to shorten several decades worth of material into an outline post:
Now, all that said, full proof is hard with a lot of problems. You hit nail on the head about the infrastructure problem. I’ve always strongly encouraged people to concentrate on building up one or two projects to have infrastructure for most common ways of solving problems in programming with as much automation as possible. They have at least been building most things on Coq, Isabelle/HOL, HOL4, and ACL2. Having prior work to build on has been helping them. As you’ve seen, though, they’ve not made this a priority: the tooling still is horrible. So, horribly-difficult work with horribly-difficult tooling. That leads to the next proposal I did that might help you or some of your readers who want 80/20 rule of most benefits without most problems:
Here’s some of those methods on top of the TLA+ and Alloy links:
Here’s my recommendation in general. Learn lightweight methods such as Meyer’s Design-by-Contract for most stuff, TLA+ for concurrency/order, Alloy for structure, SPARK for low-level code, and so on. TLA+ on distributed stuff seems to sell people the most. With a formal method in hand, then apply this methodology: memory-safe language + lightweight method to model-check important stuff + Design-by-Contract + tooling to generate tests from specs/contracts + fuzzer running with runtime checks enabled. This combination will bring the most quality improvement for least effort. In parallel, we have the experts formally verifying key components such as kernels, filesystems, network stacks, type systems, compilers, security protocols, and so on. They do so modelling real-world usage with each one coming with contracts and guidance on exactly how they should be used. The baseline goes up dramatically with specific components getting nearly bullet-proof.
Also, you said you wanted entire system to be reliable. There was one that set the gold standard on that by noting how everything failed with systematic mitigations for it in prevention and/or recovery:
The OpenVMS clustering software alone let their systems run for years at a time. One ran 17 years at a rail yard or something supposedly. The NonStop systems took that to hardware itself with them just running and running and running. Applying a combination of balanced assurance like I describe, NonStop techniques for reliability, and Cambridge’s CHERI for security should make one hell of a system. You can baseline it all with medium-assurance using tools that already exist with formal verification applied to a piece of the design at a time. The work by Rockwell-Collins on their commercially-deployed AAMP7G’s shows that’s already reachable.
So, there’s you some weekend reading. Merry Christmas! :)
Wow, that pastebin alone has a massive amount of information. Merry Christmas and thank you for the work :)
You’re very welcome. Merry Christmas to you, too. I hope you have a good one!
I largely agree with this and I especially agree with the sliver of hope in the conclusion. The outside world really undermines your assumptions, but formal verification can and will help in isolated facets of software development that are sufficiently abstracted, like protocol correctness. It’ll be a mix of art and science, just like the blend between (sophisticated) static typing and unit testing today. As always, cost/reward ratios should always be considered. Formally verifying small embedded systems that are mission critical and high risk? Medium effort, high reward. Formally verifying a web app? High effort, low reward. Formally verifying a network protocol instead of ad-hoc analyzing it on paper? Low effort, HIGH reward.
Good points but I cant resist trying.
“Verifying a web app? High effort, low reward”
Healthcare.gov? Tax site? National credentials? Online banking?
Good news is the prototypes for parts that need the assurance have already been built. It could be turned into something turn-key with templates if necessary.
Well, let’s talk about healthcare.gov. What would you like to verify? When I said “formally verify a web app”, I meant the whole thing, top to bottom, at an insanely fine grain level. But yes, I would love to see parts or layers verified. Atomicity of transactions. Security. Things like that.
That it functions the way it’s intended to. Nothing more or less. There’s enough robust stuff for web apps that this might just mean verifying a model of the app, the database backends, and their integrations. Data validation, too. I imagine those are main points failure would happen since insurance packages are likely done in mainframe apps or whatever on suppliers’ end.
So I haven’t gotten to the point where I’m adding much in terms of theorem statements to my own Idris code (or spent time with Coq, period) so I’m not fully familiar with what’s probably the crucial part of Idris in relationship to this article but even just as far as I’ve gotten to learning the system it actually seems that Idris, given the proper work, could at least solve a bit of the multiplication of code between implementation and proof. This is done through the proofs themselves being defined through types, described as “The formulae-as-types correspondence”
An example from the Idris documentation, which should look somewhat familiar and yet fairly bizarre to anyone here that uses Haskell:
four_eq : 4 = 4
four_eq = Refl
Here, the first line is the signature, the second line is the body, and Refl is the Type describing reflexivity. The actual work of proving characteristics of code then, is the work of evaluating the definitions of types, first class to the exact work used in evaluating your final program that utilizes those types to ensure those characteristics. It sure has been something else for me to wrap my head around. But it’s fairly neat stuff.