1. 48

Formal methods (theorem proving, model checking, static analysis, abstract interpretation, advanced type systems, program semantics, etc) have remained a relatively niche topic during the last two and a half decades. The field saw much interest during the 1980s and early 1990s. Tony Hoare’s 1996 speech recognized formal methods were so far unable to scale to real-world software, and that event perhaps marked the beginning of a formal methods winter.

Do you think the field is starting to regain attention? I have observed many positive indicators:

  • Strong interest in the mathematics community to mechanize proofs, driven by Lean

  • Some technical milestones, such as end-to-end proofs carried by the DeepSpec project

  • Emerging synergies with AI, e.g. synthesis

  • Some job openings in mainstream companies, e.g. Google

  • Respectable number of grant calls and studentships related to the topic, growing from close to zero

  • Funded work by cryptocurrency platforms to verify smart contracts, after multi-million fiascos

Is this a promising field to work on? What advances do you expect during this decade?

  1. 15

    One big thing that I expect will change, or rather, that I hope will change, is that formal methods techniques will become more widespread in mainstream programming languages. [1] As much as I am not a fan of Rust, it is leading the way in this space by making lifetimes and borrowing, among other things, mainstream, and I appreciate it for that.

    I emphasize techniques because I don’t think full formal methods (proving correctness of code) is going to go mainstream anytime soon, but there are smaller techniques that I believe can. I think that those techniques can add a lot of value and get us closer to correct software in the small. That’s because formal verification, at a micro level, is about proving certain properties of programs. Prove enough properties, and you’ve proven correctness. Full correctness is what people think of when they think of formal methods, but it’s just as valid to only prove one or some of the properties required to prove full correctness; that’s still proving something useful.

    (Note: in my opinion, formal methods and testing are complements because they are both about proving properties of programs. Formal methods are good at proving certain kinds of properties, and testing is good at proving other kinds of properties. Being able to produce quality software requires learning how to use both, when each should be applied, and how to make them work together.)

    An example: I hope that concepts from Design by Contract will become mainstream as well, probably helped by languages taking inspiration from Ada/SPARK and Eiffel. Design by Contract concepts are really a formal methods technique that concentrates on individual types and functions, both of which are “in the small” as mentioned. With Design by Contract concepts, a compiler might be able to do conservative analysis to figure out if a function meets its postconditions if its preconditions are true. The compiler may not be able to prove full correctness, but it may help catch most errors by proving the non-existence of some bad behaviors with respect to the contract.

    Another example: property-based testing. With property-based testing, combined with Design by Contract, the testing apparatus could actually take the preconditions and postconditions and use them as the properties to test.

    Another example: say you manage to prove the correctness of one low-level routine in your software. At that point, you know that wherever it is used, as long as it is used correctly, it is not the reason for a failing test. In fact, you could even use the fact that it is known to be correct to more easily narrow down the cause of bugs by skipping that routine entirely. It would also be easier to tell when a caller is not abiding by its preconditions.

    Another example (an extension of the one above): software is complex because of the size of the state space. Using formal methods, you can prove that entire portions of the state space will never be reached, even if you do not prove full correctness (which is the same as proving that every incorrect state is unreachable). That then means that your test suite never needs to consider those unreachable states, making the software much easier to test!

    Feel free to ignore the rest of this post; I’m just going to talk about my own projects.

    I am attempting to make a lot of “in the small” techniques from formal methods available and easy to use by designing a programming language with a lot of those techniques, mostly the ones that are easiest to apply and give the biggest bang for buck from learning and applying them, in a design that I hope makes them easy to use. I don’t have much to share on my work because I’m in the middle of a licensing conundrum and can’t make it public yet. But I am attempting to “be the change [I] want in the world.”

    [1]: In this context, “mainstream programming languages” refers to languages that programmers can get jobs in without picking one specific company. That may be a poor definition, but that’s the definition in my head.

    1. 8

      That’s because formal verification, at a micro level, is about proving certain properties of programs. Prove enough properties, and you’ve proven correctness. Full correctness is what people think of when they think of formal methods, but it’s just as valid to only prove one or some of the properties required to prove full correctness; that’s still proving something useful.

      Yes, exactly! And this has already happened, and the result of it is the type systems we have now. In other words, ordinary type systems are the parts of formal methods that have gone mainstream.

      A type system is in essence a proof search. For example, type checkingfunc map<A, B>(f: A -> B, list: List<A>) -> List<B> { body... } is searching for a proof that for all types A and B, if f is a value of type A -> B and list is a value of type List<A> then body... will evaluate for a value of type List<B>. (Or run forever, or crash in an “acceptable” way.)

      However, there is one very important property that ordinary type systems have that makes them practical, and tends to draw a line between “type systems” and other formal methods. It’s that the proof search always halts (typically in linear-ish time), and either says “yes it type checks” or “no it doesn’t type check and here’s why”. The guarantee of getting a (relatively) explanatory error message if your program fails to type check is very important. It’s what lets you easily see what the problem is, and let’s you reason about how to modify your program to satisfy the type checker.

      1. 1

        I wonder if there’s a possible halfway point where the interface the programmer uses is not so well behaved - uses tactics, may be slow and use interactive guidance - and then the output of that is a proof that can be reliably checked in linear time for linters and checking. Kind of like applying the de Bruijn criterion to complicated type systems.

        1. 1

          You are absolutely correct, and I should have mentioned better type systems as an example in my post.

        2. 2

          I don’t have much to share on my work because I’m in the middle of a licensing conundrum and can’t make it public yet.

          Oof, I don’t envy you. Hope all goes through amicably, and more so, that the project does see the light of day; it sounds really helpful & cool!

          1. 1

            Thank you. :) The licensing conundrum is around GitHub Copilot, actually. I need to get the money to ask a lawyer about the consequences of Copilot for licenses and if the use of code in Copilot is or could be prevented by a license.

        3. 13

          No, not until things get a lot worse.

          The world is the way it is because we re-purpose infrastructure designed for cat video databases for everything else. We can barely rally small investments from aerospace and military. Even in the cryptocurrency space, designers of new languages don’t start with formal semantics but bolt them on after-the-fact.

          While there has been some progress, the thaw won’t come naturally. I expect that it will be blasted away by a terrorist attack. Only then will the computing industry take safety engineering seriously 😞.

          1. 11


            My pet hobby horse.

            Let me ride it.

            It’s a source of great frustration to me that formal methods academia, compiler writers and programmers are missing the great opportunity of our life time.

            Design by Contract.

            (Small Digression: The industry is hopelessly confused by what is meant by an assert. And subtle disagreements about what is meant or implied by different programmers is an unending source of programmers talking passed each other).

            You’re welcome to your own opinion, but for the following to make any sense at all, put aside whatever you mean by “assert” for the duration, and accept what I mean. You can go back to your meaning after we have finished discussing this comment.

            By an assert in the following I mean, it’s a programmer written boolean expression, that if it ever evaluates to false, the programmer knows that the preceding code has an unknown bug that can only be fixed or handled by a new version of the code.

            If it evaluates to true, the programmer full expects the subsequent code to work and that code will fully rely on the assert expression being true.

            In fact, if the assert expression is false, the programmer is certain that the subsequent code will fail to work, so much so, there is no point in executing it.

            So going back to DbC and formal methods.

            Seriously. Writing postconditions is harder than just writing the code. Formal methods are way harder than just programming.

            But we can get 90% of the benefit by specializing the postconditions to a few interesting cases…. aka. Unit testing.

            So where can Formal Methods really help?

            Assuming we’re choosing languages that aren’t packed with horrid corner cases… (eg. Signed integer overflow in C)…

            Given a Design by Contract style of programming, where every function has a bunch of precondition asserts and a bunch of specializations of the postconditions……

            My dream is a future where formal methods academics team up with compiler writers and give us…

            • Assuming that every upstream assert expression is true, if it can be determined that any downstream assert will fail, the compile will fail with a useful warning.
            • Where for every type, the programmer can associate an invariant expression, and the compiler will attempt to verify that it is true at the end of the constructor, and the start and end of every public method and at the start of the destructor. If it can’t, it will fail the compile with a warning.
            • Wherever a type is used, the invariant expression can be used in this reasoning described above.

            So far, you might say, why involved the compiler? Why not a standalone linter?

            Answer is simple… allow the optimizer to rely on these expressions being true, and make any downstream optimizations and simplifications based on the validity of these expressions.

            A lot of optimizations are base on dataflow analysis, if the analysis can be informed by asserts, and the analysis can check the asserts, and be made more powerful and insightful by relying on these asserts… then we will get a massive step forward in performance.

            My experience of using a standalone linter like splint… is it forces you to write in a language that is almost, but not quite like C. I’d much rather whatever is parsed as valid (although perhaps buggy) program in the language by the compiler, is parsed and accepted as a valid program by the linter (although hopefully it will warn if it is buggy), and vice versa.

            I can hear certain well known lobste.rs starting to the scream about C optimizers relying on no signed integer overflow since that would be, according the standard, undefined and resulting in generated assembler that results in surprised pikachu faced programmers.

            I’m not talking about C. C has too much confused history.

            I’m talking about a new language that out of the gate takes asserts to have the meaning I describe and explains carefully to all users that asserts have power, lots and lots of power, to both fail your compile AND optimize your program.

            1. 5

              As someone who has been using Frama-C quite a lot lately, I can’t but agree with this. There’s potential for a “faster than C” language that is also safer than C because you have to be explicit with things like overflow and proving that some code can’t crash. Never assume. Instead, prove.

              1. 3

                for the following to make any sense at all, put aside whatever you mean by “assert” for the duration, and accept what I mean. You can go back to your meaning after we have finished discussing this comment

                Didactic/pedagogical critique: in such a case, it may be more appropriate to introduce a new term rather than using one which has a common lay meaning.

                My dream is a future where formal methods academics team up with compiler writers and give us […]

                Sounds a lot like symbolic execution.

                1. 3

                  using one which has a common lay meaning.

                  Does assertions have a different meaning than the one given here?

                  1. 4

                    I have colleagues for whom it means, “Gee, I didn’t think that input from outside the system was possible, so I want to know about it if I see it in unit test, and log it in production, but I must still handle it as a possibility”.

                    When I personally put in an assert at such a point, I mean, “a higher layer has validated the inputs already, and such a value is by design not possible, and this assert documents AND checks that is true, so in my subsequent code I clearly don’t and won’t handle that case”.

                    I have also seen debates online where people clearly use it to check for stuff during debugging, and then assume it is compiled out in production and hence has no further influence or value in production.

                    1. 1

                      GTest (Google Test), which I’ve recently had to use for C++ school assignments, refers to this in their macro names as EXPECT. Conditions whose failure is fatal are labelled with ASSERT. This makes intuitive sense to me: if you expect something to be true you accept its potential falsehood, whereas when you assert something to be true you reject its potential falsehood.

                      1. 1

                        TIL! Thanks! I only knew about assertions from the contract perspective.

                      2. 1

                        A common design choice is that assertions are evaluated in test environments, but not production. In that case, plus a test environment that you’re not confident fully covers production use cases, you might use assertions for hypotheses about the system that you’re not confident you can turn into an error yet.

                        I’m not sure that’s a good idea, but it’s basically how we’ve used assertions at my current company.

                      3. 2

                        Alas, my aim there is to point out people think there is a common lay meaning…. but I have been involved in enough long raging arguments online and in person to realize… everybody means whatever they damn want to mean when they write assert. And most get confused and angry when you corner them and ask them exactly what they meant.

                        However, the DbC meaning is pretty clear and for decades explicitly uses the term “assert”… except a lot of people get stuck on their own meaning of “assert” and conclude DbC is useless.

                        Sounds a lot like symbolic execution.

                        Ahh, there is that black or white thinking again that drives me nuts.

                        Symbolic execution and program proving is a false aim. The halting problem and horrible things like busy beavers and horrible fuzzy requirements at the UX end of things make it certain that automated end to end program proving simply will never happen.

                        That said, it can be incredibly useful. It’s limited for sure, but within it’s limits it can be extraordinarily valuable.

                        Odds on symbolic execution is going to fail on a production scale system. Not a chance.

                        However, it will be able to reason from assert A to assert B that given A, B will fail in these odd ball corner cases… ie. You have a bug. Hey, thats’ your grandfathers lint on steroids!

                      4. 2

                        You might find the Lean theorem proving language meets some of your requirements. As an example:

                        structure Substring :=
                        ( original : string )
                        ( offset length : ℕ )
                        ( invariant : offset + length ≤ original.length )

                        In order to construct an instance of this Substring type, my code has to provide proof of that invariant proposition. Any function that consumes this type can rely on that invariant to be constrained by the compiler, and can also make use of that invariant to prove proofs about the function’s postcondition.

                      5. 11

                        Various hot takes for discussion:

                        • The majority of software (>99.9%) doesn’t significantly benefit from formal methods, especially with the reliance on both plumbing together bits of garbage and also mixing together business and display logic.
                        • The majority of businesses don’t care or aren’t interested in limiting their domains/growth in ways amenable to formal methods. This is, incidentally, why Haskell will forever be a niche language.
                        • We’ve had some of these methods for decades (design by contract dates back to the 80s with Eiffel) and they haven’t stuck. Beyond the advent of faster computers with which to run analyses, I’m curious what changes would suddenly make programmers eat their vegetables–we still don’t do that with testing, which is far simpler than trying to pick the right types or model system interactions.
                        • For maximum benefit, new systems need to be built on either vetted new code (vetted via analysis), or old code (vetted via years of development and testing). Unfortunately, old code is oftentimes too gnarly/involved to just slather in formal methods–say, for example, old Fortran or C numerical code…and the business case for “well, it already works, let’s rewrite it so we can prove it works!” is hard to make (and that’s assuming you can even find people with the domain expertise to do a rewrite safely).
                        • So much software just isn’t worth writing well, given its lifetime. This is especially obvious in webdev. There is a transactional cost to using good, sound formal methods, and for some minor tasks it just isn’t worth paying. As mentioned above, most of what we do are those minor tasks.
                        1. 2

                          Agreed, but DbC plays nice with such a scenario.

                          I use asserts all over the place when working with, aaah, “legacy” software.

                          I analyse what I think the big ball of mud is doing, and then I executably document the results of that analysis as asserts, and then if I’m wrong with my analysis, I find out reliably at the point where I stuffed up.

                          1. 2

                            I agree - there’s a huge class of software for which even integration tests are pointless because the person asking for the software doesn’t know what they actually want.

                            1. 3

                              I’ve found myself in that situation pretty often over the years, and I don’t exactly disagree, but I think there’s still value in integration tests. An integration test can’t tell you whether the software is useful, but it can at least tell you that the software isn’t behaving in ways the programmers didn’t intend. If your integration test starts spewing database errors, it’s probably a sign of a bug even if your customer doesn’t know what end results they want.

                              But I agree broadly that the less you know about what you’re trying to accomplish, the less useful it becomes to spend a lot of time writing exhaustive tests or machine-readable formal specifications.

                          2. 4

                            Wrote 2000 words of thoughts on this! Here

                            1. 2

                              OP here. Thanks for writing such a detailed reply so quickly. It’s impressive!

                              I generally agree with most of your points but I have some comments about a few items:

                              My reference to a winter was a bit provocative to catch reader attention. But I think it was quite appropriate. Before the AI winter, all we had were brittle expert systems. I don’t think there was a huge industry interest, it was mostly academic and government funding. FM were pretty close by the mid 90s, just before Tony Hoare’s speech.

                              The Vienna Development Method was popular in some scandinavian and germanic CS departments. I have met Bjørner, one of the developers, and refinements of this method were taught and used well into the late 2000s. I am familiar with some current railway systems developed using VDM derivatives. I actually worked on one. Interestingly, Bjørner’s son is behind Z3, together with De Moura.

                              I like your point about Z3. In this regard, we should think about FM in a broader way. Type systems should be certainly included. Refinement (liquid) types are possible thanks to fancy SMT solvers. And refinement types are in many ways static DbC. In fact, some refinement types implementations push checks to runtime when they fail to prove something.

                              Static analysis is often ignored when talking about FM, but it scales to massive codebases and it has had huge successes. Galois connections are a great conceptual framework, and fast Datalog implementations help a lot in reducing the complexity of implementations. I think static analysis is gaining a lot of momentum, and we will see many more static analyzers by the end of the decade.

                              The last thing is that I wouldn’t discount major advances in theorem proving that automate many proofs using DL, augmenting the capabilities of things like Isabelle’s Sledgehammer. There are already promising publications in this avenue.

                              1. 2

                                Static analysis is often ignored when talking about FM, but it scales to massive codebases and it has had huge successes.

                                Facebook’s Infer is a perfect example of that. It’s used on both C++ and Java codebases. There were concurrency checkers that looked at various orders code could execute to spot likely problems. Non-FM experts use Rust’s borrow checker to solve such problems with effectiveness that used to take separation logic that Infer uses internally. Runtime Verification Inc built a tool used in industry on a formal semantics of C that leads to low or no false positives. A Google tool and ForAllSecure’s Mayhem automatically try to fix the problems. On impact, Microsoft’s driver verifier probably deserves higher mention given both deployment numbers and nearly eliminating the blue screens.

                                In academia or for hobbyists, many teams whose tools are entering competitions have been converging to use the same underlying codebase (CPA-Checker?). That they do means that a combo of them could be polished and packaged into an industrial tool. Other projects could generate automatic patches for whatever is found. Test generators could show equivalence on top of the benefits they already provide. If equivalent, automatically include the patch without even bothering to further analyze it. Lower and lower developer involvement with higher payoff (esp development velocity) = more adoption.

                            2. 4

                              TLA+, IMHO should be used by everyone building large complex software. It’s the only clear way to think about such systems.

                              1. 3

                                It depends on the nature of the proof. Honestly, though, I think that your thesis is wrong; there’s not been a slowdown in formal methods, but an explosion in the total amount of software. Polynomial amounts of progress in formal methods are dominated by exponential amounts of software as the ability to develop is democratized further.

                                Right now I would say that the field is marked by schism, not crisis. You mention Lean, but Mizar and (my favorite) Metamath show off completely different approaches to formalizing maths which have made similar progress on the Formalizing 100 Theorems benchmark. In particular, Metamath is syntax-driven and not type-driven, making it orders of magnitude faster than other systems.

                                Cryptocurrency research is usually not useful as presented; only the generalized results are really applicable. For example, compare Stay et al. on RChain research from the blockchain enthusiast’s perspective and Stay & Williams’ version for mathematicians which is general enough to be usable outside of blockchains.

                                1. 2

                                  @hwayne ping! I am sure you have things to say.

                                  1. 7

                                    Fun fact, lobsters doesn’t actually notify you when you do that. But I saw the question and I should really write up my thoughts!

                                    1. 3

                                      Yes, please do! This thread needs your thoughts.

                                      1. 3

                                        This is going long and will probably have to be a newsletter

                                        1. 1

                                          I am looking forward to these thoughts and / or a newsletter!

                                          1. 1

                                            I look forward to it… but please address the very real world issue of DbC interacting (in both directions) with compilers, linters and optimizers and unit tests.

                                            I’ve been writing code in longish list of languages, from mainframe days to deeply embedded systems, in government scientific research institutes to commercial and industrial.

                                            I promise you, as I described above this is the route for formal methods to make a huge, and growing and very real impact on the world.

                                      2. 2

                                        I am particularly interested in using logics to describe business rules and checking (optimistically) for completeness and consistency. Can the business rules handle every possible scenario? Do they always terminate? Are there contradictions between rules?

                                        Catala is a DSL for law but it seems like it could handle business rules in general.

                                        On the other hand I had success with solvers in operations research, if that counts as formal methods.

                                        1. 1

                                          I think (hope) the techniques of formal methods can be used to better describe systems of systems. Having a specification of how a set of services interact has long been a dream of mine. The types of properties I would love to prove are:

                                          • If service A is slow, how will a down stream can a downstream dependent system end in a different state?
                                          • Can I safely migrate this information from an old schema to a new schema?
                                          • Does the system design still meet all previous criteria after an expected change.

                                          One aspect of this that think needs better tooling and exposure is creating more meaningful simulations. I imagine other approaches might work but my personal experience with using formal methods in a organizations end with “that’s magic I don’t trust it”, “I don’t understand the presentation of this” or “I think the journey is more important that the outcome”. The few break through moments I’ve had are with simulations[1]. I think that this simulation approach allows a wider audience and meaning to any investment in a specification.

                                          [1] https://runway.systems/ or https://www.mcrl2.org/web/user_manual/index.html

                                          1. 2

                                            Who’s got two thumbs and just received his copy of Modeling and Analysis of Communicating Systems?

                                            this guy

                                            1. 2

                                              Probabilistic model checking can also help a bit with this.

                                              1. 1

                                                That’s an interesting idea. I’ve purely looked at it from an ordering of events or actions. If you have any recommended toolsets or articles about that, I would love to try them out.

                                                1. 1

                                                  PRISM introduced probabilistic model checking. There’s a lot of literature, tutorials and case studies associated to it: https://www.prismmodelchecker.org/

                                            2. 1

                                              An example of a lobster thread worth bookmarking. Wishing other threads would adopt a similar tone when presenting/responding.