1. 76
  1.  

  2. 15

    This post is entitled “Why Don’t People Use Formal Methods?”, but I feel it’s maybe underselling isself. It does explain why people don’t use formal methods, in the author’s opinion. But it also explains why you might use formal methods in the first place — better than most other sources I’ve seen.

    1. 9

      Formal methods has applications beyond “verification”. Several problem,with relative ease, can be modeled in decidable fragments of logic. This coupled with the advancement of efficient decision (and semi decision procedures), gives us a new set of tools for analyses.

      Consider for example, efficient code generation in compilers[1,2], and successful synthesize of system architectures for Boeing’s 787 [3].

      So formal methods is not just about proving correctness as an afterthought. Rather developments in formal methods can provide value in other aspects of design flow as well.

      [1]https://github.com/google/souper/

      [2] Static Analysis for GPU Program Performance

      [3] Synthesizing Cyber-Physical Architectural Models with Real-Time Constraints

      1. 4

        Static Analysis for GPU Program Performance

        This one caught my attention, so I tracked down the paper - a 2018 Thesis and GitHub repository for future reference.

      2. 6

        Your point about “people assume their current technique is sufficient” is also borne out by my experience. It’s coupled with a second observation: people assume the alternative technique is too hard/expensive/inapplicable, or they woukd already be doing it. I gave a talk a couple of weeks back about Java by Contract, and one of the first “questions” was “but I would have to learn this and train my team, when they already know TDD”. People forget that they were saying the same about TDD training 1-20 years ago.

        In my reading of history, formal methods was largely divided along the Alan Kay “observation that the Atlantic ocean has two shores” line until the SV startup-style, “worse is better” school won out. That school tends to be quite pessimistic about reuse (why would we document or modularise this, when our funding runs out in three months and a new backend stack will lead us to desire a rewrite in six?) so many applications have traditionally been rewritten from scratch, using only libraries in the infrastructure and basic computing domains. By the time reuse became popular with rubygems, the startups school had come as far as thinking that TDD was a correctness technique. Turns to camera: it isn’t.

        1. 5

          One reason is they provide no benefit for a large proportion of applications. Particularly those that model large complex domain problems, where domain knowlege is outside the development team. Problems outside the domains of computing and its infrastructure, or data sciences. In these, the problem is not the correctness of software.

          1. 14

            I would disagree. For example, Business process modelling is a huge market and how to implement them securely is a big research problem (SAP has whole offices of researchers for that). Large companies regularly go bankrupt by essentially finding modelling errors too late. A lot of the techniques they research are very similar to things like TLA+. When working there, I worked on proving properties about supplier graphs.

            I’ve had clients as seemingly simple as a messenger app crashing and burning when they hit all the edge cases they hadn’t seen. Sure, there’s more things here (other bad development practices, etc.), but I don’t agree on the point that hiring an expert in essentially modelling and bullet-proofing their messenger-system would have cost more as a team of 5 programmers figuring out everything on foot.

            The current problem I see is that we’re still lacking easily usable tools. I’ve seen clients using BPMN to inform developers of product processes instead of a bunch of stories as a shared artifact between management and devs - encouraging that and putting some form of additional checking through experts in the middle sounds like a workable sketch.

            1. 3

              The building blocks can still be proven correct.

            2. 3

              Absolutely magisterial. I’m saving this as a good starting point for all such future conversations.

              1. 3

                I just thought of argument that probably should be included in a sales pitch for formal methods. Most people will agree that code is a liability and tests are more code. So one way to mitigate the risk associated with code is to “diversify” with something that is not code. Formal specifications in that sense are a de-risking instrument.

                Maybe this is implicit in the article but would be worthwhile to unpack I think because I don’t see any obvious shortcomings with the “diversity”/“diversification”/“portfolio management” argument.

                1. 2

                  I found this educational and interesting, but I still didn’t find anything to tell me why I need formal methods on top of or in contrast to a testing + static types approach (leaving aside contracts, which I’ve barely touched, and which you mention as one of the three methods for specifying code in the “Getting the Spec” section). You seem to suggest that you can get to 99% correct without formal methods in the “Why Bother?” section…so why would I bother with formal methods? I don’t feel like you answered this in a way I could understand–forgive me if I’m just not getting it.

                  More specifically, if I’ve already invested a lot of time and energy into learning about type systems and everything they can encode, why wouldn’t I just push ahead with learning more about e.g. dependent types, which seem like they can do a lot of the same, like encoding the proof for a sorting algorithm, similarly to the insertion sort example you gave. Or say we’re concerned with distributed system issues–would we be better served if we could actually encode our message-passing protocol in the type system? Or what about starting from the spec itself to generate programs/proofs–amazing stuff! But do they achieve everything you can achieve with formal methods, with the same amount of effort (acknowledging that some of the stuff I listed above is still more theoretical than practical)? You say there is overlap in PLT and formal methods…where is that overlap, meaningfully speaking? I’d love for you to expand more on your points re: Pony/Rust/Haskell at the end of the “Partial Code Verification” section. Or, let’s assume I’ve bought into the value of spec/design languages as a given…what can they express, or what mechanistic approaches do they provide for verification that I couldn’t get through using a (sufficiently sophisticated) type system? Your description of a model checker sounds a lot like property-based testing, how are these effectively different for my purposes as an industry programmer working on…um, web and mobile apps?

                  I hope it’s clear that I’m sincerely curious about this, and not trying to cast doubt on the value of formal methods. I also want to be clear that I don’t think static typing is a silver bullet or sufficient on its own, and I recognize that the approaches I’ve suggested above have their own limitations and challenges; I myself have written plenty of Haskell code that type-checks perfectly and is absolutely, horrendously incorrect (I think I’m quite good at this actually). I don’t have the knowledge to state that one is better than the other and I believe it’s more nuanced than that regardless. I guess what I’m trying to glean, fundamentally, is how much energy to expend on learning more about formal methods, and in order to do so I want to know exactly what they can and can’t do in contrast to the approaches I already know, and how hard they are to leverage. I didn’t get that from this piece, and in fact it almost seems like you’re arguing “well, if 99% is good enough then maybe formal methods aren’t for you…” but I suspect that’s not the right takeaway.

                  Let me just reiterate: this was a very informative and interesting piece, and thank you for writing it. It just left me with more questions!

                  1. 2

                    Thanks for the questions! Lemme see if I can answer a few:

                    You seem to suggest that you can get to 99% correct without formal methods in the “Why Bother?” section…so why would I bother with formal methods? I don’t feel like you answered this in a way I could understand–forgive me if I’m just not getting it.

                    Well it’s called “Why don’t people use formal methods?”, not “why should people use formal methods?” :P

                    More seriously, I wrote this because, as a formal methods expert, I think there are important challenges in the field and serious tradeoffs in practice. Most anti-FM pieces I see, though, miss almost all the nuance, focusing on minor issues and ignoring the major ones. I don’t want to evangelize something I can’t also tear to pieces.

                    wouldn’t I just push ahead with learning more about e.g. dependent types, which seem like they can do a lot of the same, like encoding the proof for a sorting algorithm, similarly to the insertion sort example you gave.

                    Dependent types are considered formal methods. Here’s an implementation of insertion sort in Idris, which leverages dependent types to prove the list is sorted.

                    The challenge with dependent types, though, is the same as the challenge with logics and theorems: specifying them is hard, and verifying them is much harder.

                    Or say we’re concerned with distributed system issues–would we be better served if we could actually encode our message-passing protocol in the type system?

                    Message passing protocols are a relatively small part of the challenges of a distributed system. It definitely would help to be able to say we don’t send messages after we close the channel. But what if I want to guarantee that my MapReduce design will always compute the correct answer? Or that my deployment system does not surface different application versions? Or that S3 works right? All of these are pretty easy to do with design specifications, but very hard to do with code verifications.

                    The difference, I think, is that concurrent types are theoretically effective for simple problems, while design specifications are empirically effective for large problems.

                    Or what about starting from the spec itself to generate programs/proofs–amazing stuff!

                    I was lucky enough to see that talk in person: Nadia Polikaporva is one of my heroes and I probably would have attended even if the talk was “birdwatching 101”. I don’t remember whether she talked about it in the talk or in the questions after, though, but she cautioned us that synthesis is nowhere near production ready. I vaguely remember it being something like “definitely not in the next ten years”, at least, and probably a lot longer.

                    But do they achieve everything you can achieve with formal methods, with the same amount of effort (acknowledging that some of the stuff I listed above is still more theoretical than practical)?

                    The “more theoretical than practical” part is really important here. In theory, there’s no difference between theory and practice. In practice, there is. Remember, the first formal verification languages came out fifty years ago, and dependent types as a concept are even older. We’ve had a lot ideas on how to make formal verification easier, but so far most of them haven’t panned out.

                    Your description of a model checker sounds a lot like property-based testing, how are these effectively different for my purposes as an industry programmer working on…um, web and mobile apps?

                    Model checking tends to work as formal design verification, PBT is usually informal code verification. The advantages of the former are:

                    1. Exhaustive coverage
                    2. Higher level analysis
                    3. Model checkers can check additional properties, such as liveness, things like “all received messages are eventually responded to.”

                    The downside, of course, is that they can’t check the code itself. There’s not a whole lot of overlap between the PBT and FM worlds, which is a huge shame, because the two have a lot ot learn from each other.

                    Re web and mobile apps: I got started with FM by applying it to web apps. I gave a talk on it here. It probably saved us about 300k a year purely in business savings, and about half as much again in saved maintenance time.

                    I didn’t get that from this piece, and in fact it almost seems like you’re arguing “well, if 99% is good enough then maybe formal methods aren’t for you…” but I suspect that’s not the right takeaway.

                    Haha, well it might be! Most of my essays are super, super pro formal methods. My biases are that in the vast majority of cases, formal design and code specification are incredibly valuable. But in almost all cases, code verification is not worth the expense. However, I also wanted to be clear on why it’s not worth the expense, and how we know when we’ve found cases where it is worth the expense. There’s a lot of careful nuance here and I don’t want to overgeneralize.

                    1. 1

                      Hillel, sorry for my slow response–I just got busy. Thank you very much for your thoughtful reply. It helped me to understand better where I’ve been making the wrong assumptions (I did not realize dependent types were considered as a form of formal methods!) or confusing concepts (I think I’d been conflating code verification and design verification, for one…there are more). I’m going to go back and read your post again, but I’ll just add that this sentence in particular:

                      The difference, I think, is that concurrent types are theoretically effective for simple problems, while design specifications are empirically effective for large problems.

                      …helped some things click for me, even thinking more generally than just about session types–it seems like there’s a germ of a statement about the qualities that make static typing distinct from formal methods in terms of where and when and how they are useful. Although, I could be reading too much into it.

                      I was lucky enough to see that talk in person: Nadia Polikaporva is one of my heroes and I probably would have attended even if the talk was “birdwatching 101”. I don’t remember whether she talked about it in the talk or in the questions after, though, but she cautioned us that synthesis is nowhere near production ready.

                      Ha, then we were in the same room together, assuming we’re both talking about Strange Loop last year. I loved that talk, it was maybe my favorite talk at Strange Loop leaving aside Frank Pfenning’s talk at PWLConf (one of my other links was to the slides for that talk). I was very impressed with Prof. Polikaporva, both in terms of the vision she presented and in her ability to make a compelling presentation. I have to admit I left the room somewhat quickly at the end because there was such a crowd forming around her, but it sounds like I should have stuck around!

                      1. 1

                        “There’s not a whole lot of overlap between the PBT and FM worlds, which is a huge shame, because the two have a lot ot learn from each other.”

                        There has been some effort in this direction to have testing and theorem proving work together.

                        Integrating Testing and Interactive Theorem Proving (https://www.ccs.neu.edu/home/pete/pub/acl2-2011.pdf)

                    2. 2

                      I wouldn’t even know how to go about formally proving code is correct.

                      For instance (for a personal project)—given a string of ASCII characters (32 to 126 inclusive) break it into a series of lines no longer than N characters wide (for display purposes). Each line is to be as long as possible, but no longer than N characters. The original string can be split at space (character 32) or dash (character 45). If broken at a space, the space character is not included in either half; if broken at a dash, the dash is to remain at the end of the (and here, I want to say left hand side, but I haven’t formally defined that yet, but I will anyway—see how hard this is getting?) left hand string. If there is no break point in N characters, then just break at N characters.

                      So how would I 1) formally specify that, and 2) formally prove my code works?

                      And this problem is the simplified version, as I’m dealing with UTF-8, which is another level of madness.

                      1. 5

                        I totally recognize that problem (it’s the problem of “What’s the right spec?”) and think it’s a deep issue in FM and program design, but just for fun I wanted to try writing some pseudo specifications in weird Dafnyish-pseudospecs

                        def line_splitter(str: string) returns (out: seq<string>)
                        requires
                          ∀ char ∈ str: 0 ≤ ASCII(char) ≤ 126
                        ensures
                          concat(line) == str
                          ∀ line ∈ out: len(line) < N //max size
                          ∀ i  ∈ 0..len(out)-1: //break point
                             either 
                               1) out[i].last = "-"
                               2) space_exists_at_boundary(out, i)
                               3) let extra := N - len(out[i]) in //there was no better point to break at
                                 ∀ char ∈ 0..extra: 
                                   either 
                                     1) out[i+1][char] ≠ " ", "-"
                                     2) Len(out) < extra
                        

                        “As long as possible” is a weird one, because that potentially implies global optimums. If you’re doing greedy matching you could move the postcondition about extra to also hold even if the last character is - or =. Note that this would be impossible to verify as you can’t ensure the user won’t pass in aaaaaaaaaaaaaaaaaaaaaaaa, which breaks the length invariants.

                        (the above is super janky and I just whipped it up improvised, there’s probably a bunch of holes in that spec)

                        1. 2

                          One bug—requires should be:

                          ∀ char ∈ str: 32 ≤ ASCII(char) ≤ 126
                          

                          The “as long as possible” is to avoid breaking too soon, like at the first possible break point. So that with an input of “one two three four” and N=10, you don’t get

                          one
                          two
                          three
                          four
                          

                          You last case of “aaaaaaaaaaaaaaaaaaaaaaaa” I did hint at with “[i]f there is no break point in N characters, then just break at N characters.” So if we have N = 5 we end up with

                          aaaaa
                          aaaaa
                          aaaaa
                          aaaaa
                          aaaa
                          

                          But I was able to follow the logic. Thank you.

                          1. 2

                            that would be something like len(out) == N => "-" not in out and " " not in out

                            Where I don’t have fancy existential operators on my keyboard. Also => is the implication operator.

                          2. 2

                            This pseudo-specification that you’ve whipped up is exactly the sort of thing that I’d like a programming language to let me define inline, right next to the code it specifies, and check at compile-time that the code I wrote matches the spec. If there was some tooling that let me do this for some programming language, that would be ideal, even if this pseudo-specification is off-the-cuff and “wrong” in some way.

                            1. 4

                              How about VeriFast?

                              1. 1

                                … a programming language to let me define inline, right next to the code it specifies, and check at compile-time that the code I wrote matches the spec.

                                Absolutely. I recall watching an @hwayne talk about this sort of thing. I believe it was this talk.

                                There’s also this post on contract programming which shows some demo contract code in a few languages. The D and Eiffel examples look pretty interesting.

                                1. 1

                                  SPARK Ada and Frama-C are the ones with biggest industrial adoption that mix specs and code. Check them out. Anything you can’t prove it might turn into a runtime check.

                                2. 2

                                  Wow, this thing, as well as the Dafny thing, really look reasonably approachable! The Let’s Prove Leftpad repo of yours, that you linked in the article, is also super interesting. That said, based on the repo, I find Dafny the most friendly (to the extent of surprisingly so), followed by Why3. The TLA+ is already stretching it, and then all the remaining ones look super alien and waaaay over my head. At first glance, the Dafny looks like something I could try using. Which leads me to a question: is there some catch? Is Dafny notably worse than the others in some way? Or is it equivalent, and is just some recent development where its authors tried hard to make it friendly to a “common programmer”?

                                  Also, can I just go and use Dafny now, when writing my regular day-to-day CRUD/CLI/GUI/… apps?

                                  Also, is there something like Dafny for some other languages than C#?

                                  1. 1

                                    Ill also note Why3 is a backend targeted by SPARK Ada, Frama-C, and many in CompSci. It feeds to multiple solvers with different strengths and wesknesses. You get that plus possible porting of features from other languages if you pick Why3.

                                3. 3

                                  Formal proof is most useful when implementation is significantly more complex than specification. A classic example is sorting: merge sort (especially optimized one) is more complex than what sorting is. A more involved example is register allocation: you specify “live ranges should not conflict”, and prove graph coloring implementation.

                                4. 1

                                  This was outstanding, just really well written. It would be nice to have a similar outline/history from someone conversant with the history of theorem proving. The introduction of the book Coq De Art actually does this to a limited degree but it’d be nice to have something a bit more comprehensive.

                                  1. 1

                                    @hwayne What is the story behind VCS removing chapters of a book, that you’re not bitter about?

                                    1. 2

                                      Some weird interaction between OneDrive, Sharepoint, Office Uploader, and being offline made the system overwrite my then-current draft of a chapter with a version two months older.

                                    2. 1

                                      I’m super grateful to you for writing and sharing this article; I find it extremely interesting and enlightening. I especially admire and respect the way you succeed in translating a lot of the alien FM terminology into things understandable for me as a programming practitioner. If only all FM texts were written like this!… (Is your book also so approachable?)

                                      Which leads me to one factor I believe to be important, and not explicitly pointed out by you in the article. That is, a language (terminology) mismatch between FM researchers and programming practitioners. Expanding on that, I believe good education, documentation, and marketing efforts could go a long way to help popularize the FMs. See how your article got popular on lobste.rs and HN, and r/programming? If only all FM texts… Please know I’m fully aware it’s not easy to write so well. I have huge respect for what you did here! In fact, I think this article might be already be one of the most important things I read in 2019.

                                      A somewhat related point, I think ease of use and familiarity could also be important factors in making FM more appealing to programming practitioners. Being one of them, I was immensely positively surprised with how friendly and attractive the Dafny example in your proof systems comparison repo looks to me. I attribute it to how it’s similar to my regular day-to-day imperative code, and also relatively concise (compared to some verbosity monsters in the repo). More than that, the invariants and “ensures” clauses are not that far from what I might already want to write in comments! If I could use them like that in some language I’m already writing in (Go, Rust, …), that would be really tempting, and make a potential switch so much easier! But if I have to learn a completely new language, this adds a lot of cognitive effort; and also the actual programming language should be really good and usable to convince me to use it in practice… On the other end of spectrum, when reading a TLA+ tutorial recently, seeing a paragraph describing how one needs to save and restart often because the IDE is regularly crashing made me sigh deeply. That does not sound like a kind of a reliable and easily usable tool a practitioner in any profession would be tempted to use…

                                      Finally, one more thing that you already mentioned in the article, but I believe has quite significant importance. When trying to learn some Idris (IIRC) and OCaml GADTs, I also realized that this often requires a very different mindset than regular programming. That already makes it hard. But I also noticed one more thing — that I believe the need to build proofs has a tendency to notably constrain the creativity space of the “regular” code (i.e. the “classical” code which is subject of the proof). I think this is the underlying truth behind what Rust newbies call “fighting the borrow checker” (and practitioners call “listening to what it tells you”); what JS people trying statically typed languages call “losing the prototyping speed”; what was, at the “dawn of times”, probably called “losing access to some important techniques” when going from assembly to C. Per this argument, I’m recently thinking I would be really interested in a language that would support “gradual typing & partial verification” along a huge spectrum of typing disciplines — e.g. spanning something like:

                                      JS/LISP/Lua/… (dynamic typing)
                                       ↕
                                      Go/… (basic static typing)
                                       ↕
                                      Rust (linear typing?)
                                       ↕
                                      Haskell/OCaml GADTs
                                       ↕
                                      Idris/… (full provers)

                                      I’m thinking it could be really interesting if the language would allow choosing the desired balance between speed of prototyping (i.e. quickly capturing some vague thoughts as a draft) vs. robustness of full proving/FM, and choosing it differently for any fragment of a single codebase, and then changing it with time when advancing the code’s maturity. Do you know of anything like this? Maybe some kind of a language with pluggable typesystems? Would it be even possible? Or does it maybe already exist?

                                      Ah, and one more thing. Regarding the “Programmers tend to mistrust software artifacts that aren’t code or forcibly synced with code” quote, and then “documentation, comments […] are often neglected”. In a way, I would urge FM advocates to look at this as an opportunity, not a problem. My argument here is that of doxygen/javadoc/godoc. While still not a perfect panaceum, I believe those tools were a breakthrough which significantly improved the way documentation/comments are maintained. Can DV tools be used like this? If not, could they?

                                      1. 1

                                        Clojure does have spec which allows optional requirements of function calls

                                        1. 1

                                          Sorry, I can’t grasp/parse the fragment: “optional requirements of function calls”; would you fancy rephrasing/elaborating a bit more, and/or sharing a link to the spec?

                                          1. 2

                                            Basically, after you have written a loosely typed function, you can go back and describe the arguments. You can make any restriction you can check for with code, and the language will enforce it. The tool is called clojure.spec, and allows what you describe: taking a hardly typed thing and adding rigor as you need or want it.

                                      2. 1

                                        The question, then: “is 90/95/99% correct significantly cheaper than 100% correct?” The answer is very yes.

                                        :D

                                        Great article. And I’m even more hyped to read my fresh copy of Practical TLA+.