Awesome experiment and writeup. It was fun reading all the tweets. I’m commenting on just a few.

“Right after that the challenge hit theorem-prover Twitter and everybody started sending in solutions. We now have verified Leftpad in Liquid Haskell, Agda, Coq, F*, Isabelle, and Idris.”

Given that, maybe a future challenge should be primitives and meta-functions for something like LISP, Forth, or Red. Throw every testing tool we have at them running tests generated by one on all the others. Then, we’ll have almost 10 formally-verified, rigorously-tested versions of the same ultra-flexible interpreter that can express anything in terms of a few verified primitives. At some point, the people who build automated methods pick it up. Success.

Note: Myreen et al’s work on HOL-to-ASM tool is a precedent here since it got used in Milawa and CakeML after his LISP 1.5 implementation. A good-enough component can get lots of reuse there just like with regular programming.

“One person is currently trying to prove it with mypy. I’m like 90% sure this is impossible but really hope it isn’t.”

That’s hilarious. That’s also an opportunity for a verification student to test their ability to prove properties of type systems. Using tools for that, they can compare the properties people are using to what it can handle to prove it’s possible or impossible.

“shared his thoughts on what made imperative “harder” than FP. His thoughts: the heap. A bunch of other TPPs immediately went “oh yeah, that makes perfect sense.””

Yeah, that was my thought because most of the verification work for efficient imperative uses separation logic. Intros (example) I’ve seen say it’s specifically about modeling heaps to help cover things like pointers and concurrency. They’re still extending it in different ways to make expression/analysis easier and/or more automated.

The mental idea I got as an outsider, partly informed by Clay and Simpl that do it differently, is having the heap in there makes the program like two programs interacting with each other. One is your functions. One is heap operations with internal state. Instead of proving just one, you now have to prove both of them plus their interactions. Unsurprising to me it got used for concurrency since that mental model is similar to both it and methods for I/O verification. You should even be able to model these concepts in PlusCal. I know pointer interactions were attempted in SPIN in past, too.

“We can even look to integrate with constraint solvers and see how they can helps us.”

People have been using constraint solving with proving for a while even though it’s not really popular. I’m not sure if that’s for social or technical reasons. Two people here told me that Mozart/Oz had a constraint solver built in with many uses. When I submitted CHR paper, the real reason I was looking at it was as a candidate to integrate constraint solving into Why3 platform so we have yet another class of automated solvers to hit stuff others couldn’t handle. Like Why’s language, CHR had already been used with multiple solvers. If you want to play with solvers themselves, I got more recommendations for MiniZinc overtime than anything else with mzl mentioning Gecode and Choco.

“Greydon Hoare and Ron Pressler got into a debate on the economics of formal verification.”

Same nonsense he said to me about formalisms where he said we should all just use set theory and first-order logic like in TLA+. Nothing else is needed past regular math as he describes it. I pointed out some productivity benefits new methods brought. However, I think Graydon utterly destroyed his argument by sheer weight of examples he listed. I particularly like how he shows each generation of formalisms or tools does things nobody would even attempt before them. Things they thought might be impossible in general or with automated analysis. The new forms of expression made one or both easier. Therefore, the evidence is in favor of new formalisms being a good thing if they bring such benefits.

Probably just going to save that discussion for when Ron brings it up again. He definitely will.

Very much. The only thing I wish I saw was the algebraic/rewriting side of verification represented on the list. As in, the K Framework or Maude style. Rosu et al even invented matching logic specifically to address weaknessess of separation logic most standardized on for heaps. Saying that and proving it are different things, though. So, I’d rather see their versions in K, matching logic, or whatever they’d use.

“ the most robust “big programs” are all written in Ada or DO178C-compliant C. Does that make C the best language for reasoning about code? I’d personally put much more faith in the Cleanroom process than any language when it comes to correctness.”

I’ll end on that. If we bring in empirical, we have the Cleanroom studies showing low-defects on first try plus easy learning, studies on Ada showing defect reduction plus good productivity/maintenance, we have industrial use of SPARK with near-zero-defect software going back over a decade, and a few tools reliably finding concurrency problems. If formal and empirical, strongest evidence is currently on Cleanroom + Ada2012/SPARK2014 + safe, concurrency model (only where needed!) = productive, maintainable, and nearly error-free programs. If your requirements and specs are right. However, that’s why Cleanroom is iterative with verification and testing phases to spot screwups.

I guess I’m a little unclear on what this post was supposed to have shown? The original premise seems to have been that “People claim FP is more intuitive/more natural/better than IP, which is equivalent to saying it’s easier to make formal proofs in FP over IP.” Whether or not FP is better or worse than IP, I have a hard time seeing these two claims as related? Something that’s easy or difficult to prove doesn’t make it easy or difficult to write simply, to the standard of a normal software engineer.

To me, this post provides pretty clear evidence that making a claim like: “Writing Haskell means that if it compiles, I know it’s correct,” is wrong. It’s shown that types alone aren’t enough to prove correctness and that proving correctness is laborious and hard, whether you have a heap to deal with, or not.

I agree that making the jump from “intuition” to “provably correct” is not obviously related. But, by comparing the implementation of a formally verified proof in a FP language, vs. a formally verified proof in an IP language, it becomes pretty clear that formal verification is hard no matter what. The FP folks had just as much trouble as the IP folks. The FP folks wrote just as many pages of code as the IP folks did, i.e. FP didn’t provide more intuition, and wasn’t obviously better.

One thing I completely forgot to say in the article was how hard it was for me to write these functions. The first two took about an hour each, the last one took about three. This was with no experience with writing code proofs, too.

In this case what we’re measuring is “difficulty in writing a proof for an imperative algorithm” with “difficulty in writing a proof for a purely functional algorithm.”

Congratulations on not only doing your first proven programs but making waves as you did do. Im curious, though, what you read or watched to learn how to do the formal specs. Esp if you think other beginners would find it helpful.

The FP folks had just as much trouble as the IP folks.

I agree, I think that’s the part that violates a widely held belief: many people believe intuitively that FP code should be easier to formally verify than imperative code, because of referential transparency etc. I’m not sure this is a widespread view among people who do verification, but it’s fairly common view among regular FP programmers.

This was great to see unfold and pretty funny at times.

Leftpad. Takes a padding character, a string, and a total length, returns the string padded with that length with that character. If length is less than string, does nothing.

Shouldn’t this be “string padded to that length” and “length is less than the length of the string”. The second isn’t that important but the first seems like it could have meant c*l + s (instead of c*(max(0, len(s)-l)) + s).

Here’s the catch: I formally proved all three functions are correct. You have to do the same. And by “formally prove”, I mean “if there are any bugs it will not compile”. Informal arguments don’t count. Quickcheck doesn’t count. Partial proofs (“it typechecks”) don’t count.

I’ve never really understood the distinction being drawn here. If I want to prove that e.g. the sum of two even naturals is always even, and I write something like:

// definitions
trait Nat
trait Eq[A <: Nat, B <: Nat]
trait Sum[A <: Nat, B <: Nat] {
type AB <: Nat
}
object Sum {
type Witness[A <: Nat, B <: Nat, AB0 <: Nat] = Sum[A, B] { type AB = AB0 }
def apply[A <: Nat, B <: Nat]: Sum[A, B] = ??? // could be implemented inductively if need be
}
trait Product[A <: Nat, B <: Nat] {
type AB <: Nat
}
object Product {
type Witness[A <: Nat, B <: Nat, AB0 <: Nat] = Product[A, B] { type AB = AB0 }
}
trait _2 extends Nat
trait Even[N <: Nat]{
type M <: Nat
val witness: Product.Witness[_2, M, N]
}
// axioms
def additionDistributive[A, B, C, AB, AC](witnessAB: Product.Witness[A, B, AB], witnessAC: Product.Witness[A, C, AC], sum: Sum[B, C], sumA: Sum[AB, AC]): Product.Witness[A, sum.AB, sumA.AB] = ??? //could be implemented inductively
// proof
def evenAPlusB[A <: Nat, B <: Nat, AB <: Nat](evenA: Even[A], evenB: Even[B], aPlusB: Sum.Witness[A, B, AB]): Even[AB] = new Even[AB] {
val sum = Sum[evenA.M, evenB.M]
type M = sum.AB
val witness = additionDistributive(evenA.witness, evenB.witness, sum, aPlusB)
}

and say “it typechecks”, how is that different from a formal proof (indeed better, since if my proof is invalid in the sense of not following the rules for proofs, it won’t compile)? Of course there could be errors in my encoding of my axioms as function signatures, but that seems equally true for any form of formal proof.

Encoding proofs in type systems is totally valid. I added that because some people I was arguing with claimed that it typechecks always counts as a full specification of the behavior even though that’s clearly not true for most types of most functions. As I cover later in the article, several people did solve these problems via type systems.

Also, as covered by the article, all of the proofs, regardless of style, prevented compilation if incorrect.

Also, as covered by the article, all of the proofs, regardless of style, prevented compilation if incorrect.

Well, they prevented compilation if the proof was invalid as a proof. They didn’t prevent compilation if the premises / claims were incorrect. You say in the article this became an actual issue in terms of the specification of unique.

Any typechecked program (in a language fragment where we disallow casts, unbounded recursion etc.) is a proof; the extent to which it is a proof that your proposition follows from your premises is the extent to which your types accurately encode that proposition and premises. So if I build (for example) a SortedList type whose constructors only allow it to be instantiated in ways that guarantee that it is sorted, then I’d argue that a function that returns a SortedList is formally proven to return a sorted list. Of course it’s completely possible for me to make a mistake and accidentally write a constructor that allows me to construct a SortedList that isn’t sorted - but this seems to be in the same category of errors as making a mistake in your formal specification of “sorted”.

Given that the article is trying to generalise a measure of how expensive/effective different verification techniques are from small examples, it’s really important to define clear and consistent criteria for the “boundary conditions” of what constitutes verification, as any difference in those will swamp the effect you’re trying to measure. Concretely I’d say you need a defined line between the parts where mistakes could lead to a program that produces incorrect output in some cases and the parts where no such mistakes are possible, and then a rule about what kind of statements a given entrant may include in the former (the premises/TCB/…). For the imperative-analysis style where you draw a sharp distinction between a program and its proof that line is obvious, but for other approaches it isn’t necessarily so, and differing interpretations can easily lead to applying a double standard to those approaches.

@hwayne you can call people bulldogs but I’m still super confused why you used leftpad as a “challenge” for functional programmers after seeing my functional version. Obviously it can be done. What’s the challenge?

But the first paragraph of this blog post shows your main problem:

Functional programming and immutability are hot right now. On one hand, this is pretty great as there’s lots of nice things about functional programming. On the other hand, people get a little overzealous and start claiming that imperative code is unnatural or that purity is always preferable to mutation.

Functional programming allows:

Mutation AND purity

Imperative programming AND purity

These aren’t antonyms, they’re complementary. That’s why I am always so confident that things can be done using FP with no downsides. In the worst case we can just take your code and embed it as pure functions. This fact exists irrelevant to me completing your challenge or not.

@hwayne you can call people bulldogs but I’m still super confused why you used leftpad as a “challenge” for functional programmers after seeing my functional version. Obviously it can be done. What’s the challenge?

As I said in both the Twitter thread and the full blog post, the problem with your submission is that you assume all your stdlib functions conform to your spec. That means the prover is taking it as true without verifying it. It’s the equivalent of mocking everything out in a unit test.

I told you that this, and your response was basically “it would be easy to fix but I’m not going to do it.” The people who actually tried to fix it found it much harder than you thought.

In the worst case we can just take your code and embed it as pure functions. This fact exists irrelevant to me completing your challenge or not.

As everybody who completed Fulcrum admitted, “embedding it as pure functions” was incredibly difficult and took several days to prove. It took one of the core developers of Liquid haskell almost four days to complete functionally. I hammered out my imperative proof in an afternoon.

I never submitted. I wrote some code to learn Liquid Haskell, you saw it and turned it into a challenge to functional programmers. But it obviously can be done, you know the assumptions aren’t hard to fix.

It took one of the core developers of Liquid haskell almost four days to complete functionally.

I am unsurprised since a lot of the Haskell’s standard library functions aren’t specified yet. Was the difficulty that part and not the actual verification?

But it obviously can be done, you know the assumptions aren’t hard to fix.

But it was hard to fix for people. Your claim that it’s easy is not one I can accept without good evidence.

For the record, as I also make clear in the post, these were supposed to be in ascending order of difficulty. It was supposed to be the easiest. I even admitted, in the post, that I overestimated how hard unique was!

I am unsurprised since a lot of the Haskell’s standard library functions aren’t specified yet. Was the difficulty that part and not the actual verification?

No, the difficulty was the actual verification. Rhanjit said that, artnz said that, Dave - who was using Isabelle, which has full specifications - said that. It was, in their experiences, a fundamentally hard problem.

No, the difficulty was the actual verification. Rhanjit said that, artnz said that, Dave - who was using Isabelle, which has full specifications - said that. It was, in their experiences, a fundamentally hard problem.

It’s good to know that Dafny makes this easier. It doesn’t demonstrate that this can’t be done in FP as easily, just that our current tools lack the additional tooling to make it easy.

Awesome experiment and writeup. It was fun reading all the tweets. I’m commenting on just a few.

“Right after that the challenge hit theorem-prover Twitter and everybody started sending in solutions. We now have verified Leftpad in Liquid Haskell, Agda, Coq, F*, Isabelle, and Idris.”

Given that, maybe a future challenge should be primitives and meta-functions for something like LISP, Forth, or Red. Throw every testing tool we have at them running tests generated by one on all the others. Then, we’ll have almost 10 formally-verified, rigorously-tested versions of the same ultra-flexible interpreter that can express anything in terms of a few verified primitives. At some point, the people who build automated methods pick it up. Success.

Note: Myreen et al’s work on HOL-to-ASM tool is a precedent here since it got used in Milawa and CakeML after his LISP 1.5 implementation. A good-enough component can get lots of reuse there just like with regular programming.

“One person is currently trying to prove it with mypy. I’m like 90% sure this is impossible but really hope it isn’t.”

That’s hilarious. That’s also an opportunity for a verification student to test their ability to prove properties of type systems. Using tools for that, they can compare the properties people are using to what it can handle to prove it’s possible or impossible.

“shared his thoughts on what made imperative “harder” than FP. His thoughts: the heap. A bunch of other TPPs immediately went “oh yeah, that makes perfect sense.””

Yeah, that was my thought because most of the verification work for efficient imperative uses separation logic. Intros (example) I’ve seen say it’s specifically about modeling heaps to help cover things like pointers and concurrency. They’re still extending it in different ways to make expression/analysis easier and/or more automated.

The mental idea I got as an outsider, partly informed by Clay and Simpl that do it differently, is having the heap in there makes the program like

twoprograms interacting with each other. One is your functions. One is heap operations with internal state. Instead of proving just one, you now have to prove both of them plus their interactions. Unsurprising to me it got used for concurrency since that mental model is similar to both it and methods for I/O verification. You should even be able to model these concepts in PlusCal. I know pointer interactions were attempted in SPIN in past, too.“We can even look to integrate with constraint solvers and see how they can helps us.”

People have been using constraint solving with proving for a while even though it’s not really popular. I’m not sure if that’s for social or technical reasons. Two people here told me that Mozart/Oz had a constraint solver built in with many uses. When I submitted CHR paper, the real reason I was looking at it was as a candidate to integrate constraint solving into Why3 platform so we have yet another class of automated solvers to hit stuff others couldn’t handle. Like Why’s language, CHR had already been used with multiple solvers. If you want to play with solvers themselves, I got more recommendations for MiniZinc overtime than anything else with mzl mentioning Gecode and Choco.

“Greydon Hoare and Ron Pressler got into a debate on the economics of formal verification.”

Same nonsense he said to me about formalisms where he said we should all just use set theory and first-order logic like in TLA+. Nothing else is needed past regular math as he describes it. I pointed out some productivity benefits new methods brought. However, I think Graydon utterly destroyed his argument by sheer weight of examples he listed. I particularly like how he shows each generation of formalisms or tools does things nobody would even attempt before them. Things they thought might be impossible in general or with automated analysis. The new forms of expression made one or both easier. Therefore, the evidence is in favor of new formalisms being a good thing if they bring such benefits.

Probably just going to save that discussion for when Ron brings it up again. He definitely will.

“why not a verified APL?”

Yeah, it looked relatively easy. :)

“It’s a diverse field “

Very much. The only thing I wish I saw was the algebraic/rewriting side of verification represented on the list. As in, the K Framework or Maude style. Rosu et al even invented matching logic specifically to address weaknessess of separation logic most standardized on for heaps. Saying that and proving it are different things, though. So, I’d rather see their versions in K, matching logic, or whatever they’d use.

“ the most robust “big programs” are all written in Ada or DO178C-compliant C. Does that make C the best language for reasoning about code? I’d personally put much more faith in the Cleanroom process than any language when it comes to correctness.”

I’ll end on that. If we bring in empirical, we have the Cleanroom studies showing low-defects on first try plus easy learning, studies on Ada showing defect reduction plus good productivity/maintenance, we have industrial use of SPARK with near-zero-defect software going back over a decade, and a few tools reliably finding concurrency problems. If formal and empirical, strongest evidence is currently on Cleanroom + Ada2012/SPARK2014 + safe, concurrency model (only where needed!) = productive, maintainable, and nearly error-free programs. If your requirements and specs are right. However, that’s why Cleanroom is iterative with verification and testing phases to spot screwups.

I guess I’m a little unclear on what this post was supposed to have shown? The original premise seems to have been that “People claim FP is more intuitive/more natural/better than IP, which is equivalent to saying it’s easier to make formal proofs in FP over IP.” Whether or not FP is better or worse than IP, I have a hard time seeing these two claims as related? Something that’s easy or difficult to prove doesn’t make it easy or difficult to write simply, to the standard of a normal software engineer.

To me, this post provides pretty clear evidence that making a claim like: “Writing Haskell means that if it compiles, I know it’s correct,” is wrong. It’s shown that types alone aren’t enough to prove correctness and that proving correctness is laborious and

hard, whether you have a heap to deal with, or not.I agree that making the jump from “intuition” to “provably correct” is not obviously related. But, by comparing the implementation of a formally verified proof in a FP language, vs. a formally verified proof in an IP language, it becomes pretty clear that formal verification is

hardno matter what. The FP folks had just as much trouble as the IP folks. The FP folks wrote just as many pages of code as the IP folks did,i.e.FP didn’t provide more intuition, and wasn’t obviously better.One thing I

completelyforgot to say in the article was how hard it was formeto write these functions. The first two took about an hour each, the last one took about three. This was with no experience with writing code proofs, too.In this case what we’re measuring is “difficulty in writing a proof for an imperative algorithm” with “difficulty in writing a proof for a purely functional algorithm.”

Congratulations on not only doing your first proven programs but making waves as you did do. Im curious, though, what you read or watched to learn how to do the formal specs. Esp if you think other beginners would find it helpful.

I agree, I think that’s the part that violates a widely held belief: many people believe intuitively that FP code should be easier to formally verify than imperative code, because of referential transparency etc. I’m not sure this is a widespread view among people who do verification, but it’s fairly common view among regular FP programmers.

I made left-pad in Why3! (imperative version)

I’m sure I’ve said it before, but keep up the good work @hwayne, you’re a force for good.

This was great to see unfold and pretty funny at times.

Shouldn’t this be “string padded

tothat length” and “length is less thanthe length of thestring”. The second isn’t that important but the first seems like it could have meant`c*l + s`

(instead of`c*(max(0, len(s)-l)) + s`

).Good point, fixed.

I’ve never really understood the distinction being drawn here. If I want to prove that e.g. the sum of two even naturals is always even, and I write something like:

and say “it typechecks”, how is that different from a formal proof (indeed better, since if my proof is invalid in the sense of not following the rules for proofs, it won’t compile)? Of course there could be errors in my encoding of my axioms as function signatures, but that seems equally true for any form of formal proof.

Encoding proofs in type systems is totally valid. I added that because some people I was arguing with claimed that it typechecks always counts as a

fullspecification of the behavior even though that’s clearly not true for most types of most functions. As I cover later in the article, several people did solve these problems via type systems.Also, as covered by the article, all of the proofs, regardless of style, prevented compilation if incorrect.

Well, they prevented compilation if the proof was invalid as a proof. They didn’t prevent compilation if the premises / claims were incorrect. You say in the article this became an actual issue in terms of the specification of unique.

Any typechecked program (in a language fragment where we disallow casts, unbounded recursion etc.) is a proof; the extent to which it is a proof that your proposition follows from your premises is the extent to which your types accurately encode that proposition and premises. So if I build (for example) a

`SortedList`

type whose constructors only allow it to be instantiated in ways that guarantee that it is sorted, then I’d argue that a function that returns a`SortedList`

is formally proven to return a sorted list. Of course it’s completely possible for me to make a mistake and accidentally write a constructor that allows me to construct a`SortedList`

that isn’t sorted - but this seems to be in the same category of errors as making a mistake in your formal specification of “sorted”.Given that the article is trying to generalise a measure of how expensive/effective different verification techniques are from small examples, it’s really important to define clear and consistent criteria for the “boundary conditions” of what constitutes verification, as any difference in those will swamp the effect you’re trying to measure. Concretely I’d say you need a defined line between the parts where mistakes could lead to a program that produces incorrect output in some cases and the parts where no such mistakes are possible, and then a rule about what kind of statements a given entrant may include in the former (the premises/TCB/…). For the imperative-analysis style where you draw a sharp distinction between a program and its proof that line is obvious, but for other approaches it isn’t necessarily so, and differing interpretations can easily lead to applying a double standard to those approaches.

@hwayne you can call people bulldogs but I’m still super confused why you used leftpad as a “challenge” for functional programmers after seeing my functional version. Obviously it can be done. What’s the challenge?

But the first paragraph of this blog post shows your main problem:

Functional programming allows:

These aren’t antonyms, they’re complementary. That’s

whyI am always so confident that things can be done using FP with no downsides. In the worst case we can just take your code and embed it as pure functions. This fact exists irrelevant to me completing your challenge or not.I’m still reimplementing Sonic 2 btw.

As I said in both the Twitter thread and the full blog post, the problem with your submission is that you

`assume`

all your stdlib functions conform to your spec. That means the prover is taking it as true without verifying it. It’s the equivalent of mocking everything out in a unit test.I told you that this, and your response was basically “it would be easy to fix but I’m not going to do it.” The people who actually tried to fix it found it much harder than you thought.

As everybody who completed Fulcrum admitted, “embedding it as pure functions” was

incrediblydifficult and took several days to prove. It took one of the core developers of Liquid haskell almost four days to complete functionally. I hammered out my imperative proof in an afternoon.I never submitted. I wrote some code to learn Liquid Haskell, you saw it and turned it into a challenge to functional programmers. But it obviously can be done, you know the assumptions aren’t hard to fix.

I am unsurprised since a lot of the Haskell’s standard library functions aren’t specified yet. Was the difficulty that part and not the actual verification?

But it was hard to fix for people. Your claim that it’s easy is not one I can accept without good evidence.

For the record, as I also make clear in the post, these were supposed to be in ascending order of difficulty. It was

supposedto be the easiest. I even admitted, in the post, that I overestimated how hard unique was!No, the difficulty was the actual verification. Rhanjit said that, artnz said that, Dave - who was using Isabelle, which has full specifications - said that. It was, in their experiences, a fundamentally hard problem.

It’s good to know that Dafny makes this easier. It doesn’t demonstrate that this can’t be done in FP as easily, just that our current tools lack the additional tooling to make it easy.

Do you mean “prove the

FPprograms correct”?