Thanks for posting! The technical side of things is quite interesting, but the writing style of the author is rather pompous.
“What if there were a way to automatically generate C-code, complete with formal proofs of correctness, removing any need to trust fallible human developers?” .. and what about the fallible humans writing the specs?
I’ll add that there’s been many specification errors in verified programs. Mutation testing was finding issues in them, too. We’ll need mamy eyeballs on them. Checking specs/proofs in general math is a social process but mathematicians don’t like computer-focused math. So, in a way, we’re moving the problems from code nobody was reviewing or analyzing to an esoteric language fewer people will check.
It’s still great work due to the ubiquity of C, good tools for seperation logic, removing attack surface in spec-to-code gap, and pushing synthesis forward. Might as well keep putting good stuff together to see what happens.
Far as general problem, I’m a bigger fan of doing stuff like this with notations like ASM’s that wider audience could use and review. Or languages like ML or SPARK with C generated from them. If algorithmic, a side benefit is one can retarget static analyzers and automated provers to them to confirm some properties without really getting into formal logic.
That linked list copy function is a complete straw man, it’s just a bad function. You could write some crappy list copy function with terrible variable names in Java and it would be almost as incomprehensible. vx22 and y13 look like variable names auto generated by a decompiler. Edit: oh, it was auto generated, no wonder. Still seems a bad way to make your point.
It also uses pointer arithmatic to layout the data structure, rather than structs, as any sane person would. This confirms your suspicion that it was generated by a decompiler, rather than by a programmer. There’s definitely some horrible C code out there, but this isn’t realistic.
I don’t think this article is aimed at C programmers, but at their managers and other people who want to see their fears about C confirmed.
In other words, given a specification to copy a list, our synthesis tool was able to produce a complete self-contained C program with a separate independent proof certificate, formally guaranteeing, without a shadow of a doubt, that it fully satisfies the specification, thereby removing any need to trust a fallible human developer.
“Beware of bugs in the above code; I have only proved it correct, not tried it.” - Donald Knuth
Someone please set me straight if I’m wrong, but it looks to me like that generated code has a recursive call in a non-tail position. Pass it a large linked list and its stack will overflow and it will crash, despite its certified correctness and its guarantee that there’s no need for it to ever be double-checked by a fallible human developer.
The F* project has been producing C code as the output from a verification toolchain for a while and this includes use in EverCrypt, where the formal proofs include security proofs of the protocols being implemented. The examples from this post look simpler than the demos of F* from 5-10 years ago.
There are a few problems with any of the things in this space:
C is a really bad target for anything like this. C compilers all have a subtly different interpretation of the C semantics and your verified code is only correct with respect to one concrete instantiation of these semantics. Our paper from a few years ago shows the space of C semantics. Even CompCERT, which is a formally verified C compiler, only guarantees correct lowering if your input C is free from undefined behaviour.
Getting the properties that you prove correct is really hard. For example, an earlier version of some of the EverCrypt code was proven to be free of temporal safety errors. The theorem was that no object may be accessed after it is freed. The proofs succeeded trivially: nothing was ever freed. Changing this to specify lifetimes and guarantee that everything was deallocated in a timely fashion was a lot of work.
To the best of my knowledge, none of the proof tools work on non-trivial examples of concurrent code. CppMem is probably the best of these and it can’t handle more than a few dozen lines of code. When you can formally verify something like this multi-producer, single-consumer queue I’ll be really excited. And if your name is Mark Batty then I might even believe you can do it.
Anything that the surrounding code does is an axiom. Shared-everything languages are incredibly bad for partial proofs: a memory-safety bug or a concurrency bug anywhere in the rest of your program can invalidate one of the axioms that your proof code depends on.
None of the tools are good enough for large programs. This means that you invariably have to mix verified and non-verified code in the same program. Most of the synthesis tools that I’ve seen generate interfaces that are really hard to use correctly. Part of this comes from targeting C. There are a load of properties that are trivial to enforce with C++ smart pointers or type-safe enums, that the C programmer has to just remember to get right. This feeds back into the compositional correctness problems: If your interfaces are hard to use, you’ll get bugs at the boundary of the proof code. In real-world uses I’ve seen, you get more of these than you fixed in the verified part.
TL;DR: Verification and program synthesis are super interesting but they’re nowhere near ready for prime time.
I haven’t found any cost amounts from poking around online. The Eurofighter Typhoon also uses SPARK in some of its systems. Yes, verification is hard, but there are real systems out there which do it. I think people assume it’s some unreachable goal, but with piecemeal approaches like Ada/SPARK, I think it’s a lot closer than people realize and the future is very bright.
I don’t think we’ll see a big bang of “Our thing is verified!” it’s going to slow roll out with “this tiny lib we’re using is verified” and “we have some components which are verified”, and “we use verification tools” moving forward. Something along the lines of “functional core, imperative shell” but “verified core, un-verified wrappers.”
I’ll add the Infer tool that Facebook acquired to that. It also uses separation logic for temporal analysis of C/C++ programs. It handles massive codebases. They apply it to Java, too.
The difference is that these tools narrow the problem down, aim for high automation, make tradeoffs that facilitate real-world use, and have people doing non-glamorous work supporting that (eg standard library, build integration). Academics often don’t do all this given they’re just proving a concept, doing more ambitious things, or sometime just what they enjoy.
That’s an interesting comment. I never took that quote to mean that he wasn’t confident in the correctness of his handcrafted proof. I always interpreted it as a warning that proofs of code can have non-obvious limitations: they can be completely correct and internally consistent, but if the person doing the proving didn’t think to include some aspect of the desired behavior in the specification, the code can still do unexpected things.
The code sample in the article is a good example. “Uses a constant amount of stack space regardless of input size” and “terminates in a well-defined way if memory allocation fails” weren’t part of the specification, so the code can bomb out even though it can be proven to perform the correct sequence of steps to copy a linked list as described by its spec.
Of course, I know a fraction of a fraction of a fraction as much about formal proofs of code as you do, so take this as the mad ramblings of a neophyte; it’s just what I took away from that quote when I first read it.
It’s also subject to interpretation by the various compilers.
This seems like a misconception. If your C output follows the C standard, it is not open for interpretation, if compilers don’t follow the standards, then it is a bug.
The recursive call could blow the stack on a large list.
The call to malloc is not checked for NULL.
Furthermore, as has been pointed out, it’s completely incomprehensible and not something a human would ever write. Here’s a less straw-man version, that fixes one of the bugs:
#include <stddef.h>
#include <stdlib.h>
typedef struct List List;
struct List {
int data;
List *next;
};
// Wraps malloc(), aborts if allocation fails.
void *errmalloc(size_t sz) {
void *ret = malloc(sz);
if(!ret) {
abort();
}
return ret;
}
List *listcopy(List *src) {
if(!src) {
return NULL;
}
List *dest = (List *)errmalloc(sizeof(List));
dest->data = src->data;
dest->next = listcopy(src->next);
return dest;
}
I see the value in formal verification, but (1) it doesn’t save you from mistakes in your model (in this case, ignoring stack overflow and getting the contract for malloc() wrong), and (2) given the quality of the output it’s not clear why you’d bother with C as a target in the first place, instead of say LLVM.
I’m not affiliated with this project, but I’m studying synthesis and can offer opinions on both (1) and (2):
(1) Program synthesis and verification can’t save you from correcting mistakes in your model, because humans still develop the model in order to write the synthesizer. The thing is, programming without a synthesizer can’t save you from mistakes either. The advantage of synthesis is that you can provably fix the stack overflow and malloc issues in the synthesizer (i.e. for all future synthesized programs), as opposed to having to prove and fix each buggy program in order to declare it verified.
Obviously, this is all brand new, and so there are a lot of bugs in the synthesizer, but this post seems to be looking towards a future vision where the big bugs have been (provably) squashed.
(2) The first section answers it – there’s a lot of interest in verifying C, perhaps due to the existence of famous bugs in C, and (my guess:) more interest means nonzero grant money. This doesn’t mean synthesizing LLVM isn’t an attractive problem to solve, but if you had to choose one…
(edit) Another reason might be the irreplacability of C as opposed to LLVM – like sure Rust exists, and so every C memory bug can be ‘solved’ by being in Rust instead, but you can’t just turn all C infrastructure into Rust, so having a way to synthesize provably verified C is the next best thing.
i have my doubts, fwiw. programming at the level of specifications is still programming, and is quite hard to get right f.e. try coming up with a specification for a hash table which does not end up being linear search…
It has undefined behavior, by the way. Dereferencing a void** is UB because pointers can be different sizes on the same architecture, depending on which type of pointer is underneath.
It has undefined behavior, by the way. Dereferencing a void** is UB because pointers can be different sizes on the same architecture, depending on which type of pointer is underneath.
That doesn’t matter. void* has a single size (though this is implementation defined), and so dereferencing a void** that is the address of a void* is fine. Taking the address of a T* and casting it to void** then dereferencing it is unspecified and may be invalid if the representation of T* and void* are different. In particular, void* and char* must be able to store any kind of data pointer (to handle word-addressable machines where a char* may include a mask to indicate which offset within the word it refers to) and so these types may be wider than pointers to other types.
Their code is taking the address of a void* and then dereferencing the resulting void**, which is well defined.
Their code is very difficult to follow, but the only things I see that look like int* appear to be the payload. The next pointers are void*. They do have an assumption that sizeof(void*) >= sizeof(int*) in the offset calculation, but that’s correct according to the standard.
If I’m reading it correctly, the assumption is sizeof(void*) >= sizeof(int) (not sizeof(int*)). The list nodes store an int, followed by a void* which points to the next node in the list. In:
*((int *)y2) = vx22;
*((void **)y2+1) = y12
The first assignment (from vx22) is an int (the stored value in the node, copied from the original node), the second is void * (the copy of the list tail).
I can’t recall off the top of my head whether the standard requires that sizeof(void*) >= sizeof(int) but it does typically hold.
(edit: Incidentally, C doesn’t really allow for storing separate objects into a malloc’d area in this way. The language used in the standard re malloc is: The pointer returned if the allocation succeeds is
suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental
alignment requirement and then used to access such an object or an array of such objects in the space
allocated (until the space is explicitly deallocated), Here what is being stored is two objects of distinct type, which is not an array nor a single non-array object. I consider this an oversight in the standard though, not a deliberate restriction).
I don’t think you’ve read it correctly. At no point does it store an int *.
The closest it comes to what you’re saying is this line:
*((int *)y2) = vx22;
That’s storing an int, not an int *, and it’s storing it by casting a void * which is the result of a call to malloc into an int *, then storing via that pointer. There’s nothing wrong with that.
In fact, the code written above was actually entirely automatically synthesised by our tool, along with a proof certificate, formally guaranteeing its correctness.
From the “Certified Program Synthesis to the rescue!” section, right under the void listcopy() picture.
Also, the fact that a certificate exists for that code makes me doubt their tool.
No problem! I understand that it is a little buried, and like you, I want things like this to work.
Also like you, I would love to see formal methods used more in the industry. I am currently learning TLA+, helped by your intro to it, and I am also learning Alloy. I would like to write a version control system that “doesn’t randomly delete chapters of a user’s book.” :)
Thanks for posting! The technical side of things is quite interesting, but the writing style of the author is rather pompous.
“What if there were a way to automatically generate C-code, complete with formal proofs of correctness, removing any need to trust fallible human developers?” .. and what about the fallible humans writing the specs?
Here’s a link directly to their paper (he also provides other references at the bottom): https://gopiandcode.uk/pdfs/CySuSLik-icfp21.pdf
I’ll add that there’s been many specification errors in verified programs. Mutation testing was finding issues in them, too. We’ll need mamy eyeballs on them. Checking specs/proofs in general math is a social process but mathematicians don’t like computer-focused math. So, in a way, we’re moving the problems from code nobody was reviewing or analyzing to an esoteric language fewer people will check.
It’s still great work due to the ubiquity of C, good tools for seperation logic, removing attack surface in spec-to-code gap, and pushing synthesis forward. Might as well keep putting good stuff together to see what happens.
Far as general problem, I’m a bigger fan of doing stuff like this with notations like ASM’s that wider audience could use and review. Or languages like ML or SPARK with C generated from them. If algorithmic, a side benefit is one can retarget static analyzers and automated provers to them to confirm some properties without really getting into formal logic.
There are so many memory unsafety bugs. Most of these are caused by selecting memory-unsafe languages, and C is the worst offender by far.
That linked list copy function is a complete straw man, it’s just a bad function. You could write some crappy list copy function with terrible variable names in Java and it would be almost as incomprehensible. vx22 and y13 look like variable names auto generated by a decompiler. Edit: oh, it was auto generated, no wonder. Still seems a bad way to make your point.
It also uses pointer arithmatic to layout the data structure, rather than structs, as any sane person would. This confirms your suspicion that it was generated by a decompiler, rather than by a programmer. There’s definitely some horrible C code out there, but this isn’t realistic.
I don’t think this article is aimed at C programmers, but at their managers and other people who want to see their fears about C confirmed.
“Beware of bugs in the above code; I have only proved it correct, not tried it.” - Donald Knuth
Someone please set me straight if I’m wrong, but it looks to me like that generated code has a recursive call in a non-tail position. Pass it a large linked list and its stack will overflow and it will crash, despite its certified correctness and its guarantee that there’s no need for it to ever be double-checked by a fallible human developer.
The F* project has been producing C code as the output from a verification toolchain for a while and this includes use in EverCrypt, where the formal proofs include security proofs of the protocols being implemented. The examples from this post look simpler than the demos of F* from 5-10 years ago.
There are a few problems with any of the things in this space:
TL;DR: Verification and program synthesis are super interesting but they’re nowhere near ready for prime time.
The NATS iFACTS traffic control system (released in 2011) has over half a million lines of verified “crash-proof” SPARK code.
Is there any good info on how well it has stood up since then, how much work it was to produce and verify, etc?
I haven’t found any cost amounts from poking around online. The Eurofighter Typhoon also uses SPARK in some of its systems. Yes, verification is hard, but there are real systems out there which do it. I think people assume it’s some unreachable goal, but with piecemeal approaches like Ada/SPARK, I think it’s a lot closer than people realize and the future is very bright.
I don’t think we’ll see a big bang of “Our thing is verified!” it’s going to slow roll out with “this tiny lib we’re using is verified” and “we have some components which are verified”, and “we use verification tools” moving forward. Something along the lines of “functional core, imperative shell” but “verified core, un-verified wrappers.”
I’ll add the Infer tool that Facebook acquired to that. It also uses separation logic for temporal analysis of C/C++ programs. It handles massive codebases. They apply it to Java, too.
The difference is that these tools narrow the problem down, aim for high automation, make tradeoffs that facilitate real-world use, and have people doing non-glamorous work supporting that (eg standard library, build integration). Academics often don’t do all this given they’re just proving a concept, doing more ambitious things, or sometime just what they enjoy.
He wrote this before we had theorem provers. He proved it by hand, which is a lot more error prone.
Doesn’t mean I agree with you, just that I don’t think that quote is really as damning as people say.
That’s an interesting comment. I never took that quote to mean that he wasn’t confident in the correctness of his handcrafted proof. I always interpreted it as a warning that proofs of code can have non-obvious limitations: they can be completely correct and internally consistent, but if the person doing the proving didn’t think to include some aspect of the desired behavior in the specification, the code can still do unexpected things.
The code sample in the article is a good example. “Uses a constant amount of stack space regardless of input size” and “terminates in a well-defined way if memory allocation fails” weren’t part of the specification, so the code can bomb out even though it can be proven to perform the correct sequence of steps to copy a linked list as described by its spec.
Of course, I know a fraction of a fraction of a fraction as much about formal proofs of code as you do, so take this as the mad ramblings of a neophyte; it’s just what I took away from that quote when I first read it.
Why use C as a target at all? The code output is unreadable and useless. It’s also subject to interpretation by the various compilers.
LLVM IR seems a better idea.
Worse, it invites edits that could invalidate the proofs.
This seems like a misconception. If your C output follows the C standard, it is not open for interpretation, if compilers don’t follow the standards, then it is a bug.
The example code has at least two bugs:
Furthermore, as has been pointed out, it’s completely incomprehensible and not something a human would ever write. Here’s a less straw-man version, that fixes one of the bugs:
I see the value in formal verification, but (1) it doesn’t save you from mistakes in your model (in this case, ignoring stack overflow and getting the contract for malloc() wrong), and (2) given the quality of the output it’s not clear why you’d bother with C as a target in the first place, instead of say LLVM.
I’m not affiliated with this project, but I’m studying synthesis and can offer opinions on both (1) and (2):
(1) Program synthesis and verification can’t save you from correcting mistakes in your model, because humans still develop the model in order to write the synthesizer. The thing is, programming without a synthesizer can’t save you from mistakes either. The advantage of synthesis is that you can provably fix the stack overflow and malloc issues in the synthesizer (i.e. for all future synthesized programs), as opposed to having to prove and fix each buggy program in order to declare it verified.
Obviously, this is all brand new, and so there are a lot of bugs in the synthesizer, but this post seems to be looking towards a future vision where the big bugs have been (provably) squashed.
(2) The first section answers it – there’s a lot of interest in verifying C, perhaps due to the existence of famous bugs in C, and (my guess:) more interest means nonzero grant money. This doesn’t mean synthesizing LLVM isn’t an attractive problem to solve, but if you had to choose one…
(edit) Another reason might be the irreplacability of C as opposed to LLVM – like sure Rust exists, and so every C memory bug can be ‘solved’ by being in Rust instead, but you can’t just turn all C infrastructure into Rust, so having a way to synthesize provably verified C is the next best thing.
i have my doubts, fwiw. programming at the level of specifications is still programming, and is quite hard to get right f.e. try coming up with a specification for a hash table which does not end up being linear search…
There’s also a different project with similar goals called cogent by Data61.
The code snippet was entirely synthesized? Ouch…
It has undefined behavior, by the way. Dereferencing a
void**
is UB because pointers can be different sizes on the same architecture, depending on which type of pointer is underneath.Doesn’t bode well for the rest of the article.
That doesn’t matter.
void*
has a single size (though this is implementation defined), and so dereferencing avoid**
that is the address of avoid*
is fine. Taking the address of aT*
and casting it tovoid**
then dereferencing it is unspecified and may be invalid if the representation ofT*
andvoid*
are different. In particular,void*
andchar*
must be able to store any kind of data pointer (to handle word-addressable machines where achar*
may include a mask to indicate which offset within the word it refers to) and so these types may be wider than pointers to other types.Their code is taking the address of a
void*
and then dereferencing the resultingvoid**
, which is well defined.The function is recursive, and
r
is created for those recursive calls by taking the address of anint*
. So it’s still undefined.Their code is very difficult to follow, but the only things I see that look like
int*
appear to be the payload. The next pointers arevoid*
. They do have an assumption thatsizeof(void*) >= sizeof(int*)
in the offset calculation, but that’s correct according to the standard.If I’m reading it correctly, the assumption is
sizeof(void*) >= sizeof(int)
(notsizeof(int*)
). The list nodes store anint
, followed by avoid*
which points to the next node in the list. In:The first assignment (from
vx22
) is anint
(the stored value in the node, copied from the original node), the second isvoid *
(the copy of the list tail).I can’t recall off the top of my head whether the standard requires that
sizeof(void*) >= sizeof(int)
but it does typically hold.(edit: Incidentally, C doesn’t really allow for storing separate objects into a malloc’d area in this way. The language used in the standard re malloc is: The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated), Here what is being stored is two objects of distinct type, which is not an array nor a single non-array object. I consider this an oversight in the standard though, not a deliberate restriction).
It’s convoluted and hard to follow, but the code doesn’t at any point take the address of an
int *
. I think you’re incorrect.It stores an
int*
into avoid**
. No, it doesn’t take the address, but anint*
is not guaranteed to be the same size as avoid**
.I don’t think you’ve read it correctly. At no point does it store an
int *
.The closest it comes to what you’re saying is this line:
That’s storing an
int
, not anint *
, and it’s storing it by casting avoid *
which is the result of a call tomalloc
into anint *
, then storing via that pointer. There’s nothing wrong with that.It wasn’t. That’s an example he provided of “by hand” code.
It was.
From the “Certified Program Synthesis to the rescue!” section, right under the
void listcopy()
picture.Also, the fact that a certificate exists for that code makes me doubt their tool.
Also, big fan. I follow your blog as of recently.
Mea culpa!
No problem! I understand that it is a little buried, and like you, I want things like this to work.
Also like you, I would love to see formal methods used more in the industry. I am currently learning TLA+, helped by your intro to it, and I am also learning Alloy. I would like to write a version control system that “doesn’t randomly delete chapters of a user’s book.” :)