t would also pass property testing: if you randomly picked two 16 bit integers, you’d have to run something like 500 million tests to have a 10% chance of triggering the bug.
The trick is not to run this for an unbounded number of iterations, the trick is to instrument the binary under test, to observe how it reacts to input, to be able to find interesting inputs much more efficiently than by random trial.
the trick is to instrument the binary under test, to observe how it reacts to input, to be able to find interesting inputs much more efficiently than by random trial
Interesting. I wonder, how long would fuzzing take to find the bug in the following implementation?
pub fn buggy_add(x: u64, y: u64) -> u64 {
if x == 37215885475012864 {
return x.wrapping_sub(y);
}
return x.wrapping_add(y);
}
With gray box fuzzing (e.g. libFuzzer, AFL, honggfuzz, etc.) the program is instrumented with sanitizers (see https://clang.llvm.org/docs/SanitizerCoverage.html, https://clang.llvm.org/docs/AddressSanitizer.html). These feed information back into the input generator as their effectiveness in covering new code. These fuzzers usually use evolutionary techniques to find a corpus that is representative of all reachable branches by the harness inputs.
So, yes, the fuzzer will know there’s something funny about that magic number and will try to figure out the coverage relationship between the inputs and it.
I was able to reproduce this result, in Rust with bolero and in C with AFL and honggfuzz (both in persistent mode, the latter achieving more than 500,000 iterations/s).
Unfortunately, if the X input is given as a decimal string rather than 8 bytes and then passed through scanf() for conversion, then neither AFL nor honggfuzz seem to be able to find the buggy input.
In the case of honggfuzz I tried more than 1 billion iterations before I gave up.
I would expect the fuzzers would need more detailed feedback to figure out whether an input is getting closer to making the conditional true. I wonder if an increase in the number of matching bits in a comparison (such as A == B) could be used as a feedback mechanism.
also means that certain types of atomically executed checks with a large search space may pose an insurmountable obstacle to the fuzzer; a good example of this may be:
if (strcmp(header.magic_password, "h4ck3d by p1gZ")) goto terminate_now;
I tested this and honggfuzz is able to find the matching string in 1 second, without any starting seed inputs.
I’m guessing it’s able to succeed either because it handles strcmp() specially or simply because strcmp() is not actually atomic like the post says, but composed of successive matches.
Although I would guess finding valid checksum values would pose a much bigger challenge.
And of course, reversing cryptographic hashes would definitely pose an insurmountable challenge.
If the function is not instrumented then it will be very difficult to distinguish any coverage effects that any given input has on the code. I think the more modern engines try to replace the libc functions with instrumented versions for this reason. And yes checksums and crypto is a massive blocker. But the difficulty also applies to formal methods as well. Like if you ask Z3 to give you an input that produces a specific SHA256 hash, it’s going to spin for just as long as a fuzzer would. And that’s a good thing! Otherwise your hash function would be broken :)
If the function is not instrumented then it will be very difficult to distinguish any coverage effects any given input has on the code.
honggfuzz also supports hardware counters as feedback (number of instructions executed, number of branches executed, etc). This makes non-instrumented code easier to fuzz. But in this case, I wasn’t using hardware counters so it probably has some kind of libc instrumentation like you said.
And yes checksums and crypto is a massive blocker. But the difficulty also applies to formal methods as well. Like if you ask Z3 to give you an input that equals a hash, it’s going to spin for just as long as a fuzzer would.
I’m not sure if I agree with that. It would be true if we were trying to find a concrete input that leads to a bug, but even then this can be much easier to do with formal methods.
For example, there are many cases where formal methods allow checking something trivially while fuzzers would normally have a really hard time.
Consider a pseudo-code fragment such as this, for example:
z = 0;
if (SHA512sum(body) != header.checksum) {
goto error;
}
// The following line is buggy, because `z` is zero
x = y / z;
For fuzzers to reach the buggy code they would have to find an input whose header contains a valid SHA-512 hash of the body. Only then could the bug be found. This would be very hard for them to achieve, if not almost impossible.
Most formal methods tools don’t actually require you to find a valid input that successfully passes through the “if” statement. By default they just assume that there will be one (although you can prove them wrong if you want to).
They just ask you to prove that “z” is not zero, assuming the “if” statement is successfully passed through. It’s trivial to prove that “z” would be zero in the “x = y / z” statement and that the code is therefore buggy.
Z3 would do this instantly (in Why3 or in F*, Dafny, SPARK, Frama-C, Microsoft VCC, Viper/Carbon, etc) and it would be similarly trivial to prove in ITPs such as HOL4, Coq, Isabelle/HOL, etc, and also in dependently-typed or refinement-typed languages (Liquid Haskell, Idris, ATS, …).
Furthermore, even if you had to provide such a valid input (with a correct checksum) to prove something, it would be very easy for a human to provide such an input to an interactive theorem prover (as opposed to automated solvers such as SMT solvers, at least without additional help), as one would only have to model the SHA512sum() function to calculate the correct hash for any given input.
That is, while doing a proof it would be very easy to compute the SHA512 checksum of a given body (even if you’re just doing it symbolically, not necessarily with concrete values) and provide that input to some theorem or check, while fuzzers would have to work backwards and try to sort of “reverse engineer” the hash function to actually discover a valid concrete checksum for some concrete body. This would be almost impossible, of course.
To be clear, I’m not saying formal methods are a panacea, as like you said, finding a concrete body such that SHA512sum(body) = <some concrete checksum value> would also be practically impossible, regardless of which formal methods tool one is using.
Anyway, I also found this paper, which tries to make fuzzers bypass such checks for valid checksums by simply skipping the check (which could be considered cheating, but still):
It’s not exactly a reliable way to verify a program to be correct as the approach is not sound (i.e. it can generate false bug diagnostics), but it’s an interesting heuristic which would surely allow to fuzz more targets.
One question that might arise is: is there any point in using fuzzing alongside Kani? In other words, if we can prove a harness for all inputs with Kani, would we still want to use fuzzing? The answer is yes, because fuzzing and Kani provide complementary benefits.
Kani verifies a Rust program by symbolically analyzing its code. This allows Kani to make mathematical statements about the expected semantics of the Rust code being verified. Kani allows you to prove that, for all possible inputs, the code under verification follows its specification, assuming everything else functions correctly (e.g. the underlying hardware, the OS, etc.).
Fuzzing concretely executes the program under test. This gives you end-to-end confidence that for the set of inputs generated by the fuzzer, the code under verification follows its specification, under real-world conditions. Fuzzing and Kani fit together to give more assurance than either provides on its own.
Maybe this’ll be a good excuse to finally learn some more ATS! I’ve tried and bounced off of it multiple times in the past, but I’ve learnt that these kinds of things take time and lots of failures to make their way into my head :)
Hopefully this shows that proving things isn’t as hard as people think. Ok, it’s leftpad, but I would hope that proving something as simple as leftpad would be easy, and in many of the provers it is. In Isabelle, the theorem prover that I’m most familiar with, the two specification properties are proved in one line each.
I’ve tried and failed a couple of times writing the leftPad in mcrl2. It’s just awkward to perform this sort of proof when the entire concept is communicating processes. All that said reading this yesterday motivated me to write a small spec of a single process that I think I can fit all the criteria. It made me realize some limitations of mcrl2 I don’t often hit.
modeling a function as a map with rewrite rule works well but I’m stumped at how to prove it.
the usual workflow is process definition -> linear process spec -> labeled transition system (LTS) -> boolean system. Generally speaking LTSs fail with infinite values so it’s difficult to represent this sort of problem. I have yet to figure out how to write the boolean systems, but I suppose with the right effort I could.
This is a problem where mcrl2 is certainly the wrong tool and I’m not convinced it’s even capable without accepting concepts are equal without providing proof. That doesn’t fit the criteria outlined.
I’ve tried to understand the Lean 4 version and I couldn’t. The Dafny version was a bit more bearable. All in all, I think those systems require a lot of predefined “blocks” to be ergonomic enough, and that’s a problem
I find that Isabelle is simpler than a lot of the dependently-typed theorem provers, of which Lean is one. The Isabelle solution provided here actually has 2 different solutions, one of which is meant to be instructive. But the basic solution is tiny:
definition isPadded where "isPadded padChar unpadded padded
≡ ∃ n. set (take n padded) ⊆ { padChar } ∧ drop n padded = unpadded"
definition leftPad where "leftPad padChar targetLength s
≡ replicate (targetLength - length s) padChar @ s"
lemma isPadded_leftPad: "isPadded padChar s (leftPad padChar targetLength s)"
unfolding isPadded_def leftPad_def by (intro exI [where x = "targetLength - length s"], auto)
lemma length_leftPad: "length (leftPad padChar targetLength s) = max targetLength (length s)"
unfolding leftPad_def by auto
ACL2 is not a “tool to verify code in an existing language”. It is its own language, that happens to looks a lot like Common Lisp and that is implemented in Common lisp.
From Wikipedia:
ACL2 (“A Computational Logic for Applicative Common Lisp”) is a software system consisting of a programming language, an extensible theory in a first-order logic, and an automated theorem prover. ACL2 is designed to support automated reasoning in inductive logical theories, mostly for the purpose of software and hardware verification.
Fuzzing pops this though:
https://github.com/matklad/repros/tree/master/fuzz-add
Since you used bolero for your test harness interface, you actually get both a fuzzer and a proof. We enabled Kani to run the proofs. (See https://model-checking.github.io/kani-verifier-blog/2022/10/27/using-kani-with-the-bolero-property-testing-framework.html)
Looks like it’s now the bolero quickstart 😄
Well, what is fuzzing if not unbounded property testing?
The trick is not to run this for an unbounded number of iterations, the trick is to instrument the binary under test, to observe how it reacts to input, to be able to find interesting inputs much more efficiently than by random trial.
Interesting. I wonder, how long would fuzzing take to find the bug in the following implementation?
1.04 seconds
Wow, that’s awesome! How does fuzzing figure out the 64-bit value that causes the bug?
Does it actually look at the binary constants in the executable binary?
With gray box fuzzing (e.g. libFuzzer, AFL, honggfuzz, etc.) the program is instrumented with sanitizers (see https://clang.llvm.org/docs/SanitizerCoverage.html, https://clang.llvm.org/docs/AddressSanitizer.html). These feed information back into the input generator as their effectiveness in covering new code. These fuzzers usually use evolutionary techniques to find a corpus that is representative of all reachable branches by the harness inputs.
So, yes, the fuzzer will know there’s something funny about that magic number and will try to figure out the coverage relationship between the inputs and it.
This blog post shows AFL pulling jpegs out of thin air with these techniques: https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html. Naive, uninstrumented random input generation would not be able to do something like this.
Thanks! I suspected something like that.
I was able to reproduce this result, in Rust with bolero and in C with AFL and honggfuzz (both in persistent mode, the latter achieving more than 500,000 iterations/s).
Unfortunately, if the X input is given as a decimal string rather than 8 bytes and then passed through scanf() for conversion, then neither AFL nor honggfuzz seem to be able to find the buggy input.
In the case of honggfuzz I tried more than 1 billion iterations before I gave up.
I would expect the fuzzers would need more detailed feedback to figure out whether an input is getting closer to making the conditional true. I wonder if an increase in the number of matching bits in a comparison (such as A == B) could be used as a feedback mechanism.
Note that bolero uses libFuzzer by default but can be changed to use AFL, honggfuzz, and Kani as engines https://camshaft.github.io/bolero/features/unified-interface.html
Interestingly, this blog post says:
I tested this and honggfuzz is able to find the matching string in 1 second, without any starting seed inputs.
I’m guessing it’s able to succeed either because it handles strcmp() specially or simply because strcmp() is not actually atomic like the post says, but composed of successive matches.
Although I would guess finding valid checksum values would pose a much bigger challenge.
And of course, reversing cryptographic hashes would definitely pose an insurmountable challenge.
If the function is not instrumented then it will be very difficult to distinguish any coverage effects that any given input has on the code. I think the more modern engines try to replace the libc functions with instrumented versions for this reason. And yes checksums and crypto is a massive blocker. But the difficulty also applies to formal methods as well. Like if you ask Z3 to give you an input that produces a specific SHA256 hash, it’s going to spin for just as long as a fuzzer would. And that’s a good thing! Otherwise your hash function would be broken :)
honggfuzz also supports hardware counters as feedback (number of instructions executed, number of branches executed, etc). This makes non-instrumented code easier to fuzz. But in this case, I wasn’t using hardware counters so it probably has some kind of libc instrumentation like you said.
I’m not sure if I agree with that. It would be true if we were trying to find a concrete input that leads to a bug, but even then this can be much easier to do with formal methods.
For example, there are many cases where formal methods allow checking something trivially while fuzzers would normally have a really hard time.
Consider a pseudo-code fragment such as this, for example:
For fuzzers to reach the buggy code they would have to find an input whose header contains a valid SHA-512 hash of the body. Only then could the bug be found. This would be very hard for them to achieve, if not almost impossible.
Most formal methods tools don’t actually require you to find a valid input that successfully passes through the “if” statement. By default they just assume that there will be one (although you can prove them wrong if you want to).
They just ask you to prove that “z” is not zero, assuming the “if” statement is successfully passed through. It’s trivial to prove that “z” would be zero in the “
x = y / z
” statement and that the code is therefore buggy.Z3 would do this instantly (in Why3 or in F*, Dafny, SPARK, Frama-C, Microsoft VCC, Viper/Carbon, etc) and it would be similarly trivial to prove in ITPs such as HOL4, Coq, Isabelle/HOL, etc, and also in dependently-typed or refinement-typed languages (Liquid Haskell, Idris, ATS, …).
Furthermore, even if you had to provide such a valid input (with a correct checksum) to prove something, it would be very easy for a human to provide such an input to an interactive theorem prover (as opposed to automated solvers such as SMT solvers, at least without additional help), as one would only have to model the SHA512sum() function to calculate the correct hash for any given input.
That is, while doing a proof it would be very easy to compute the SHA512 checksum of a given body (even if you’re just doing it symbolically, not necessarily with concrete values) and provide that input to some theorem or check, while fuzzers would have to work backwards and try to sort of “reverse engineer” the hash function to actually discover a valid concrete checksum for some concrete body. This would be almost impossible, of course.
To be clear, I’m not saying formal methods are a panacea, as like you said, finding a concrete
body
such thatSHA512sum(body) = <some concrete checksum value>
would also be practically impossible, regardless of which formal methods tool one is using.Anyway, I also found this paper, which tries to make fuzzers bypass such checks for valid checksums by simply skipping the check (which could be considered cheating, but still):
https://ieeexplore.ieee.org/document/5504701
It’s not exactly a reliable way to verify a program to be correct as the approach is not sound (i.e. it can generate false bug diagnostics), but it’s an interesting heuristic which would surely allow to fuzz more targets.
Fuzzing definitely has pros and cons vs other data generation strategies, but in general it’s still not a replacement for proof.
I would argue you really need both concrete and symbolic tests for high assurance software. So either one is not a replacement for the other.
Ultimately, with proofs you’re working with an abstract, mathematical machine that may not completely model the environment you end up running in.
We tried to touch on this in the Kani/Bolero integration blog post: https://model-checking.github.io/kani-verifier-blog/2022/10/27/using-kani-with-the-bolero-property-testing-framework.html
Maybe this’ll be a good excuse to finally learn some more ATS! I’ve tried and bounced off of it multiple times in the past, but I’ve learnt that these kinds of things take time and lots of failures to make their way into my head :)
TIL: Why3. Looks dope.
Hopefully this shows that proving things isn’t as hard as people think. Ok, it’s leftpad, but I would hope that proving something as simple as leftpad would be easy, and in many of the provers it is. In Isabelle, the theorem prover that I’m most familiar with, the two specification properties are proved in one line each.
I’ve got a little data structure that I keep meaning to formally specify and prove.
Do it! If it’s in Isabelle, I can try and help.
I’ve tried and failed a couple of times writing the leftPad in mcrl2. It’s just awkward to perform this sort of proof when the entire concept is communicating processes. All that said reading this yesterday motivated me to write a small spec of a single process that I think I can fit all the criteria. It made me realize some limitations of mcrl2 I don’t often hit.
I’ve tried to understand the Lean 4 version and I couldn’t. The Dafny version was a bit more bearable. All in all, I think those systems require a lot of predefined “blocks” to be ergonomic enough, and that’s a problem
I find that Isabelle is simpler than a lot of the dependently-typed theorem provers, of which Lean is one. The Isabelle solution provided here actually has 2 different solutions, one of which is meant to be instructive. But the basic solution is tiny:
ACL2 is not a “tool to verify code in an existing language”. It is its own language, that happens to looks a lot like Common Lisp and that is implemented in Common lisp.
From Wikipedia: