Let’s Prove Leftpad is super cool! It’s pretty easy to check it in CrossHair like this!
Note that CrossHair isn’t a real verification tool, and falls short of the guarantees that such tools provide. Do you think it’s worth a pull request anyway? With the appropriate caveats made clear?
Contracts is interesting. It’s also interesting to note that CrossHair has similar challenges with consumable iterators. Early on, I’d looked at the vaguely similar pyContracts library, but ended up latching on to the only thing I could find that had been proposed as a standard (PEP316). Also, because CrossHair works by proxying specific python types, I was a little biased towards a setup that presumes typing annotations. But I’d love to work with folks to come up with a standard that could reasonably support all the tooling options: symbolic execution, concrete (fuzzed) execution, and runtime enforcement.
Thank you for playing around with it!! I think your example falls into the “nonlinear integer arithmetic” space where Z3 cannot always find an answer.
That said, I think perhaps there isn’t a counterexample in your example. (n+333333) will be either positive or negative, and if it’s negative, the product will become positive. If we try something like (n+333333)*(n+333) +1 instead, we can sometimes get the result to be negative! example
Ha! I am going to have to correct myself! Your condition does have a counterexample. make_bigger(-333333) gives a result of exactly 1 which I think is the one and only input that fails your post condition!
Perhaps, all the more reason that I need a computer to help me think about problems like this.😀
Good job :) That was the point of the test, to have only one failing condition with a very big number, so something hard to find if you don’t know how to reason about quadratic equations, which I doubted the solver would be able to do.
Thank you for the excellent tool. I am also interested in the samespace using Python. We found that Z3 is still fairly bad when it comes to string constraints. In particular, even simple things like lower()upper() etc need to be defined by the implementors, and Z3 tends to give up pretty fast with relatively small expressions.
CrossHair uses the standard string solver in Z3 and doesn’t employ any special Z3 tactics, so I imagine it has the same weaknesses. (to be honest, I’ve only tried very simple test cases so far) I’d love it if you could send me some more details on your challenging use cases to see what happens. Additionally, I think that over the last year or so, the Z3 string solver has been in a state of flux; I suspect you’ll get a variety of behaviors depending on what build you’re using.
I’d love to collaborate in any way that makes sense: at a minimum, sharing challenges and solutions would be great. If you’re looking to develop something into a practical tool along these lines, I’d love to collaborate more directly too. The illusion that a tool like this needs to create is really difficult to get right, and too big of a task for one person.
Thank you for the response. I was trying out simple string functions in crosshair (lower, upper, capitalize), and it seems string functions are not supported yet, which from your reply I suppose is not unexpected – they are not directly supported by smt-lib (or Z3) and need to be implemented separately. This is what we did.
If you’re looking to develop something into a practical tool along these lines, I’d love to collaborate more directly too. The illusion that a tool like this needs to create is really difficult to get right, and too big of a task for one person.
We are working to make the fuzzing book into a usable suite of tools in Python. The idea is to explain in detail what goes into each tool, and the trade-offs thereof in a comprehensive Jupyter notebook – essentially the idea is to get the reader started on a given advanced topic over the course of a weekend (where possible). This means keeping the basic implementation simple, understandable and well explained (which means advanced optimizations that may be needed to make these algorithms scalable and practical are given lower priority if they make the basic idea difficult to understand – this may be counter to your goal). The users can then build their tools on top of this suite of tools by replacing what they need with advanced versions.
Cross-fertilization of ideas and code would indeed be great, and I will certainly keep track of where crosshair goes.
Yeah - your prompting about string methods here made me notice a pretty bad regression where those methods were failing. I’ve fixed it in HEAD and will cut a new release soon. That said, even when “fixed,” CrossHair now just falls back to concrete execution - a real fuzz tester would be much better in these cases.
Your solution for str.upper/str.lower looks great. I might try to come up with a unicode-aware variant and share it back with you. (though perhaps that’s quite a bit more complex now that I think about it!)
Agreed that I think some of our goals are different, as I would like to get to a more practical, usable product. Excited about chatting in the future though!
Optimize with Confidence. Using post-conditions, CrossHair ensures that optimized code continues to behave like equivalent naive code:
This is very interesting but it may be out of reach for an automated theorem prover right? What are the limits when
this can be used? I see lists are supported so I guess the builtins are covered
There are a lot of limits. Some of those limits will get better with time and effort; others will never get better. I’ll try to quickly go through them:
Proving arbitrary properties of programs is known to be an undecidable problem. (no matter what, there will be some programs with bugs that CrossHair can’t detect)
CrossHair uses the theorem prover only in the context of a single path of execution. Most interesting functions have an infinite number of paths: CrossHair can only verify a finite number of them. The Fuzzing Book that @vrtha mentions above actually has a wonderful description of how this works.
Yes, most of the builtins (lists, dictionaries, sets, etc) should work, but I have to say that the way we implement this is pretty delicate, and I need people to use it in order to find the cases that don’t work properly. (please help me! 😁)
As for the standard library and 3rd party modules, pure python implementations should work (since they can use the symbolic builtins), but modules implemented in C will generally not be analyzable. To fix this, we can sometimes re-implement such modules in pure Python and tell CrossHair to use those implementations instead.
Primary author here. I am too fresh a lobster to post myself, but randomly stumbled on this post today: thanks! Ask me anything!
Love it, all the work on types has brought me back to Python after a long, tooling-induced, hiatus and I’m having so much fun!
Let’s Prove Leftpad is super cool! It’s pretty easy to check it in CrossHair like this!
Note that CrossHair isn’t a real verification tool, and falls short of the guarantees that such tools provide. Do you think it’s worth a pull request anyway? With the appropriate caveats made clear?
Contracts is interesting. It’s also interesting to note that CrossHair has similar challenges with consumable iterators. Early on, I’d looked at the vaguely similar pyContracts library, but ended up latching on to the only thing I could find that had been proposed as a standard (PEP316). Also, because CrossHair works by proxying specific python types, I was a little biased towards a setup that presumes typing annotations. But I’d love to work with folks to come up with a standard that could reasonably support all the tooling options: symbolic execution, concrete (fuzzed) execution, and runtime enforcement.
Neat!!!
It’s not foolproof, of course. I was able to defeat it with large numbers:
I guess that’s expected though. An SMT solver can only do so much. And that’s probably not a huge issue in practice.
Thank you for playing around with it!! I think your example falls into the “nonlinear integer arithmetic” space where Z3 cannot always find an answer.
That said, I think perhaps there isn’t a counterexample in your example.
(n+333333)
will be either positive or negative, and if it’s negative, the product will become positive. If we try something like(n+333333)*(n+333) +1
instead, we can sometimes get the result to be negative! exampleHa! I am going to have to correct myself! Your condition does have a counterexample.
make_bigger(-333333)
gives a result of exactly 1 which I think is the one and only input that fails your post condition!Perhaps, all the more reason that I need a computer to help me think about problems like this.😀
Good job :) That was the point of the test, to have only one failing condition with a very big number, so something hard to find if you don’t know how to reason about quadratic equations, which I doubted the solver would be able to do.
Thank you for the excellent tool. I am also interested in the same space using Python. We found that Z3 is still fairly bad when it comes to string constraints. In particular, even simple things like
lower()
upper()
etc need to be defined by the implementors, and Z3 tends to give up pretty fast with relatively small expressions.What was your experience with string functions?
CrossHair uses the standard string solver in Z3 and doesn’t employ any special Z3 tactics, so I imagine it has the same weaknesses. (to be honest, I’ve only tried very simple test cases so far) I’d love it if you could send me some more details on your challenging use cases to see what happens. Additionally, I think that over the last year or so, the Z3 string solver has been in a state of flux; I suspect you’ll get a variety of behaviors depending on what build you’re using.
I’d love to collaborate in any way that makes sense: at a minimum, sharing challenges and solutions would be great. If you’re looking to develop something into a practical tool along these lines, I’d love to collaborate more directly too. The illusion that a tool like this needs to create is really difficult to get right, and too big of a task for one person.
Thank you for the response. I was trying out simple string functions in crosshair (lower, upper, capitalize), and it seems string functions are not supported yet, which from your reply I suppose is not unexpected – they are not directly supported by smt-lib (or Z3) and need to be implemented separately. This is what we did.
We are working to make the fuzzing book into a usable suite of tools in Python. The idea is to explain in detail what goes into each tool, and the trade-offs thereof in a comprehensive Jupyter notebook – essentially the idea is to get the reader started on a given advanced topic over the course of a weekend (where possible). This means keeping the basic implementation simple, understandable and well explained (which means advanced optimizations that may be needed to make these algorithms scalable and practical are given lower priority if they make the basic idea difficult to understand – this may be counter to your goal). The users can then build their tools on top of this suite of tools by replacing what they need with advanced versions.
Cross-fertilization of ideas and code would indeed be great, and I will certainly keep track of where crosshair goes.
Yeah - your prompting about string methods here made me notice a pretty bad regression where those methods were failing. I’ve fixed it in HEAD and will cut a new release soon. That said, even when “fixed,” CrossHair now just falls back to concrete execution - a real fuzz tester would be much better in these cases.
Your solution for
str.upper
/str.lower
looks great. I might try to come up with a unicode-aware variant and share it back with you. (though perhaps that’s quite a bit more complex now that I think about it!)Agreed that I think some of our goals are different, as I would like to get to a more practical, usable product. Excited about chatting in the future though!
This is very interesting but it may be out of reach for an automated theorem prover right? What are the limits when this can be used? I see lists are supported so I guess the builtins are covered
There are a lot of limits. Some of those limits will get better with time and effort; others will never get better. I’ll try to quickly go through them: