Why should I care about Hypothesis from a research point of view?
Was going to write a comment asking @DRMacIver if he intended to combine coverage-driven analysis with the property-based testing of Hypothesis and then scrolled down to see the “glass box testing” section.
I think that unguided fuzzing will inherently require more time than property-based testing. Have you considered having a semi-guided mode? If you can find a roughly robust way to partition inputs by the program flow that they cause AND infer additional input constraints from those partitions (both of these aren’t easy, I understand) a semi-guided tool sounds great– “hmm, spend the next 5 minutes just doing inputs that produce traces at least this long”.
Is there prior art in partitioning execution traces by similarity or inferring properties from input space partitions?
Could you elaborate on what exactly you mean by guided?
“Guidance” as in human eyes looking at some representation of intermediate fuzzing results and human input specifying to the fuzzer things like, “spend your time only on inputs (that produce traces) like these” or “omit inputs like these; the crash is (upon my insightful human inspection) trivial”.
Depends on your goals. If you want high-impact, it looks like it could have it. Reading it made me think it could benefit from IDE assistance where certain analysis happen to the code every time it’s introduced or changes. Then those guided test cases that took hours might take seconds to minutes… at least a subset of them. Might also be a fun project to try to formally verify in K framework or Coq. Need more verified tooling.
I’d also like to see more testing and analysis tools applied to CakeML to show it maintains its invariants in every path of execution. That would increase odds it could be used in DO-178B certified tools without the source-to-object code inspections like required for SCADE done in Ocaml. Just a pile of equivalent tests instead on each transform that can be batched. It would also be good for reference implementations of code used to detect compiler errors in main compilers.
I’d also like to see more testing and analysis tools applied to CakeML to show it maintains its invariants in every path of execution. That would increase odds it could be used in DO-178B certified tools without the source-to-object code inspections like required for SCADE done in Ocaml.
I don’t think those kinds of improvements really capture what makes Hypothesis so magical. Hypothesis is accessible. You can slide it into any Python project and using it doesn’t require much learning. The implementation is, of course, amazingly complex, but on the surface none of that appears. Just put a @given on your test and it’s fundamentally changed, giving you something unittests don’t - can’t - give you. It’s hard to overstate how incredible this is.
In startupland, everyone thinks in terms of unit tests. Even the common ‘alternatives’, like integration tests and feature specs, are really larger-scale unit tests: each is just a single point in the state space, even if the size of the subdimension is larger. Anything more sophisticated than single-state tests has a steep learning curve. It took me two months to finally feel comfortable with even simple formal specifications. And convincing someone that type systems are good is a running gag at this point. “A monad is just a monoid in the category of endofunctors, what’s the problem?”
The thing with steep learning curves is it’s hard to convince people that it’s worth learning, or even that there’s something there to be learned at all. When your options are “unit tests” and “learning category theory”, the default becomes believing that single-state tests are the only useful test, maybe even the only practical test.
Then you spin up Hypothesis and property tests are possible, then trivial. And that opens up an entire world of testing, because you know that single-state tests aren’t everything and that there’s really amazing discoveries just waiting out there.
I’m a huge advocate of formal methods, to the point of writing an entire beginner’s guide to TLA+. The only reason I started, though, was because I learned Hypothesis. Without that formal methods would have just been another weird academia thing.
I just wanted to say that getting comments like this about Hypothesis is delightful, thank you. :-)
“on’t think those kinds of improvements really capture what makes Hypothesis so magical. ”
Of course not. That’s what your comment did pretty well. Mine was about extra ways to target or improve it that were interesting for researchers. That’s what OP asked about in the text of the submission. If aiming for max deployment, I’d say port the methodology to every mainstream language you can with prebuilt integrations for all kinds of build systems. Also apply it on common projects. Write guides for new people.
That’s all just boring as hell to most CompSci people wanting to extend cutting-edge projects. Integrating this with a verified stack to produce full coverage or spec testing that might be applied in domains like aerospace… less boring. ;)
That’s all just boring as hell to most CompSci people wanting to extend cutting-edge projects.
So there’s actually at least one major reason why the widespread deployment of Hypothesis isn’t boring from a research point of view: It’s a great source of data for whether things done to improve its implementation actually find more bugs in practice. If and when I want to test an exciting new algorithm I can just fire up all the open source code that’s tested with Hypothesis and run it and see if it finds interesting new bugs.
There’s another for the list. Yet another building on that would be tech that seeds bugs of specific types into codebases like moyix on Hacker News builds. The algorithm introduced into Hypothesis might catch certain types of bugs or might only due so in certain contexts. Such limitations aren’t always obvious from one or two runs. Seeding more of that same kind of bug in many, random places might show it. This continues on with the bug funding and bug seeding tools running opposite of each other testing a variety of algorithms on large codebases. A hostile co-evolution much like Danny Hillis used for sorting algorithms on Thinking Machines' hardware.