1. 18

I am doing research (masters) in generative testing and I would be very grateful if I could have the opinions of the Lobsters community on generative testing. Do you use it? In which kind of application? What framework do you use? Are you happy with the results? What would you like to have improved in generative testing in general?

Thanks! All help appreciated.


  2. 10

    Yes, I use generative testing everywhere. In Haskell I use Hedgehog and in Scala I use ScalaCheck.

    At work we use generative testing for things like JSON/MongoDB encoders/decoders, functional optics (e.g. lenses) where we want to test laws and just testing application logic, in general.

    I’ve blogged about some of my generative testing work here:


    1. 6

      People don’t go nearly far enough with generative testing IMO. I have really great success while using it to test distributed systems by using a generator to seed clusters with client requests at specific time steps, and I also generate a schedule of network weather. For the client requests that receive responses, it’s often useful when testing consensus algorithms etc… to check that they linearize.

      For concurrent systems, I generate small sets (2-4) operations against the full concurrent API and run them on multiple threads and record their return values. If it’s impossible to then find a single sequential schedule that has the same return values for the same requests by running different permutations of the concurrent ops on a single thread, then I just found a violation of atomicity.

      For systems that write to disk, I have a file interface that when run in testing mode records a log of writes and fsync calls. I’ll generate operations against the API that persists data, and a crash time when writes are either partially applied or completely deleted after the last fsync call. Upon restart if the system is in an inconsistent state, then I just found incorrect usage of the filesystem (so many databases fail to do this correctly).

      People tend to just test things that adhere to a function signature, and you get great results for very little effort by doing this, but there are mountains of gold if you get more creative with bigger and messier systems.

      1. 5

        What would you like to have improved in generative testing in general?

        • Smart ways to generate recursive data structures with low risk of them blowing up exponentially as the size parameter increases

        • Smart ways to direct generator distributions to problematic inputs.

        1. 3

          Smart ways to generate recursive data structures with low risk of them blowing up exponentially as the size parameter increases

          Most automated implementations suffer from this (e.g. generic deriving libraries in Haskell). It seems to rely on the expected number of recursive calls being made: for example, if a binary tree generator has a 50/50 chance of picking a leaf or a node, the leaf has no recursive calls and the node has two, we can expect 0.5 * 0 + 0.5 * 2 = 1 recursive call; and hence the expected size of our data is unbounded. When there can be many sub-expressions, like in a list, this number can grow really large.

          The naive way to tackle this is to adjust the probabilities, such that leaf is chosen more often and the expected recursive calls are < 1. Unfortunately this causes exponential decay in the amount of data generated; so we may never see values more than a few levels deep.

          I tend to avoid this by passing a “fuel” parameter through the generator. This is conserved, so if we want to generate multiple pieces of data (e.g. elements in a list) we must divide it up. The original QuickCheck paper mentions this, but says it’s undesirable since it couples together different parts of the generated data (if some values are large, the others will be small).

          There are some smarter approaches too although I’ve not used them.

        2. 5
          • The learning curve is steep. Most tutorials do a great job of explaining it with super basic examples (e.g. if my function is sum(x, y) I should be able to swap the order of x and y with no change), but there’s a big jump from that to testing actual business logic with actual input
          • I’ve used both ScalaCheck and JsVerify - in both I seem to spend a lot of time writing boilerplate for generators… e.g. there’ll be a generator for taking a subset of a list, but not an in-order sublist, or it can generate a random object but not one that conforms to a type definition. There’s gotta be room for a more user-friendly way of generating input.
          • Shrinking is always a massive source of surprises - often you’ll hit a failing case, have the input shrunk to something that doesn’t satisfy the original parameters you set, then puzzle as to how it managed to generate a value that you specifically told it not to generate. This is especially bad in ScalaCheck because by default it shrinks lists of tuples down in a way that swaps the values inside the tuple.
          1. 4

            Yes, generative/property/fuzz tests are awesome. I use them in Elm whenever I can. I’ve also written a library that uses them for the “Msg / Model / update` triplets of The Elm Architecture: http://package.elm-lang.org/packages/Janiczek/elm-architecture-test/latest

            Used them eg. on my implementation of a text editor, and they found so many edge cases, you wouldn’t believe. https://github.com/Janiczek/elm-editor/blob/master/tests/CodeEditor.elm

            They effectively allow you to pin down the behaviour of a function to a certain degree; giving you counter-examples like “You said the cursor wouldn’t go to a negative column, but if you do [Insert ‘a’, Left, Backspace], the cursor is -1!”, with the Msg list minimized and all that (it probably found the bug with ~100 Msgs in the list, and threw out those that don’t matter)

            EDIT: What would I love in generative testing would be some scheme that allows for it to be a monad (to have a andThen : (a -> Fuzzer b) -> Fuzzer a -> Fuzzer b function) and still be able to shrink well.

            1. 3

              EDIT: What would I love in generative testing would be some scheme that allows for it to be a monad (to have a andThen : (a -> Fuzzer b) -> Fuzzer a -> Fuzzer b function) and still be able to shrink well.

              FWIW this is literally the main innovation in Hypothesis. :-)

            2. 3

              I’ve used ScalaCheck in the past but I’m not actively using it presently out of expediency.

              I’m really interested right now in finding a tool that can look at a database table (MySQL currently) and generate records for it. I’m slowly tracking down a bug that seems to affect how some data is stored by a third-party app whose database I’m having to access directly (long story). Finding a tool that could just generate some rows would save me a lot of time.

              1. 2

                I have begun using PropEr for my Erlang/OTP code, largely because I feel unit testing of routines – routines that tend to be around four lines tall – in a language that has single assignment (SSA), to be quite the Kabuki theatre. Hurdles with this new way of testing, are mostly related to my slow adoption of something quite new. So, not sure if this feedback helps in any shape or form. I can’t stress enough how little unit testing has done for me, with Erlang. So, I’ll probably stick with PropEr for a good long while. Perhaps a resource with the focus along the lines of `Property-based testing for people coming from Unit testing’ would have helped some of the growing pains.

                1. 2

                  Write a simulated ‘world’ by abstracting all IO of your program. Write a generative test to generate user actions, run code against the simulated matrix. Add assertions everywhere. Worked for foundation DB.

                  1. 2

                    Interesting answers, but still no mentions of Clojure’s Spec. i really wonder why?

                    1. 2

                      Yes, I use Clojure Spec and the Python library Hypothesis.

                      One thing that I really miss from Spec when using Hypothesis is that the data models are only useful when testing unlike Spec’s which can also be used for data conformation and transformation in the running program. So if you want to do something in the Python world you could explore the possibility of using Marshmallow schemas as data models for Hypothesis.

                      1. 2

                        I would love to see research on using an AFL-style genetic algorithm based on (branch) coverage feedback for generating test cases in a QuickCheck-style property testing framework. You could do that with clang’s -fsanitize-coverage options, similar to what libFuzzer does but with type-aware input generation, shrinking, etc.

                        1. 1

                          This is something I’ve been wanting to do as a side-project for very long time. Instrument the output of Elm compiler with code coverage stats in the runtime and use these stats from within the test runner for some kind of coverage-maximizing AFL-style fuzzers.

                        2. 2

                          I use StreamData (soon to become ExUnit.Properties in a future release) for testing Elixir/OTP applications. I find that it helps isolate bugs which would otherwise be subtle as it’s so easy to forget an edge case.

                          1. 2

                            One thing some of us have a consensus on it to remember to combine whatever you do with fuzz testing of the same program using the specs/contracts/properties as runtime checks. So, your generated tests catch the kind of problems your looking for. The fuzzing catches those you aren’t but will often take you right to at least point of failure due to runtime checks.

                            If the property is expensive to check during runtime, you can set it up to log the output of the modules or app so you can batch the expensive checks later. Doing it that way can be more cache and multicore friendly. If one process per analysis, you can even keep the analysis single-threaded but use multiple cores overall. That has benefits when you want the analysis itself to be more trustworthy (i.e. no concurrency errors) or your dependency doesn’t support multicore (eg CakeML or Ocaml).

                            Far as generation methods, here’s a survey paper showing different ways people are doing it. I have quite a few papers on model-based and combinatorial testing to if you need any.

                            1. 2

                              Haskell’s QuickCheck is my test framework of choice. There’s a really nice library called LazySmallCheck2012 which I’ve used a few times, but it seems mostly unmaintained (I patched it to work with GHC 7.10, but the pull request has sat unmerged for years). Maybe some idea in that direction would be useful?

                              Let’s say we need to generate some data of type (Bool, Either String Colour). With QuickCheck and SmallCheck we would choose a value (at random or systematically, respectively), like (True, Right Blue), and pass this to a test.

                              The idea of LSC and LSC2012 is that we only make choices when we have to. We do this by effectively running our test on the value undefined. If the test passes, we know it will work for all possible inputs. If it fails, we’ve found a (minimal) counterexample. If it throws an exception, it must have tried branching on the value we gave it; in which case we re-run the test on all those values which make one more choice than last time, e.g. (undefined, undefined). If we get another exception, say from the second element, we would re-run with both (undefined, Left undefined) and (undefined, Right undefined). We keep doing this, adding specificity if there’s an exception, until either the test passes, we find a counterexample, or we reach some depth limit.

                              The effect seems to be rather like using logic programming.

                              1. 1

                                Today I had (once again) a pleasure of using Go’s quick.CheckEqual. It’s very simple (for example there is no minification step for inputs) but it is also very easy to use and is always there as part of standard library.

                                Here’s an example that verifies equivalence of naive implementation with real one.

                                1. 1

                                  I just ripped out a lot of ScalaCheck tests from my codebase. It was not worth the hassle, and I dislike non-determinism.

                                  1. 1

                                    Do you use it?
                                    I used combinatorial testing in a project for work last year.

                                    In which kind of application?
                                    AuditAssistant is a Windows Forms application in C# that generates software inventory reports and files them to a document management system. My employer used it for our migration to Windows 10.

                                    What framework do you use?
                                    I used the XUnit.Combinatorial library for xUnit.

                                    Are you happy with the results?
                                    Generative testing played a small role in this project. I used XUnit.Combinatorial for testing two DTOs. Both classes check their preconditions but neither of them contain any other logic. The results are a mixed bag. I didn’t write enough generative tests for them to really pay off. They would be more valuable in a larger project with more complex invariants to enforce.

                                    What would you like to have improved in generative testing in general?
                                    I would like a book on generative testing along the lines of Growing Object-Oriented Software Guided by Tests.