1. 24
  1.  

  2. 33

    I think this disadvantage:

    Debugging is harder

    hides an important problem of this approach: the results aren’t reproducible. This is the main reason why you want to start from a clean state – not because it makes it a fair test (real-world usage is rarely fair, after all) but because it guarantees that the entire system state, and thus any subsequent behaviour that depends on it, is under your control.

    (“The entire system state” is, of course, limited by what your test suite includes – I’ve certainly seen test suites that worked on one system and failed on another and it eventually boiled down to different libsomething versions or whatever).

    Otherwise, if a test crashes on Monday’s night automated test run, but not on Tuesday’s night automated test run, you don’t know if that’s because the bug fix you attempted on Tuesday morning really fixed it, or because someone else pushed another feature whose test runs before yours and it just happens to make whatever garbage was in the database go away.

    This also guarantees that, if a test crashes on the Monday night CI run, whoever has to fix in Tuesday morning has to run just that one test, not the whole test suite up to the failing test (since the database isn’t cleared between tests, there’s no reason to assume that whatever you’re left with at the end of the whole run is actually relevant).

    I’ve been bit by this once – I got a bug report about a crashing test, and it was revealed because a previous test didn’t clear the system state (it didn’t clear some packet counters, which caused a whole number of improbable interactions and eventually caused a pointer in an array of pointers (not the same one each time) to point into garbage land. Sometimes. Depending on some counter values, and there were a few hundred counters. A subsequent test would, much later, cause the code to dereference one of those pointers and boom. Fortunately it dereferenced all of them, since it looped over the whole array, so at least it was the same test failing all the time, but that was just a happy accident.)

    On the one hand, that did reveal a nasty bug. On the other hand the test suite had several thousand test cases,it took almost 14 hours to run, and about 8 hours to see the crash. It took me almost two weeks just to get it to happen without waiting for 8 hours, so that I might actually confirm it’s fixed.

    Edit: oh yeah, got sucked into my own story and nearly forgot.

    There is definitely an argument to be made for the fact that you should test your code against bad data. Lots and lots of automated test suites verify functionality by building up a valid system state, and then verifying that the running the system with that state produces the expected (“valid”) outcome. That’s great but getting code to run correctly on correct data isn’t that hard, it’s that pesky invalid data that makes it choke.

    But in order for this to be valuable in terms of debugging and validation, it has to be reproducible bad data. Early in the development process you can probably crash anything by piping /dev/urandom into it, but getting /dev/urandom to produce the same data on two machines is hopefully very difficult.

    There are plenty of ways to systematically produce a useful stream of bad data. At the end of the whole thing, you can certainly top it off with a stress test from /dev/urandom – sure, it’ll be hard to reproduce but it’s better to know that there’s a hidden bug somewhere than to not know it). But it’s not that hard to get bad data that covers most of the frequent failure modes.

    1. -1

      I suppose my response to this is a) if you are having flaky tests perhaps you need to go back to starting each test with a clean state until you resolve it (easily done - as mentioned in the post) and b) perhaps it’s worth looking into whether you can store the state at the start of the test with the aim of aiding reproducibility. Also I’m not sure I would leave dirty data in CI, I’m more thinking about local environments here.

      I think it’s rarely as simple as piping /dev/urandom to your application to generate bad data. If you look at libraries for generative testing you’ll see they have lots of tricks for generation, shrinking, etc to help generate “interesting” random data with low(ish) computational cost. Interesting tidbit: I use generative testing libraries a lot an occasionally I find that they can go for a while (weeks, and millions of iterations) before finding a counterexample in some cases.

      1. 5

        a) if you are having flaky tests perhaps you need to go back to starting each test with a clean state until you resolve it (easily done - as mentioned in the post)

        So… how can you tell if a test is flaky then? Just because it triggers a crash? That’s what tests are supposed to do.

        and b) perhaps it’s worth looking into whether you can store the state at the start of the test with the aim of aiding reproducibility.

        Isn’t that equivalent to clearing the database between tests and populating it with the (invalid) data that causes the crash? It just takes way longer to get there – plus you get to deal with all sorts of additional problems, like storing and restoring state, which may not be limited to database records.

        1. 4

          So… how can you tell if a test is flaky then

          A flaky test can pass and fail when run on the same code.

          This can be due to state from previous tests (self-poisoning) or ‘hidden inputs’ (eg OS threads getting scheduled/interrupted in a different order, networking etc).

          1. 3

            Sure, if a test passes one the first run and it fails on the next, there’s no question that it needs attention. That’s not the problem.

            Suppose you have the proverbial bug about handling people whose last name somehow evaluates to NULL – so your code crashes when it encounters someone named James Null. You do the fix on your branch, looks okay. You run your test suite on your branch, and it seems to work fine. Nothing crashes anymore, not today, not this week – in fact not for a whole year.

            Twelve months later the test suite still runs fine. Is that because:

            a) Your fix was correct and no regression has been introduced, or b) Your fix was accidentally reverted at some point, or it was sloppily ported during a refactoring, or was lost in a complete rewrite of that module – but someone also implemented a “wildcard remove” feature, and their test removes every record in the database that contains the word “James” – including “James Null” – prior to your test?

            Suppose you did the smart thing, though, and instead of relying on state from previous cases, you’ve now integrated the “James Null” case in your integration test, and you add a “James Null” record as part of the setup. Better yet, to avoid the kind of mishap that resulted in your input being removed just because it had “James” in it, your setup case adds a record with 48 random characters and the word Null. So twelve months later, if your test suite doesn’t crash, is that because:

            a) You correctly handle the “James Null” case, or b) One of the previous test cases left a trigger function that automatically sanitizes invalid inputs, so “James Null” (ro whatever random input is there) magically gets rewritten to “James \0”, and now there’s no problem anymore, ‘cause that’s just a string like any other?

            1. 1

              Yeah, but … how cleaning the database helps with these questions? I’m not seeing it.

    2. 17

      In my experience, tests should be as reproducible as possible. If there’s any interaction between tests, you just made your life more difficult. This even includes using random(ized) fixture data. It seems like a good idea to smoke-test your code for resilience against data that’s never the same, but if a test fails in a particular case, it can be extremely hard to figure out exactly what test or what data caused the issue.

      When a test fails on your CI, and you want to fix it, ideally you can just run the single test which failed. But you can’t do that anymore when you have all this extra state. You could run the entire suite, but if there’s another test which clears out the table on which that other test fails, you have to somehow interrupt the tests at exactly the right moment. Sounds very painful to me.

      Also, having zero records in a table is an edge case you probably should be testing anyway, and by leaving state from previous tests around, you’re not testing that situation. Unless of course you’re deleting the records, but then you have no guarantee that the tests you want to be running against a populated db are not run just after this “clean out” test.

      Regarding the setting up and tearing down of the db: I would just set it up at the start of the tests and then run every test in a transaction, which is rolled back. This is relatively cheap and avoids lots of annoying situations where you forget to clear out a certain table.

      The post does conclude with:

      I haven’t used this technique in many places and have only use it myself for a short time.

      I think this is a reason to take this entire post with a big grain of salt, especially since my experience is exactly the opposite. That’s many times of hitting my head against the wall due to flaky test suites in different projects across different tech stacks against “this seems like a good idea and works in this one project I did”.

      1. 4

        There are ways to be quicker, here are a few:

        • avoid tearing down/recreating static data such as lookup tables
        • track which tables have been touched and don’t clean untouched ones
        • turn off crash safety while testing

        I wonder how much of this article is back-justification to get around this technical issue.

        The technical issue can be addressed with schema isolation or other namespacing techniques. This way you can run independent tests in parallel without any risk of conflict.

        1. 7

          That’s not what unit tests are. Once you’re testing something else, you need a different specific test for that, that tests that other thing properly. If you’re testing for B or ‘other stuff’ while you’re testing for A you decrease accuracy of testing A and you’re not explicitly and thoroughly testing B and definitely not ‘other stuff’.

          1. 2

            Different philosophies of unit testing I’m afraid! I don’t subscribe to that view at all. My problem with the approach you describe for testing is that a) it’s incredibly laborious to write tests function-by-function in this way and b) the benefit of tests in this style as an aid-to-correctness is much reduced because testing a function without reference to what it interacts with massively reduces the value of the test. As a result I’m happy to do database work in a unit test, I’m happy to involve multiple system components, etc, etc - pretty much touch anything except third party services.

            I do try to address this in the first paragraph of the article - mostly because I want to quickly move on from debates on what constitutes and “unit” test vs an “integration” test as I don’t think that ends up being very valuable, but since we’re on it anyway: my opinion is that taking a reductive view of the word “unit” in unit test is not helpful.

            1. 7

              Different philosophies of unit testing I’m afraid

              When people talk about the differences between unit and integration testing being partly philosophical, usually they are referring to the fact that an aggregation of several functions or classes can still be considered a “unit” from a testing point of view, so there is no clear cut-off at which a complex unit is no longer a unit.

              Unit testing is not the only type of testing, and of course there is no agreed definition of what exactly unit testing is. However, what you are doing (deliberately integrating complex systems and also introducing unknown state) is contrary to the idea of what many people consider to be a unit test. Common ideas about what a unit test is include:

              • “small” - doesn’t try to test everything, just a single “unit”
              • “isolated” - we try to remove the unit from any external influences besides the test input.

              This is not to say that your testing methods are necessarily wrong: I think ultimately, this is a language issue. You are free to define words however you like, but if your definition flies in the face of conventional usage, you’re just going to make things difficult for yourself. If you talk about what you are doing as an integration test or some sort of ad hoc fuzzing, I think you might have an easier job discussing the ideas themselves, rather than having people get hung up on whether they are really unit tests.

              1. 3

                Including dirty data is definitely well outside of most people’s definition of a unit test and I’m fine with that and perhaps should have been more explicit about it. However I think including first-party collaborators is not - though since unit testing gained popularity there are now more microservices about and I think what was previously a collaborator class/object is now often a collaborator microservice.

                I disagree that the (IMO) over-reductive view that a unit test should only cover a single function or class (analogous to the “Mockist” approach described in Martin Fowler’s article) is in complete common parlance and I think that’s why “that’s not a (unit|integration|system) test, that’s a (unit|integration|system) test!” is such a common complaint. I think a large chunk of people would accept that a unit test should include at least some collaborators - and most of those people would include the database in that, including the creator of Ruby on Rails.

                1. 5

                  There is no authoritative definition of a unit test, but I’m confident that very few definitions have a scope large enough to include actual database access.

                  The definition I’m most fond of is: cloning your repository and running the unit tests should always work, and shouldn’t require anything other than the language tool chain. No internet access, no implicit dependencies on local services, no Docker commands, etc.

                  1. 3

                    This.

                    Additionally, the “who has claim to the term ‘unit test’” argument is beside the point. Whatever terms you use, there is an important conceptual distinction between the kinds of tests that have no outside dependencies (unless they’re mocked) and the kinds of tests that do. Just stamping your foot and calling the latter “unit tests” doesn’t magically give them the same design and behavioral properties as the former, and those properties are the reason we write those kinds of tests.

                    Of course, you are free to make the case against the value of those kinds of tests, but that has not been done persuasively in this article.

                    1. 2

                      While that’s it’s rare for unit tests to include the database, it’s not universal. Rails devs often consider tests that involve models, which save and read from the database, as unit tests.

            2. 4

              These are not unit tests, no matter how much you’d like them to be. There are good reasons we have different names for things.

              1. 2

                I’ve got some parallel database tests by using temporary tables & views. Each connection (test) can only see its own data. This isn’t possible for every setup but works pretty well for me.

                1. 1

                  Tearing down data between tests or schemas between test runs is not free of (computational) charge.

                  This is an important point, but also something you can mitigate. For example, in Postgres DELETE can be significantly faster than TRUNCATE for small tables. I wrote more on the topic: https://lob.com/blog/truncate-vs-delete-efficiently-clearing-data-from-a-postgres-table/

                  1. 1

                    Yes that’s mentioned in the article too :)

                    1. 1

                      Ahh didn’t see it in the foot notes. Very good.

                  2. 1

                    A test that I commonly write is for filtering / sorting, to see if complex combinations result in desired results. For this, the state of the db is quite important. I’m curious, how would approach such a test?

                    1. 1

                      So far I’ve handled that by limiting the scope to a single user. Another way would be to write the assertion as a property rather than checking for entries x, y and z - a bit more like a generative test. Maybe there are other ways depending on context. If nothing works you could fall back to a second backing store that was cleaned for that test run.