1. 27
  1.  

  2. 11

    Mostly agree with all of these. I’ll add one more: people should be using snapshot/approval/golden master testing a lot more than they are now, they make writing tests pretty painless and almost a fun experience. People don’t even realize that they’re not just for UI tests, they can be for any transformation which takes data as input and produces data as output.

    1. 8

      Golden master tests are also a great way to add tests to an existing untested project.

      1. 4

        Very much this! Both ideas are immensely valuable:

        • formulating the test explicitly as “feed data in, get data out, assert properties of output”, as opposed to “some code with assertions”
        • making the “assert” step auto-updatable, such that the refactors changing the shape of output data don’t need to be blocked on manually touching all the tests.
        1. 2

          Sorry if I fail to understand your comment. I think you are saying that there is ‘testing for enhancing the quality of developed code’ and there is ‘testing for protecting the reputation of released code’. The latter was once called Quality Assurance.

          The whole game of testing and other quality enhancement games is to make money by spending less on errors than they cost. If you have a bug that blows up a $10M dollar rocket, assume it is worth $60M to fix the bug before it blows up the rocket. If the bug has a one in a hundred chance of blowing it up, one should spend up to, but no more than, $0.6M to prevent it. You can now evaluate what is worth fixing.

          The value of running some tests and sanity checks on the golden master of an infrequently released software almost always dwarfs the cost of doing so.

          1. 1

            That’s definitely not what I’m talking about, I’m responding in the context of unit tests which the OP is about.

          2. 1

            Personally, I haven’t seen “golden master” tests used a lot, and where I have, they seemed like an anti-pattern: they asserted way too much unimportant details (formatting, order of stuff, etc.), which resulted in regular false positives in those unimportant details (often due to changes in code not tightly related to the tests), which then resulted in important details being lost in the noise, and the tests eventually being blindly overwritten with a dumped output of the most recent run of the actual test whenever “red”. Honestly, I don’t really know how to avoid this outcome with “golden master” tests; do you have some hints for that? Have you seen “golden master” tests work well and not degrade along this path?

            In other words, as much as they make writing tests “pretty painless and almost a fun experience”, in the few places I saw them used, this cost seemed offset by making maintaining the tests harder and more annoying. Which seems a net negative to me. Do you know of ways to avoid this?

            1. 1

              rust-lang/rust is, I think, a great example of this - you code-review the diffs to the golden output along with the code.

              1. 1

                Yes, it requires a little work but you need to format the snapshots in such a way that the unimportant details are not captured. E.g. if you need to snapshot this JSON:

                {
                  "timestamp": "2022-09-12T20:13:00Z",
                  "a": 1,
                  "b": 2
                }
                

                But the timestamp will change with every call so it can’t be part of the test. Then you’ll first need to ensure that it’s left out of the rendered snapshot:

                {
                  "a": 1,
                  "b": 2
                }
                

                It depends on your language but it should be possible to override the formatting of the snapshot, with varying degrees of automation. In languages with type classes e.g. (Haskell, Scala, Rust), it would look like:

                assertSnapshot(Json.Obj(
                  "timestamp" -> Json.String("2022-09-12T20:13:00Z"),
                  "a" -> Json.Number(1),
                  "b" -> Json.Number(2)))
                

                And behind the scenes the assertSnapshot method would find a FormatSnapshot[Json] instance that knows it’s supposed to omit the timestamp from the rendered snapshot.

            2. 4

              I don’t agree with point 5. Even if you know how to implement something fairly simple, the implementation is not going to be the simplest possible unless you use TDD. And anything but the simplest possible implementation is technical debt you’ll pay for every time you touch that code in the future. I’ve tried both, and writing the tests first makes all the difference.

              One caveat: don’t use TDD while prototyping though. If you just need to prove an idea before implementing it in earnest, just clobber something together. The sloppier the better, to encourage re-implementing rather than releasing the prototype.

              1. 3

                I think unit testing like some other areas in IT that are written about a lot suffers from the problem where you can get philosophical really quickly and it often depends on context, scenarios, situations, etc. Then dogmas, “best practices”, and religious fights on them are created.

                On top of that everyone has different experiences, one that software wasn’t tested at all before it shipped, others situations where people did something like test_true = true in the setup and later assert(test_true, true), others again saw that people around them never did negative tests or that at some point they stalled, or that the prototype had proper tests and the finished products none, etc.

                More than that there isn’t that golden rule on how to best test in every scenarios, yet many people having strong opinions on how to properly test. There’s a lot of bikeshedding going on.

                So in the end it probably boils down to “yes, test and use your experience and the current context/scenario to tell you how”. I think if there was an actually perfect rule, we’d not discuss it and probably even have automated it away somehow.

                And on the article itself. I feel like for some claims it would be good to do it both ways. Running tests in order vs not running them in order. If the tests aren’t too expensive, do both. That might give you a better picture of what’s going wrong and why. “You should not be running your unit tests in isolation.” is another situation. This might be a viable, easy to achieve option and not so hard in some scenarios.

                1. 2

                  If the test runner finishes without running any tests, it should return a non-zero error code.

                  That will only catch test scripts which die before running any tests, not those which die partway through. Better - use the same approach taken by perl TAP (test anything protocol).

                  Have each test emit a pass/fail (or “todo”) and also declare at the start a “plan” of how many tests should occur. If the number of emitted tests doesn’t match the plan (too many or too little) the suite as a whole has failed.

                  https://perldoc.perl.org/5.8.7/Test::Harness::TAP#The-plan

                  I’d love the whole testing infrastructure in all languages to get behind TAP. It’s a simple, well thought out way for tests and test runners to interoperate. One downside of opinionated tooling (e.g. in golang) - which I otherwise love - is that it is difficult to slot in an alternate approach - the test runner is defined and can’t be replaced.

                  1. 3

                    If the test runner dies for some reason, you’d hope that something else would catch that. Most things will, for example, interpret “exited due to a signal” as a failure. The only realistic situation I can think of where the test runner would die halfway through and it wouldn’t be reported as an error would be if there’s a bug in the test runner itself.

                    The situation where no tests execute is more likely to be a misconfiguration though, such as specifying a filter which matches no tests or accidentally #ifdef’ing out your tests when you didn’t intend to. Those cases wouldn’t be catched by anything else if the test runner treated it as a success.

                    1. 1

                      The process doesn’t have to die to not run all the tests you want. A test suite may contain logic to calculate the number of tests based on external data, which might be partially absent in some environments (e.g. your CI pipeline). Sure - if someone looks they’ll see this.

                      I’ve definitely (multiple times) had conversations along the lines of “why didn’t the tests catch that?”…and then finding that the tests weren’t running correctly in a way which didn’t flag a “fail” to the test runner. A plan doesn’t prevent all such schemes, but it is a simple and cheap measure to improve things.

                  2. 2

                    The first one is to run them in parallel. The second one is to avoid positive interference between tests, where one test changes the global state in a way that causes a later test to pass.

                    If you are worried about this, they are integration or e2e tests, not unit tests.

                    1. 2

                      Whether you call them integration, e2e or unit tests, some functions use global variables. A function can even change from being pure to using a global variable without changing any part of its interface. I’m not sure that being super picky about the details here is productive. The intention of a test can be to test one unit, even if that unit happens to, say, transparently cache a resource or whatever in a way which means it isn’t pure in its implementation.

                      Hell, taken to the extreme, nothing which ever allocates memory is a “unit test” to you, since the memory allocator is one giant piece of global mutable state.

                      1. 2

                        I’m not sure that being super picky about the details here is productive.

                        Not just being disagreeable, but I think pickiness about this point is essential.

                        If you think “it’s all just tests, and sometimes they use global variables” it means you aren’t decomposing your system into pieces of pure of logic and pieces with side-effects, and I’d argue that’s the single most important axis to be thinking about. Testing pure functions is an entirely different activity – and much easier to do – than testing anything else.

                        You can say it’s just a linguistic battle and people use “unit test” to mean different things, and empirically this is true, but it’s also common to reserve the term “unit testing” for pure functions, and the reason it’s often associated with mocking is that mocking is just a way to remove side effects. Regardless of how broadly you want to use that term, the distinction that the narrower meaning makes should inform your testing strategy and your code design.

                        1. 2

                          Alright, let’s make a hypothetical example to talk about something concrete.

                          You make a sorting function. You decide to use merge sort, so to sort an array, you need to allocate a temporary array. No problem, you consider memory allocation and freeing to be “pure”, so your merge sort is pure. You write a bunch of unit tests which show that your merge sort is correct, you build a bunch of functions and classes which use your sort function somehow, and you write unit tests for those functions and classes. So far, everything is pure and you have unit tests.

                          But at some point, you find that allocating and freeing memory in your sort function is a performance bottleneck. You change the implementation of your sort algorithm to keep around a threadlocal buffer, so you only need to allocate if your cached buffer is too small, and this new implementation resolves your performance issues.

                          What do you do in this situation? Do you redefine all your tests to be e2e or integration tests because they now rely on global state? Do you rewrite all tests which transitively or directly use your sort function with a different test runner whose focus is on integration tests rather than unit tests? Or do you recognize that your tests are still unit tests “in spirit”? Or do you do something else entirely? This question isn’t rhetorical, I’m genuinely curious.

                          1. 1

                            It’s a good question, and really clarifies why we missed each other.

                            My first honest answer is that this problem simply does not arise in the kind of programming I typically do, which uses high-level memory-managed languages.

                            With that caveat to my qualifications, I think I’d do something like this…

                            1. Make a judgment about how likely a memory leak was in this situation, how tricky the management code was, and if it needed to be tested “as if it were global state” or not. If I could responsibly recognize they were still unit tests “in spirit”, I would treat them as such.
                            2. If I determined that was not responsible, I would treat the situation the same way I would treat verifying that a web-server or database could handle the load I needed. I’d setup up “load tests” using multiple threads (or whatever made sense) and then make assertions on the memory usage and so on. These would very much not be unit tests any more, and I would not call them that.
                      2. 2

                        This is a good example of how confusing and useless the “unit” terminology has become. It seems at this point it’s better to “inline” the definition of unit, rather than use the word. I see three completely orthogonal definitions in this discussion:

                        • For you, it is pure-vs-impure distinction, which I 100% agree is the most practically important lens to view the tests through
                        • For mort, it is the size of unit-under-test: if this is a single function, than that’s unit test (even if it uses a global variable); if it is several interacting subsystems, that’s integration test. I would argue that this is the more traditional definition. Unit-testing had become popular before functionally-inspired thinking reached mainstream, xunit frameworks are quite side-effect happy.
                        • For article itself, I think it is just “automated testing”, as catch2 is a framework for that.

                        So:

                        • unit-test terminology is poison, avoid it (ranted about that here)
                        • perhaps we should seriously try to make “pure tests” an actual term people use? This should de-confuse the terminology a bit and also put the actually important thing into the spotlight?
                        1. 1

                          Great point. Completely agree with the post and your linked article.

                      3. 2

                        Tests should be run in random order by default.

                        Random and parallel by default!

                        1. 3

                          Parallel, yes, random, not so certain.

                          For me, the issue is reproducibility. One thing that has been the bane of my life are flaky tests (for whatever reason). So something that deliberately adds to this flakiness doesn’t help.

                          Whenever I do tests which include randomness (for instance Property based tests https://scalacheck.org/) I always end up specifying the seed so I can reproduce any issues.

                          1. 12

                            RSpec (and I suspec most test frameworks) lets you specify seed for a run. When not specified it will pick one at random. So by default you have your tests ran in random order but when you have a flaky test you can specify seed and run tests in that particular order to debug it. Best of both worlds.

                            1. 7

                              If you’re able to reproduce test failures (by failures returning the seed and runs able to take the seed as a parameter), and if you treat failures as actual bugs to fix (whether that be a bug in the application or a bug in the tests), then I don’t have a problem with randomised tests that pass sometimes and fail other times.

                              After all, the point of a testsuite is to find bugs, not to be deterministic. So if randomness is an effective way to find bugs, then I’m all for it!

                              1. 1

                                After all, the point of a testsuite is to find bugs, not to be deterministic. So if randomness is an effective way to find bugs, then I’m all for it!

                                The issue I would bring up in the case of random test order is that while it does find bugs, the bugs it finds are often not in the application code under test – instead it tends to turn up bugs in the tests. And gets from there into the debate about cost/benefit tradeoffs. If putting in the time and effort (which aren’t free) to do testing “the right way” offers only small marginal gains in actual application code quality over doing a bare-minimum testing setup – and often, the further you go into “the right way” the more you encounter diminishing returns on code quality – then should you be putting in the time and effort to do it “the right way”? Or do you get a better overall return from a simpler testing setup and spending that time/effort elsewhere?

                                1. 1

                                  Tests are code. Code can have bugs. Therefore, tests can have bugs.

                                  But as with any bug, it’s hard to say in general case what consequences are. Maybe it’s just a flaky test. Maybe it’s a bug that masks a bug in the application code.

                                  You’re also right that software quality is on a spectrum. A one-off script in bash probably doesn’t need any tests. And formal correctness proof is very time/money-expensive. Another blog engine probably doesn’t need that level of correctness.

                                  Of all the things one can do to improve software quality (including test code quality) running tests in random order is not that expensive.

                              2. 2

                                For me, the issue is reproducibility. One thing that has been the bane of my life are flaky tests (for whatever reason). So something that deliberately adds to this flakiness doesn’t help.

                                Admittedly a truism, but: If the “test” is flaky, then it is not a test.

                                Whenever I do tests which include randomness (for instance Property based tests https://scalacheck.org/) I always end up specifying the seed so I can reproduce any issues.

                                Exactly. This is very good practice.

                                I have taken it a step further: When I build randomized tests, I have the failing test print out the source code of the new test that reproduces the problem.

                                1. 2

                                  Catch2 always prints its seed, so you can let it generate a random seed for your normal test runs, but if a test fails in CI or something, you can always look back at the log to see the seed and use that for debugging.

                                  If you can’t reproduce a failing test because of randomness, that’s a failure of the test runner to give you the information you need.

                                  1. 1

                                    Have you tried rr? Especially with its chaos mode enabled it’s really helpful for fixing nondeterministic faults (aka bugs).

                                    1. 1

                                      Do you mean this url?

                                      1. 1

                                        I do! (oops)

                                2. 1

                                  Unit tests aren’t good because they provide correctness. In fact, they are bad at providing correctness.

                                  Yep!

                                  This doesn’t mean they’re bad, it just means that unit tests are not a great proxy for actual correctness. When your unit tests also are really poor documentation, and prevent you from refactoring regularly, then I ask: why does this not deeply bother you?