Counterintuitively, i think (some schools of) tdd are actually on the same page as this. By using tests to define interfaces we basically define the parts that “rarely” change, though it might be more often that what the author considers rarely.
Can’t remember where, but some sources on tdd explicitly encouraged deleting tests if they a) are covered by other tests as well and b) do not specify behavior that you care about. That would leave us tests that are similar to what the author would have, again, ignoring different ideas of “won’t change”.
The major difference is that tdd still encourages writing these tests to drive out the API for the API handler or repository. It forces us to be deliberate about the component’s API, if we don’t care about that then we don’t need to design it.
(To clarify, the author doesn’t mention tdd. These are my thoughts and since the topics are not well differentiated I thought I’d share them)
One neat tactical trick here is that, if you want to unit test something deep in the guts of the system to exercise all corner cases which are hard to reach from the outside, you can create “it, that won’t change” yourself. Rather than writing each test directly against the API of the internal component, write a check driver function, which takes some data structure as an input, feeds it to the component, and then checks against expected results. Each test then just calls this check with specific inputs.
As a results, tests are shielded from components API changes. The API has only two usages: one production, and one in the check function. This is in contrast to typical situation, where you have one prod and ten test usages.
Can you elaborate on your check function approach? I’m having trouble imagining how that changes anything about testing an internal component.
If the API of your internal component changes, don’t you have to change the way you call this check function? Either the input data structure or the expected result values?
This is great - I think I actually read this blog a while ago but couldn’t find it in my history to put in the further reading section, which explains how mine ended up starting with a similar anecdote about testing as a junior dev 😂.
Ah! Thank you. That clarifies it. So, this check function approach is more of a “buffer” against API changes- it’s going to reduce the tedium of mechanically updating many tests after somewhat-trivial API signature changes.
That’s clever. I assume that you must still use this approach, since you just posted here about it. Have you developed any kind of “rules of thumb” in your experience for when this approach works best or might not be worth it (e.g., “I know that any change to this signature will mean I need to actually rethink all my tests anyway”)?
That’s just default approach I use. The driving rule of thumb is perhaps “the number of usages of an API in tests should not exceed the number of usages in prod”.
(Great to see a random post from a former colleague pop up!)
TDD and similar ideas have frustrated me for a long time, because they do tend to push people towards testing their implementation much more than their interfaces; “how can you know if a larger part works if you don’t know if its components do?” seems to be the default position people fall into.
As a Haskell developer, I’ve spent a lot of time look at unit test suites for other languages and thinking “why are you testing that? your type system should make that impossible”, and coming to the realisation that for many languages, their test suite is their type system - but it is an incomplete, sometimes buggy one limited by what the developer was able predict could go wrong.
I’ve always tried to make programs where only valid business logic is possible, and doing this relies heavily on the use of sum types to be precise about what is allowable - I was talking earlier today to someone on IRC about the difference between validating (a.k.a writing code that works with bools to gate progress to other code) and parsing (writing code that produces valid values - perfectly summed up in Alexis King’s Parse, don’t validate).
At a previous job, we did this religiously, data coming into the program was parsed (in this case from SQS queues), such that we knew in the rest of the system that all our preconditions had been satisfied, and only valid data was allowed within the inner shell of the app. This meant that most of our tests were actually around serialisation to ensure that a) the given inputs produced the expected internal representation (unit test), b) a given output type produced the expected serialisation (JSON) (unit test) and c) that if a type was used for both input and output, that serialisation in both directions round tripped (property test: decode . encode == identity).
There were more tests than this, specifically for cases where precisely encoding what was allowed would have become impractical, but there was only a small number of those.
The rest of the program was basically implemented in a way that it was very difficult to write incorrect code, all alternatives were represented as subtypes, and invariants were encoded this way too. All functions were then made total, and generally any change that was needed was 95% mechanical - make the code part of the change, and follow the compiler errors until there weren’t any. The maintainability of the project was fantastic, and we routinely made large refactors fearlessly.
LLVM’s testing infrastructure has gradually nudged me in the opposite direction: tests tell you what you’ve changed. Tests may check some internal dependencies but then when they fail you’re the best person to judge whether you actually meant to change that behaviour. For example, I’ve made a change that made a bunch of tests fail because register allocation happened in a different order. That’s the sort of thing that this tells you that the test shouldn’t check (as long as it’s ABI compliant, I don’t care which registers are being used). It turned out that I’d actually changed something in how alignment was calculated that triggered some failures on other targets and it was a real bug in my code, which I was able to fix before pushing.
I think the main lesson that I’ve learned about testing is to treat your tests as a software engineering artefact just like the rest of your code. They should be well documented and easy to modify. If they test an implementation detail then changing the tests should be as easy as changing that aspect of the implementation. When I change the implementation and I get test failures, I should be able to look at that test and see a comment telling me precisely what it’s testing and why. I can then decide if I want to update it for the new implementation or just delete it.
If nothing else, changing the implementation in such a way that a test can observe it probably needs to end up in my release notes because some other consumer may be relying on that behaviour. For example, if I previously returned an array that happened to be sorted by one attribute and now return something sorted with a different criterion or not sorted at all, that may not be part of my documented API contract, but it might be something that other code depends on.
There’s totally room for such tests, e.g. if you really just want to lock in some super specific functionality for whatever reason. But I am against tests like these being your primary kind of test.
The test mentions the “testing trophy” which is an alternative to the pyramid. Basically unit tests are still used, just not as much. That’s my happy medium.
This is in line with what Vladimir Khorikov advices in his book “Unit Testing Principles, Practices, and Patterns” (and blog).
Mocks tend to produce brittle tests (i.e. tests that break due to refactorings, changes to implementation details that don’t change the behavior of a piece of code), so the recommendation is to only mock what he calls “Shared, unmanaged out-of-process dependencies”, mainly because these tend not to change much (they have a strong requirement of backwards compatibility), making it less likely that mocks will need to be updated. By doing this, I normally end up mostly with integration tests.
In it, he takes a real design posed by the authors of the Growing Object-Oriented Software Guided by Tests book (pretty much the book that really proposed mocking as a primary testing technique) , and proposes what he feels is a design which eliminates most of the mocks. The result is much simpler (subjective of course).
Now, GooS is an amazing book. I’m not knocking the book, since they have many amazing points in it, and there is a wealth of knowledge to be taken from it. But what I love about this response post is that it’s not theoretical. We need non-trivial projects like this to compare multiple different designs for the same realistic project.
Some of this just falls down to design sensibility too, which is relatively subjective. Not everyone clicks with the immutable architecture, but for me it’s fantastic and aligns with my sensibilities.
Highly recommended! I was trying to find some answers as to what constitutes good tests and when is and isn’t appropriate to mock, and found one of the author’s blog posts tackling a bit the second, so eventually I gave the book a try. I really liked how he establishes some principles early on which are easy to agree on, and then from those all the rest follows.
That’s exactly what I’m looking for! After years of reading bits and pieces and just general development work I felt like it was time to read a more complete account of how to test well. Glad to get another positive review.
If we were to unit test this according to the pyramid, maybe we’d end up with:
Testing the POST handler against a mocked API client and database repository
Testing our API client’s methods against a mocked http client
Testing the database repository against a mocked ORM library
It seems to me that the bulk of his problem with heavy unit-tests comes from over-reliance on mocks. Mocks have their place (though I prefer to use other kinds of test doubles wherever possible), but they have the effect of multiplying the cost of an interface change by 2 or more. (With a mock, an interface change means changing both the client of the interface, and the test code for that client. IME this is often more than 2x the cost of the change to production code alone, because test code has a tendency to grow larger than production code. Other kinds of test doubles don’t have this issue as often, since they can be more easily shared across tests.)
It seems to me that the bulk of his problem with heavy unit-tests comes from over-reliance on mocks.
I’ve pretty much soured on test doubles. They’re often a pain to set up (especially when dealing with nested class structures), requiring a lot of boilerplate, and typically they only implement the part of the API that is actually being called. Many libraries have such a big surface API that faking or stubbing them completely and correctly would be a project unto itself. And of course, there lies one problem - if you’re only making them respond to the methods calls that the implementation is using, you just created an extremely tightly coupled system. If you change the implementation to use different method calls, you have to go back into your tests and update all the mocks as well.
And even if you did manage to fix all the method fakes to correctly respond to your usage, especially in dynamic languages you are quite likely to end up with a mock or a stub that returns data in a different shape to the actual API. Or because it is necessarily simpler than the real thing, it will be more permissive and accept some inputs that the real system would raise an exception for. I’ve seen projects where the large and very extensive test suite passed with flying colours, but when using it in a real project against the real API, it would just fail, because the test suite was subtly incorrect in the implementation of its test doubles.
So, if at all possible, use the real library. Of course, if it does external network requests or other side-effects that are not desirable in a test situation, you’ll have to make some sort of test double. Perhaps you can set up a fake server, or an in-memory database so that you’re at least drawing the line at the inter-system boundary. But, I have to admit, in some situations, a mock or stub is really the quickest way to test something. But then you’ll have to bear in mind that often the integration point where the calls end in a mock (and would go out to the real system in production) is where the bugs are at.
For what it’s worth, I’ve noticed that these sort of issues crop up more in systems with elaborate usages of types (think Java-esque class hierarchies and interfaces). There you have to jump through more hoops to accept and return exactly the right type of object in each context. In general, designs in a functional style make testing easier because there are less side-effects. It’s mostly the side-effects that you have to mock at all so that your test suite doesn’t make external network calls. When designing an API, it’s always good to keep this in mind. If the system is more functional, it’s easier to test all the moving parts in isolation, and you’ll only have to mock the “edges” of the system.
I’ve pretty much soured on test doubles. They’re often a pain to set up (especially when dealing with nested class structures), requiring a lot of boilerplate, and typically they only implement the part of the API that is actually being called.
It’s possible that I’m misusing the term “test doubles”, but I’m thinking of a particular project in which we nearly eliminated all these problems by moving most of the boilerplate into a common location. For example, we weren’t religious about the Law of Demeter (I’m still not a fan), so we’d have APIs like:
class Foo {
public:
virtual Bar &get_bar();
};
where the obvious approach to mocking would lead to boilerplate like this, repeated through the test code:
And now an interface change gets 1 additional place to modify outside the production code (the TestFoo implementation), not N. (I realize this example is using a mocking library, but a) that’s not what we actually did in the project I’m thinking of, but I’m already conscious of spending too much text on this, so I’m trying to minimize the context I need to provide; and b) it’s doing so as an implementation detail of a different kind of test double.)
So the examples you gave above are interesting, because it sounds like they’re interfaces defined in terms of the implementation specifics (eg: the “mocked ORM library”) rather than saying “a thing that can store the information I care about, and read it back later”.
I agree wholeheartedly. The value prop against super isolated unit testing just isn’t there, especially not in large systems. Every time I want to actually change a design, it involves complete test suite surgery. Unit testing zealots will reply to that: “well you’re not doing it right. If your tests have to change, you’re not listening to the coupling that they were telling you about! And your design is bad!”
It’s purely mythical. Maybe, maybe if you get your interfaces designed incredibly well from the beginning, they can support evolution over time. But all of the large refactors / redesigns I ever worked on were so past what unit tests can support it’s silly. Example: switching from querying data on the fly to precomputing it, or moving logic from in-memory into a query for performance reasons. Anytime you do this, you’ll be modifying 10 different test files and changing the specification of internal components that don’t affect the external behavior of the system.
Then, there’s the practical side. I feel like a big reason unit testing has been so popular is simply because of speed. They are easy for the developer to write, and they’re easy to execute quickly. So we feel like we’re getting work done. But years and years have passed. Machines are still quicker, and where single cores aren’t much quicker we have more opportunities for parallelism. Now, think about property-based testing which is embarrassingly parallelizable. We already see things like Jepsen which apply property-based testing at the system level. You can scale it up by simply running the same test on more workers. That alleviates a huge part of the penalty of integration tests, and is where I’m banking on the future heading to.
Counterintuitively, i think (some schools of) tdd are actually on the same page as this. By using tests to define interfaces we basically define the parts that “rarely” change, though it might be more often that what the author considers rarely. Can’t remember where, but some sources on tdd explicitly encouraged deleting tests if they a) are covered by other tests as well and b) do not specify behavior that you care about. That would leave us tests that are similar to what the author would have, again, ignoring different ideas of “won’t change”.
The major difference is that tdd still encourages writing these tests to drive out the API for the API handler or repository. It forces us to be deliberate about the component’s API, if we don’t care about that then we don’t need to design it.
(To clarify, the author doesn’t mention tdd. These are my thoughts and since the topics are not well differentiated I thought I’d share them)
Yes, very much this. The way I like to conceptualize this is “test features, not code”. Another good one is “test at the system boundary”: https://www.tedinski.com/2019/03/19/testing-at-the-boundaries.html.
One neat tactical trick here is that, if you want to unit test something deep in the guts of the system to exercise all corner cases which are hard to reach from the outside, you can create “it, that won’t change” yourself. Rather than writing each test directly against the API of the internal component, write a
check
driver function, which takes some data structure as an input, feeds it to the component, and then checks against expected results. Each test then just calls this check with specific inputs.As a results, tests are shielded from components API changes. The API has only two usages: one production, and one in the check function. This is in contrast to typical situation, where you have one prod and ten test usages.
Can you elaborate on your
check
function approach? I’m having trouble imagining how that changes anything about testing an internal component.If the API of your internal component changes, don’t you have to change the way you call this
check
function? Either the input data structure or the expected result values?See an illustrative example here: https://matklad.github.io/2021/05/31/how-to-test.html#test-driven-design-ossification.
And here are test for tree-diffing functionality in rust-analyzer as a real-world example:
https://github.com/rust-lang/rust-analyzer/blob/e9d3fe04844a8f335dec4e40f3ed8e5c4af90c32/crates/syntax/src/algo.rs#L255
Note how all tests just call
check-diff
The example in the blog post seems very similar to https://github.com/golang/go/wiki/TableDrivenTests
This is great - I think I actually read this blog a while ago but couldn’t find it in my history to put in the further reading section, which explains how mine ended up starting with a similar anecdote about testing as a junior dev 😂.
Ah! Thank you. That clarifies it. So, this check function approach is more of a “buffer” against API changes- it’s going to reduce the tedium of mechanically updating many tests after somewhat-trivial API signature changes.
That’s clever. I assume that you must still use this approach, since you just posted here about it. Have you developed any kind of “rules of thumb” in your experience for when this approach works best or might not be worth it (e.g., “I know that any change to this signature will mean I need to actually rethink all my tests anyway”)?
That’s just default approach I use. The driving rule of thumb is perhaps “the number of usages of an API in tests should not exceed the number of usages in prod”.
If I expect “I need to rethink all my tests anyway” situation, I add expectation testing into the mix: https://matklad.github.io/2021/05/31/how-to-test.html#expect-tests
(Great to see a random post from a former colleague pop up!)
TDD and similar ideas have frustrated me for a long time, because they do tend to push people towards testing their implementation much more than their interfaces; “how can you know if a larger part works if you don’t know if its components do?” seems to be the default position people fall into.
As a Haskell developer, I’ve spent a lot of time look at unit test suites for other languages and thinking “why are you testing that? your type system should make that impossible”, and coming to the realisation that for many languages, their test suite is their type system - but it is an incomplete, sometimes buggy one limited by what the developer was able predict could go wrong.
I’ve always tried to make programs where only valid business logic is possible, and doing this relies heavily on the use of sum types to be precise about what is allowable - I was talking earlier today to someone on IRC about the difference between validating (a.k.a writing code that works with bools to gate progress to other code) and parsing (writing code that produces valid values - perfectly summed up in Alexis King’s Parse, don’t validate).
At a previous job, we did this religiously, data coming into the program was parsed (in this case from SQS queues), such that we knew in the rest of the system that all our preconditions had been satisfied, and only valid data was allowed within the inner shell of the app. This meant that most of our tests were actually around serialisation to ensure that a) the given inputs produced the expected internal representation (unit test), b) a given output type produced the expected serialisation (JSON) (unit test) and c) that if a type was used for both input and output, that serialisation in both directions round tripped (property test:
decode . encode == identity
).There were more tests than this, specifically for cases where precisely encoding what was allowed would have become impractical, but there was only a small number of those.
The rest of the program was basically implemented in a way that it was very difficult to write incorrect code, all alternatives were represented as subtypes, and invariants were encoded this way too. All functions were then made total, and generally any change that was needed was 95% mechanical - make the code part of the change, and follow the compiler errors until there weren’t any. The maintainability of the project was fantastic, and we routinely made large refactors fearlessly.
LLVM’s testing infrastructure has gradually nudged me in the opposite direction: tests tell you what you’ve changed. Tests may check some internal dependencies but then when they fail you’re the best person to judge whether you actually meant to change that behaviour. For example, I’ve made a change that made a bunch of tests fail because register allocation happened in a different order. That’s the sort of thing that this tells you that the test shouldn’t check (as long as it’s ABI compliant, I don’t care which registers are being used). It turned out that I’d actually changed something in how alignment was calculated that triggered some failures on other targets and it was a real bug in my code, which I was able to fix before pushing.
I think the main lesson that I’ve learned about testing is to treat your tests as a software engineering artefact just like the rest of your code. They should be well documented and easy to modify. If they test an implementation detail then changing the tests should be as easy as changing that aspect of the implementation. When I change the implementation and I get test failures, I should be able to look at that test and see a comment telling me precisely what it’s testing and why. I can then decide if I want to update it for the new implementation or just delete it.
If nothing else, changing the implementation in such a way that a test can observe it probably needs to end up in my release notes because some other consumer may be relying on that behaviour. For example, if I previously returned an array that happened to be sorted by one attribute and now return something sorted with a different criterion or not sorted at all, that may not be part of my documented API contract, but it might be something that other code depends on.
There’s totally room for such tests, e.g. if you really just want to lock in some super specific functionality for whatever reason. But I am against tests like these being your primary kind of test.
The test mentions the “testing trophy” which is an alternative to the pyramid. Basically unit tests are still used, just not as much. That’s my happy medium.
This is in line with what Vladimir Khorikov advices in his book “Unit Testing Principles, Practices, and Patterns” (and blog). Mocks tend to produce brittle tests (i.e. tests that break due to refactorings, changes to implementation details that don’t change the behavior of a piece of code), so the recommendation is to only mock what he calls “Shared, unmanaged out-of-process dependencies”, mainly because these tend not to change much (they have a strong requirement of backwards compatibility), making it less likely that mocks will need to be updated. By doing this, I normally end up mostly with integration tests.
By that author, this is one of the bests posts ever: https://enterprisecraftsmanship.com/posts/growing-object-oriented-software-guided-by-tests-without-mocks/
In it, he takes a real design posed by the authors of the Growing Object-Oriented Software Guided by Tests book (pretty much the book that really proposed mocking as a primary testing technique) , and proposes what he feels is a design which eliminates most of the mocks. The result is much simpler (subjective of course).
Now, GooS is an amazing book. I’m not knocking the book, since they have many amazing points in it, and there is a wealth of knowledge to be taken from it. But what I love about this response post is that it’s not theoretical. We need non-trivial projects like this to compare multiple different designs for the same realistic project.
Some of this just falls down to design sensibility too, which is relatively subjective. Not everyone clicks with the immutable architecture, but for me it’s fantastic and aligns with my sensibilities.
I just started that book last night! So far I’m digging it.
Highly recommended! I was trying to find some answers as to what constitutes good tests and when is and isn’t appropriate to mock, and found one of the author’s blog posts tackling a bit the second, so eventually I gave the book a try. I really liked how he establishes some principles early on which are easy to agree on, and then from those all the rest follows.
That’s exactly what I’m looking for! After years of reading bits and pieces and just general development work I felt like it was time to read a more complete account of how to test well. Glad to get another positive review.
You won’t discover leaks and other internal problems.
It seems to me that the bulk of his problem with heavy unit-tests comes from over-reliance on mocks. Mocks have their place (though I prefer to use other kinds of test doubles wherever possible), but they have the effect of multiplying the cost of an interface change by 2 or more. (With a mock, an interface change means changing both the client of the interface, and the test code for that client. IME this is often more than 2x the cost of the change to production code alone, because test code has a tendency to grow larger than production code. Other kinds of test doubles don’t have this issue as often, since they can be more easily shared across tests.)
I’ve pretty much soured on test doubles. They’re often a pain to set up (especially when dealing with nested class structures), requiring a lot of boilerplate, and typically they only implement the part of the API that is actually being called. Many libraries have such a big surface API that faking or stubbing them completely and correctly would be a project unto itself. And of course, there lies one problem - if you’re only making them respond to the methods calls that the implementation is using, you just created an extremely tightly coupled system. If you change the implementation to use different method calls, you have to go back into your tests and update all the mocks as well.
And even if you did manage to fix all the method fakes to correctly respond to your usage, especially in dynamic languages you are quite likely to end up with a mock or a stub that returns data in a different shape to the actual API. Or because it is necessarily simpler than the real thing, it will be more permissive and accept some inputs that the real system would raise an exception for. I’ve seen projects where the large and very extensive test suite passed with flying colours, but when using it in a real project against the real API, it would just fail, because the test suite was subtly incorrect in the implementation of its test doubles.
So, if at all possible, use the real library. Of course, if it does external network requests or other side-effects that are not desirable in a test situation, you’ll have to make some sort of test double. Perhaps you can set up a fake server, or an in-memory database so that you’re at least drawing the line at the inter-system boundary. But, I have to admit, in some situations, a mock or stub is really the quickest way to test something. But then you’ll have to bear in mind that often the integration point where the calls end in a mock (and would go out to the real system in production) is where the bugs are at.
For what it’s worth, I’ve noticed that these sort of issues crop up more in systems with elaborate usages of types (think Java-esque class hierarchies and interfaces). There you have to jump through more hoops to accept and return exactly the right type of object in each context. In general, designs in a functional style make testing easier because there are less side-effects. It’s mostly the side-effects that you have to mock at all so that your test suite doesn’t make external network calls. When designing an API, it’s always good to keep this in mind. If the system is more functional, it’s easier to test all the moving parts in isolation, and you’ll only have to mock the “edges” of the system.
It’s possible that I’m misusing the term “test doubles”, but I’m thinking of a particular project in which we nearly eliminated all these problems by moving most of the boilerplate into a common location. For example, we weren’t religious about the Law of Demeter (I’m still not a fan), so we’d have APIs like:
where the obvious approach to mocking would lead to boilerplate like this, repeated through the test code:
but this is easy to avoid by factoring that out into a test support class, and instantiating that instead of instantiating the mock objects directly.
And now an interface change gets 1 additional place to modify outside the production code (the
TestFoo
implementation), not N. (I realize this example is using a mocking library, but a) that’s not what we actually did in the project I’m thinking of, but I’m already conscious of spending too much text on this, so I’m trying to minimize the context I need to provide; and b) it’s doing so as an implementation detail of a different kind of test double.)The biggest Issue I’ve seen with test double usage is picking their seams based on what’s convenient for the implemenation, rather than finding a more natural seam (or joint) for the problem at hand.
So the examples you gave above are interesting, because it sounds like they’re interfaces defined in terms of the implementation specifics (eg: the “mocked ORM library”) rather than saying “a thing that can store the information I care about, and read it back later”.
I agree wholeheartedly. The value prop against super isolated unit testing just isn’t there, especially not in large systems. Every time I want to actually change a design, it involves complete test suite surgery. Unit testing zealots will reply to that: “well you’re not doing it right. If your tests have to change, you’re not listening to the coupling that they were telling you about! And your design is bad!”
It’s purely mythical. Maybe, maybe if you get your interfaces designed incredibly well from the beginning, they can support evolution over time. But all of the large refactors / redesigns I ever worked on were so past what unit tests can support it’s silly. Example: switching from querying data on the fly to precomputing it, or moving logic from in-memory into a query for performance reasons. Anytime you do this, you’ll be modifying 10 different test files and changing the specification of internal components that don’t affect the external behavior of the system.
Then, there’s the practical side. I feel like a big reason unit testing has been so popular is simply because of speed. They are easy for the developer to write, and they’re easy to execute quickly. So we feel like we’re getting work done. But years and years have passed. Machines are still quicker, and where single cores aren’t much quicker we have more opportunities for parallelism. Now, think about property-based testing which is embarrassingly parallelizable. We already see things like Jepsen which apply property-based testing at the system level. You can scale it up by simply running the same test on more workers. That alleviates a huge part of the penalty of integration tests, and is where I’m banking on the future heading to.