These rules make perfect sense in a closed, Google-like ecosystem. Many of them don’t make sense outside of that context, at least not without serious qualifications. The danger with articles like this is that they don’t acknowledge the contextual requirements that motivate each practice, making them liable for cargo-culting into situations where they end up doing more harm than good.
Automate common tasks
Absolutely — unless building and maintaining that automation takes more time than just doing it manually. Which tends to happen, especially when you don’t have a team dedicated to infrastructure, and spending time on automation necessarily means not spending time on product development. Programmers love to overestimate the cost of toil, and the benefit of avoiding it; and to underestimate the cost of building and running new software.
Stubs and mocks make bad tests
Stubs and mocks are tools for unit testing, just one part of a complete testing breakfast. Without them, it’s more difficult to achieve encapsulation, build strong abstractions, and keep complex systems coherent. You need integration tests, absolutely! But if you just have integration tests, you’re stacking the deck against yourself architecturally.
Small frequent releases
No objection.
Upgrade dependencies early, fast, and often
Big and complex dependencies, subject to CVEs, and especially if they interface with out-of-process stuff that may not retain a static API? Absolutely. Smaller dependencies, stuff that just serves a single purpose? It’s make-work, and adds a small amount of continuous risk to your deployments — even small changes can introduce big bugs that skirt past your test processes — which may not be the best choice in all environments.
Expert makes everyone’s update
(Basically: update your consumers for them.) This one in particular is so pernicious. The relationship between author and consumer is one to many, with no upper bound on the many. Authors always owe some degree of care and responsibility to their consumers, but not, like, total fealty. That’s literally impossible in open ecosystems, and even in closed ones, taking it to this extreme rarely makes sense in the cost/benefit sense. Software is always an explorative process, and needs to change to stay healthy; extending authors’ domain of responsibility literally into the codebases of their consumers makes change just enormously difficult. That’s appropriate in some circumstances, where the cost of change is very high! But the cost of change is not always very high. Sometimes, often, it’s more important to let authors evolve their software relatively unconstrained, then to bind them to Hyrum’s Law.
Stubs and mocks are tools for unit testing, just one part of a complete testing breakfast. Without them, it’s more difficult to achieve encapsulation, build strong abstractions, and keep complex systems coherent.
extending authors’ domain of responsibility.. into the codebases of their consumers.. is appropriate in some circumstances..
The second bullet here rebuts the first if you squint a little. When subsystems have few consumers (the predominant case for integration tests), occasionally modifying a large number of tests is better than constantly relying on stubs and mocks.
You can’t just dream up strong abstractions on a schedule. Sometimes they take time to coalesce. Overly rigid mocking can prematurely freeze interfaces.
I’m afraid I don’t really understand what you’re getting at here. I want to! Do you maybe have an example?
You can’t just dream up strong abstractions on a schedule. Sometimes they take time to coalesce. Overly rigid mocking can prematurely freeze interfaces.
I totally agree! But mocking at component encapsulation boundaries isn’t a priori rigid, I don’t think?
When subsystems have few consumers (the predominant case for integration tests), occasionally modifying a large number of tests is better than constantly relying on stubs and mocks.
I understand integration tests as whole-system, not subsystem. Not for you?
I need to test one subsystem. I could either do that in isolation using mocks to simulate its environment, or in a real environment. That’s the trade-off we’re talking about, right? When you say “integration tests make it difficult to achieve encapsulation” I’m not sure what you mean. My best guess is that you’re saying mocks force you to think about cross-subsystem interfaces. Does this help?
What is a subsystem? Is it a single structure with state and methods? A collection of them? An entire process?
edit:
I need to test one subsystem. I could either do that in isolation using mocks to simulate its environment, or in a real environment. That’s the trade-off we’re talking about, right? When you say “integration tests make it difficult to achieve encapsulation” I’m not sure what you mean.
Programs are a collection of components that provide capabilities and depend on other components. So in the boxes-and-lines architecture diagram sense, the boxes. They encapsulate the stuff they need to do their jobs, and provide their capabilities as methods (or whatever) to their consumers. This is what I’m saying should be testable in isolation, with mocks (fakes, whatever) provided as dependencies. Treating them independently in this way encourages you to think about their APIs, avoid exposing internal details, etc. etc. — all necessary stuff. I’m not saying integration tests make that difficult, I’m saying if all you have is integration tests, then there’s no incentive to think about componentwise APIs, or to avoid breaking encapsulation, or whatever else. You’re treating the whole collection of components as a single thing. That’s bad.
If you mean subsystem as a subset of inter-related components within a single application, well, I wouldn’t test anything like that explicitly.
All I mean by it is something a certain kind of architecture astronaut would use as a signal to start mocking :) I’ll happily switch to “component” if you prefer that. In general conversations like this I find all these nouns to be fairly fungible.
More broadly, I question your implicit premise that encapsulation and whatnot is something to pursue as an end in itself. When I program I try to gradually make the program match its domain. My programs tend to start out as featureless blobs and gradually acquire structure as I understand a domain and refactor. I don’t need artificial pressures to progress on this trajectory. Even in a team context, I don’t find teams that use them to be better than teams that don’t.
I wholeheartedly believe that tests help inexperienced programmers learn to progress on this trajectory. But unit vs integration is in the noise next to tests vs no tests.
But unit vs integration is in the noise next to tests vs no tests.
My current company is a strong counterpoint against this.
Lots of integration tests, which have become sprawling, slow, and flaky.
Very few unit tests – not coincidentally, the component boundaries are not crisp, how things relate is hard to follow, and dependencies are not explicitly passed in (so you can’t use fakes). Hence unit tests are difficult to write. It’s a case study in the phenomenon @peterbourgon is describing.
I’ve experienced it as well. I’ve also experienced the opposite, codebases with egregious mocking that were improved by switching to integration tests. So I consider these categories to be red herrings. What matters is that someone owns the whole, and takes ownership of the whole by constantly adjusting boundaries when that’s needed.
So I consider these categories to be red herrings.
I don’t think this follows though. Ime, the egregious mocking always results from improper application code design or improper test design. That is, any time I’ve seen a component like that, the design (of the component, of the test themselves, or of higher parts of the system in which the component is embedded) has always been faulty, and the hard-to-understand mocks would melt away naturally when that was fixed.
What matters is that someone owns the whole, and takes ownership of the whole by constantly adjusting boundaries when that’s needed.
Per the previous point, ownership alone won’t help if the owner’s design skills aren’t good enough. I see no way around this, though I wish there were.
More broadly, I question your implicit premise that encapsulation and whatnot is something to pursue as an end in itself. When I program I try to gradually make the program match its domain. My programs tend to start out as featureless blobs and gradually acquire structure as I understand a domain and refactor. I don’t need artificial pressures to progress on this trajectory. Even in a team context, I don’t find teams that use them to be better than teams that don’t.
This is a fine process! Follow it. But when you put your PR up for review or whatever, this process needs to be finished, and I need to be looking at well-thought-out, coherent, isolated, and, yes, encapsulated components. So I think it is actually a goal in itself. Technically it’s meant to motivate coherence and maintainability, but I think it’s an essential part of those things, not just a proxy for them.
Stubs and mocks are tools for unit testing, just one part of a complete testing breakfast. Without them, it’s more difficult to achieve encapsulation, build strong abstractions, and keep complex systems coherent. You need integration tests, absolutely! But if you just have integration tests, you’re stacking the deck against yourself architecturally.
Traditional OO methodology encourages you to think of your program as loosely coupled boxes calling into each other, and your unit test should focus on exact one box, and stub out all the other boxes. But it’s not a suitable model for everything.
Consider a simple function for calculating factorial of n: when you write a unit test for it, you wouldn’t stub out the * operation, you take it for granted. But in a pure OO sense, the * operation is a distinct “box” that the factorial function is calling into, so a unit test that doesn’t stub out * is technically an integration test, and a “real” unit test should stub it out too. But we know that the latter is just meaningless (you’ll essentially be re-implementing *, but for a small set of operands in the stubs) and we still happily call the former a unit test.
A more suitable model for this scenario is to think of some of dependencies as an implementation detail, and instead of stubbing them out, use either the real thing or something that replicates its behavior (called “fakes” in Google). These boxes might still be dependencies in a technical sense (e.g. subject to dependency injection), but they should be considered “hidden” in an architectural sense. The * operation in the former example is one such dependency. If you are unit testing some web backend, databases often fall into this category too.
Still, the real world is quite complex, and there are often cases that straddle the line between a loosely-coupled-box dependency and a mostly-implementation-detail dependency. Choosing between them is a constant tradeoff and requires evaluation of usage patterns. Even the * operation could cross over from the latter category to the former, if you are implementing a generic function that supports both real number multiplications and matrix multiplications, for example.
Consider a simple function for calculating factorial of n: when you write a unit test for it, you wouldn’t stub out the * operation, you take it for granted. But in a pure OO sense, the * operation is a distinct “box” that the factorial function is calling into, so a unit test that doesn’t stub out * is technically an integration test, and a “real” unit test should stub it out too.
Imo this is a misunderstanding (or maybe that’s what you’re arguing?). You should only stub out (and use DI for) dependencies with side effects (DB calls, network calls, File I/O, etc). Potentially if you had some really slow, computationally expensive pure function, you could stub that too. I have never actually run into this use-case but can imagine reasons for it.
But in a pure OO sense, the * operation is a distinct “box” that the factorial function is calling into, so a unit test that doesn’t stub out * is technically an integration test
Well, these terms aren’t well defined, and I don’t think this is a particularly useful definition. The distinct boxes are the things that exist in the domain of the program (i.e. probably not language constructs) and act as dependencies to other boxes (i.e. parameters to constructors). So if factorial took multiply as a dependency, sure.
instead of stubbing them out, use either the real thing or something that replicates its behavior
Names, details, it’s all fine. The only thing I’m claiming is important is that you’re able to exercise your code, at some reasonably granular level of encapsulation, in isolation.
If you have a component that’s tightly coupled to the database with bespoke SQL, then consider it part of the database, and use “the real thing” in tests. Sure. Makes sense. But otherwise, mocks (fakes, whatever) are a great tool to get to this form of testability, which is in my experience the best proxy for “code quality” that we got.
Absolutely — unless building and maintaining that automation takes more time than just doing it manually. Which tends to happen, especially when you don’t have a team dedicated to infrastructure, and spending time on automation necessarily means not spending time on product development.
I already spelled out my position in detail in my linked article, which echoes the experience that the Google book from TFA talks about.
Should I copy-paste it here?
Mocks are largely a unit-testing anti-pattern, they can easily make your tests worse than useless, because you believe you have real tests, but you actually do not. This is worse than not having tests and at least knowing you don’t have tests. (It is also more work). Stubs have the same structural problems, but are not quite as bad as mocks, because they are more transparent/straightforward.
I humbly submit that if you think the difference is too subtle to be useful, then you might not actually understand it.
Because the difference is huge. And it seems that Google Engineering agrees with me. Now the fact that Google Engineering believes something doesn’t automatically make it right, they can mess up like anyone else. On the other hand, they have a lot of really, really smart engineers, and a lot of experience building a huge variety of complex systems. So it seems at least conceivable that all of us (“Google and Me”, LOL) might have, in the tens or hundreds of thousands of engineer-years figured out that a distinction that may seem very subtle on the surface is, in fact, profound.
Welp I’m officially nerd sniped, gonna try to do a very broad writeup tomorrow. tl;dr most of the research is pretty rotten and I don’t think there’s strong evidence either way.
I find this top-level comment a compelling reply and want to expand on it a little bit.
Automate common tasks
In light of @peterbourgon’s remarks, all I’d say is that you automate what you can. If it’s common but not a time sink and the infra to automate it will be a burden, maybe it’s fine to not automate it.
Small frequent releases. […] Weekly at slowest. Daily is great. Hourly is dreamy.
This is highly dependent on the product. Here’s the thing: our customers don’t even want releases anywhere near this fast. Also, our product takes at about 10 hours just of build and test. Full regression tests take days. If a build is considered to be a “release” then we can do daily at best. The number of commits per week to the code base is very low so frequent releases don’t do much for us. Add to this that some of the changes we make are dependent on slow upstream sources means making frequent releases is basically pointless.
So, like pretty much everything in software, small frequent releases? Maybe.
These rules make perfect sense in a closed, Google-like ecosystem. Many of them don’t make sense outside of that context, at least not without serious qualifications. The danger with articles like this is that they don’t acknowledge the contextual requirements that motivate each practice, making them liable for cargo-culting into situations where they end up doing more harm than good.
Absolutely — unless building and maintaining that automation takes more time than just doing it manually. Which tends to happen, especially when you don’t have a team dedicated to infrastructure, and spending time on automation necessarily means not spending time on product development. Programmers love to overestimate the cost of toil, and the benefit of avoiding it; and to underestimate the cost of building and running new software.
Stubs and mocks are tools for unit testing, just one part of a complete testing breakfast. Without them, it’s more difficult to achieve encapsulation, build strong abstractions, and keep complex systems coherent. You need integration tests, absolutely! But if you just have integration tests, you’re stacking the deck against yourself architecturally.
No objection.
Big and complex dependencies, subject to CVEs, and especially if they interface with out-of-process stuff that may not retain a static API? Absolutely. Smaller dependencies, stuff that just serves a single purpose? It’s make-work, and adds a small amount of continuous risk to your deployments — even small changes can introduce big bugs that skirt past your test processes — which may not be the best choice in all environments.
(Basically: update your consumers for them.) This one in particular is so pernicious. The relationship between author and consumer is one to many, with no upper bound on the many. Authors always owe some degree of care and responsibility to their consumers, but not, like, total fealty. That’s literally impossible in open ecosystems, and even in closed ones, taking it to this extreme rarely makes sense in the cost/benefit sense. Software is always an explorative process, and needs to change to stay healthy; extending authors’ domain of responsibility literally into the codebases of their consumers makes change just enormously difficult. That’s appropriate in some circumstances, where the cost of change is very high! But the cost of change is not always very high. Sometimes, often, it’s more important to let authors evolve their software relatively unconstrained, then to bind them to Hyrum’s Law.
The second bullet here rebuts the first if you squint a little. When subsystems have few consumers (the predominant case for integration tests), occasionally modifying a large number of tests is better than constantly relying on stubs and mocks.
You can’t just dream up strong abstractions on a schedule. Sometimes they take time to coalesce. Overly rigid mocking can prematurely freeze interfaces.
I’m afraid I don’t really understand what you’re getting at here. I want to! Do you maybe have an example?
I totally agree! But mocking at component encapsulation boundaries isn’t a priori rigid, I don’t think?
I understand integration tests as whole-system, not subsystem. Not for you?
I need to test one subsystem. I could either do that in isolation using mocks to simulate its environment, or in a real environment. That’s the trade-off we’re talking about, right? When you say “integration tests make it difficult to achieve encapsulation” I’m not sure what you mean. My best guess is that you’re saying mocks force you to think about cross-subsystem interfaces. Does this help?
What is a subsystem? Is it a single structure with state and methods? A collection of them? An entire process?
edit:
Programs are a collection of components that provide capabilities and depend on other components. So in the boxes-and-lines architecture diagram sense, the boxes. They encapsulate the stuff they need to do their jobs, and provide their capabilities as methods (or whatever) to their consumers. This is what I’m saying should be testable in isolation, with mocks (fakes, whatever) provided as dependencies. Treating them independently in this way encourages you to think about their APIs, avoid exposing internal details, etc. etc. — all necessary stuff. I’m not saying integration tests make that difficult, I’m saying if all you have is integration tests, then there’s no incentive to think about componentwise APIs, or to avoid breaking encapsulation, or whatever else. You’re treating the whole collection of components as a single thing. That’s bad.
If you mean subsystem as a subset of inter-related components within a single application, well, I wouldn’t test anything like that explicitly.
All I mean by it is something a certain kind of architecture astronaut would use as a signal to start mocking :) I’ll happily switch to “component” if you prefer that. In general conversations like this I find all these nouns to be fairly fungible.
More broadly, I question your implicit premise that encapsulation and whatnot is something to pursue as an end in itself. When I program I try to gradually make the program match its domain. My programs tend to start out as featureless blobs and gradually acquire structure as I understand a domain and refactor. I don’t need artificial pressures to progress on this trajectory. Even in a team context, I don’t find teams that use them to be better than teams that don’t.
I wholeheartedly believe that tests help inexperienced programmers learn to progress on this trajectory. But unit vs integration is in the noise next to tests vs no tests.
My current company is a strong counterpoint against this.
Lots of integration tests, which have become sprawling, slow, and flaky.
Very few unit tests – not coincidentally, the component boundaries are not crisp, how things relate is hard to follow, and dependencies are not explicitly passed in (so you can’t use fakes). Hence unit tests are difficult to write. It’s a case study in the phenomenon @peterbourgon is describing.
I’ve experienced it as well. I’ve also experienced the opposite, codebases with egregious mocking that were improved by switching to integration tests. So I consider these categories to be red herrings. What matters is that someone owns the whole, and takes ownership of the whole by constantly adjusting boundaries when that’s needed.
Agreed, I’ve seen this too.
I don’t think this follows though. Ime, the egregious mocking always results from improper application code design or improper test design. That is, any time I’ve seen a component like that, the design (of the component, of the test themselves, or of higher parts of the system in which the component is embedded) has always been faulty, and the hard-to-understand mocks would melt away naturally when that was fixed.
Per the previous point, ownership alone won’t help if the owner’s design skills aren’t good enough. I see no way around this, though I wish there were.
This is a fine process! Follow it. But when you put your PR up for review or whatever, this process needs to be finished, and I need to be looking at well-thought-out, coherent, isolated, and, yes, encapsulated components. So I think it is actually a goal in itself. Technically it’s meant to motivate coherence and maintainability, but I think it’s an essential part of those things, not just a proxy for them.
Traditional OO methodology encourages you to think of your program as loosely coupled boxes calling into each other, and your unit test should focus on exact one box, and stub out all the other boxes. But it’s not a suitable model for everything.
Consider a simple function for calculating factorial of n: when you write a unit test for it, you wouldn’t stub out the
*
operation, you take it for granted. But in a pure OO sense, the*
operation is a distinct “box” that the factorial function is calling into, so a unit test that doesn’t stub out*
is technically an integration test, and a “real” unit test should stub it out too. But we know that the latter is just meaningless (you’ll essentially be re-implementing*
, but for a small set of operands in the stubs) and we still happily call the former a unit test.A more suitable model for this scenario is to think of some of dependencies as an implementation detail, and instead of stubbing them out, use either the real thing or something that replicates its behavior (called “fakes” in Google). These boxes might still be dependencies in a technical sense (e.g. subject to dependency injection), but they should be considered “hidden” in an architectural sense. The
*
operation in the former example is one such dependency. If you are unit testing some web backend, databases often fall into this category too.Still, the real world is quite complex, and there are often cases that straddle the line between a loosely-coupled-box dependency and a mostly-implementation-detail dependency. Choosing between them is a constant tradeoff and requires evaluation of usage patterns. Even the
*
operation could cross over from the latter category to the former, if you are implementing a generic function that supports both real number multiplications and matrix multiplications, for example.Imo this is a misunderstanding (or maybe that’s what you’re arguing?). You should only stub out (and use DI for) dependencies with side effects (DB calls, network calls, File I/O, etc). Potentially if you had some really slow, computationally expensive pure function, you could stub that too. I have never actually run into this use-case but can imagine reasons for it.
I think we’re broadly in agreement.
Well, these terms aren’t well defined, and I don’t think this is a particularly useful definition. The distinct boxes are the things that exist in the domain of the program (i.e. probably not language constructs) and act as dependencies to other boxes (i.e. parameters to constructors). So if factorial took multiply as a dependency, sure.
Names, details, it’s all fine. The only thing I’m claiming is important is that you’re able to exercise your code, at some reasonably granular level of encapsulation, in isolation.
If you have a component that’s tightly coupled to the database with bespoke SQL, then consider it part of the database, and use “the real thing” in tests. Sure. Makes sense. But otherwise, mocks (fakes, whatever) are a great tool to get to this form of testability, which is in my experience the best proxy for “code quality” that we got.
Obligatory relevant XKCDs:
Nope.
Why I don’t mock
That mocks are tools for unit testing is a statement of fact?
I don’t think we’re talking about the same thing.
Mocks are tools for unit testing the same way hammers are tools for putting in screws.
A great way to make pilot holes so you don’t split your board while putting the screw in?
A great way to split hairs without actually putting a screw in? ¯\(ツ)/¯
You seem way more interested in making dropping zingers than actually talking about your position.
I already spelled out my position in detail in my linked article, which echoes the experience that the Google book from TFA talks about.
Should I copy-paste it here?
Mocks are largely a unit-testing anti-pattern, they can easily make your tests worse than useless, because you believe you have real tests, but you actually do not. This is worse than not having tests and at least knowing you don’t have tests. (It is also more work). Stubs have the same structural problems, but are not quite as bad as mocks, because they are more transparent/straightforward.
Fakes are OK.
Mocks, stubs, fakes — substitutes for the real thing. Whatever. They play the same role.
They are not the same thing and do not play the same role.
I recommend that you learn why and how they are different.
I understand the difference, it’s just that it’s too subtle to be useful.
I humbly submit that if you think the difference is too subtle to be useful, then you might not actually understand it.
Because the difference is huge. And it seems that Google Engineering agrees with me. Now the fact that Google Engineering believes something doesn’t automatically make it right, they can mess up like anyone else. On the other hand, they have a lot of really, really smart engineers, and a lot of experience building a huge variety of complex systems. So it seems at least conceivable that all of us (“Google and Me”, LOL) might have, in the tens or hundreds of thousands of engineer-years figured out that a distinction that may seem very subtle on the surface is, in fact, profound.
Make of that what you will.
I’m sure we’re not talking about the same thing.
[Comment from banned user removed]
I shall summon @hwayne - weren’t you into the empirical side of computer science and software engineering. Where would you look? :)
Welp I’m officially nerd sniped, gonna try to do a very broad writeup tomorrow. tl;dr most of the research is pretty rotten and I don’t think there’s strong evidence either way.
Don’t feel like you need to come up with an answer. I appreciate you digging in, but I really just wondered where to start looking.
Here ya go: I ****ing Hate Science
I find this top-level comment a compelling reply and want to expand on it a little bit.
In light of @peterbourgon’s remarks, all I’d say is that you automate what you can. If it’s common but not a time sink and the infra to automate it will be a burden, maybe it’s fine to not automate it.
This is highly dependent on the product. Here’s the thing: our customers don’t even want releases anywhere near this fast. Also, our product takes at about 10 hours just of build and test. Full regression tests take days. If a build is considered to be a “release” then we can do daily at best. The number of commits per week to the code base is very low so frequent releases don’t do much for us. Add to this that some of the changes we make are dependent on slow upstream sources means making frequent releases is basically pointless.
So, like pretty much everything in software, small frequent releases? Maybe.