1. 7
  1. 12

    Let’s talk about the cost of unit tests. In my experience, unit test suites with good coverage fall somewhere between 3-5x of the size of the production code. Also, in my experience, a medium size company at about 10 years of life will end up with 50,000 - 100,000 individual, handwritten test cases. At 50,000, that would mean that one person has to add ~13 test cases a day, 365 days a year for those 10 years. As test coverage grows, the probability of a design change requiring substantial test suite updates goes through the roof, so the test suite also increases development time.

    Unit test suites are not free, and in practice they are actually quite costly. They also rarely catch bugs related to data combinations, since that requires even more elbow grease in covering branches multiple times over. And we know that branch coverage does not equal correctness because of data combinations.


    Once again, I question whether it was a problem due to mocking or misusing mocks. 

    This is oft-repeated, but unfortunately is just condescending at this point. If everyone is “mismusing mocks,” maybe they are too sharp of a tool to be practically useful? If a technique requires everyone to be a genius, it’s not a robust or widely applicable technique. This is maximalism at its finest.

    On the surface that looks like an argument in favor of integration tests, since integration tests focus on the outer layer where many of these implementation details don’t leak. 

    Once again, I ask is it a problem with unit tests, or is it a problem with the way we are implementing unit tests?

    The author is almost seeing the light here - there’s a logical impossibility with unit tests. How can you perfectly choose boundaries ahead of time such that they never need to change and they also allow covering all edge cases? The answer is only to abstract more so that the underlying implementation can change more freely, but instead of going that way they appeal to the magical savior of “implementing unit tests right.”

    It is snake oil.

    Based on all of this, I believe the test diamond is correct. Unit tests have their place, I just don’t believe they should be the largest portion of your test suite. Instead of arguments about only the benefits of unit testing, I would like to see more acknowledgement of their costs as well.

    1. 1

      Great response. Test pragmatism is vital. In the wild you rarely see such cleanly hewn code that “the best” way to test it is obvious. There really is no best; just a bunch of tradeoffs to inaccurately weigh. Hopefully you come to a pragmatic decision about what mix of tradeoffs between time/cost, long term durability, and overall coverage you and your business can abide.

    2. 5

      If you redefine “unit” to mean basically what is commonly understood to be “integration”, then yeah, unit tests are great. Test larger areas of code, at the boundaries, without coupling to implementation details, and without misusing mocks. It’s still the testing diamond, just with some labels changed.

      1. 2

        This article would benefit tremendously from not using “unit” and “integration” words. The two words are hopelessly confused (https://matklad.github.io/2022/07/04/unit-and-integration-tests.html) and the following

        What does the unit in unit tests mean? It means a unit of behavior.

        in the middle of the article does add quite a bit to that.

        1. 1

          This is one of those hopeless fights unfortunately. The terms are accurate enough that people know what they mean, and their definitions are relatively well accepted. Once terms get embedded into a community like that, they’re very hard to unwind - actually I would say impossible in practice.

          Yes, if you get down to details they don’t have fully clear boundaries (because that’s not possible without formal language), but almost everyone will agree on specific examples.

          1. 3

            I mean, this very article is a counter example? It says

            What does the unit in unit tests mean? It means a unit of behavior. There’s nothing in that definition dictating that a test has to focus on a single file, object, or function.

            While, for example, a random search result from Google gives:

            A unit test is a way of testing a unit - the smallest piece of code that can be logically isolated in a system. In most programming languages, that is a function, a subroutine, a method or property.

            This seems like the perfect opposite of the tacit agreement on meaning you are describing.

            It can be argued is that article is just wrong, and the Google search is correct, but that’s still the reason to not use this terminology in this specific article.

            I am also willing to state, but not defend ( :) ), a stronger claim that this is not article-specific confusion, but a reflection of the general state of discourse.

            1. 1

              I know what you’re saying, I’ve made this same argument myself for years. But the examples you gave aren’t contradictions of one another, because language is fuzzy. Note the usage of the word “most.” The google result says:

              the smallest piece of code that can be logically isolated in a system

              The article clearly says that integration tests are something “larger”:

              we can say that integration tests ensure the correct behavior of the Adapters, Gateways, and Clients that mediate the relationship with other development units (such as APIs, Plugins, Databases, and Modules).

              What you’re highlighting is also the classic distinction between solitary and sociable unit tests. Many people think only solitary tests are unit tests, where all dependencies are mocked. In that case you’re only testing a single function / method etc. But many people (myself included), have no problem unit testing a set of collaborating objects / functions, so long as they’re accomplishing a common goal.

              The fact that there’s a blurry line there does not mean that there is confusion about unit vs. integration tests. Integration tests are surely about testing much larger pieces of code, often with external services. System vs. code boundaries, etc. There’s not much disagreement there.

              1. 4

                Ok, I think I am willing to make a stronger statement: I bet that you and the author really do use “unit test” for mostly disjoint blobs of things, and that that results in misunderstanding

                As a litmus test, test suite for rust-analyzer would work. 99% of the tests look like this:

                fn completes_trait_method() {
                struct S {}
                pub trait T {
                    fn f(&self)
                impl T for S {}
                fn main(s: S) {
                            me f() (as T) fn(&self)

                Here, we check that completion of a particular identifier(method f) works, by invoking the whole compiler pipeline, from the lexer to the typechecker, and verifying that each piece deals correctly with incomplete code. Crucially, the code also does zero syscalls (apart from mmap for malloc) — everything happens in memory in a virtual file system, etc. We just feed strings to check function (common to a class of tests, like all completion tests), and that takes care of setting up the required environment.

                Separately, there is a dozen of slow tests (source), which invoke rust-analyzer as a process, have that spawn cargo internally, and talk to the combo through stdio.

                I think the author of the article would describe it as a perfect test pyramid, which exactly matches prescriptions from the article, with 99% unit-tests testing the boundary outside-in and a very small number of integration smoke tests testing the boundary inside-out.

                At the same time, it seems that you’d say that it’s a testing diamond, because all those 99% are integration tests stressing compiler’s pipeline as a whole (eg, that single test would execute like 95% of code volume in rust-analyzer).

                On my side, I’d refuse to name these tests as all terminology in the area is poisoned, and instead say that I care about two properties:

                • that the the suite of 4000 tests takes 4 seconds to execute because the tests don’t IO
                • that I can completely replace the guts of rust-analyzer with, eg, a neural network, and still get to re-use all the tests, because they don’t exercise internal APIs directly.
                1. 2

                  The test that you presented is an integration test, and the author of the article would agree with that. I know this because:

                  A common misunderstanding is thinking that opaque-box testing can only be applied to the outer boundaries of our system. That is wrong. Our system is built with many boundaries. Some may be accessible through a communication protocol, while others may be extended with in-process adapters. Each adapter has its own boundaries and can be tested in a behavioral-driven approach.

                  The test you presented begins at the outermost boundary of rust-analyzer - the entry point that accepts source code. Your test stubs the filesystem, which sounds like it’s for performance and / or determinism reasons, but everything in between runs through the real system (based on your description). Many boundaries are crossed in that test, which is why it’s an integration test.

                  Presumably there are internal boundaries within rust-analyzer that you could test, as independent units. As the behavior within the boundaries becomes more focused, those would be unit tests.

                  I know the terminology is bad, but they have very well-accepted meanings, and we can’t say that those don’t exist just because we don’t like them.

                  that I can completely replace the guts of rust-analyzer with, eg, a neural network, and still get to re-use all the tests, because they don’t exercise internal APIs directly.

                  This by definition isn’t possible with unit testing. You seem to like testing from the outermost boundary of the system so that all of the internals can be modified at will. I’m completely with you on this! That’s why I prefer the test diamond, so that the testing of internal boundaries is limited. As you’ve discovered, it’s even possible to make these tests fast with careful placement of test doubles.

                  1. 1

                    Aha, thanks! This is not exactly the outermost boundary: the outermost one is an IPC to talk LSP protocol with the world. That layer parses the protocol, and calls into an internal API which is agnostic of a particular protocol details. That’s exactly functional core/imperative shell ports and adapters style of thing.

                    The completes_trait_method hits this internal boundary, and verifies behavior of the system. The “slow tests” hit the LSP, and check that the system is glued to the outside world correctly.

                    In my reading, the article tries to draw a boundary between exactly these two types of tests. In particular, only “slow tests” are real integration tests:

                    This type of architecture will make unit testing comfortable and lead you to use integration tests for what they should be: testing adapters to the outside world.

                    And the second kind of tests, which ignore outside world, and test the boundary outside-in, are unit-tests, because we only have two words.

        2. 1

          What’s the tl;dr? Too many words have been written about this already, and the same arguments have been rehashed a bajillion times on both sides, including how confusing the terms are.

          1. 3

            The tl;dr is:

            All of the complaints that you have about unit testing are because you don’t understand it and you’re not doing it correctly.

            1. 1

              Sounds about right. I was lucky enough to learn that lesson (I was doing it wrong, and how to do it right) a few years ago, but I don’t know of any resource which teaches it in a manner which can be grokked in a short time. It was basically a whole huge lot of different things, with red-green-refactor being the most obviously teachable one.