Maybe this is specific to the type of projects that I work on, but to me the distinction between unit and integration tests is not superfluous. Specifically, in my world, unit tests test the implementation details directly while integration tests only test via the public interface.
As an example, say I am writing C or C++ library. In order to access the implementation details (which may include symbols that are not even exported), the library is build from a utility library (a type of static library) and the unit tests link directly to this utility library. This requires the library and its unit tests to reside in the same project.
In contrast, integration tests link to the library itself and only access it via the public interface, just like the real users would. In the build system that I use (build2) we go a step further and put integration tests into a subproject which can be built against an installed version of the library (helpful to make sure the installation actually works). And if the integration tests have any third-party dependencies (say one of the numerous C++ testing frameworks), then they can even be placed into a separate package.
Testing public API vs internal implementation details is indeed an important physical distinction, and this is exactly the semantics Cargo ascribes to integration/unit terms (unit tests reside in the same translation unit integration tests link with crate under test).
But I wish we had better terminology here, I don’t think everyone would agree that unit/integration is about visibility of the API. Eg, there’s “unit tests should only use public API of the class” school of thought, and there “unit” clearly doesn’t refer to “has access to private stuff”.
I agree the terminology is not very intuitive. The way I think about it is along these lines: one should be able to test every aspect of the “integrated implementation” via the public interface but sometime things are complex enough to warrant testing “units of the implementation” individually, which most likely will require access to implementation details not exposed in the public API.
For libraries, integration tests test from the user perspective. Unit tests test from within.
For applications, integration tests test code you don’t own. Unit testing a database abstraction typically involves mocking/whatever, but integration tests of the same thing would actually spin up Postgres or similar.
Yeah, the fact that this blog itself didn’t understand the distinction, tells me that the terms are hopelessly polluted. Lately I’ve tried saying “internal (logic) test” and “external (black-box) test” instead. My intuition about it is that a logic test helps reassure me that an algorithm or data flow is correct, and an external test verifies that it behaves the way I promised in the readme.
Yeah, the fact that this blog itself didn’t understand the distinction
I wouldn’t agree with that: the
Cargo uses “unit” and “integration” terminology to describe Rust-specific properties of the compilation model, which is orthogonal to the traditional, however fuzzy, meaning of this terms.
bullet point describes exactly the situation in the top-level comment.
“Testing via public interface / via internal interface” does seem pretty orthogonal to “testing a single unit in isolation / testing interactions of multiple units” which I think the traditional meaning is.
Right. The rust definitions and the ones used in the post are both different from the “old” ones, which doesn’t mean anyone is wrong, just that the terms are hopelessly polluted and we need to find more specific terminology to describe what we mean.
Can you elaborate on “implementation details”? I can understand the desire to test internal utility APIs (such as a sort function that only used internally with specific ordering requirements), but never feel comfortable with testing “implementation details” (such as this whether this public API call will trigger a network call). I guess you probably have a different definition of “implementation details” and would love to see some clarifications.
Unfortunately most engineers haven’t heard of Design by Contract. It’s rare to find people with experience in it. I always have to teach engineers I work with to employ it. But to give you an idea of how powerful it is, We employed it in a computer vision system that processed Petabytes of data and if an contract failed, the processing would stop. At any given year we had one or two bugs found in the live production environment. That’s a bad year.
Another excellent technique to keep bugs low is to have a zero bug policy (which I also always employ). If you find a bug, you drop what you are doing and fix it. If it takes more than 4 hours, then you put it on the backlog (not a bug list). Don’t keep a bug list. Bug lists are a great place for bugs to hide.
Yup, Jim’s paper is golden! Might be the time to re-submit it to lobsters!
Though, I would say it argues against minimizing extent, rather than maximizing purity. As a litmus, ahem, test, the test from the post would I think be considered OK from the perspective of Jim’s paper: it’s a system test which directly checks business requirement (that a particular completion is shown to the user). And that is wide-extent, very pure test.
I hate the naming but somewhat ok with the concepts. Thought I think this classification can only show value when it’s applied in some certain cenarios but not always.
I definitely don’t like the bottom-up thinking of technical design consideration of the tests (i.e. performance driven). I think tests should be business driven. What you are testing should reflect the business’s risk tolerance and the resource the business willing to spend to setup tests to mitigate those risk vectors. Poorly design tests should be acceptable as long as they contribute to business value. However, more code means more technical debts and that debts incurs interest overtime without proper care. So it’s quite a job to balance that equation.
Another thought is that “purity” is relative and should be coupled with frame of reference. Depends on the testing setup where a distributed computation could be pure. Some of the recent advancement with hermetic Build Tools such as Bazel and high speed containerization such as FireCracker could easily obsolete this definition entirely if at the end, we are using ‘performance’ (or speed) as the business vector to prioritize for.
The most useful model I’ve found to categorize tests is based on runtime dependencies. A unit test is a test which can be successfully executed without any interaction outside of the process itself: no disk, no network, no subprocesses, etc. This roughly maps to syscalls. An integration test is then a test which needs any external resource.
I find this model useful not only because it is reasonably well-defined, but also because it reflects meaningful differences to users. These definitions allow me to clone a repo and run the unit tests without anything more than the language toolchain. Anything that doesn’t fit this description — requiring Docker, or a DB, or a filesystem even! — is an Integration test, and thus should be opt-in.
—
Single-threaded pure computation
Multi-threaded parallel computation
Multi-threaded concurrent computation with time-based synchronization and access to disk
Multi-process computation
Distributed computation
Each step of this ladder adds half-an-order of magnitude to test’s runtime.
Hopefully not!! Execution speed of code, tests included, is broadly a function of CPU utilization and syscall waits. If the number of threads and/or processes has a categorical effect on runtime, something is probably wrong.
So something like
Tests that don’t make any syscalls — should be basically instantaneous
Tests that make syscalls for local resources e.g. disk, time — limited by IO speed
Tests that make syscalls for remote resources e.g. network — limited by third-parties and/or timeouts
A unit test is a test which can be successfully executed without any interaction outside of the process itself: no disk, no network, no subprocesses, etc.
Yup! And an important thing to realize is that such a unit test sometimes can exercise pretty-much the entirety of the application.
Hopefully not!! Execution speed of code, tests included, is broadly a function of CPU utilization and syscall waits. If the number of threads and/or processes has a categorical effect on runtime, something is probably wrong.
The conclusion does not match my experience: some syscalls tend to be rather slow. In particular, stuff that spawns process per test is way slower than stuff that does many tests in a single process. Several instances where I observed this:
In Kotlin/Native compiler, rewriting the test suite from compiling each test program via a separate execution of compiler into using a single compiler process to build the whole test corpus lead to order of magnitude perf improvement
Cargo’s test suite is quite slow, as each test typically execs at least two processes (cargo, which then executes rustc)
Rust’s doc-tests are notoriously slow, because each test is compiled as a separate executable
rustc test suite is notoriously slow, as each test is compiled as a separate executable (but folks don’t perceive this slowness, as it is dominated by absolutely atrocious time to build the compiler itself).
I don’t have such vivid anecdotes about threads, so I wouldn’t be surprised if there’s little difference between the stuff that generally has some concurrency & synchronizations in tests vs the stuff that doesn’t. Though, my prior is that the difference there would be meaningful, and indeed the SWE book does call out threads/no threads distinction.
It has also been my experience that often test-suite performance is the a function of the amount of manually inserted sleeps and serializations. A lot of stuff I see in the wild utilizes less than a single CPU in the test suite.
Maybe this is specific to the type of projects that I work on, but to me the distinction between unit and integration tests is not superfluous. Specifically, in my world, unit tests test the implementation details directly while integration tests only test via the public interface.
As an example, say I am writing C or C++ library. In order to access the implementation details (which may include symbols that are not even exported), the library is build from a utility library (a type of static library) and the unit tests link directly to this utility library. This requires the library and its unit tests to reside in the same project.
In contrast, integration tests link to the library itself and only access it via the public interface, just like the real users would. In the build system that I use (
build2
) we go a step further and put integration tests into a subproject which can be built against an installed version of the library (helpful to make sure the installation actually works). And if the integration tests have any third-party dependencies (say one of the numerous C++ testing frameworks), then they can even be placed into a separate package.Testing public API vs internal implementation details is indeed an important physical distinction, and this is exactly the semantics Cargo ascribes to integration/unit terms (unit tests reside in the same translation unit integration tests link with crate under test).
But I wish we had better terminology here, I don’t think everyone would agree that unit/integration is about visibility of the API. Eg, there’s “unit tests should only use public API of the class” school of thought, and there “unit” clearly doesn’t refer to “has access to private stuff”.
I agree the terminology is not very intuitive. The way I think about it is along these lines: one should be able to test every aspect of the “integrated implementation” via the public interface but sometime things are complex enough to warrant testing “units of the implementation” individually, which most likely will require access to implementation details not exposed in the public API.
I think this is also the correct distinction.
For libraries, integration tests test from the user perspective. Unit tests test from within.
For applications, integration tests test code you don’t own. Unit testing a database abstraction typically involves mocking/whatever, but integration tests of the same thing would actually spin up Postgres or similar.
Rust/Cargo encourages breaking projects down into smaller crates, which blurs this line, because now you have public APIs of private components.
Yeah, the fact that this blog itself didn’t understand the distinction, tells me that the terms are hopelessly polluted. Lately I’ve tried saying “internal (logic) test” and “external (black-box) test” instead. My intuition about it is that a logic test helps reassure me that an algorithm or data flow is correct, and an external test verifies that it behaves the way I promised in the readme.
I wouldn’t agree with that: the
bullet point describes exactly the situation in the top-level comment.
“Testing via public interface / via internal interface” does seem pretty orthogonal to “testing a single unit in isolation / testing interactions of multiple units” which I think the traditional meaning is.
Right. The rust definitions and the ones used in the post are both different from the “old” ones, which doesn’t mean anyone is wrong, just that the terms are hopelessly polluted and we need to find more specific terminology to describe what we mean.
Can you elaborate on “implementation details”? I can understand the desire to test internal utility APIs (such as a sort function that only used internally with specific ordering requirements), but never feel comfortable with testing “implementation details” (such as this whether this public API call will trigger a network call). I guess you probably have a different definition of “implementation details” and would love to see some clarifications.
The best argument against unit testing (or as the article calls, purity dimension of testing) is from Jim Coplien in his “Why Most Unit Testing is Waste” (https://rbcs-us.com/site/assets/files/1187/why-most-unit-testing-is-waste.pdf)
He provides several profound arguments against unit testing that I agree with from my experience. If you actually want a powerful way to find and destroy bugs, employ Design by Contract (https://en.wikipedia.org/wiki/Design_by_contract) which has research from Microsoft that shows it works (https://www.microsoft.com/en-us/research/publication/assessing-the-relationship-between-software-assertions-and-code-qualityan-empirical-investigation/).
Unfortunately most engineers haven’t heard of Design by Contract. It’s rare to find people with experience in it. I always have to teach engineers I work with to employ it. But to give you an idea of how powerful it is, We employed it in a computer vision system that processed Petabytes of data and if an contract failed, the processing would stop. At any given year we had one or two bugs found in the live production environment. That’s a bad year.
Another excellent technique to keep bugs low is to have a zero bug policy (which I also always employ). If you find a bug, you drop what you are doing and fix it. If it takes more than 4 hours, then you put it on the backlog (not a bug list). Don’t keep a bug list. Bug lists are a great place for bugs to hide.
Yup, Jim’s paper is golden! Might be the time to re-submit it to lobsters!
Though, I would say it argues against minimizing extent, rather than maximizing purity. As a litmus, ahem, test, the test from the post would I think be considered OK from the perspective of Jim’s paper: it’s a system test which directly checks business requirement (that a particular completion is shown to the user). And that is wide-extent, very pure test.
I like it. *throws beer in the floor* Another!
Something missing is simple labels. We went from Unit vs integration, now it seems like there should be 4 labels
I hate the naming but somewhat ok with the concepts. Thought I think this classification can only show value when it’s applied in some certain cenarios but not always.
I definitely don’t like the bottom-up thinking of technical design consideration of the tests (i.e. performance driven). I think tests should be business driven. What you are testing should reflect the business’s risk tolerance and the resource the business willing to spend to setup tests to mitigate those risk vectors. Poorly design tests should be acceptable as long as they contribute to business value. However, more code means more technical debts and that debts incurs interest overtime without proper care. So it’s quite a job to balance that equation.
Another thought is that “purity” is relative and should be coupled with frame of reference. Depends on the testing setup where a distributed computation could be pure. Some of the recent advancement with hermetic Build Tools such as Bazel and high speed containerization such as FireCracker could easily obsolete this definition entirely if at the end, we are using ‘performance’ (or speed) as the business vector to prioritize for.
The most useful model I’ve found to categorize tests is based on runtime dependencies. A unit test is a test which can be successfully executed without any interaction outside of the process itself: no disk, no network, no subprocesses, etc. This roughly maps to syscalls. An integration test is then a test which needs any external resource.
I find this model useful not only because it is reasonably well-defined, but also because it reflects meaningful differences to users. These definitions allow me to clone a repo and run the unit tests without anything more than the language toolchain. Anything that doesn’t fit this description — requiring Docker, or a DB, or a filesystem even! — is an Integration test, and thus should be opt-in.
—
Hopefully not!! Execution speed of code, tests included, is broadly a function of CPU utilization and syscall waits. If the number of threads and/or processes has a categorical effect on runtime, something is probably wrong.
So something like
Yup! And an important thing to realize is that such a unit test sometimes can exercise pretty-much the entirety of the application.
The conclusion does not match my experience: some syscalls tend to be rather slow. In particular, stuff that spawns process per test is way slower than stuff that does many tests in a single process. Several instances where I observed this:
cargo
, which then executesrustc
)I don’t have such vivid anecdotes about threads, so I wouldn’t be surprised if there’s little difference between the stuff that generally has some concurrency & synchronizations in tests vs the stuff that doesn’t. Though, my prior is that the difference there would be meaningful, and indeed the SWE book does call out threads/no threads distinction.
Some overhead benchmarks are here: https://github.com/matklad/benchmarks/tree/e29a260182d154ab1feb6a919ea88696aee8c69a/fn-thread-proc
It has also been my experience that often test-suite performance is the a function of the amount of manually inserted sleeps and serializations. A lot of stuff I see in the wild utilizes less than a single CPU in the test suite.