What does fibonacci(15) equal? If you already know, terrific—but what are you meant to do if you don’t?
yeah, It’s going to be kind of problematic to test a function when you don’t know what the correct results are. The advice here seems to be to run it once, take for granted that the result is correct, and enshrine that as the expected result. But that does nothing to help you determine that you implemented the Fibonacci series correctly!
Even if you do know the correct result, this approach seems prone to laziness — just hit OK without really thinking when the program fills n the blanks for you.
For me, the annoying part of writing tests isn’t figuring our the answers; it’s all the boilerplate around expressing assertions. Some languages / test frameworks make it pleasant (I like Catch and Jest) and some are painful (like Apple’s XCTest and Go’s lack of any built-in assertions at all.)
You are correct that blindly accepting the output at a particular point in time does not mean anything about its correctness. However there’s still some value in these kinds of tests as they tend to alert you about unintended or accidental changes.
In some cases what is “correct” is not even that obvious or rigidly defined. So it makes sense to take a “known good” state of something and treat it as correct and then look for unexpected changes to it.
For example when it comes to some UI components or other visual things that are generated by code you may not have a rigid definition of the correctness but you can just look at the output and say that you are happy with it as it is at that point and treat unexpected changes as test failures.
These sorts of test have value, I’m just not sure they belong alongside actual domain-specific unit tests. Usually something like Percy is good at catching that sort of thing and since it gives you images to review the chances of false approvals is lower too IMO.
It depends a bit on your workflow. For example, imagine if you start by writing the trivial naïve implementation of fibonacci (recursive, no memoisation). Now you write your test. The test then captures the output. Now you realise it’s slow and rewrite it to do a linear scan, possibly caching some larger values. Now your tests probably tell you if you implemented it correctly. If they fail, then one of your implementations is wrong and you can spend some time figuring out which.
Most of the LLVM tests are regression tests of this rough shape. They’re created by running the tool, visually inspecting the output to make sure it looks plausible, and then flagging some of the features of the output as important. It’s fairly common for a change to cause a test to spuriously fail, but that at least makes you go and check that your change to the output really doesn’t break anything in the output that other people depend on.
True, if your simple model can execute at a useful speed. The Fibonacci example is nice because you don’t need very large numbers before it will start taking minutes to execute.
Fibonacci(15) isn’t a bad example. Various other math formulas, like say the centroid of a polygon. How about a hash function of some sample string.
Obviously I can work the answers out by hand or run the input through someone else’s implementation. My point is that filling in the expected value from a value the function generated, without checking, is a bad idea.
expect test trivialize time to upgrade the test when the code changes. This is just as important as saving time during writing the test (well, upgrades actually save you more time, but those savings are a bit less valuable then initial ones)
expect tests force you to actually have good string representations of things! That’s one of the pre-requisites for debugging systems.
So, ultimately, expect is equivalent to assert (String.equal stdout_string expected_string)? It’s very surprising to hear this kind of idea coming from Ocaml people - I’d have expected them to prefer keeping the safety of their strong typing rather than go with a stringly-typed testing framework. I am not certain this is a good idea even if you have strong confidence in your pretty printers, I can imagine it resulting in weird code contortions when attempting to test something hard to print/that doesn’t have a readily-available printer.
It’s possible to get a false negative if you don’t design your pretty-printers well. For example, if you print a string directly, without quotes, then you might not notice when it’s missing/empty. But in practice, 1) debug representations are often hardened against that anyways, because precision here is also useful for debugging, and 2) overall, I think the productivity gains and low friction of snapshot testing outweigh the risks.
Jest, I believe, will actually embed the data structure into the test when possible, rather than just a string representation. Besides type-safety, the practical question is whether you have a good diff tool for structured objects. If you don’t, then diffing against the string representation will produce better error messages when fixing a broken test. Pytest, for example, will let you assert equality of various objects, but it’ll show you the string diff in addition to a structured message (like “item 1 in list did not match”) because it’s often higher-quality.
Well you need to know the type of something in order to print it, so the type system is still working for you in test code. But I think you’re right that this doesn’t make sense without a reasonable debug representation for your types.
I can imagine it resulting in weird code contortions when attempting to test something hard to print/that doesn’t have a readily-available printer.
But since this isn’t the only way to write tests, you don’t have to contort anything if this isn’t a good fit for the particular thing you’re testing. From the article:
Classical assertion-style unit tests still have their place—just a much smaller one.
In practice though I’d usually define a new type that contains all of the relevant data that I want to assert on and then print that out to keep the ergonomic benefits of writing expect tests. (If an assertion is anything more complicated than an equality check, a property test is often a better fit anyway.)
I feel like the Hardcaml example is a pretty compelling argument for spending the time to write a decent pretty-printer, though – imagine what that test would look like with assertions!
Writing tests is a joyful experience to me. I love to see code passing a complex set of inputs. I can push the code to its limit, giving it unexpected inputs. This is especially true if we are testing complex state machines.
But updating tests as requirements change is just heartbreakingly boring. All the previous work is for naught.
I started using this for my source-to-source compilers & its amazing.
I first wrote manual unit tests everywhere & I was doing TDD it was amazing, but at a certain point I stopped caring about the specifics of my AST & just wanted to make sure my output isn’t changing everything else rarely causes bugs.
I wrote a state machine DSL we use at work & use this approach & writing a test is as simple as clj -X main/snapshot src test-file.state name verify-this-property, it then generates 2 files test/snapshots/verify-this-property.state & test/snapshot/verify-this-property.ts and the test runs through all the .state files in snapshots directory & verifies the compile output is the same as whats stored. This makes writing the test matter of seconds after I have the code snippet.
IME this is extremely high value, it catches way more bugs than regular unit tests & is dead simple to expand on.
I want to bring this approach to most of my job (svelte) but I’m not sold on what I understand snapshot testing as it is right now, if I understand this blog post correct - I describe a set of commands & store the results of each command & when it reruns it verifies those commands haven’t changes the expected output? I’m not sure if the current JS snapshot libraries can do something like that… if they can, I’m sold.
Before I tried expect tests, I was very suspect of the workflow. However it really is the most productive workflow I’ve found. It’s very close to a REPL driven workflow that just happens to be repeatable.
That said I find that I often remove my expect tests over time with model or property testing. I do that because I’ve found lots of small simple assert tests are much harder to maintain over the life of a project.
At work I replaced a bunch of custom-written assertions with an in-house snapshot (expect) test implementation. Each test now looks like this (Scala):
validateSnapshot("input.file")(processFile)
This method writes a snapshot file input.file.snapshot to the same directory as input.file, containing the processed output from the processFile method. We compare them side-by-side, then commit the snapshots into the repo.
It has saved many hours of tediously writing manual assertions.
I do think this is a nice workflow, but it doesn’t solve the root of the problem for me. The root being, interesting input combinations are still hard to come up with and write down, and dependencies still have to get wired up somewhere. It’s not like expect tests are immune to code changes.
It’s definitely a step in the right direction though, by letting the computer do lots of work for you. That I agree with 100%.
From what I know of them they figure out the coverage by pretty arcane means and aren’t always reliable. Are there good tools that you know of?
Also, the downside of white-box coverage is that it can only cover the code that’s actually written, not tell you that you are missing branches or conditions. Bugs also arise from missing branches and conditions.
For the Ada programming language, GNATcoverage is pretty solid, although I might be biased since I work at the company that makes it (but not on GNATcoverage itself) :).
Also, the downside of white-box coverage is that it can only cover the code that’s actually written, not tell you that you are missing branches or conditions.
Most code coverage tools will report if/when an “else” condition is tested or not even if there’s no syntactic “else” block construct. If you notice that an else condition is never tested, then you write a test for it. If it’s tested but there’s a bug that isn’t noticed in there, the problem is in the test.
I’ll put my hand up, and say I’m struggling to wrap my head around this.
I can understand the appeal of having an auto-generated set of test cases to minimise the chance of things breaking. But isn’t assuming that your outputs are correct and then encoding those a bit ass backwards?
All I can think of when reading the article are “property based tests”.
(I also have a pretty negative experience of frontend snapshot tests. But I’m willing to concede that I might be missing something and it’s a me issue).
You don’t assume that the output is correct, you very much check it manually the first time it is generated.
The technique here is a dual of property based testing: if property based testing doesn’t work, testing with expectations is often the best alternative.
I also have a pretty negative experience of frontend snapshot tests
You don’t assume the test case expectations are correct. You have the outputs automatically generated and then manually verify their correctness. If behavior changes in the future, you regenerate the output and the diff of the output shows you the change in behavior.
yeah, It’s going to be kind of problematic to test a function when you don’t know what the correct results are. The advice here seems to be to run it once, take for granted that the result is correct, and enshrine that as the expected result. But that does nothing to help you determine that you implemented the Fibonacci series correctly!
Even if you do know the correct result, this approach seems prone to laziness — just hit OK without really thinking when the program fills n the blanks for you.
For me, the annoying part of writing tests isn’t figuring our the answers; it’s all the boilerplate around expressing assertions. Some languages / test frameworks make it pleasant (I like Catch and Jest) and some are painful (like Apple’s XCTest and Go’s lack of any built-in assertions at all.)
You are correct that blindly accepting the output at a particular point in time does not mean anything about its correctness. However there’s still some value in these kinds of tests as they tend to alert you about unintended or accidental changes.
In some cases what is “correct” is not even that obvious or rigidly defined. So it makes sense to take a “known good” state of something and treat it as correct and then look for unexpected changes to it.
For example when it comes to some UI components or other visual things that are generated by code you may not have a rigid definition of the correctness but you can just look at the output and say that you are happy with it as it is at that point and treat unexpected changes as test failures.
These sorts of test have value, I’m just not sure they belong alongside actual domain-specific unit tests. Usually something like Percy is good at catching that sort of thing and since it gives you images to review the chances of false approvals is lower too IMO.
It depends a bit on your workflow. For example, imagine if you start by writing the trivial naïve implementation of fibonacci (recursive, no memoisation). Now you write your test. The test then captures the output. Now you realise it’s slow and rewrite it to do a linear scan, possibly caching some larger values. Now your tests probably tell you if you implemented it correctly. If they fail, then one of your implementations is wrong and you can spend some time figuring out which.
Most of the LLVM tests are regression tests of this rough shape. They’re created by running the tool, visually inspecting the output to make sure it looks plausible, and then flagging some of the features of the output as important. It’s fairly common for a change to cause a test to spuriously fail, but that at least makes you go and check that your change to the output really doesn’t break anything in the output that other people depend on.
Sounds like more of an argument for defining a simplified model and doing model-based testing.
True, if your simple model can execute at a useful speed. The Fibonacci example is nice because you don’t need very large numbers before it will start taking minutes to execute.
That’s fair. No free lunch.
what’s an example of a function you wrote for which you don’t know the “correct” output of?
Fibonacci(15) isn’t a bad example. Various other math formulas, like say the centroid of a polygon. How about a hash function of some sample string.
Obviously I can work the answers out by hand or run the input through someone else’s implementation. My point is that filling in the expected value from a value the function generated, without checking, is a bad idea.
Couple of other emphasis points:
So, ultimately,
expect
is equivalent toassert (String.equal stdout_string expected_string)
? It’s very surprising to hear this kind of idea coming from Ocaml people - I’d have expected them to prefer keeping the safety of their strong typing rather than go with a stringly-typed testing framework. I am not certain this is a good idea even if you have strong confidence in your pretty printers, I can imagine it resulting in weird code contortions when attempting to test something hard to print/that doesn’t have a readily-available printer.It’s possible to get a false negative if you don’t design your pretty-printers well. For example, if you print a string directly, without quotes, then you might not notice when it’s missing/empty. But in practice, 1) debug representations are often hardened against that anyways, because precision here is also useful for debugging, and 2) overall, I think the productivity gains and low friction of snapshot testing outweigh the risks.
Jest, I believe, will actually embed the data structure into the test when possible, rather than just a string representation. Besides type-safety, the practical question is whether you have a good diff tool for structured objects. If you don’t, then diffing against the string representation will produce better error messages when fixing a broken test. Pytest, for example, will let you assert equality of various objects, but it’ll show you the string diff in addition to a structured message (like “item 1 in list did not match”) because it’s often higher-quality.
Well you need to know the type of something in order to print it, so the type system is still working for you in test code. But I think you’re right that this doesn’t make sense without a reasonable debug representation for your types.
But since this isn’t the only way to write tests, you don’t have to contort anything if this isn’t a good fit for the particular thing you’re testing. From the article:
In practice though I’d usually define a new type that contains all of the relevant data that I want to assert on and then print that out to keep the ergonomic benefits of writing expect tests. (If an assertion is anything more complicated than an equality check, a property test is often a better fit anyway.)
I feel like the Hardcaml example is a pretty compelling argument for spending the time to write a decent pretty-printer, though – imagine what that test would look like with assertions!
Writing tests is a joyful experience to me. I love to see code passing a complex set of inputs. I can push the code to its limit, giving it unexpected inputs. This is especially true if we are testing complex state machines.
But updating tests as requirements change is just heartbreakingly boring. All the previous work is for naught.
I started using this for my source-to-source compilers & its amazing.
I first wrote manual unit tests everywhere & I was doing TDD it was amazing, but at a certain point I stopped caring about the specifics of my AST & just wanted to make sure my output isn’t changing everything else rarely causes bugs.
I wrote a state machine DSL we use at work & use this approach & writing a test is as simple as
clj -X main/snapshot src test-file.state name verify-this-property
, it then generates 2 filestest/snapshots/verify-this-property.state
&test/snapshot/verify-this-property.ts
and the test runs through all the .state files in snapshots directory & verifies the compile output is the same as whats stored. This makes writing the test matter of seconds after I have the code snippet.IME this is extremely high value, it catches way more bugs than regular unit tests & is dead simple to expand on.
I want to bring this approach to most of my job (svelte) but I’m not sold on what I understand snapshot testing as it is right now, if I understand this blog post correct - I describe a set of commands & store the results of each command & when it reruns it verifies those commands haven’t changes the expected output? I’m not sure if the current JS snapshot libraries can do something like that… if they can, I’m sold.
Before I tried expect tests, I was very suspect of the workflow. However it really is the most productive workflow I’ve found. It’s very close to a REPL driven workflow that just happens to be repeatable.
That said I find that I often remove my expect tests over time with model or property testing. I do that because I’ve found lots of small simple assert tests are much harder to maintain over the life of a project.
A previous post on this blog had a few more examples of expect tests in practice: https://blog.janestreet.com/computations-that-differentiate-debug-and-document-themselves/
At work I replaced a bunch of custom-written assertions with an in-house snapshot (expect) test implementation. Each test now looks like this (Scala):
This method writes a snapshot file
input.file.snapshot
to the same directory asinput.file
, containing the processed output from theprocessFile
method. We compare them side-by-side, then commit the snapshots into the repo.It has saved many hours of tediously writing manual assertions.
Data-driven tests are similar. The difference is that the test cases are separated from the test harness code. https://github.com/cockroachdb/datadriven
They’re widely used in CockroachDB tests. Here’s an example: https://github.com/cockroachdb/cockroach/blob/446bf3058ec0006ce3ddfe16f171ca3b51d63e4a/pkg/sql/sem/eval/testdata/eval/in
I do think this is a nice workflow, but it doesn’t solve the root of the problem for me. The root being, interesting input combinations are still hard to come up with and write down, and dependencies still have to get wired up somewhere. It’s not like expect tests are immune to code changes.
It’s definitely a step in the right direction though, by letting the computer do lots of work for you. That I agree with 100%.
Figuring out interesting combinations can probably be delegated to an MC/DC code coverage tool.
From what I know of them they figure out the coverage by pretty arcane means and aren’t always reliable. Are there good tools that you know of?
Also, the downside of white-box coverage is that it can only cover the code that’s actually written, not tell you that you are missing branches or conditions. Bugs also arise from missing branches and conditions.
For the Ada programming language, GNATcoverage is pretty solid, although I might be biased since I work at the company that makes it (but not on GNATcoverage itself) :).
Most code coverage tools will report if/when an “else” condition is tested or not even if there’s no syntactic “else” block construct. If you notice that an else condition is never tested, then you write a test for it. If it’s tested but there’s a bug that isn’t noticed in there, the problem is in the test.
That’s not what I’m saying - say your code looks like:
There’s no coverage tool that can tell you that the correct code is:
Oh, right. For such totally disjoint conditions, there’s really no tooling that would be able to catch the problem indeed :).
I’ll put my hand up, and say I’m struggling to wrap my head around this.
I can understand the appeal of having an auto-generated set of test cases to minimise the chance of things breaking. But isn’t assuming that your outputs are correct and then encoding those a bit ass backwards?
All I can think of when reading the article are “property based tests”.
(I also have a pretty negative experience of frontend snapshot tests. But I’m willing to concede that I might be missing something and it’s a me issue).
You don’t assume that the output is correct, you very much check it manually the first time it is generated.
The technique here is a dual of property based testing: if property based testing doesn’t work, testing with expectations is often the best alternative.
Would be interesting to here what went wrong :)
You don’t assume the test case expectations are correct. You have the outputs automatically generated and then manually verify their correctness. If behavior changes in the future, you regenerate the output and the diff of the output shows you the change in behavior.