1. 16
  1. 19

    I have a problem with how some of the results in this paper are interpreted (and that is not considering the statistical problem of how test suite size is accounted for).

    The essential trouble here is that the authors are using normalized effectiveness score. What is a normalized effectiveness score? “The normalized effectiveness measurement is the number of mutants a test suite detected divided by the number of non-equivalent mutants it covers”. It is only when the authors compare the correlation of normalized effectiveness score to statement coverage that they find the correlation dropping. It is also from this result that they get the title of the paper.

    Now, why is this problematic? Consider, for argument’s sake, that you have perfect test cases. That is, the test case kills any non equivalent-mutant it covers. If so, should we now have high correlation with coverage? If what they are measuring is a proxy for the effectiveness of the test suite, and you have perfect test cases, then you should have high correlation right?

    If a test case kills every mutant it covers, then the normalized score for every test suite would be 1. That is, no correlation at all.

    My take here is that the results has been misinterpreted here. You can’t take from this paper that coverage is not a good measure for test suite effectiveness.

    1. 5

      I am a big fan of coverage, but feel that a lot of the debate around the practice largely misses the point(s). So, while I agree that complete or high coverage does not automatically mean that a test suite or the software is good… of course it doesn’t? In the extreme, it’s pretty trivial to reach 100% coverage without testing the actual behaviour at all.

      Coverage is useful for other reasons, for example the one this article ends with:

      Our results suggest that coverage, while useful for identifying under-tested parts of a program, should not be used as a quality target because it is not a good indicator of test suite effectiveness.

      Identifying under-tested parts of a program seems like a pretty important part of a testing strategy to me. Like many advantages of coverage, though, you have to have pretty high coverage for it to be useful. There are other “flavours” of this advantage that I find useful all the time, most obviously dead code elimination. High test coverage at the very least signals that the developers are putting effort into testing, and checking that their testing is actually hitting important pieces of the code. Maybe their test suite is in fact nearly useless, but that seems pretty unlikely, and it could be nearly useless without coverage, too. That said, like any metric, it can be gamed, and pursuing the metric itself can easily go wrong. Test coverage is a means to many useful ends, not an end unto itself.

      The quest for 100% may be a bit of a wank, but I’ve tried that in a few projects before and actually found it quite useful. In particular it highlights issues with code changes that affect the coverage of the test suite in a very simple way. Day-to-day, this means that you don’t need to meticulously pour over the test suite every time any change is made to make sure that some dead code or dead/redundant branches weren’t added. If you don’t have total coverage, doing that is a chore. If you do, it’s trivial: “oh, the number is not 100% anymore, I should look into why”. I regularly end up significantly improving the code during this process. It’s undeniably a lot of work to get there (depending on the sort of project), but once you do, there are a lot of efficiency benefits to be had. If the project has platform-specific or -dependant aspects, then this is even more useful in conjunction with a decent CI system.

      As to the article itself, the methodology here seems rather… convenient to me:

      • Programs are mutated by replacing a conditional operator with a different one. This mutation does not affect coverage (except perhaps branch coverage, in exactly one case, if you’re replacing > with >= as they are here). It also hardly seems like a common case.

      • The effectiveness of the test suite as a whole is determined by running random subsets of the tests and seeing if they catch the bug. This is absurd. Test suites are called test suites for a reason. The instant you remove arbitrary tests, you are no longer evaluating the effectiveness of the test suite, full stop. You are - obviously - evaluating the effectiveness of a random subset of the test suite. Who cares about that?

      Am I missing something? In short, given this methodology, the only things these results seem to say to me is: “running a random subset of a test suite is not a reliable way to detect random mutations that change one conditional operator to another”. I don’t think this is at all an indicator of overall test suite effectiveness.

      That said, I have not read the actual paper (paywall), and am assuming that the summary in the article is accurate.

      1. 4

        I also find coverage extremely valuable for finding dead or unreachable code.

        I frequently find that unreachable code should be unreachable, e.g. error-handling for a function that doesn’t error when provided with certain inputs; this unreachable-by-design error handling should be replaced with panics since reaching them implies a critical bug. Doing so combines well with fuzz-testing.

        It’s also useful for discovering properties of inputs. Say I run a function isOdd that never returns true and thus never allows a certain branch to be covered. I therefore know that somehow all inputs are even; I can then investigate why this is and perhaps learn more about the algorithms or validation the program uses.

        In other words, good coverage helps me design better programs; it’s not just a bug-finding tool.

        This only holds true if I have a plethora of test cases (esp if I employ something like property testing) and if tests lean a little towards integration on the (contrived) “unit -> integration” test spectrum. I.e. only test user-facing parts and see what gets covered, and see how much code gets covered for each user-facing component.

        1. 1

          This matches my experience very well. Good point that the sort of test suite is relevant here. I get the impression that the article is coming from more of a purist unit-testing perspective, but this dead code elimination thing is mostly useful when you have a pretty integrated test suite (I agree that this axis is largely contrived).

          I find it particularly nice for non-user-facing things with well-defined inputs and outputs like parsers, servers, and so on. If you have a test suite that mostly does the thing the software actually has to do (e.g. read this file with these options and output this file), in my experience, coverage exposes dead code a lot more often than you expect.

          This has the interesting side-effect that unit tests which only exist to cover internal code are actually harmful in a way, because something useless will still be covered.

          1. 1

            I find it particularly nice for non-user-facing things with well-defined inputs and outputs like parsers, servers, and so on. If you have a test suite that mostly does the thing the software actually has to do (e.g. read this file with these options and output this file), in my experience, coverage exposes dead code a lot more often than you expect.

            I think it’s just fine, as long as it’s possible to turn them off and just run the subset of tests for public functions or user-facing code. I typically have a portable Makefile that includes make test-cov, make test, and make test-quick; if applicable, only make test needs to touch all test files.

        2. 2

          I have not read the actual paper (paywall)

          The PDF is on the linked ACM site: https://dl.acm.org/doi/pdf/10.1145/2568225.2568271 – I think you must have misinterpreted something or took a wrong turn somewhere(?)

          Otherwise there is always that certain site run by a certain Kazakhstani :-)

          1. 1

            Paywalled in the typical ACM fashion as far as I can tell?

            That said, sure, there are… ways (and someone’s found an author copy on the open web now). I’m just lazy :)

            1. 1

              Skimmed the paper. It seems the methodology summary in the article is accurate, and I stand by my critique of it. To be fair, doing studies like this is incredibly hard, but I don’t think the suggested conclusions follow from the data. The constructed “suites” are essentially synthetic, and so don’t really say anything about how useful of a quality metric or target coverage is in a real-world project.

              1. 1

                Huh, I can just access it. I don’t know, ACM is weird at times; for a while they blocked my IP because it was “infiltrated by Sci-Hub” 🤷 Don’t ask me what that means exactly, quoting their support department.

                1. 1

                  Hm. Out of curiosity, do you have a lingering academic account, or are you accessing it via some institution’s network? I know I was surprised and dismayed when my magical “free” access to all papers got taken away :)

                  1. 1

                    I only barely finished high school, and that was a long time ago. So no 🙃

                    Maybe they’re providing free access to some developing countries (Indonesia in my case), or they just haven’t fully understood carrier grade NAT (my IP address is shared by hundreds or thousands of people, as is common in many countries). Or maybe both. Or maybe it’s one of those “free access to the first n articles, paywall afterwards” things? I don’t store cookies by default (only whitelisted sites), so that could play a factor too.

            2. 1

              Identifying under-tested parts of a program seems like a pretty important part of a testing strategy to me.

              My interpretation is that test coverage reports can be useful if you look at them in detail to identify specific areas in the code where you thought you were testing it but you were wrong.

              But test coverage reports are completely useless if you just look at a percentage number on its own and say “the tests for project X are better than project Y because their number is higher”. We have a codebase at work with the coverage number around 80%, and having looked at it in detail, I can tell you that we could raise that number to 90% and get absolutely no actual benefit from it.

            3. 3

              The paper is here.

              1. 3

                I think there’s another issue with the paper (unless I missed how it’s addressed). They don’t seem to differentiate tests that fail on assertions and uncaught exceptions. The problem with that is that mixing up binary operators may result in a large number of loops exploding due to broken stop conditions or null checks being inverted. In those cases, there’s no difference between that change and sprinkling around “throw new Exception”.

                So the impact would be that any test touching that execution path at all would fail. This automatically should match their result - (randomly selected tests) x (randomly injected exceptions) gives you a proportion of crashes that’s not correlated with coverage.

                I’d love to see this test repeated with those trivial crashes ignored (or separated). They’re both not interesting for the theoretical result and in practice (if the whole module just doesn’t work at all, you likely don’t need the ci to tell you that).

                1. 1

                  I can’t quite make this precise, but I have a worry about this study.

                  I think programmers are typically targeting coverage when they produce test cases. I suspect most test cases are written for one of three reasons:

                  1. Writing tests to cover new functionality or help refactor existing functionality (usually as prelude to introducing new functionality)
                  2. Writing tests to ensure that overall coverage becomes/remains high (in orgs that mandate some level of coverage).
                  3. Writing tests in response to specific bugs (fix the issue, write a test so it can’t happen again). This may or may not increase coverage.

                  2 of the 3 reasons explicitly mention the idea that tests should increase coverage.[1]

                  If your test cases are written in a way that targets coverage, then wouldn’t you expect that the fraction of the test suite you throw away would be highly correlated with coverage?

                  If that’s right, then is it the case that you can replace coverage with size only because programmers mostly write tests that increase coverage–they’re less often adding tests that keep coverage the same, but improve overall reliability. In that case, it’s still reasonable to look at a test suite and focus on adding coverage to the places that are not already covered.

                  [1] This is true whether or not you’re actually using a code coverage tool. At $DAYJOB$ CI doesn’t flag code coverage, but development still emphasizes it. The way this will work is that you’ll open a PR, and if it doesn’t have tests, the first question is likely to be “why doesn’t this have tests?” Actual measurement of coverage would be better, but this is the primitive approach using human brains to do a machine’s job.

                  1. 1

                    In my experience, projects with 100% test coverage tend to have a lot of trivial tests – for example, the function sets foo=1, and the test verifies that foo==1. Higher order tests that actually verify that nontrivial logic is correct are much more valuable, but they are also harder to measure.

                    1. 1

                      Yeah, that part is tricky. If you want 100% you’re probably going to need at least some of those. I think keeping the number of them down, and trying to make them as non-pointless as possible is just part of the art of coverage-directed testing.

                      That said, I see the phenomenon you’re mentioning mostly in the context of very unit-testing focused projects (and especially exclusively unit-testing focused projects). It works much better with high level tests. For example, if you’re really testing your application well and setting foo to 1 is not covered (both by line, and in the higher level sense of the value “1” having some impact)… why is that code even there? Is 1 even correct? Why?