1. 20

  2. 18

    Disclaimer: I read the article you are responding to (the one by Eugen Kiss) and I agree it’s pretty garbage. I do want to make some responses to your response, less because I’m opposed to TDD (I’m a pretty big fan) and more because doing an academic survey is really, really hard.


    First the Microsoft Research article (not a paper) was mentioned in the context of code-coverage. … The Microsoft article quotes one of the researchers, Nagappan as saying (emphasis added): “What the research team found was that the TDD teams produced code that was 60 to 90 percent better in terms of defect density than non-TDD teams.”

    The original paper is linked in the article: realizing quality improvement through test-driven development. You then say

    So by [Kiss] cherry-picking the coupling and cohesion metrics comment (which isn’t quite right either, but I’ll get to that), it seems as if TDD isn’t worth it.

    • The original paper found it’s 40-90% better, not 60-90% better. The article got 60-90 by dropping the IBM instance, being a Microsoft PR piece. So that result is cherry-picking.
    • Nagappan also found that the TDD projects took 15-35% longer than the control projects. Suggesting there’s a benefit without a downside is also arguably cherry-picking.
    • Kiss didn’t cite Nagappan with respect to cohesion. He only cites it with respect to code coverage. You’re misrepresenting how he formed the argument.


    Switching to the second article:

    They did say that some benefits in terms of higher cohesion and lower coupling were difficult to confirm. But with respect to coupling, the authors state that “some coupling can be good,”

    They said “it appears that test-first programmers might actually tend to write more highly coupled smaller units”, and that they couldn’t determine whether or not it was “good” coupling.

    The problem here is that the metric (LCOM5 in this case) counts accessors (getter/setter), which I’m strongly against in Java anyway

    They specifically raised this exact problem in that section: “The use of accessor methods is common in Java software, and all the study projects involved Java.” That’s why they calculate the accessor methods independently.

    So if nothing else, the Test-First project had many fewer one-line accessors, which in my mind is a good thing!

    They explicitly said that ITF-TL was an anomaly in the opposite direction: “The test-first projects had an average of 10 percent more accessors in all but the ITF-TL study.” Under your argument, test-first has more accessors, which in your mind is a bad thing.


    Going on to the survey paper. Note that your link is paywalled, and I was unable to find a preprint. I’m not able to confirm anything you say here.

    Ignoring productivity (which I partially attribute to inexperience or poor training, and otherwise to how its measured – a whole other topic, for sure)

    You can’t just dismiss a metric like that, especially when one of them (64) is the Nagappan study you help up as beneficially earlier.

    That table is for papers found that the survey classified as “high rigor” … and “high relevance” (where “relevance considers whether the results apply to an industrial context”, i.e., is more like a real-world setting).

    Relevance is a pretty big deal for these kinds of studies. Consider all of the times something killed cancer cells “in mice”. Citing the “low relevance” studies below that doesn’t provide us much information and should not be used to support “TDD is good”.

    Even the “low relevance” (but stil “high rigor”) studies indicate that you are pretty likely to do better, and possibly no difference, so why not do #TDD?

    Far more papers in this category show “no difference” than “in favor”. In which case there’s a pretty good reason not to do it: why retrain your entire team on a completely different practice when there aren’t any differences?


    The final paper doesn’t have a preprint, either, so I’m following along with the abstract.

    This paper mentions that most empirical studies “tend to neglect whether the subjects possess the necessary skills to apply TDD.” So with more/better training in this complex skill, folks could do even better.

    According to the abstract, “We did not observe a statistically significant difference between the clusters either for external software quality or productivity”. So it’s claiming the opposite: it’s not clear that doing “proper” TDD makes it any more effective.

    if you’ve already fully mapped out and planned the implementation in your head, you’re not really doing TDD, you’re just doing Test-First. And if you’ve ever heard me talk about TDD, you know that TDD != Test-First.

    This pretty much scuttles your entire case: if TDD != Test-First, then all of the previous papers that showed any benefits are now irrelevant: they’re all Test-First! You’d have to go through every study and confirm it’s talking about “proper” TDD. The Nagappan study, for example, explicitly mentions they do a hybrid “plan-first” approach.

    1. 1

      At this point, with my researcher hat firmly in place (I really do need to get such a hat)…

      Can we pause for a moment to research what such a hat may look like?

      I feel sure it would be related to a Jägermonster hat.

      1. 1

        Merlin’s Wizard Hat

        1. 2

          Merlin’s Wizard Hat

          Hmm. No.

          A Wizard’s Hat suggests a person that makes magic happen.

          A researcher should be the person that finds out why and how magic happens and whether it does any good.

          A researcher should be the person (dog?) looking behind curtains and explicitly paying attention to the man behind the curtain.

          1. 1

            Damn, you got me there. Good points.