1. 32
  1.  

  2. 24

    I’m sympathetic to the goal of making reasoning about software defects more insightful to management, but I feel that ‘technical debt’ as a concept is very problematic. Software defects don’t behave in any way like debt.

    Debt has a predictable cost. Software defects can have zero costs for decades, until a single small error or design oversight creates millions in liabilities.

    Debt can be balanced against assets. ‘Good’ software (if it exists!) doesn’t cancel out ‘Bad’ software; in fact, it often amplifies the effects of bad software. Faulty retry logic on top of a great TCP/IP stack can turn into a very damaging DoS attack.

    Additive metrics like microdefects or bugs per line of code might be useful for internal QA processes, but especially when talking to people with a financial background, I’d avoid them, and words like ‘debt’, like the plague. They need to understand software used by their organization as a collection of potential liabilities.

    1. 11

      Debt has a predictable cost. Software defects can have zero costs for decades, until a single small error or design oversight creates millions in liabilities.

      I think this you’ve nailed the key flaw with the “technical debt” metaphor here. It strongly supports this “microdefect” concept, explicitly by analogy to microCOVID, which the piece doesn’t mention is named for micromort. The analogy works really well to your point: these issues are very low cost and then sudden, potentially catastrophic failure. Maybe “microcrash” or “microoutage” would be a clearer term; I’ve seen “defect” used for pretty harmless issues like UI typos.

      The piece is a bit confusing by relying on the phrase ‘technical debt’ while trying to supplant it, it’d be stronger if it only used it once or twice to argue its limitations.

      We’ve seen papers on large-scale analyses of bugfixes on GitHub. Feels like that route of large-scale analysis could provide some empirical justification for assessing values of different microdefects.

      1. 1

        I’m very surprised by the microcovid.org website not mentioning their inspiration from the micromort.

        1. 1

          It’s quite possible they invented the term “microCOVID” independently. “micro-” is a well-known prefix in science.

        2. 1

          One thing I think focusing on defects fails to capture is the way “tech debt” can slow down development,even if it’s not actually resulting in more defects. If a developer wastes a few days flailing because the didn’t understand something crucial about a system e.g. because it was undocumented, then that’s a cost even if it doesn’t result in them shipping bugs.

          Tangentially relatedly, the defect model also implicitly assumes a particular behavior of the system is either a bug or not a bug. Often things are either subjective or at least a question of degree; performance problems often fall into this category, as do UX issues. But I think things which cause maintenance problems (lack of docs, code that is structured in a way that is hard to reason about, etc) often work similarly, even if they don’t directly manifest in the runtime behavior of the system.

          1. 1

            Microcovids and micromorts at least work out in the aggregate; the catastrophic failure happens to the individual, i.e. there’s no joy in knowing the chance of death is one in a million if you happen to be that fatality.

            Knowing the number of code defects might give us a handle on the likelihood of one having an impact, but not on the size of its impact.

          2. 3

            Actually, upon re-reading, it seems the author defines technical debt purely in terms of code beautification. In that case the additive logic probably holds up well enough. But since beautiful code isn’t a customer-visible ‘defect’, I don’t understand how monetary value could be attached to it.

            1. 3

              I usually see “tech debt” used to describe following the “no design” line on https://www.sandimetz.com/s/012-designStaminaGraph.gif past the crossing point. The idea is that the longer you keep on this part of the curve, the harder it becomes to create or implement any design, and the ability to maintain the code slows.

              1. 1

                I think this is the key:

                For example, your code might violate naming conventions. This makes the code slightly harder to read and understand which increases the risk to introduce bugs or miss them during a code review.

                Tech debt so often leads to defects, they become interchangeable.

                1. 1

                  To me, this sounds like a case of the streetlight effect. Violated naming conventions are a lot easier to find than actual defects, so we pretend fixing one helps with the other.

              2. 3

                I think it’s even simpler than that: All software is a liability. The more you have of it and the more critical it is to your business, the bigger the liability. As you say, it might be many years before a catastrophic error occurs that causes actual monetary damage, but a sensible management should have amortized that cost over all the preceding years.

                1. 1

                  I think it was Dijkstra who said something like “If you want to count lines of code, at least put them on the right side of the balance sheet.”

                2. 2

                  Debt has a predictable cost

                  Only within certain bounds. Interest rates fluctuate and the interest rate that you can actually get on any given loan depends on the amount of debt that you’re already carrying. That feels like quite a good analogy for technical debt:

                  • It has a certain cost now.
                  • That cost may unexpectedly jump to a significantly higher cost as a result of factors outside your control.
                  • The more of it you have, the more expensive the next bit is.
                  1. 1

                    especially when talking to people with a financial background, I’d avoid them, and words like ‘debt’, like the plague

                    Interesting because Ward Cunningham invented the term when he worked as a consultant for people with a financial background to explain why code needs to be cleaned up. He explicitly chose a term they knew.

                    1. 1

                      And he didn’t choose very wisely. Or maybe it worked at the time if it got people to listen to him.

                  2. 2

                    This doesn’t actually solve the hard problem, which is estimating the financial cost of a less-than-optimal implementation.

                    1. 3

                      I think the framework is there: you could calculate a probability distribution of financial cost based on the number of distinct issues and their microdefect amounts. Optimally you’d also use a probability distribution for the cost of a single defect instead of just using an average.

                      For an organization well-versed in risk management this might just work. But without understanding the concept of probabilistic risk I don’t believe the tradeoffs in implementation (and design) can be managed.

                      The article seems to focus on just the expected value of microdefects. This might be enough for some decisions, but it’s not a good way to conceptualize “technical debt”.

                      1. 3

                        One interesting implication is that if we can estimate the costs of different violations, we can estimate the cost-saving of tools that prevent them.

                        For example, if “if without an else” is $0.01, then a linter that prevents that or a language where conditionals are expressions rather than statements automatically saves you a dollar per 100 conditionals.

                        1. 2

                          you could calculate a probability distribution of financial cost based on the number of distinct issues and their microdefect amounts

                          My point is, we can’t do that because we don’t know what the average cost of a defect is, and we have no way of finding out.

                          1. 2

                            I think we do (certainly I have some internal numbers for some of these things) the thing that we don’t know is the cost distribution of defects. For example, the cost of a security vulnerability that allows arbitrary code execution is significantly higher than the cost of a bug that causes occasional and non-reproduceable crashes on 0.01% of installs. A bug that causes non-recoverable data corruption to 0.01% of users is somewhere in the middle. We also don’t have a good way of mapping the probability of any kind of bug to something in the source code at any useful granularity (we can say, for example, that the probability of a critical vulnerability in a C codebase is higher than in a modern C++ one, but that doesn’t help us target the things to fix in the C codebase and rewriting it entirely is prohibitively expensive in the common case).

                            1. 1

                              What sorts of things do you have numbers for, if you can share? I have heard of people estimating costs, but only for performance issues when you can map it to machine usage costs pretty easily, so I’d be interested in other examples.

                            2. 1

                              It’s true we can’t know the distribution or the average exactly. But if you measured the cost of each found defect after it’s fixed, you could make a reasonable statistical model after N=1000 or so. And note that we do know lower and upper bounds for the financial cost of a defect: the cost must typically be between zero and the cost of bankruptcy.

                              1. 4

                                if you measured the cost of each found defect after it’s fixed, you could make a reasonable statistical model after N=1000 or so

                                You are also assuming the hard part. How are you measuring the cost of a defect?

                                1. 1

                                  It depends a lot on the business you are in. For Open Source it is hopeless because you don’t know how many users you even have. My work is in automotive, where we can count the cost for customer defects quite well. Probably better than our engineering costs in general.

                                  1. 1

                                    we can count the cost for customer defects quite well

                                    Are these software defects or hardware defects? As a followup, if they are software defects, are they the sort of defects that would be described as “tech debt” or as outright bugs?

                                    1. 1

                                      Yes, the classification is still tricky. Assume we have a defect. We trace it down to a simple one line change in the software and fix it. Customer happy again. They get a price reduction for the hassle. That amount plus the effort invested for debugging and fixing is the cost of the defect.

                                      Now we need to consider what technical debt could have encouraged writing that bug: Maybe a variable involved violated the naming convention so the bug was missed during code review? Maybe the cyclomatic complexity of the function is too high? Maybe the Doxygen comment was incomplete? Maybe the line was not covered by a unit test? For all such possible causes, you can now adapt the microdefect cost slightly upwards.

                                      1. 1

                                        That’s an interesting idea. And then microdefects would work well, because you average out differences in like how much it costs a customer to be happy that don’t have much to do with the bug itself.

                                        Do you have a similar process for bugs that don’t affect customers, or correct but inefficient code implementations?

                                        1. 1

                                          You are thinking of those “phew, glad we found that before anyone noticed” incidents, I assume. The cost is only the effort here.

                                          We have something similar. Sometimes we find a defect which has already shipped but apparently the customer (OEM) nor the users seem to have noticed. Then there is a risk assessment, where tradeoffs are considered:

                                          • How many users do we expect to notice it? Mostly depends on how many users there are and how often the symptoms occur.
                                          • How severe is the impact? If is a safety risk, the fixing is mandatory.
                                          • How much will it cost to fix it? Again, the more users there are the higher the cost.
                                          • How visible is the fix? If you bring a modern car to the yearly inspection, chances are that quite a few bugfixes are installed to various controllers without you noticing it.

                                          You can estimate anything but of course the accuracy and precision can get out of hand.