1. 35
  1.  

  2. 11

    Which is way too complicated question to answer in a Tweetstorm. It’s also not something I really have capacity right now to write a fully researched essay on, so you will get my thoughts as newsletter. Hooray!

    As of one hour ago I am one beautiful, beautiful step closer to having the capacity for fully-researched essay on this.

    1. 1

      Looking forward to it!

    2. 9

      I subscribe to Demings view point that we need to look at whole system which produces software and that top management is responsible for designing that system.

      Does that mean the programmer is never too blame and always “the system”? Not completely. A system assigns its elements certain responsibilities. What exactly the duty of a programmer is varies from system to system. A system might place the full blame in the programmers. A system might define only find grained guidelines like “80% test coverage”.

      So, if a defect can be traced back to programmers who did not fulfill their responsibilities, they are the blame. Usually you can also trace it back to others as well though. For example, the programmers might not know what exactly their responsibility is and that is not their fault.

      If a system is good is a very different question. Number if defects is only one aspect. Programmer happiness and morals are another, for example. It seems we only have the long term success of companies/projects as a measure and that is confounded by lots of other factors.

      1. 7

        One more thought:

        Robert Martins argument can be seen as an application of Occam’s Razor: The simplest answer is the programmers put the defect in, thus they are to blame.

        Well, I recently learned about Hickam’s dictum. It comes from the medical profession and says “A man can have as many diseases as he damn well pleases.” A more formal version is “in complex systems there is usually multiple causes for a problem.”

        1. 5

          Robert Martins argument can be seen as an application of Occam’s Razor: The simplest answer is the programmers put the defect in, thus they are to blame.

          You link to the definition of Occam’s Razor, but then you state the incorrect pop-science version of it. Occam’s Razor does not say that the simplest solution is correct. It does not even say that the simplest solution is most likely to be correct. It says that, given two hypotheses one of which depends on a subset of the factors of the other and which give the same predictions, there is no point in using the more complex one. This is a simple inductive model: you can always add irrelevant factors to a model that have no impact on the prediction to any hypothesis. William of Occam was telling people not to do this.

          A simple answer is the programmers put the defects in, but that is an hypothesis that predicts that the defect rate would remain the same if you kept the requirements and other constraints the same but changed the language. That is not a variant of the other hypothesis with one factor removed and the predictions remaining the same, so Occam’s Razor does not apply.

          If people keep slicing their faces open with a chainsaw when they use it instead of a knife and fork, the simple hypothesis is that these people are the problem. That hypothesis predicts that if you give them a fork or a chopstick they’ll still manage to impale their face. The hypothesis that cutlery is more appropriate than powertools for conveying food to your face and that only some of the people would still manage to impale themselves on a spoon may be more complex but it gives different predictions.

          1. 1

            Where do you get that “subset constraint” from?

            Even when looking at a more formal version like Solomonoff Induction, it is not about subsets.

          2. 2

            Hickan’s dictum is one of the most underrated principles. Frankly, I don’t see people fail to apply Occam’s razor very often. However, I’ve seen (and done) trying to trace a bunch of problems to a single cause only to find that it’s in fact a set of unrelated problems that just happen to show up together.

            1. 1

              Thank you for introducing me to Hickam’s dictum.

              It also amazes me when people treat “blame it on the programmer” as the simplest solution when individual humans are absurdly complex systems all on their own.

          3. 5

            No, defects are the fault of individual humans and small groups of humans. Defects caused by combining two things that work fine in isolation but not together is the fault of the programmer that integrated them. 95% of the time it’s because they made assumptions along the lines of ‘assume postcondition A implies precondition X’ or just ignored the guarantees and requirements of the two systems.

            More generally, stuff like this…

            Just scapegoat the human and let the broader system get off scot-free. Don’t ask if the equipment was getting out of date, or if the training program is underfunded, or if the official processes were disconnected from the groundlevel reality. When all defects are the fault of the human, then we don’t have any tools to address defects besides blaming humans. And that’s completely inadequate for our purposes.

            …ignores that systems are set up by and signed off by humans. None of this stuff just spontaneously arises. Depending on random npm packages with no security guarantees or code review isn’t a system problem, it’s the fault with the person that added that dependency or the person that signed off on allowing random unreviewed dependencies.

            Making things “system problems” is just a way to shift blame from individuals to everyone because “you can’t fire EVERYONE”. But actually the people that sign off on these systems, the people that get paid HUGE amounts of money on the basis that they’re supposed to be held responsible when something goes wrong, but who somehow never are? The blame is on them. If you tell developers not to review the contents of NPM packages because you have a tight schedule and don’t have time for it then YOU AND ONLY YOU are responsible for the consequences of that decision. Not the system. Not the collective.

            If you don’t have time or resources to do your job professionally then you need to take a stand and tell you manager so. In writing. “I can’t be sure of the quality of this if you do not give me time to either review the dependencies deeply or write that code myself”. Be a professional.

            The only way the public is going to get software that works and is safe and isn’t riddled with security holes is if programmers start creating an expectation of professionalism. Nobody says ‘oh well this bridge doesn’t REALLY matter so we can just cut corners’. Software might not always have the same potentially bad consequences for physical safety (although often, increasingly often, does) but has serious consequences for emotional, psychological and privacy safety. What’s more people should be able to expect software to work, to always work, to work properly and accurately and safely. If that costs more then it costs more. I’m sure unsafe bridges would be cheap too.

            1. 20

              Having, uh, worked on, well, not bridges before, but some physical objects, I really, really hate it when reasoning by analogy is brought up, in the form of bridges or cars or whatever.

              First of all, everybody asks which corners can be cut, all the time. An unsettling amount of regulation is in place precisely in order to prevent people from cutting the wrong corners, and oftentimes precisely because someone did cut the wrong corner in the hope of getting slightly richer, but they ended up killing someone in the process. (Yay for insurance, though – they really did get slightly richer most of the time).

              Second, in other engineering disciplines (electrical, in my case, although I’ve always been an intruder in that field and haven’t really done any EE work in years now) the fact that “the system” sometimes doesn’t work is perfectly understood. Yes, it’s usually one person (or a group of persons) that makes the mistake – the efficient cause of the mistake, if you will – but in a well-lead team, that rarely happens in isolation from others, and purely as a personal failure. Most bad decisions I’ve seen happen at the intersection of:

              • Incomplete understanding of a problem’s implications among management ranks
              • Incomplete understanding of requirements, deployment constraints, usage patterns etc. among engineers
              • Lack of patience among management ranks (“I don’t need all these details, just give me, like, an executive summary” – it’s not like a complex problem magically becomes simple once the gremlins realize you’re busy)
              • Impedance mismatch at the middle-management layer

              In these cases, many mistakes really do end up being “the system’s” fault – in that, although someone’s wrong calculations or wrong prioritization really did end up causing the screw-up, no one, not even that person, can really be blamed for what they did.

              “Defensive” responses to this are absolutely common (and this article is one of them, too), at both ends of the corporate ranks. Upper management & co. insists that personal responsibility is key and that if you think something is wrong, you should speak up and improve the process – gleefully ignoring the fact that doing so usually gets you ignored (see points 1 and 3 above), and sometimes fired if you go over middle management’s head. Middle management and engineers will insist that fault rests solely with the stewards of the system – gleefully ignoring the fact that, indeed, no system is completely impervious to someone doing their jobs sloppily.

              Of course, the people at the bottom of the hierarchy don’t have any subordinates to shift the blame to, which is why the former reaction ends up getting more street cred, but that’s a whole other story…

              1. 5

                Ok, it’s the fault of individual developers. What then? What are you offering as a solution to that from a practical point of view? Empirically, appeals for professionalism and accountability clearly don’t work.

                Besides, bridges also seem to collapse with some regularity in places like the US and Italy, so I don’t know if that’s a useful counterexample and I don’t know if appealing to professionalism is going to solve that problem either.

                1. 4

                  Depending on random npm packages with no security guarantees or code review isn’t a system problem, it’s the fault with the person that added that dependency or the person that signed off on allowing random unreviewed dependencies.

                  Did you … did you read the thing in the link?

                  1. 2

                    programmers start creating an expectation of professionalism

                    I observe a power imbalance here. Most programmers I know are very quick to shift the responsibility away: “It is not my job to prioritize” and “I need to ask my manager”. On the other hand, managers practically never do that: “I will clarify that” or “I will follow up on that”. It does not mean that managers actually solve more issues than programmers but it creates a perception that the managers are the doers and programmers are the whiners. If programmers never take the responsibility, they will never get the power either.

                    Disclaimer: Of course, there comes a risk with taking responsibility. In some circumstances it might even be a trap.

                    1. 1

                      This is both a completely true and also effectively useless. As a purely practical consideration @hwayne’s approach to analysis and addressing issues has a higher likelihood of success than blaming human error.

                    2. 3

                      May I suggest that defects are the fault of Robert Martin because he hasn’t done enough to teach the programmers how to stop them?

                      1. 2

                        I have seen some terrible design work by engineers that cannot be attributed to management or lack of testing. You simply cannot manage your way around bad employees. The more senior the employee, the more responsibility that employee holds.

                        1. 2

                          A lot of the nastiest bugs come from components that are all correct in isolation but interact in a dangerous way.

                          Well, let’s say people want to build a bridge. They start the construction work from two sides at once, so it will be faster. They want to join those sides in the middle. So after a month of construction they finally join and they discover that they are off by 1 meter, who’s fault is this? The system fault? Earth’s fault? Maybe Isaac Newton is the culprit?

                          I think it’s the fault of some person who was involved in the construction process. Either designer, or maybe a person who’s job was to verify the requirements are being correctly implemented, I don’t know, but I do know that someone has done a sloppy job.

                          Saying that it is the fault of the system is a manifestation of incompetence, because if you say this, it means that you probably don’t know where exactly the problem is, don’t even know who’s fault it was, as well as you don’t know how to prevent this fault from occurring again.

                          1. 6

                            Well, let’s say people want to build a bridge. They start the construction work from two sides at once, so it will be faster. They want to join those sides in the middle. So after a month of construction they finally join and they discover that they are off by 1 meter, who’s fault is this? The system fault? Earth’s fault? Maybe Isaac Newton is the culprit?

                            Last year I interviewed a bunch of hybrid trad/software engineers, and my main takeaway was that all analogies by software engineers about trad engineers are wrong. One person didn’t have this exact issue, but had a very similar one. On investigation the root cause was something like “the soil under one particular support compacted slightly differently when it was frozen and it rained”, in which case the next questions were 1) why was that enough to throw things off, 2) why wasn’t that detected, 3) is that something they should add to their process, or was the overhead enough to make it unfeasible, 4) …

                            Saying that it is the fault of the system is a manifestation of incompetence, because if you say this, it means that you probably don’t know where exactly the problem is, don’t even know who’s fault it was, as well as you don’t know how to prevent this fault from occurring again.

                            It’s the exact opposite. If you say it’s one person’s fault, you leave all of the systemic issues in place to keep causing problems later. Historically, this approach has failed again and again. For Nancy Leveson’s Engineering a Safer World:

                            During and after World War II, the Air Force had serious prob-lems with aircraft accidents: From 1952 to 1966, for example, 7,715 aircraft were lost and 8,547 people killed [79]. Most of these accidents were blamed on pilots. Some aerospace engineers in the 1950s did not believe the cause was so simple and argued that safety must be designed and built into aircraft just as are performance, stability, and structural integrity. Although a few seminars were conducted and papers written about this approach, the Air Force did not take it seriously until they began to develop intercontinental ballistic missiles: there were no pilots to blame for the frequent and devastating explosions of these liquid-propellant missiles. In having to confront factors other than pilot error, the Air Force began to treat safety as a system problem, and System Safety programs were developed to deal with them.

                            1. 5

                              How do you build a bridge from two sides and make them meet in the middle? Imagine it’s never been done before. How do you ensure they actually meet in the middle?

                              Well, if you’re good at trying to build bridges, you sit down, ideally with some other smart people with a broad spectrum of disciplines, and you write down everything that could go wrong with the process. You figure out how wrong each thing could go, and you come up with a plan to mitigate each thing. Then in the real world, something happens that you didn’t anticipate (and often literally couldn’t anticipate), you work to understand it, and figure out how to compensate for it. Then the next time someone builds a bridge starting from both sides at once, they learn from your mistakes.

                              Case study: SpaceX rocket explosion of Sept 1, 2016. A rocket exploded due to a complicated and not super likely series of events that made a composite tank fail. As far as I understand, nobody foresaw this happening… because nobody had ever used composite tanks carrying helium dunked in liquid oxygen at the temperatures that SpaceX was working with. It was a new and unforeseen failure mode. Who did a sloppy job there? The person who wrote the requirements? The person who signed off on them? The person who designed the tank to those requirements? The person who manufactured the tank?

                              That rocket and payload cost, I dunno, something like a couple hundred million dollars. Compare with, say, a major AWS outage or two. Some of those have probably cost about as much, all told. And you know what? Nobody’d ever built AWS before either.

                              Things fail. Engineering is a process of planning to minimize failures, and testing to try to catch failures before they matter, and then if something fails anyway, figuring out how it happened to ensure that they don’t fail the same way twice.

                              1. 2

                                There is a difference between doing something that hasn’t been done before, and doing something that has a well known production methodology. You’ve assumed just the first case, but I had an impression the article talks about all cases.

                                It’s true that people make mistakes, even in cases where the methodology is well known. That’s why when a guy from my team makes a mistake in the code, because he didn’t perform some null check, I’m always telling – people make mistakes, let’s do a fix, release update and we’re done. That doesn’t mean the problem wasn’t his fault. It was his fault; we only choose not to make any problem because of it, because mistakes happen all the time, and I’m certainly not an exception.

                                Contrary to this situation, where a new technology is being worked on, so that the majority of people are not competent in it (this is a normal situation; it’s impossible to be competent in an area that is new) it also will be someone’s fault. Sometimes the supervisor won’t even know who should be doing additional verification because the area still hasn’t been researched that well in order to know what methodology to use, but this only means that the manager is at fault here, because the manager tries to do something that he doesn’t know that much about. In cases of companies that want to be pioneers, this fault and losses generated by the accidents are taken into consideration when building business plans, and sometimes even expensive faults are still acceptable, but I don’t think it’s okay to say that nobody is to blame for any accidents.

                                As for your question, I have no idea who was to blame for the SpaceX explosion, because I don’t know anything about rockets.

                                As far as I understand, nobody foresaw this happening… because nobody had ever used composite tanks carrying helium dunked in liquid oxygen at the temperatures that SpaceX was working with.

                                And this makes the accident okay? I mean, if someone doesn’t think about bad outcomes because the people are outside of their competence, then it’s okay to make mistakes? I still think that there was a person that should think of him/herself as being accountable for the fault. I’m not saying the person should be fired from SpaceX or anything; it might be the case that this person is still the best person for the job and still nobody would do it better. But saying that it was nobody’s fault tells me only that someone doesn’t have enough information to be able to tell what exactly went wrong (e.g. in the production process methodology).

                                Being wrong about something shouldn’t turn anyone into a black sheep. It shouldn’t put any social stigma on that person, because only the person that didn’t do anything in his life can say that he’s always right. I think that any punishments or stigmas should be conditioned by the answer to the questions: 1) ‘what this person has done in order to minimize the risk’ and 2) ‘what was done in order not to make the same mistake again’, and it shouldn’t be about if it’s the person’s fault or not.

                                1. 3

                                  I think that we’re both conflating two different things: Whose FAULT something is, vs. whose RESPONSIBILITY something is.

                                  I think that any punishments or stigmas should be conditioned by the answer to the questions: 1) ‘what this person has done in order to minimize the risk’ and 2) ‘what was done in order not to make the same mistake again’, and it shouldn’t be about if it’s the person’s fault or not.

                                  This is exactly what I want, yes. The way I see it, “fault” is worthless for these goals. Saying “it’s my fault” doesn’t really help make anything better. Saying “it’s my responsibility” IS useful, it means someone is going to take action to make sure it doesn’t happen again. “Blame” is pretty worthless for actually getting anything done IMO, it just makes people afraid of failing. Failure is part of life. Blame and fault, as the original article says, “[do] nothing to help us fix defects.”

                            2. 1

                              Again, building robust software is exactly like building any other robust machine. Every other field of engineering has been struggling with this for about 200 years now, and has methods and processes for designing things that are based on the assumption that the engineer will make mistakes. We’re only human, after all. This takes the right approach of, if you want to solve problems, then you need to understand them first.

                              Edit: Nevermind, x64k said it better: https://lobste.rs/s/zi1lvy/defects_are_not_fault_programmers#c_xa4w88