FWIW the motivation for this was apparently a comment on a thread about a review of the book “Software Engineering at Google” by Titus Winters, Tom Manshreck, and Hyrum Wright.
I meant to comment on that original thread, because I thought the question was misguided. Well now that I look it’s actually been deleted?
Anyway the point is that is that the empirical question isn’t really actionable IMO. You could “answer it” and it still wouldn’t tell you what to do.
I think you got this post exactly right – there’s no amount of empiricism that can help you. Software engineering has changed so much in the last 10 or 20 years that you can trivially invalidate any study.
Yaron Minsky has a saying that “there’s no pile of sophomores high enough” that is going to prove anything about writing code. (Ironically he says that in advocacy of static typing, which I view as an extremely domain specific question.) Still I agree with his general point.
This is not meant to be an insult, but when I see the names Titus Winters and Hyrum Wright, I’m less interested in the work. This is because I worked at Google for over a decade and got lots of refactoring and upgrade changelists/patches from them, as maintainer of various parts of the codebase. I think their work is extremely valuable, but it is fairly particular to Google, and in particular it’s done without domain knowledge. They are doing an extremely good job of doing what they can to improve the codebase without domain knowledge, which is inherent in their jobs, because they’re making company-wide changes.
However most working engineers don’t improve code without domain knowledge, and the real improvements to code require such knowledge. You can only nibble at the edges otherwise.
@peterbourgon said basically what I was going to say in the original thread – this is advice is generally good in the abstract, but it lacks context.
The way I learned things at Google was to look at what people who “got things done” did. They generally “break the rules” a bit. They know what matters and what doesn’t matter.
For those who don’t know, he was creator of GMail, working on it for 3 years as a side project (and Gmail was amazing back then, faster than desktop MS Outlook, even though it’s rotted now.) He mentions in that post how he prototyped some ads with the aid of some Unix shell. (Again, ads are horrible now, a cancer on the web – back then they were useful and fast. Yes really. It’s hard to convey the difference to someone who wasn’t a web user then.)
As a couple other anecdotes, I remember people a worker complaining that Guido van Rossum’s functions were too long. (Actually I somewhat agreed, but he did it in service of getting something done, and it can be fixed later.)
I also remember Bram Moolenaar’s (author of Vim) Java readability review, where he basically broke all the rules and got angry at the system (for a brief time I was one of the people who picked the Python readability reviewers, so I’m familiar with this style of engineering. I had to manage some disputes between reviewers and applicants.).
So you have to take all these rules with a grain of salt. These people can obviously get things done, and they all do things a little differently. They don’t always write as many tests as you’d ideally like. One of the things I tried to do as the readability reviewer was to push back against dogma and get people to relax a bit. There is value to global consistency, but there’s also value to local domain-specific knowledge. My pushing back was not really successful and Google engineering has gotten more dogmatic and sclerotic over the years. It was not fun to write code there by the time I left (over 5 years ago)
So basically I think you have to look at what people build and see how they do it. I would rather read a bunch of stories like “Coders at Work” or “Masterminds of Programming” than read any empirical study.
I think there should be a name for this empirical fallacy (or it probably already exists?) Another area where science has roundly failed is nutrition and preventative medicine. Maybe not for the same exact reasons, but the point is that controlled experiments are only one way of obtaining knowledge, and not the best one for many domains. They’re probably better at what Taleb calls “negative knowledge” – i.e. disproving something, which is possible and valuable. Trying to figure out how to act in the world (how to create software) is less possible. All things being equal, more testing is better, but all things aren’t always equal.
Oil is probably the most rigorously tested project I’ve ever worked on, but this is because of the nature of the project, and it isn’t right for all projects as a rule. It’s probably not good if you’re trying to launch a video game platform like Stadia, etc.
Anyway the point is that is that the empirical question isn’t really actionable IMO. You could “answer it” and it still wouldn’t tell you what to do.
I think you got this post exactly right – there’s no amount of empiricism that can help you.
This was my exact reaction when I read the original question motivating Hillel’s post.
I even want to take it a step further and say: Outside a specific context, the question doesn’t make sense. You won’t be able to measure it accurately, and even if you could, there would such huge variance depending on other factors across teams where you measured it that your answer wouldn’t help you win any arguments.
I think there should be a name for this empirical fallacy
It seems especially to afflict the smart and educated. Having absorbed the lessons of science and the benefits of skepticism and self-doubt, you can ask of any claim “But is there a study proving it?”. It’s a powerful debate trick too. But it can often be a category error. The universe of useful knowledge is much larger than the subset that has been (or can be) tested with a random double blind study.
I even want to take it a step further and say: Outside a specific context, the question doesn’t make sense. You won’t be able to measure it accurately, and even if you could, there would such huge variance depending on other factors across teams where you measured it that your answer wouldn’t help you win any arguments.
It makes a lot of sense to me in my context, which is trying to convince skeptical managers that they should pay for my consulting services. But it’s intended to be used in conjunction with rhetoric, demos, case studies, testimonials, etc.
It seems especially to afflict the smart and educated. Having absorbed the lessons of science and the benefits of skepticism and self-doubt, you can ask of any claim “But is there a study proving it?”. It’s a powerful debate trick too. But it can often be a category error. The universe of useful knowledge is much larger than the subset that has (or can) be tested with a random double blind study.
I’d say in principle it’s Scientism, in practice it’s often an intentional sabotaging tactic.
It makes a lot of sense to me in my context, which is trying to convince skeptical managers that they should pay for my consulting services. But it’s intended to be used in conjunction with rhetoric, demos, case studies, testimonials, etc.
100%.
I should have said: I don’t think it would help you win any arguments with someone knowledgeable. I completely agree that in the real world, where people are making decisions off rough heuristics and politics is everything, this kind of evidence could be persuasive.
So a study showing that “catching bugs early saves money” functions here like a white lab coat on a doctor: it makes everyone feel safer. But what’s really happening is that they are just trusting that the doctor knows what he’s doing. Imo the other methods for establishing trust you mentioned – rhetoric, demos, case studies, testimonials, etc. – imprecise as they are, are probably more reliable signals.
EDIT: Also, just to be clear, I think the right answer here, the majority of the time, is “well obviously it’s better to catch bugs early than later.”
And in which cases is this false? Is it when the team has lots of senior engineers? Is it when the team controls both the software and the hardware? Is it when OTA updates are trivial? (Here is a knock-on effect: what if OTA updates make this assertion false, but then open up a huge can of security vulnerabilities, which overall negates any benefit that the OTA updates add?) What does a majority here mean? I mean, a majority of 55% means something very different from a majority of 99%.
This is the value of empirical software study. Adding precision to assertions (such as understanding that a 55% majority is a bit pathological but a 99% majority certainly isn’t.) Diving into data and being able to understand and explore trends is also another benefit. Humans are motivated to categorize their experiences around questions they wish to answer but it’s much harder to answer questions that the human hasn’t posed yet. What if it turns out that catching bugs early or late is pretty much immaterial where the real defect rate is simply a function of experience and seniority?
This is the value of empirical software study.
I think empirical software study is great, and has tons of benefits. I just don’t think you can answer all questions of interest with it. The bugs question we’re discussing is one of those.
And in which cases is this false? Is it when the team has lots of senior engineers? Is it when the team controls both the software and the hardware? Is it when OTA updates are trivial? (Here is a knock-on effect: what if OTA updates make this assertion false, but then open up a huge can of security vulnerabilities, which overall negates any benefit that the OTA updates add?)
I mean, this is my point. There are too many factors to consider. I could add 50 more points to your bullet list.
What does a majority here mean?
Something like: “I find it almost impossible to think of examples from my personal experience, but understand the limits of my experience, and can imagine situations where it’s not true.” I think if it is true, it would often indicate a dysfunctional code base where validating changes out of production (via tests or other means) was incredibly expensive.
What if it turns out that catching bugs early or late is pretty much immaterial where the real defect rate is simply a function of experience and seniority?
One of my points is that there is no “turns out”. If you prove it one place, it won’t translate to another. It’s hard even to imagine an experimental design whose results I would give much weight to. All I can offer is my opinion that this strikes me as highly unlikely for most businesses.
Why is software engineering such an outlier when we’ve been able to measure so many other things? We can measure vaccine efficacy and health outcomes (among disparate populations with different genetics, diets, culture, and life experiences), we can measure minerals in soil, we can analyze diets, heat transfer, we can even study government policy, elections, and even personality 1 though it’s messy. What makes software engineering so much more complex and context dependent than even a person’s personality?
The fallacy I see here is simply that software engineers see this massive complexity in software engineering because they are software experts and believe that other fields are simpler because software engineers are not experts in those fields. Every field has huge amounts of complexity, but what gives us confidence that software engineering is so much more complex than other fields?
Why is software engineering such an outlier when we’ve been able to measure so many other things?
You can measure some things, just not all. Remember the point of discussion here is: Can you empirically investigate the claim “Finding bugs earlier saves overall time and money”? My position is basically: “This is an ill-defined question to ask at a general level.”
We can measure vaccine efficacy and health outcomes (among disparate populations with different genetics, diets, culture, and life experiences)
Yes.
we can measure minerals in soil, we can analyze diets, heat transfer,
Yes.
we can even study government policy
In some way yes, in some ways no. This is a complex situation with tons of confounds, and also a place where policy outcomes in some places won’t translate to other places. This is probably a good analog for what makes the question at hand difficult.
and even personality
Again, in some ways yes, in some ways no. With the big 5, you’re using the power of statistical aggregation to cut through things we can’t answer. Of which there are many. The empirical literature on “code review being generally helpful” seems to have a similar force. You can take disparate measures of quality, disparate studies, and aggregate to arrive at relatively reliable conclusions. It helps that we have an obvious, common sense causal theory that makes it plausible.
What makes software engineering so much more complex and context dependent than even a person’s personality?
I don’t think it is.
Every field has huge amounts of complexity, but what gives us confidence that software engineering is so much more complex than other fields?
I don’t think it is, and this is not where my argument is coming from. There are many questions in other fields equally unsuited to empirical investigation as: “Does finding bugs earlier save time and money?”
In some way yes, in some ways no. This is a complex situation with tons of confounds, and also a place where policy outcomes in some places won’t translate to other places. This is probably a good analog for what makes the question at hand difficult.
That hasn’t stopped anyone from performing the analysis and using these analyses to implement policy. That analysis of this data is imperfect is beside the point; it still provides some amount of positive value. Software is in the data dark ages in comparison to government policy; what data driven decision has been made among software engineer teams? I don’t think we even understand whether Waterfall or Agile reduces defect rates or time to ship compared to the other.
With the big 5, you’re using the power of statistical aggregation to cut through things we can’t answer. Of which there are many. The empirical literature on “code review being generally helpful” seems to have a similar force. You can take disparate measures of quality, disparate studies, and aggregate to arrive at relatively reliable conclusions. It helps that we have an obvious, common sense causal theory that makes it plausible.
What’s stopping us from doing this with software engineering? Is it the lack of a causal theory? There are techniques to try to glean causality from statistical models. Is this not in line with your definition of “empirically”?
That hasn’t stopped anyone from performing the analysis and using these analyses to implement policy. That analysis of this data is imperfect is beside the point; it still provides some amount of positive value.
It’s not clear to me at all that, as a whole, “empirically driven” policy has had positive value? You can point to successful cases and disasters alike. I think in practice the “science” here is at least as often used as a veneer to push through an agenda as it is to implement objectively more effective policy. Just as software methodologies are.
Is it the lack of a causal theory?
I was saying there is a causal theory for why code review is effective.
What’s stopping us from doing this with software engineering?
Again, some parts of it can be studied empirically, and should be. I’m happy to see advances there. But I don’t see the whole thing being tamed by science. The high-order bits in most situations are politics and other human stuff. You mentioned it being young… but here’s an analogy that might help with where I’m coming from. Teaching writing, especially creative writing. It’s equally ad-hoc and unscientific, despite being old. MFA programs use different methodologies and writers subscribe to different philosophies. There is some broad consensus about general things that mostly work and that most people do (workshops), but even within that there’s a lot of variation. And great books are written by people with wildly different approaches. There are a some nice efforts to leverage empiricism like Steven Pinker’s book and even software like https://hemingwayapp.com/, but systematization can only go so far.
We can measure vaccine efficacy and health outcomes (among disparate populations with different genetics, diets, culture, and life experiences)
Good vaccine studies are pretty expensive from what I know, but they have statistical power for that reason.
Health studies are all over the map. The “pile of college sophomores” problem very much applies there as well. There are tons of studies done on Caucasians that simply don’t apply in the same way to Asians or Africans, yet some doctors use that knowledge to treat patients.
Good doctors will use local knowledge and rules of thumb, and they don’t believe every published study they see. That would honestly be impossible, as lots of them are in direct contradiction to each other. (Contradiction is a problem that science shares with apprenticeship from experts; for example IIRC we don’t even know if a high fat diet causes heart disease, which was accepted wisdom for a long time.)
I would recommend reading some books by Nassim Taleb if you want to understand the limits of acquiring knowledge through measurement and statistics (Black Swan, Antifragile, etc.). Here is one comment I made about them recently: https://news.ycombinator.com/item?id=27213384
Key point: acting in the world, i.e. decision making under risk, are fundamentally different than scientific knowledge. Tinkering and experimentation are what drive real changes in the world, not planning by academics. He calls the latter “the Soviet-Harvard school”.
The books are not well organized, but he hammers home the difference between acting in the world and knowledge over and over in many different ways. If you have to have scientific knowledge before acting, you will be extremely limited in what you can do. You will probably lose all your money in the markets too :)
Update: after Googling the term I found in my notes, I’d say “Soviet-Harvard delusion” captures the crux of the argument here. One short definition is the the (unscientific) overestimation of the reach of scientific knowledge.
This sounds like empiricism. Not in the sense of “we can only know what we can measure” but in the sense of “I can only know what I can experience”. The Royal Society’s motto is “take nobody’s word for it”.
Tinkering and experimentation are what drive real changes in the world, not planning by academics.
I 100% agree but it’s not the whole picture. You need theory to compress and see further. It’s the back and forth between theory and experimentation that drives knowledge. Tinkering alone often ossifies into ritual. In programming, this has already happened.
I wouldn’t agree programming has ossified into ritual. Certainly it has at Google, which has a rigid coding style, toolchain, and set of languages – and it’s probably worse at other large companies.
But I see lots of people on this site doing different things, e.g. running OpenBSD and weird hardware, weird programming languages, etc. There are also tons of smaller newer companies using different languages. Lots of enthusiasm around Rust, Zig, etc. and a notable amount of production use.
Good vaccine studies are pretty expensive from what I know, but they have statistical power for that reason.
Oh sure, I’m not saying this will be cheap. In fact the price of collecting good data is what I feel makes this research so difficult.
Health studies are all over the map. The “pile of college sophomores” problem very much applies there as well. There are tons of studies done on Caucasians that simply don’t apply in the same way to Asians or Africans, yet some doctors use that knowledge to treat patients.
We’ve developed techniques to deal with these issues, though of course, you can’t draw a conclusion with extremely low sample sizes. One technique used frequently to compensate for low statistical power studies in meta studies is called Post-Stratification.
Good doctors will use local knowledge and rules of thumb, and they don’t believe every published study they see. That would honestly be impossible, as lots of them are in direct contradiction to each other. (Contradiction is a problem that science shares with apprenticeship from experts; for example IIRC we don’t even know if a high fat diet causes heart disease, which was accepted wisdom for a long time.)
I think medicine is a good example of empiricism done right. Sure, we can look at modern failures of medicine and nutrition and use these learnings to do better, but medicine is significantly more empirical than software. I still maintain that if we can systematize our understanding of the human body and medicine that we can do the same for software, though like a soft science, definitive answers may stay elusive. Much work over decades went into the medical sciences to define what it even means to have an illness, to feel pain, to see recovery, or to combat an illness.
I would recommend reading some books by Nassim Taleb if you want to understand the limits of acquiring knowledge through measurement and statistics (Black Swan, Antifragile, etc.). Here is one comment I made about them recently: https://news.ycombinator.com/item?id=27213384
Key point: acting in the world, i.e. decision making under risk, are fundamentally different than scientific knowledge. Tinkering and experimentation are what drive real changes in the world, not planning by academics. He calls the latter “the Soviet-Harvard school”.
I’m very familiar with Taleb’s Antifragile thesis and the “Soviet-Harvard delusion”. As someone well versed in statistics, these are theses that are both pedestrian (Antifragile itself being a pop-science look into a field of study called Extreme Value Theory) and old (Maximum Likelihood approaches to decision theory are susceptible to extreme/tail events which is why in recent years Bayesian and Bayesian Causal analyses have become more popular. Pearson was aware of this weakness and explored other branches of statistics such as Fiducial Inference). (Also I don’t mean this as criticism toward you, though it’s hard to make this tone come across over text. I apologize if it felt offensive, I merely wish to draw your eyes to more recent developments.)
To draw the discussion to a close, I’ll try to summarize my position a bit. I don’t think software empiricism will answer all the questions, nor will we get to a point where we can rigorously determine that some function f exists that can model our preferences. However I do think software empiricism together with standardization can offer us a way to confidently produce low-risk, low-defect software. I think modern statistical advances have offered us ways to understand more than statistical approaches in the ‘70s and that we can use many of the newer techniques used in the social and medical sciences (e.g. Bayesian methods) to prove results. I don’t think that, even if we start a concerted approach today to do this, that our understanding will get there in a matter of a few years. To do that would be to undo decades of software practitioners creating systemic analyses from their own experiences and to create a culture shift away from the individual as artisan to a culture of standardization of both communication of results (what is a bug? how does it affect my code? how long did it take to find? how long did it take to resolve? etc) and of team conditions (our team has n engineers, our engineers have x years of experience, etc) that we just don’t have now. I have hope that eventually we will begin to both standardize and understand our industry better but in the near-term this will be difficult.
I’m reading a book right now about 17th century science. The author has some stuff to say about Bacon and Empiricism but I’ll borrow an anecdote from the book. Boyle did an experiment where he grew a pumpkin and measured the dirt before and after. The weight of the dirt hadn’t changed much. The only other ingredient that had been added was water. It was obvious that the pumpkin must be made of only water.
This idea that measurement and observation drive knowledge is Bacon’s legacy. Even in Bacon’s own lifetime, it’s not how science unfolded.
Fun fact: Bacon is often considered the modern founder of the idea that knowledge can be used to create human-directed progress. Before him, while scholars and astronomers used to often study things and invent things, most cultures still viewed life and nature as a generally haphazard process. As with most things in history the reality involves more than just Bacon, and there most-certainly were non-Westerners who had similar ideas, but Bacon still figures prominently in the picture.
Hm interesting anecdote that I didn’t know about (I looked it up). Although I’d say that’s more an error of reasoning within science? I realized what I was getting at could be called the Soviet-Harvard delusion, which is overstating the reach of scientific knowledge (no insult intended, but it is a funny and memorable name): https://lobste.rs/s/v4unx3/i_ing_hate_science#c_nrdasq
To be fair, the vast majority of the mass of the pumpkin is water. So the inference was correct to first order. The second-order correction of “and carbon from the air”, of course, requires being much more careful in the inference step.
So basically I think you have to look at what people build and see how they do it. I would rather read a bunch of stories like “Coders at Work” or “Masterminds of Programming” than read any empirical study.
Perhaps, but this is already what happens, and I think it’s about time we in the profession raise our standards, both of pedagogy and of practice. Right now you can take a casual search on the Web and you can find respected talking-heads talk about how their philosophy is correct, despite being in direct contrast to another person’s philosophy. This behavior is reinforced by the culture wars of our times, of course, but there’s still much more aimless discourse than there is consistency in results. If we want to start taking steps to improve our practice, I think it’s important to understand what we’re doing right and more importantly what we’re doing wrong. I’m more interested here in negative results than positive results. I want to know where as a discipline software engineering is going wrong. There’s also a lot at stake here purely monetarily; corporations often embrace a technology methodology and pay for PR and marketing about their methodology to both bolster their reputations and to try to attract engineers.
think there should be a name for this empirical fallacy (or it probably already exists?) Another area where science has roundly failed is nutrition and preventative medicine.
I don’t think we’re even at the point in our empirical understanding of software engineering where we can make this fallacy. What do we even definitively understand about our field? I’d argue that psychology and sociology have stronger well-known results than what we have in software engineering even though those are very obviously soft sciences. I also think software engineers are motivated to think the problem is complex and impossible to be empirical for the same reason that anyone holds their work in high esteem; we believe our work is complicated and requires highly contextual expertise to understand. However if psychology and sociology can make empirical progress in their fields, I think software engineers most definitely can.
Do you have an example in mind of the direct contradiction? I don’t see much of a problem if different experts have different opinions. That just means they were building different things and different strategies apply.
Again I say it’s good to “look at what people build” and see if it applies to your situation; not blindly follow advice from authorities (e.g. some study “proved” this, or some guy from Google who may or may not have built things said this was good; therefore it must be good).
I don’t find a huge amount of divergence in the opinions of people who actually build stuff, vs. talking heads. If you look at what says John Carmack says about software engineering, it’s generally pretty level-headed, and he explains it well. It’s not going to differ that much from what Jeff Dean says. If you look at their C++ code, there are even similarities, despite drastically different domains.
Again the fallacy is that there’s a single “correct” – it depends on the domain; a little diversity is a good thing.
Do you have an example in mind of the direct contradiction? I don’t see much of a problem if different experts have different opinions. That just means they were building different things and different strategies apply.
I don’t find a huge amount of divergence in the opinions of people who actually build stuff, vs. talking heads. If you look at what says John Carmack says about software engineering, it’s generally pretty level-headed, and he explains it well. It’s not going to differ that much from what Jeff Dean says. If you look at their C++ code, there are even similarities, despite drastically different domains.
Who builds things though? Several people build things. While we hear about John Carmack and Jeff Dean, there are folks plugging away at the Linux kernel, on io_uring, on capability object systems, and all sorts of things that many of us will never be aware of. As an example, Sanjay Ghewamat is someone who I wasn’t familiar with until you talked about him. I’ve also interacted with folks in my career who I presume you’ve never interacted with and yet have been an invaluable source of learnings for my own code. Moreover these experience reports are biased by their reputations; I mean of course we’re more likely to listen to John Carmack than some Vijay Foo (not a real person, as far as I’m aware) because he’s known for his work at iD, even if this Vijay Foo may end up having as many or more actionable insights than John Carmack. Overcoming reputation bias and lack of information about “builders” is another side effect I see of empirical research. Aggregating learnings across individuals can help surface lessons that otherwise would have been lost due to structural issues of acclaim and money.
Again the fallacy is that there’s a single “correct” – it depends on the domain; a little diversity is a good thing.
This seems to be a sentiment I’ve read elsewhere, so I want to emphasize: I don’t think there’s anything wrong with diversity and I don’t think Emprical Software Engineering does anything to diversity. Creating complicated probabilistic models of spaces necessarily involve many factors. We can create a probability space which has all of the features we care about. Just condition against your “domain” (e.g. kernel work, distributed systems, etc) and slot your result into that domain. I don’t doubt that a truly descriptive probability space will be very high dimensional here but I’m confident we have the analytical and computational power to perform this work nonetheless.
The real challenge I suspect will be to gather the data. FOSS developers are time and money strapped as it is, and excluding some exceptional cases such as curl’s codebase statistics, they’re rarely going to have the time to take the detailed notes it would take to drive this research forward. Corporations which develop proprietary software have almost no incentive to release this data to the general public given how much it could expose about their internal organizational structure and coding practices, so rather than open themselves up to scrutiny they keep the data internal if they measure it at all. Combating this will be a tough problem.
Yeah I don’t see any conflict there (and I’ve watched the first one before). I use both static and dynamic languages and there are advantages and disadvantages to each. I think any programmer should comfortable using both styles.
I think that the notion that a study is going to change anyone’s mind is silly, like “I am very productive in statically typed languages. But a study said that they are not more productive; therefore I will switch to dynamically typed”. That is very silly.
It’s also not a question that’s ever actionable in reality. Nobody says “Should I use a static or dynamic language for this project?” More likely you are working on existing codebase, OR you have a choice between say Python and Go. The difference between Python and Go would be a more interesting and accurate study, not static vs. dynamic. But you can’t do an “all pairs” comparison via scientific studies.
If there WERE a study definitely proving that say dynamic languages are “better” (whatever that means), and you chose Python over Go for that reason, that would be a huge mistake. It’s just not enough evidence; the languages are different for other reasons.
I think there is value to scientific studies on software engineering, but I think the field just moves very fast, and if you wait for science, you’ll be missing out on a lot of stuff. I try things based on what people who get things done do (e.g. OCaml), and incorporate it into my own work, and that seems like a good way of obtaining knowledge.
Likewise, I think “Is catching bugs earlier less expensive” is a pretty bad question. A better scientific question might be “is unit testing in Python more effective than integration testing Python with shell” or something like that. Even that’s sort of silly because the answer is “both”.
But my point is that these vague and general questions simply leave out a lot of subtlety of any particular situation, and can’t be answered in any useful way.
I think that the notion that a study is going to change anyone’s mind is silly, like “I am very productive in statically typed languages. But a study said that they are not more productive; therefore I will switch to dynamically typed”. That is very silly.
While the example of static and dynamic typing is probably overbroad to be meaningless, I don’t actually think this would be true. It’s a bit like saying “Well I believe that Python is the best language and even though research shows that Go has propertries <x, y, and z> that are beneficial to my problem domain, well I’m going to ignore them and put a huge prior on my past experience.” It’s the state of the art right now; trust your gut and the guts of those you respect, not the other guts. If we can’t progress from here I would indeed be sad.
It’s also not a question that’s ever actionable in reality. Nobody says “Should I use a static or dynamic language for this project?” More likely you are working on existing codebase, OR you have a choice between say Python and Go. The difference between Python and Go would be a more interesting and accurate study, not static vs. dynamic. But you can’t do an “all pairs” comparison via scientific studies.
Sure, as you say, static vs dynamic languages isn’t very actionable but Python vs Go would be. And if I’m starting a new codebase, a new project, or a new company, it might be meaningful to have research that shows that, say, Python has a higher defect rate but an overall lower mean time to resolution of these defects. Prior experience with Go may trump benefits that Python has (in this synthetic example) if project time horizons are short, but if time horizons are long Go (again in the synthetic example) might look better. I think this sort of comparative analysis in defect rates, mean time to resolution, defect severity, and other attributes can be very useful.
Personally, I’m not satisfied by the state of the art of looking at builders. I think the industry really needs a more rigorous look at its assumptions and even if we never truly systematize and Fordify the field (which fwiw I don’t think is possible), I certainly think there’s a lot of progress for us to make yet and many pedestrian questions that we can answer that have no answers yet.
Sure, as you say, static vs dynamic languages isn’t very actionable but Python vs Go would be. And if I’m starting a new codebase, a new project, or a new company, it might be meaningful to have research that shows that, say, Python has a higher defect rate but an overall lower mean time to resolution of these defects.
Python vs Go defect rates also seem to me to be far too general for an empirical study to produce actionable data.
How do you quantify a “defect rate” in a way that’s relevant to my problem, for example? There are a ton of confounds: genre of software, timescale of development, size of team, composition of team, goals of the project, etc. How do I know that some empirical study comparing defect rates of Python vs. Go in, I dunno, the giant Google monorepo, is applicable to my context? Let’s say I’m trying to pick a language to write some AI research software, which will have a 2-person team, no monorepo or formalized code-review processes, a target 1-year timeframe to completion, and a primary metric of producing figures for a paper. Why would I expect the Google study to produce valid data for my decision-making?
Nobody says “Should I use a static or dynamic language for this project?”
Somebody does. Somebody writes the first code on a new project and chose the language. Somebody sets the corporate policy on permissible languages. Would be amazing if even a tiny input to these choices were real instead of just perceived popularity and personal familiarity.
Tangentially. I did my MS work on debugging. Because I am fascinated by both computers and history, I dug through essentially all the papers written on the matter between, um, 1949 and 1980, and a fair sampling thereafter (because the paper count got too large). I finished that work in 2011ish (I had to cut a great deal of the historical analysis out of my finished MS work, so it’s not available to read, sorry). What was fascinating is that after a certain point, the academic work on software engineering work became navel gazing. Initially it was done by people who were actively programming in practice. Then the work migrated away from the active programming and more and more into sort of generic or ultra-specialized systems designed for studying debugging, which only had, in passing, a relation to the practical systems. Too, fad work started to show up in the early 90s.
Much like OP, I keep an eye on empirical software engineering, particularly around codebase metrics. My broad take is that it’s useful bunkum. Much like Red Riding Hood is about opsec but never actually happened, metrics tell a story, and its a useful story, but maybe don’t believe them too literally.
One more wrinkle: “are late bugs more expensive to fix” isn’t even the question we need to answer. It’s an intellectually interesting question, but the question we actually need to answer is “how should we develop software?”
If bugs we discover late are more expensive, that suggests that we could save money by finding them earlier. But that’s not a guarantee–perhaps they’re more expensive to fix because they’re just harder to find, not because they’re found later? I’m happy to assume it’s some of the second, even without research, but they’re two different explanations of the hypothesized phenomenon.
Agreed. Oftentimes the better question wasn’t “When should we fix these bugs?” but rather “How might this new feature impact the bug profile of our codebase” or even “Should we spend code on this feature in the first place?”
That’s a great point, and to add to it, I think looking at whether a specific bug would have been cheaper to fix earlier discounts the possibility that the cost of finding the bug earlier would have exceeded the cost of fixing it later. My intuition is that that must be true at least some of the time.
Over time you slowly build out a list of good “node papers” (mostly literature reviews) and useful terms to speed up this process, but it’s always gonna be super time consuming.
Although it doesn’t really replace the manual thoroughness, I’ve found Connected Papers and Citation Tree to be handy tools for augmenting this process – if only to get a quick lay of the land!
I really appreciate your focus on history and historiography in the industry. So much wisdom and context has been lost and distorted due to ignorance of it. This is a problem everywhere but seems particularly egregious in software
Let me chime in on this. I enjoyed your article as I always do. Minor comments first:
re formal methods making it cheaper. I’d scratch this claim since it might give a wrong impression of the field. Maybe you’ve seen a ton of papers claiming this in your research. It could be popular these days. In most of mine (ranging four decades min.), they usually just said formal methods either reduced or (false claim here) totally eliminated defects in software. They focus on how the software will have stunning quality, maybe even perfection. If they mention cost, it’s all over the place: built barely anything with massive labor; cost a lot more to make something better; imply savings in long-term since it needed no maintenance. The lighterweight methods, like Cleanroom, sometimes said it’s cheaper due to (other claim) prevention being cheaper. They didn’t always, though.
re common knowledge is wrong. I got bit by a lot of that. Thanks for digging into it, cutting through the nonsense, and helping us think straight!
Now for the big stuff.
re stuff we can agree on. That’s what I’m going to focus on. The examples you’re using all support your point that nobody can figure anything out. You’re not using examples we’ve discussed in the past that have more weight with high, practical value. I’ll also note that some of these are logical in nature: eliminating the root causes means we don’t need a study to prove their benefit except that they correctly eliminate the root causes. Unless I skimmed past it, it looks like you don’t bring up those at all. Maybe you found problems in them, too. Maybe just forgot. So, I’ll list them in case any of you find them relevant.
Memory safety. Most studies a massive number of bugs are memory safety. Languages that block them by default and tools that catch most of them will, if they work (prevent/catch), block a massive number of bugs. Lesser claims: many exploits (a) also use those kinds of bugs and (b) more easily use them. That lets us extrapolate the first claim to reduction of exploits and/or their severity. An easy test would be to compare (minus web apps) both the number of memory-safety errors and code injections found in C/C++ apps vs Java, C#, and Rust. If Java or C#, in the safe parts rather than unsafe parts. I predict a serious drop since their root cause is logically impossible most of the time. I believe field evidence already supports this.
Garbage collection (temporal safety, single-threaded). The industry largely switched to languages using GC, esp Java and .NET, from C/C++. That’s because temporal errors are more difficult to prevent, were harder to find at one point (maybe today), and might cost more to fix (action at a distance). These kinds of errors turn up even in high-quality codebases like OpenBSD. I predict a huge drop by garbage collection making them impossible or less-probable. If it’s not performance-prohibitive, a GC-by-default approach reduces development time, totally eliminates cost of fixing these, and reduces crashes they cause. What I don’t have proof of is that I used to see more app crashes before managed languages due to a combo of spatial and temporal memory errors. Might warrant a study.
Concurrency (temporal safety, multi-threaded). Like what GC’s address, concurrency errors are more difficult to prevent and can be extremely hard to debug. One Lobster was even hired for a long period of time specifically to fix one. There’s both safe methods for concurrency (eg SCOOP, Rust) and tools that detect the errors (esp in Java). If the methods work, what we’d actually be studying is if not introducing concurrency errors led to less QA issues than introducing and fixing concurrency errors. No brainer. :)
Static analysis. They push the button, the bugs appear, and they fix the ones that are real. Every good tool had this effect on real-world, OSS codebases. That proves using them will help you get rid of more bugs than not using them. If no false positives (eg RV-Match, Astree), your argument is basically done. If false positives (eg PVS-Check, Coverity), then one has to study if the cost to go through the false positives is worth it vs methods without them. I’d initially compare to no-false-positive methods. Then, maybe to methods that are alternatives to static analysis.
Test generators. Path-based, combinatorial, fuzzing… many methods. Like static analysis, they each find errors. The argument one is good is easy. If many are available (esp free), then other empirical data says each will spot things others miss. That has a quality benefit without a study necessary if the administrative and hardware cost of using them is low enough.
My persistent claim on Lobsters for code-level quality: use a combination of FOSS languages and tools with memory-safety, GC-by-default, safe concurrency, static analysis, and test generation. To take the claim down, you just need empirical evidence against one or more of the above since my claim just straight-forward builds on them. Alternatively, show that how they interact introduces new problems that are costlier to fix than what they fixed. That is a possibility.
Drivers. Microsoft Windows used to blue screen all the time for all kinds of users. The reports were through the roof. It was mostly drivers. They built a formal methods tool to make drivers more reliable. Now, Windows rarely blue screens from driver failures. There you go.
Design flaws. You’ve personally shown how TLA+ mockups catch in minutes problems that alternative methods either don’t catch or take roughly forever to. Any research like that makes an argument for using those tools to catch those kinds of problems. I’ll limit the claim to that. It becomes a tiny, tiny claim compared to silver bullets most are looking for. Yet, it already allows one to save time on UI’s, distributed systems, modeling legacy issues, etc.
Functional and immutable over imperative and mutable. In a nutshell, what happens as we transition from one to another is that the number of potential interactions increase. Managing them is more complex. Piles of research across fields show that both complexity and interactions drive all kinds of problems. Interestingly, this is true for both human brains and machine analysis. You’ve also linked to a formal methods article where author illustrated that mathematically for verifying functional vs imperative software. Any technique reducing both complexity and potential interactions for developers reduces their labor so long as the technique itself is understandable. That gives us root cause analysis, cross-disciplinary evidence, and mathematical proofs that it will be easier in long-term to verify programs that are more functional and use less, mutable state.
Things we’ll be stuck with, like Internet protocols. Certain systems are set up to be hard to change or never change. We’re stuck with them. OSS, industry, and government keeps deploying bandaids or extra layers to fix problems in such systems without fixing the root. When they do fix a root cause, it’s enormously expensive (see Y2K bug). I’d argue for getting them right up-front to reduce long-term costs. I’m not sure what empirical studies have gone into detail about these systems. I just think it’s the best thing to apply that kind of thinking to.
I’d say those are the only things I’d mention in an article like yours. They’re things that are easy to argue for using visible, real-world results.
The biggest issue with a lot of software engineering research that I encounter is that it is unsurprising. It is information that practitioners can acquire talking to a wide range of other practitioners or via first-principles thinking.
The recurring pattern seems to be that organizations seeking better ways of working try things out and gain some advantage. Others hear about their success and attempt the same thing. The ones that don’t
(laggards), often wait for confirmation from various source of practice authority (academia in this case) before trying something that appears risky to them.
Since the primary learning loop happens through market experimentation, this leaves research the task of putting a stamp of approval on what is often common knowledge in some pockets of practice, leading to wider adoption. Or, more rarely, showing that some common knowledge may not work in particular circumstances, and that is more of a negative agenda. Practice discovery (positive agenda) will always be easier in industry.
This makes academic software engineering research very hard and, maybe, demoralizing for those who do it. It’s relatively easy to show that some things don’t work in particular circumstances, but the range of circumstances is huge and every company / open source project has incentive to discover niche places where a particular method, tool or practice provides value.
At one point, there was a decent amount of research on software engineering practices. For example, I have this book on my shelf (I needed it for a class I took in grad school). However, it seems to have gone out of style around roughly 2000 with the mass switch to agile/scrum. Part of the problem is a lot of this work was funded by the US DoD and by IBM, neither of which embraced agile/scrum along with the rest of the industry. The newer crop of tech superstar companies seems to be much more secretive about its engineering practices, and also less interested in systematically studying them.
FWIW the motivation for this was apparently a comment on a thread about a review of the book “Software Engineering at Google” by Titus Winters, Tom Manshreck, and Hyrum Wright.
https://lobste.rs/s/9n7aic/what_i_learned_from_software_engineering
I meant to comment on that original thread, because I thought the question was misguided. Well now that I look it’s actually been deleted?
Anyway the point is that is that the empirical question isn’t really actionable IMO. You could “answer it” and it still wouldn’t tell you what to do.
I think you got this post exactly right – there’s no amount of empiricism that can help you. Software engineering has changed so much in the last 10 or 20 years that you can trivially invalidate any study.
Yaron Minsky has a saying that “there’s no pile of sophomores high enough” that is going to prove anything about writing code. (Ironically he says that in advocacy of static typing, which I view as an extremely domain specific question.) Still I agree with his general point.
This is not meant to be an insult, but when I see the names Titus Winters and Hyrum Wright, I’m less interested in the work. This is because I worked at Google for over a decade and got lots of refactoring and upgrade changelists/patches from them, as maintainer of various parts of the codebase. I think their work is extremely valuable, but it is fairly particular to Google, and in particular it’s done without domain knowledge. They are doing an extremely good job of doing what they can to improve the codebase without domain knowledge, which is inherent in their jobs, because they’re making company-wide changes.
However most working engineers don’t improve code without domain knowledge, and the real improvements to code require such knowledge. You can only nibble at the edges otherwise.
@peterbourgon said basically what I was going to say in the original thread – this is advice is generally good in the abstract, but it lacks context.
https://lobste.rs/s/9n7aic/what_i_learned_from_software_engineering
The way I learned things at Google was to look at what people who “got things done” did. They generally “break the rules” a bit. They know what matters and what doesn’t matter.
Jeff Dean and Sanjay Ghewamat indeed write great code and early in my career I exchanged a few CLs with them and learned a lot. I also referenced a blog post by Paul Bucheit in The Simplest Explanation of Oil.
For those who don’t know, he was creator of GMail, working on it for 3 years as a side project (and Gmail was amazing back then, faster than desktop MS Outlook, even though it’s rotted now.) He mentions in that post how he prototyped some ads with the aid of some Unix shell. (Again, ads are horrible now, a cancer on the web – back then they were useful and fast. Yes really. It’s hard to convey the difference to someone who wasn’t a web user then.)
As a couple other anecdotes, I remember people a worker complaining that Guido van Rossum’s functions were too long. (Actually I somewhat agreed, but he did it in service of getting something done, and it can be fixed later.)
I also remember Bram Moolenaar’s (author of Vim) Java readability review, where he basically broke all the rules and got angry at the system (for a brief time I was one of the people who picked the Python readability reviewers, so I’m familiar with this style of engineering. I had to manage some disputes between reviewers and applicants.).
So you have to take all these rules with a grain of salt. These people can obviously get things done, and they all do things a little differently. They don’t always write as many tests as you’d ideally like. One of the things I tried to do as the readability reviewer was to push back against dogma and get people to relax a bit. There is value to global consistency, but there’s also value to local domain-specific knowledge. My pushing back was not really successful and Google engineering has gotten more dogmatic and sclerotic over the years. It was not fun to write code there by the time I left (over 5 years ago)
So basically I think you have to look at what people build and see how they do it. I would rather read a bunch of stories like “Coders at Work” or “Masterminds of Programming” than read any empirical study.
I think there should be a name for this empirical fallacy (or it probably already exists?) Another area where science has roundly failed is nutrition and preventative medicine. Maybe not for the same exact reasons, but the point is that controlled experiments are only one way of obtaining knowledge, and not the best one for many domains. They’re probably better at what Taleb calls “negative knowledge” – i.e. disproving something, which is possible and valuable. Trying to figure out how to act in the world (how to create software) is less possible. All things being equal, more testing is better, but all things aren’t always equal.
Oil is probably the most rigorously tested project I’ve ever worked on, but this is because of the nature of the project, and it isn’t right for all projects as a rule. It’s probably not good if you’re trying to launch a video game platform like Stadia, etc.
This was my exact reaction when I read the original question motivating Hillel’s post.
I even want to take it a step further and say: Outside a specific context, the question doesn’t make sense. You won’t be able to measure it accurately, and even if you could, there would such huge variance depending on other factors across teams where you measured it that your answer wouldn’t help you win any arguments.
It seems especially to afflict the smart and educated. Having absorbed the lessons of science and the benefits of skepticism and self-doubt, you can ask of any claim “But is there a study proving it?”. It’s a powerful debate trick too. But it can often be a category error. The universe of useful knowledge is much larger than the subset that has been (or can be) tested with a random double blind study.
It makes a lot of sense to me in my context, which is trying to convince skeptical managers that they should pay for my consulting services. But it’s intended to be used in conjunction with rhetoric, demos, case studies, testimonials, etc.
I’d say in principle it’s Scientism, in practice it’s often an intentional sabotaging tactic.
100%.
I should have said: I don’t think it would help you win any arguments with someone knowledgeable. I completely agree that in the real world, where people are making decisions off rough heuristics and politics is everything, this kind of evidence could be persuasive.
So a study showing that “catching bugs early saves money” functions here like a white lab coat on a doctor: it makes everyone feel safer. But what’s really happening is that they are just trusting that the doctor knows what he’s doing. Imo the other methods for establishing trust you mentioned – rhetoric, demos, case studies, testimonials, etc. – imprecise as they are, are probably more reliable signals.
EDIT: Also, just to be clear, I think the right answer here, the majority of the time, is “well obviously it’s better to catch bugs early than later.”
And in which cases is this false? Is it when the team has lots of senior engineers? Is it when the team controls both the software and the hardware? Is it when OTA updates are trivial? (Here is a knock-on effect: what if OTA updates make this assertion false, but then open up a huge can of security vulnerabilities, which overall negates any benefit that the OTA updates add?) What does a majority here mean? I mean, a majority of 55% means something very different from a majority of 99%.
This is the value of empirical software study. Adding precision to assertions (such as understanding that a 55% majority is a bit pathological but a 99% majority certainly isn’t.) Diving into data and being able to understand and explore trends is also another benefit. Humans are motivated to categorize their experiences around questions they wish to answer but it’s much harder to answer questions that the human hasn’t posed yet. What if it turns out that catching bugs early or late is pretty much immaterial where the real defect rate is simply a function of experience and seniority?
I mean, this is my point. There are too many factors to consider. I could add 50 more points to your bullet list.
Something like: “I find it almost impossible to think of examples from my personal experience, but understand the limits of my experience, and can imagine situations where it’s not true.” I think if it is true, it would often indicate a dysfunctional code base where validating changes out of production (via tests or other means) was incredibly expensive.
One of my points is that there is no “turns out”. If you prove it one place, it won’t translate to another. It’s hard even to imagine an experimental design whose results I would give much weight to. All I can offer is my opinion that this strikes me as highly unlikely for most businesses.
Why is software engineering such an outlier when we’ve been able to measure so many other things? We can measure vaccine efficacy and health outcomes (among disparate populations with different genetics, diets, culture, and life experiences), we can measure minerals in soil, we can analyze diets, heat transfer, we can even study government policy, elections, and even personality 1 though it’s messy. What makes software engineering so much more complex and context dependent than even a person’s personality?
The fallacy I see here is simply that software engineers see this massive complexity in software engineering because they are software experts and believe that other fields are simpler because software engineers are not experts in those fields. Every field has huge amounts of complexity, but what gives us confidence that software engineering is so much more complex than other fields?
You can measure some things, just not all. Remember the point of discussion here is: Can you empirically investigate the claim “Finding bugs earlier saves overall time and money”? My position is basically: “This is an ill-defined question to ask at a general level.”
Yes.
Yes.
In some way yes, in some ways no. This is a complex situation with tons of confounds, and also a place where policy outcomes in some places won’t translate to other places. This is probably a good analog for what makes the question at hand difficult.
Again, in some ways yes, in some ways no. With the big 5, you’re using the power of statistical aggregation to cut through things we can’t answer. Of which there are many. The empirical literature on “code review being generally helpful” seems to have a similar force. You can take disparate measures of quality, disparate studies, and aggregate to arrive at relatively reliable conclusions. It helps that we have an obvious, common sense causal theory that makes it plausible.
I don’t think it is.
I don’t think it is, and this is not where my argument is coming from. There are many questions in other fields equally unsuited to empirical investigation as: “Does finding bugs earlier save time and money?”
That hasn’t stopped anyone from performing the analysis and using these analyses to implement policy. That analysis of this data is imperfect is beside the point; it still provides some amount of positive value. Software is in the data dark ages in comparison to government policy; what data driven decision has been made among software engineer teams? I don’t think we even understand whether Waterfall or Agile reduces defect rates or time to ship compared to the other.
What’s stopping us from doing this with software engineering? Is it the lack of a causal theory? There are techniques to try to glean causality from statistical models. Is this not in line with your definition of “empirically”?
It’s not clear to me at all that, as a whole, “empirically driven” policy has had positive value? You can point to successful cases and disasters alike. I think in practice the “science” here is at least as often used as a veneer to push through an agenda as it is to implement objectively more effective policy. Just as software methodologies are.
I was saying there is a causal theory for why code review is effective.
Again, some parts of it can be studied empirically, and should be. I’m happy to see advances there. But I don’t see the whole thing being tamed by science. The high-order bits in most situations are politics and other human stuff. You mentioned it being young… but here’s an analogy that might help with where I’m coming from. Teaching writing, especially creative writing. It’s equally ad-hoc and unscientific, despite being old. MFA programs use different methodologies and writers subscribe to different philosophies. There is some broad consensus about general things that mostly work and that most people do (workshops), but even within that there’s a lot of variation. And great books are written by people with wildly different approaches. There are a some nice efforts to leverage empiricism like Steven Pinker’s book and even software like https://hemingwayapp.com/, but systematization can only go so far.
Good vaccine studies are pretty expensive from what I know, but they have statistical power for that reason.
Health studies are all over the map. The “pile of college sophomores” problem very much applies there as well. There are tons of studies done on Caucasians that simply don’t apply in the same way to Asians or Africans, yet some doctors use that knowledge to treat patients.
Good doctors will use local knowledge and rules of thumb, and they don’t believe every published study they see. That would honestly be impossible, as lots of them are in direct contradiction to each other. (Contradiction is a problem that science shares with apprenticeship from experts; for example IIRC we don’t even know if a high fat diet causes heart disease, which was accepted wisdom for a long time.)
https://www.nytimes.com/2016/09/13/well/eat/how-the-sugar-industry-shifted-blame-to-fat.html
I would recommend reading some books by Nassim Taleb if you want to understand the limits of acquiring knowledge through measurement and statistics (Black Swan, Antifragile, etc.). Here is one comment I made about them recently: https://news.ycombinator.com/item?id=27213384
Key point: acting in the world, i.e. decision making under risk, are fundamentally different than scientific knowledge. Tinkering and experimentation are what drive real changes in the world, not planning by academics. He calls the latter “the Soviet-Harvard school”.
The books are not well organized, but he hammers home the difference between acting in the world and knowledge over and over in many different ways. If you have to have scientific knowledge before acting, you will be extremely limited in what you can do. You will probably lose all your money in the markets too :)
Update: after Googling the term I found in my notes, I’d say “Soviet-Harvard delusion” captures the crux of the argument here. One short definition is the the (unscientific) overestimation of the reach of scientific knowledge.
https://www.grahammann.net/book-notes/antifragile-nassim-nicholas-taleb
https://medium.com/the-many/the-right-way-to-be-wrong-bc1199dbc667
https://taylorpearson.me/antifragile-book-notes/
This sounds like empiricism. Not in the sense of “we can only know what we can measure” but in the sense of “I can only know what I can experience”. The Royal Society’s motto is “take nobody’s word for it”.
I 100% agree but it’s not the whole picture. You need theory to compress and see further. It’s the back and forth between theory and experimentation that drives knowledge. Tinkering alone often ossifies into ritual. In programming, this has already happened.
I agree about the back and forth, of course.
I wouldn’t agree programming has ossified into ritual. Certainly it has at Google, which has a rigid coding style, toolchain, and set of languages – and it’s probably worse at other large companies.
But I see lots of people on this site doing different things, e.g. running OpenBSD and weird hardware, weird programming languages, etc. There are also tons of smaller newer companies using different languages. Lots of enthusiasm around Rust, Zig, etc. and a notable amount of production use.
My bad, I didn’t mean all programming has become ritual. I meant that we’ve seen instances of it.
Oh sure, I’m not saying this will be cheap. In fact the price of collecting good data is what I feel makes this research so difficult.
We’ve developed techniques to deal with these issues, though of course, you can’t draw a conclusion with extremely low sample sizes. One technique used frequently to compensate for low statistical power studies in meta studies is called Post-Stratification.
I think medicine is a good example of empiricism done right. Sure, we can look at modern failures of medicine and nutrition and use these learnings to do better, but medicine is significantly more empirical than software. I still maintain that if we can systematize our understanding of the human body and medicine that we can do the same for software, though like a soft science, definitive answers may stay elusive. Much work over decades went into the medical sciences to define what it even means to have an illness, to feel pain, to see recovery, or to combat an illness.
I’m very familiar with Taleb’s Antifragile thesis and the “Soviet-Harvard delusion”. As someone well versed in statistics, these are theses that are both pedestrian (Antifragile itself being a pop-science look into a field of study called Extreme Value Theory) and old (Maximum Likelihood approaches to decision theory are susceptible to extreme/tail events which is why in recent years Bayesian and Bayesian Causal analyses have become more popular. Pearson was aware of this weakness and explored other branches of statistics such as Fiducial Inference). (Also I don’t mean this as criticism toward you, though it’s hard to make this tone come across over text. I apologize if it felt offensive, I merely wish to draw your eyes to more recent developments.)
To draw the discussion to a close, I’ll try to summarize my position a bit. I don’t think software empiricism will answer all the questions, nor will we get to a point where we can rigorously determine that some function
f
exists that can model our preferences. However I do think software empiricism together with standardization can offer us a way to confidently produce low-risk, low-defect software. I think modern statistical advances have offered us ways to understand more than statistical approaches in the ‘70s and that we can use many of the newer techniques used in the social and medical sciences (e.g. Bayesian methods) to prove results. I don’t think that, even if we start a concerted approach today to do this, that our understanding will get there in a matter of a few years. To do that would be to undo decades of software practitioners creating systemic analyses from their own experiences and to create a culture shift away from the individual as artisan to a culture of standardization of both communication of results (what is a bug? how does it affect my code? how long did it take to find? how long did it take to resolve? etc) and of team conditions (our team hasn
engineers, our engineers havex
years of experience, etc) that we just don’t have now. I have hope that eventually we will begin to both standardize and understand our industry better but in the near-term this will be difficult.Here’s a published paper that purposefully illustrates the point you’re trying to make: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC300808/. It’s an entertaining read.
Yup I remember that from debates on whether to wear masks or not! :) It’s a nice pithy illustration of the problem.
Actually I found a (condescending but funny/memorable) name for the fallacy – the “Soviet-Harvard delusion” :)
I found it in my personal wiki, in 2012 notes on the book Antifragile.
Original comment: https://lobste.rs/s/v4unx3/i_ing_hate_science#c_nrdasq
I’m reading a book right now about 17th century science. The author has some stuff to say about Bacon and Empiricism but I’ll borrow an anecdote from the book. Boyle did an experiment where he grew a pumpkin and measured the dirt before and after. The weight of the dirt hadn’t changed much. The only other ingredient that had been added was water. It was obvious that the pumpkin must be made of only water.
This idea that measurement and observation drive knowledge is Bacon’s legacy. Even in Bacon’s own lifetime, it’s not how science unfolded.
Fun fact: Bacon is often considered the modern founder of the idea that knowledge can be used to create human-directed progress. Before him, while scholars and astronomers used to often study things and invent things, most cultures still viewed life and nature as a generally haphazard process. As with most things in history the reality involves more than just Bacon, and there most-certainly were non-Westerners who had similar ideas, but Bacon still figures prominently in the picture.
Hm interesting anecdote that I didn’t know about (I looked it up). Although I’d say that’s more an error of reasoning within science? I realized what I was getting at could be called the Soviet-Harvard delusion, which is overstating the reach of scientific knowledge (no insult intended, but it is a funny and memorable name): https://lobste.rs/s/v4unx3/i_ing_hate_science#c_nrdasq
To be fair, the vast majority of the mass of the pumpkin is water. So the inference was correct to first order. The second-order correction of “and carbon from the air”, of course, requires being much more careful in the inference step.
Perhaps, but this is already what happens, and I think it’s about time we in the profession raise our standards, both of pedagogy and of practice. Right now you can take a casual search on the Web and you can find respected talking-heads talk about how their philosophy is correct, despite being in direct contrast to another person’s philosophy. This behavior is reinforced by the culture wars of our times, of course, but there’s still much more aimless discourse than there is consistency in results. If we want to start taking steps to improve our practice, I think it’s important to understand what we’re doing right and more importantly what we’re doing wrong. I’m more interested here in negative results than positive results. I want to know where as a discipline software engineering is going wrong. There’s also a lot at stake here purely monetarily; corporations often embrace a technology methodology and pay for PR and marketing about their methodology to both bolster their reputations and to try to attract engineers.
I don’t think we’re even at the point in our empirical understanding of software engineering where we can make this fallacy. What do we even definitively understand about our field? I’d argue that psychology and sociology have stronger well-known results than what we have in software engineering even though those are very obviously soft sciences. I also think software engineers are motivated to think the problem is complex and impossible to be empirical for the same reason that anyone holds their work in high esteem; we believe our work is complicated and requires highly contextual expertise to understand. However if psychology and sociology can make empirical progress in their fields, I think software engineers most definitely can.
Do you have an example in mind of the direct contradiction? I don’t see much of a problem if different experts have different opinions. That just means they were building different things and different strategies apply.
Again I say it’s good to “look at what people build” and see if it applies to your situation; not blindly follow advice from authorities (e.g. some study “proved” this, or some guy from Google who may or may not have built things said this was good; therefore it must be good).
I don’t find a huge amount of divergence in the opinions of people who actually build stuff, vs. talking heads. If you look at what says John Carmack says about software engineering, it’s generally pretty level-headed, and he explains it well. It’s not going to differ that much from what Jeff Dean says. If you look at their C++ code, there are even similarities, despite drastically different domains.
Again the fallacy is that there’s a single “correct” – it depends on the domain; a little diversity is a good thing.
Here’s two fun ones I like to contrast: The Unreasonable Effectiveness of Dynamic Typing for Practical Programs (Vimeo) and The advantages of static typing, simply stated. Two separate authors that came to different conclusions from similar evidence. While yes their lived experience is undoubtedly different, these are folks who are espousing (mostly, not completely) contradictory viewpoints.
Who builds things though? Several people build things. While we hear about John Carmack and Jeff Dean, there are folks plugging away at the Linux kernel, on
io_uring
, on capability object systems, and all sorts of things that many of us will never be aware of. As an example, Sanjay Ghewamat is someone who I wasn’t familiar with until you talked about him. I’ve also interacted with folks in my career who I presume you’ve never interacted with and yet have been an invaluable source of learnings for my own code. Moreover these experience reports are biased by their reputations; I mean of course we’re more likely to listen to John Carmack than some Vijay Foo (not a real person, as far as I’m aware) because he’s known for his work at iD, even if this Vijay Foo may end up having as many or more actionable insights than John Carmack. Overcoming reputation bias and lack of information about “builders” is another side effect I see of empirical research. Aggregating learnings across individuals can help surface lessons that otherwise would have been lost due to structural issues of acclaim and money.This seems to be a sentiment I’ve read elsewhere, so I want to emphasize: I don’t think there’s anything wrong with diversity and I don’t think Emprical Software Engineering does anything to diversity. Creating complicated probabilistic models of spaces necessarily involve many factors. We can create a probability space which has all of the features we care about. Just condition against your “domain” (e.g. kernel work, distributed systems, etc) and slot your result into that domain. I don’t doubt that a truly descriptive probability space will be very high dimensional here but I’m confident we have the analytical and computational power to perform this work nonetheless.
The real challenge I suspect will be to gather the data. FOSS developers are time and money strapped as it is, and excluding some exceptional cases such as curl’s codebase statistics, they’re rarely going to have the time to take the detailed notes it would take to drive this research forward. Corporations which develop proprietary software have almost no incentive to release this data to the general public given how much it could expose about their internal organizational structure and coding practices, so rather than open themselves up to scrutiny they keep the data internal if they measure it at all. Combating this will be a tough problem.
Yeah I don’t see any conflict there (and I’ve watched the first one before). I use both static and dynamic languages and there are advantages and disadvantages to each. I think any programmer should comfortable using both styles.
I think that the notion that a study is going to change anyone’s mind is silly, like “I am very productive in statically typed languages. But a study said that they are not more productive; therefore I will switch to dynamically typed”. That is very silly.
It’s also not a question that’s ever actionable in reality. Nobody says “Should I use a static or dynamic language for this project?” More likely you are working on existing codebase, OR you have a choice between say Python and Go. The difference between Python and Go would be a more interesting and accurate study, not static vs. dynamic. But you can’t do an “all pairs” comparison via scientific studies.
If there WERE a study definitely proving that say dynamic languages are “better” (whatever that means), and you chose Python over Go for that reason, that would be a huge mistake. It’s just not enough evidence; the languages are different for other reasons.
I think there is value to scientific studies on software engineering, but I think the field just moves very fast, and if you wait for science, you’ll be missing out on a lot of stuff. I try things based on what people who get things done do (e.g. OCaml), and incorporate it into my own work, and that seems like a good way of obtaining knowledge.
Likewise, I think “Is catching bugs earlier less expensive” is a pretty bad question. A better scientific question might be “is unit testing in Python more effective than integration testing Python with shell” or something like that. Even that’s sort of silly because the answer is “both”.
But my point is that these vague and general questions simply leave out a lot of subtlety of any particular situation, and can’t be answered in any useful way.
While the example of static and dynamic typing is probably overbroad to be meaningless, I don’t actually think this would be true. It’s a bit like saying “Well I believe that Python is the best language and even though research shows that Go has propertries <x, y, and z> that are beneficial to my problem domain, well I’m going to ignore them and put a huge prior on my past experience.” It’s the state of the art right now; trust your gut and the guts of those you respect, not the other guts. If we can’t progress from here I would indeed be sad.
Sure, as you say, static vs dynamic languages isn’t very actionable but Python vs Go would be. And if I’m starting a new codebase, a new project, or a new company, it might be meaningful to have research that shows that, say, Python has a higher defect rate but an overall lower mean time to resolution of these defects. Prior experience with Go may trump benefits that Python has (in this synthetic example) if project time horizons are short, but if time horizons are long Go (again in the synthetic example) might look better. I think this sort of comparative analysis in defect rates, mean time to resolution, defect severity, and other attributes can be very useful.
Personally, I’m not satisfied by the state of the art of looking at builders. I think the industry really needs a more rigorous look at its assumptions and even if we never truly systematize and Fordify the field (which fwiw I don’t think is possible), I certainly think there’s a lot of progress for us to make yet and many pedestrian questions that we can answer that have no answers yet.
Python vs Go defect rates also seem to me to be far too general for an empirical study to produce actionable data.
How do you quantify a “defect rate” in a way that’s relevant to my problem, for example? There are a ton of confounds: genre of software, timescale of development, size of team, composition of team, goals of the project, etc. How do I know that some empirical study comparing defect rates of Python vs. Go in, I dunno, the giant Google monorepo, is applicable to my context? Let’s say I’m trying to pick a language to write some AI research software, which will have a 2-person team, no monorepo or formalized code-review processes, a target 1-year timeframe to completion, and a primary metric of producing figures for a paper. Why would I expect the Google study to produce valid data for my decision-making?
Somebody does. Somebody writes the first code on a new project and chose the language. Somebody sets the corporate policy on permissible languages. Would be amazing if even a tiny input to these choices were real instead of just perceived popularity and personal familiarity.
Too many downvotes this month. ¯\_(ツ)_/¯
This situation is not ideal :(
Tangentially. I did my MS work on debugging. Because I am fascinated by both computers and history, I dug through essentially all the papers written on the matter between, um, 1949 and 1980, and a fair sampling thereafter (because the paper count got too large). I finished that work in 2011ish (I had to cut a great deal of the historical analysis out of my finished MS work, so it’s not available to read, sorry). What was fascinating is that after a certain point, the academic work on software engineering work became navel gazing. Initially it was done by people who were actively programming in practice. Then the work migrated away from the active programming and more and more into sort of generic or ultra-specialized systems designed for studying debugging, which only had, in passing, a relation to the practical systems. Too, fad work started to show up in the early 90s.
Much like OP, I keep an eye on empirical software engineering, particularly around codebase metrics. My broad take is that it’s useful bunkum. Much like Red Riding Hood is about opsec but never actually happened, metrics tell a story, and its a useful story, but maybe don’t believe them too literally.
One of the most useful journals was https://onlinelibrary.wiley.com/journal/1097024x .
One more wrinkle: “are late bugs more expensive to fix” isn’t even the question we need to answer. It’s an intellectually interesting question, but the question we actually need to answer is “how should we develop software?”
If bugs we discover late are more expensive, that suggests that we could save money by finding them earlier. But that’s not a guarantee–perhaps they’re more expensive to fix because they’re just harder to find, not because they’re found later? I’m happy to assume it’s some of the second, even without research, but they’re two different explanations of the hypothesized phenomenon.
Agreed. Oftentimes the better question wasn’t “When should we fix these bugs?” but rather “How might this new feature impact the bug profile of our codebase” or even “Should we spend code on this feature in the first place?”
That’s a great point, and to add to it, I think looking at whether a specific bug would have been cheaper to fix earlier discounts the possibility that the cost of finding the bug earlier would have exceeded the cost of fixing it later. My intuition is that that must be true at least some of the time.
Although it doesn’t really replace the manual thoroughness, I’ve found Connected Papers and Citation Tree to be handy tools for augmenting this process – if only to get a quick lay of the land!
I really appreciate your focus on history and historiography in the industry. So much wisdom and context has been lost and distorted due to ignorance of it. This is a problem everywhere but seems particularly egregious in software
Let me chime in on this. I enjoyed your article as I always do. Minor comments first:
re formal methods making it cheaper. I’d scratch this claim since it might give a wrong impression of the field. Maybe you’ve seen a ton of papers claiming this in your research. It could be popular these days. In most of mine (ranging four decades min.), they usually just said formal methods either reduced or (false claim here) totally eliminated defects in software. They focus on how the software will have stunning quality, maybe even perfection. If they mention cost, it’s all over the place: built barely anything with massive labor; cost a lot more to make something better; imply savings in long-term since it needed no maintenance. The lighterweight methods, like Cleanroom, sometimes said it’s cheaper due to (other claim) prevention being cheaper. They didn’t always, though.
re common knowledge is wrong. I got bit by a lot of that. Thanks for digging into it, cutting through the nonsense, and helping us think straight!
Now for the big stuff.
re stuff we can agree on. That’s what I’m going to focus on. The examples you’re using all support your point that nobody can figure anything out. You’re not using examples we’ve discussed in the past that have more weight with high, practical value. I’ll also note that some of these are logical in nature: eliminating the root causes means we don’t need a study to prove their benefit except that they correctly eliminate the root causes. Unless I skimmed past it, it looks like you don’t bring up those at all. Maybe you found problems in them, too. Maybe just forgot. So, I’ll list them in case any of you find them relevant.
Memory safety. Most studies a massive number of bugs are memory safety. Languages that block them by default and tools that catch most of them will, if they work (prevent/catch), block a massive number of bugs. Lesser claims: many exploits (a) also use those kinds of bugs and (b) more easily use them. That lets us extrapolate the first claim to reduction of exploits and/or their severity. An easy test would be to compare (minus web apps) both the number of memory-safety errors and code injections found in C/C++ apps vs Java, C#, and Rust. If Java or C#, in the safe parts rather than unsafe parts. I predict a serious drop since their root cause is logically impossible most of the time. I believe field evidence already supports this.
Garbage collection (temporal safety, single-threaded). The industry largely switched to languages using GC, esp Java and .NET, from C/C++. That’s because temporal errors are more difficult to prevent, were harder to find at one point (maybe today), and might cost more to fix (action at a distance). These kinds of errors turn up even in high-quality codebases like OpenBSD. I predict a huge drop by garbage collection making them impossible or less-probable. If it’s not performance-prohibitive, a GC-by-default approach reduces development time, totally eliminates cost of fixing these, and reduces crashes they cause. What I don’t have proof of is that I used to see more app crashes before managed languages due to a combo of spatial and temporal memory errors. Might warrant a study.
Concurrency (temporal safety, multi-threaded). Like what GC’s address, concurrency errors are more difficult to prevent and can be extremely hard to debug. One Lobster was even hired for a long period of time specifically to fix one. There’s both safe methods for concurrency (eg SCOOP, Rust) and tools that detect the errors (esp in Java). If the methods work, what we’d actually be studying is if not introducing concurrency errors led to less QA issues than introducing and fixing concurrency errors. No brainer. :)
Static analysis. They push the button, the bugs appear, and they fix the ones that are real. Every good tool had this effect on real-world, OSS codebases. That proves using them will help you get rid of more bugs than not using them. If no false positives (eg RV-Match, Astree), your argument is basically done. If false positives (eg PVS-Check, Coverity), then one has to study if the cost to go through the false positives is worth it vs methods without them. I’d initially compare to no-false-positive methods. Then, maybe to methods that are alternatives to static analysis.
Test generators. Path-based, combinatorial, fuzzing… many methods. Like static analysis, they each find errors. The argument one is good is easy. If many are available (esp free), then other empirical data says each will spot things others miss. That has a quality benefit without a study necessary if the administrative and hardware cost of using them is low enough.
My persistent claim on Lobsters for code-level quality: use a combination of FOSS languages and tools with memory-safety, GC-by-default, safe concurrency, static analysis, and test generation. To take the claim down, you just need empirical evidence against one or more of the above since my claim just straight-forward builds on them. Alternatively, show that how they interact introduces new problems that are costlier to fix than what they fixed. That is a possibility.
Drivers. Microsoft Windows used to blue screen all the time for all kinds of users. The reports were through the roof. It was mostly drivers. They built a formal methods tool to make drivers more reliable. Now, Windows rarely blue screens from driver failures. There you go.
Design flaws. You’ve personally shown how TLA+ mockups catch in minutes problems that alternative methods either don’t catch or take roughly forever to. Any research like that makes an argument for using those tools to catch those kinds of problems. I’ll limit the claim to that. It becomes a tiny, tiny claim compared to silver bullets most are looking for. Yet, it already allows one to save time on UI’s, distributed systems, modeling legacy issues, etc.
Functional and immutable over imperative and mutable. In a nutshell, what happens as we transition from one to another is that the number of potential interactions increase. Managing them is more complex. Piles of research across fields show that both complexity and interactions drive all kinds of problems. Interestingly, this is true for both human brains and machine analysis. You’ve also linked to a formal methods article where author illustrated that mathematically for verifying functional vs imperative software. Any technique reducing both complexity and potential interactions for developers reduces their labor so long as the technique itself is understandable. That gives us root cause analysis, cross-disciplinary evidence, and mathematical proofs that it will be easier in long-term to verify programs that are more functional and use less, mutable state.
Things we’ll be stuck with, like Internet protocols. Certain systems are set up to be hard to change or never change. We’re stuck with them. OSS, industry, and government keeps deploying bandaids or extra layers to fix problems in such systems without fixing the root. When they do fix a root cause, it’s enormously expensive (see Y2K bug). I’d argue for getting them right up-front to reduce long-term costs. I’m not sure what empirical studies have gone into detail about these systems. I just think it’s the best thing to apply that kind of thinking to.
I’d say those are the only things I’d mention in an article like yours. They’re things that are easy to argue for using visible, real-world results.
The biggest issue with a lot of software engineering research that I encounter is that it is unsurprising. It is information that practitioners can acquire talking to a wide range of other practitioners or via first-principles thinking.
The recurring pattern seems to be that organizations seeking better ways of working try things out and gain some advantage. Others hear about their success and attempt the same thing. The ones that don’t (laggards), often wait for confirmation from various source of practice authority (academia in this case) before trying something that appears risky to them.
Since the primary learning loop happens through market experimentation, this leaves research the task of putting a stamp of approval on what is often common knowledge in some pockets of practice, leading to wider adoption. Or, more rarely, showing that some common knowledge may not work in particular circumstances, and that is more of a negative agenda. Practice discovery (positive agenda) will always be easier in industry.
This makes academic software engineering research very hard and, maybe, demoralizing for those who do it. It’s relatively easy to show that some things don’t work in particular circumstances, but the range of circumstances is huge and every company / open source project has incentive to discover niche places where a particular method, tool or practice provides value.
At one point, there was a decent amount of research on software engineering practices. For example, I have this book on my shelf (I needed it for a class I took in grad school). However, it seems to have gone out of style around roughly 2000 with the mass switch to agile/scrum. Part of the problem is a lot of this work was funded by the US DoD and by IBM, neither of which embraced agile/scrum along with the rest of the industry. The newer crop of tech superstar companies seems to be much more secretive about its engineering practices, and also less interested in systematically studying them.