Oh dang another essay on empirical software engineering! I wonder if they read the same sources I did
Reads blog
You watched the conference talk “What We Know We Don’t Know”, by Hillel Wayne, who, also disturbed by software’s apparent lack of scientific foundation, found and read as many scholarly papers as he could find. His conclusions are grim.
I think I’m now officially internet famous. I feel like I crossed a threshold or something :D
So I’m not sure how much of this is frustration with ESE in general or with me in particular, but a lot of quotes are about my talk, and so I’m not sure if I should be defending myself? I’m gonna err on the side of defending myself, mostly because it’s an excuse to excitedly talk about why I’m so fascinated by empirical engineering.
One thing I want to open with. I’ve mentioned a couple of times on Lobsters that I’m working on a long term journalism project. I’m interviewing people who worked as “traditional” engineers, then switched to software, and what they see as the similarities and differences. I’ve learned a lot from this project, but one thing in particular stands out: we are not special. Almost everything we think is unique about software, from the rapid iteration to clients changing the requirements after we’ve released, happens all the time in other fields.
So, if we can’t empirically study software engineering, it would follow that we can’t empirically study any kind of engineering. If “you can’t study it” only applied to software, that would make software Special. And everything else people say about how software is Special turns out to be wrong, so I think it’s the case here.
I haven’t interviewed people outside of engineering, but I believe it goes even further: engineering isn’t special. If we can’t study engineers, then we can’t study lawyers or nurses or teachers or librarians. Human endeavor is incredibly complex, and every argument we can make about why studying software is impossible extends to any other job. I fundamentally reject that. I think we can usefully study people, and so we can usefully study software engineers.
Okay so now for individual points. There’s some jank here, because I didn’t edit this a whole lot and didn’t polish it at all.
You were disappointed with Accelerate: The Science of Lean Software and DevOps. You agreed with most of its prescriptions. It made liberal use of descriptive statistics.
Accelerate’s research is exclusively done by surveying people. This doesn’t mean it’s not empirical- as I say in the talk, qualitative information is really helpful. And one of my favorite examples of qualitative research, the Gamasutra Study on Crunch Mode, uses a similar method. But it’s far from being settled, and it bothers me that people use Accelerate as “scientifically proven!!!”
Controlled experiments are typically nothing like professional programming environments […] So far as I know, no researcher has ever gathered treatment and control groups of ten five-developer teams each, put them to work M-F, 9-5, for even a single month, in order to realistically simulate the conditions of a stable, familiar team and codebase.
You’d be surprised. “Two comparisons of programming languages”, in “making software”, does this with nine teams (but only for one day). Some labs specialize in this, like SIMULA lab. Companies do internal investigations on this- Microsoft and IBM especially has a lot of great work in this style.
But regardless of that, controlled experiments aren’t supposed to be holistic. They test what we can, in a small context, to get solid data on a specific thing. Like VM Warmup Blows Hot and Cold: in a controlled environment, how consistent are VM benchmarks? Turns out, not very! This goes against all of our logic and intuition, and shows the power of controlled studies. Ultimately, though, controlled studies are a relatively small portion of the field, just as they’re a small portion of most social sciences.
For that matter, using students is great for studies on how students learn. There’s a ton of amazing research on what makes CS concepts easier to learn, and you have to use students for that.
The unpredictable dynamics of human decision-making obscure the effects of software practices in field data. […] This doesn’t hold for field data, because real-life software teams don’t adopt software practices in a random manner, independent from all other factors that might potentially affect outcomes.
This is true for every form of human undertaking, not just software. Can we study teachers? Can we study doctors and nurses? Their world is just as chaotic and dependent as ours is. Yet we have tons of research on how educators and healthcare professionals do their jobs, because we collectively agree that it’s important to understand those jobs better.
One technique we can use cross-correlating among many different studies on many different groups. Take the question “does Continuous Delivery help”. Okay, we see that companies that practice it have better outcomes, for whatever definiton of “outcomes” we’re using. Is that correlation or causation? Next we can look at “interventions” where a company moved to CD and see how it changed their outcomes. We can see what practices all of the companies share and what things they have different, to see what cluster of other explanations we have. We can examine companies where some teams use CD and some teams do not, and correlate their performance. We can look at what happens when people move between the different teams. We can look at companies that moved away from CD.
We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be. This isn’t easy! But it’s worth doing.
The outcomes that can be measured aren’t always the outcomes that matter. […] So in order to effectively inform practice, research needs to ask a slightly different, more sophisticated question – not e.g. “what is the effect software practice X has on ‘defect rate’”, but “what is the effect software practice X has on ‘defect rate per unit effort’”. While it might be feasible to ask this question in the controlled experiment setting, it is difficult or impossible to ask of field data.
Pretty much all studies take this as a given. When we study things like “defect rate”, we’re always studying it in the context of unit time or unit cost. Otherwise we’d obviously just use formal verification for everything. And it’s totally feasible to ask this of field data. In some cases, companies are willing to instrument themselves- see TSP or the NASA data sets. In other cases, the data is computable- see research on defect rates due to organizational structure and code churn. Finally, we can cross-correlate between different projects, as is often done with repo mining.
These are hard problems, certaintly. But lots of things are “hard problems”. It’s literally scientists’ jobs to figure out how to solve these problems. Just because we, as layfolk, can’t figure out how to solve these problems doesn’t they’re impossible to solve.
Software practices and the conditions which modify them are varied, which limits the generality and authority of any tested hypothesis
This is why we do a lot of different studies and test a lot of different hypothesis. Again, this is an accepted fact in empiricial research. We know it’s hard. We do it anyway.
But if you’re holding your breath for the day when empirical science will produce a comprehensive framework for software development – like it does for, say, medicine – you will die of hypoxia.
A better analogue is healthcare, the actual system of how we run hospitals and such. Thats in the same boat as software development: there’s a lot we don’t know, but we’re trying to learn more. The difference is that most people believe studying healthcare is important, but that studying software is not.
Is this cause for despair? If science-based software development is off the table, what remains? Is it really true as Hillel suggests, that in the absence of science “we just don’t know” anything, and we are doomed to an era of “charisma-driven development” where the loudest opinion wins, and where superstition, ideology, and dogmatism reign supreme?
The lack of empirical evidence for most things doesn’t mean we’re “doomed to charisma-driven development.” Rather it’s the opposite: I find the lack of evidence immensely freeing. When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. They just believe it. And maybe it’s true! But if they want to be honest with themselves, they have to accept that doubt. Nobody has the secret knowledge. Nobody actually knows, and we all gotta be humble and honest about how little we know.
Of course not. Scientific knowledge is not the only kind of knowledge, and scientific arguments are not the only type of arguments. Disciplines like history and philosophy, for instance, seem to do rather well, despite seldom subjecting their hypotheses to statistical tests.
Of course science isn’t the only kind of knowledge! I just gave a talk at Deconstruct on the importance of studying software history. My favorite software book is Data and Reality, which is a philosophical investigation into the nature of information representation. My claim is that science is a very powerful form of knowledge that we as software folk not only neglect, but take pride in our neglecting. It’s like, yes, we don’t just have science, we have history and philosophy. But why not use all three?
Your decision to accept or reject the argument might be mistaken – you might overlook some major inconsistency, or your judgement might be skewed by your own personal biases, or you might be fooled by some clever rhetorical trick. But all in all, your judgement will be based in part on the objective merit of the argument
Of course we can do that. Most of our knowledge will be accumulated this way, and that’s fine. But I think it’s a mistake to be satisfied with that. For any argument in software, I can find two experts, giants in their fields, who have rigorous arguments and beautiful narratives… that contradict each other. Science is about admitting that we are going to make mistakes, that we’re going to naturally believe things that aren’t true, no matter how mentally rigorous we try to be. That’s what makes it so important and so valuable. It gives us a way to say “well you believe X and I believe not X, so which is it?”
Science – or at least a mysticized version of it – can be a threat to this sort of inquiry. Lazy thinkers and ideologues don’t use science merely as a tool for critical thinking and reasoned argument, but as a substitute. Science appears to offer easy answers. Code review works. Continuous delivery works. TDD probably doesn’t. Why bother sifting through your experiences and piecing together your own narrative about these matters, when you can just read studies – outsource the reasoning to the researchers? […] We can simply dismiss them as “anti-science” and compare them to anti-vaxxers. […] I witnessed it play out among industry leaders in my Twitter feed, the day after I started drafting this post.
I think I know what you’re referencing here, and if it’s what I think it is, yeah that got ugly fast.
Regardless of how Thought Leaders use science, my experience has been the opposite of this. Being empirical is the opposite of easy. If I wanted to not think, I’d say “LOGICALLY I’m right” or something. But I’m an idiot and want to be empirical, which means reading dozens of papers that are all maddeningly contradictory. It means going through papers agonizingly carefully because the entire thing might be invalidated by an offhand remark.[1] It means reading paper’s references, and the references’ references, and trawling for followup papers, and reading the followup paper’s other references. It means spending hours hunting down preprints and emailing authors because most of the good stuff is locked away by the academic paper hoarders.
Being empirical means being painfully aware of the cognitive dissonance in your head. I love TDD. I recommend it to beginners all the time. I think it makes me a better programmer. At the same time, I know the evidence for it is… iffy. I have to accept that something I believe is mostly unfounded, and yet I still believe in it. That’s not the easy way out, that’s for sure!
And even when the evidence is in your favor, the final claim is infuriatingly nuanced. Take code review! “Code Review works”. By works, I mean “in most controlled studies and field studies, code review finds a large portion of the extant bugs in reviewed code in a reasonable timeframe. But most of the comments in code review are not bug-finding, but code quality things, about 3 code improvements per 1 bug usually. Certain things make CR better, and certain things make it a lot worse, and developers often complain that most of the code review comments are nitpicks. Often CRs are assigned to people who don’t actually know that area of the codebase well, which is a waste of time for everyone. There’s a limit to how much people can CR at a time, meaning it can easily become a bottleneck if you opt for 100% review coverage.”
That’s a way more nuanced claim than just “code review works!” And it’s way, way more nuanced than about 99% of the Code Review takes I see online that don’t talk about the evidence. Empiricism means being more diligent and putting in more work to understand, not less.
So one last thought to close this out. Studying software is hard. People bring up how expensive it is. And it is expensive, just as it’s expensive to study people in general. But here’s the thing. We are one of the richest industries in the history of the world. Apple’s revenue last year was a quarter trillion dollars. That’s not something we should leave to folklore and feelings. We’re worth studying.
[1]: I recently read one paper that looked solid and had some really good results… and one sentence in the methodology was “oh yeah and we didn’t bother normalizing it”
Hi Hillel! I’m glad you found this, and thank you for taking the time to respond.
I’m not sure you necessarily need to mount a defense, either. I didn’t consciously intend to set your talk up as the antagonist in my post, but I realize this is sort of what I did. The attitude I’m trying to refute (that empirical science is the only source of objective knowledge about software) is somewhat more extreme than the position you advocate. And the attitude you object to (that software “can’t be studied” empirically, and nothing can be learned this way) is certainly more extreme than the position I hoped to express. I think in the grand scheme of things we largely share the same values, and our difference of opinion is rather esoteric and mostly superficial. That doesn’t mean it’s not interesting to debate, though.
Re: Omitted variable bias
You seemed to suggest that research could account for omitted variable bias by “cross-correlating” studies
across different companies
within one same company before and after adopting/disadopting the practice
across different teams within the same company.
I submit to you this is not the case. Continuing with the CD example, suppose CD doesn’t improve outcomes but the “trendiness” that leads to it does. It is completely plausible for
trendy companies to be more likely to adopt CD than non-trendy companies
trendy teams within a company to be more likely to adopt CD than non-trendy teams
a company that is becoming more trendy is more likely to adopt CD and be trendier before the adoption than after adoption
a company that is becoming less trendy is more likely to disadopt CD and be trendier before the disadoption than after
If these hold, then all of the studies in the “cross-correlation” you describe will still misattribute an effect to CD.
You can’t escape omitted variable bias just by collecting more data from more types of studies. In order to legitimately address it, you need to do one of:
Find some sort of data that captures “trendiness” and include it as a statistical control.
Find an instrumental variable
Find data on teams within a company that were randomly assigned to CD (so that trendiness no longer correlates with the decision to adopt).
If you don’t address a plausible omitted variable bias in one of these ways, then basically you have no guarantee that the effect (or lack of effect) you measured was actually the effect of the practice and not the effect of whatever social conditions or ideology led to the adoption of your practice (or something else that those social conditions caused). This is a huge threat to validity, especially to “code mining” studies whose only dataset is a git log and therefore have no possible hope of capturing or controlling the social or human drivers behind the practice. To be totally honest, I assign basically zero credibility to the empirical argument of any “code mining” study for this reason.
Re: The analogy to medicine
As @notriddle seemed to be hinting at, professions comprehensively guided by science are the exception, not the rule. Science-based lawyering seems… unlikely. Science-based education is not widely practiced, and is controversial in any case. Medicine seems to be the major exception. It’s worth exploring the analogy/disanalogy between software and medicine in greater detail. Is software somehow inherently more difficult to study than medicine?
Maybe not. You brought up two good points about avenues of software research.
Companies do internal investigations on this- Microsoft and IBM especially has a lot of great work in this style.
and
In some cases, companies are willing to instrument themselves- see TSP or the NASA data sets.
I think analysis of this form is miles more persuasive than computer lab studies or code mining. If a company randomly selects certain teams to adopt a certain practice and certain teams not to, this solves the realism problem because they are, in fact, real software teams. And it solves the omitted variable bias problem because the practice was guaranteed to have been adopted randomly. I think much of the reason medicine has been able to incorporate empirical studies so successfully is because hospitals are so heavily “instrumented” (as you put it) and willing to conduct “clinical trials” where the treatment is randomly assigned. I’m quite willing to admit that we could learn a lot from empirical research if software shops were willing to instrument themselves as heavily as hospitals, and begin randomly designating teams to adopt practices they want to study. I think it’s quite reasonable to advocate for a movement in that direction.
But whether or not we should advocate for more better data/more research is orthogonal to the main concern of my post: in the meantime, while we are clamoring for better data, how ought we evaluate software practices? Do we surrender to nihilism because the data doesn’t (yet) paint a complete picture? Do we make wild extrapolations from the faint picture the data does paint? Or should we explore and improve the body of “philosophical” ideas about programming, developed by programmers through storytelling and reflection on experience?
It is very important to do that last thing. I wrote my post because, for a time, my own preoccupation with the idea that only scientific inquiry had an admissible claim to objective truth prevented me from enjoying and taking e.g. “A Philosophy of Software Design” seriously (because it was not empirical), and realizing what a mistake this was was somewhat of a personal revelation.
Re: Epistemology
Science is about admitting that we are going to make mistakes, that we’re going to naturally believe things that aren’t true, no matter how mentally rigorous we try to be. That’s what makes it so important and so valuable. It gives us a way to say “well you believe X and I believe not X, so which is it?”
Science won’t rescue you from the fact that you’re going to believe things that aren’t true, no matter how mentally rigorous you try to be. Science is part of the attempt to be mentally rigorous. If you aren’t mentally rigorous and you do science, your statistical model will probably be wrong, and omitted variable bias will lead you to conclude something that isn’t true.
Science, to me, is merely a toolbox for generating persuasive empirical arguments based on data. It can help settle the debate between “X” and “not X” if there are persuasive scientific arguments to be found for X, and there are not persuasive scientific arguments to be found for “not X” – but just as frequently, there turn out to be persuasive scientific arguments for both “X” and “not X” that cannot be resolved empirically must be resolved theoretically/philosophically. (Or – as I think describes the state of software research so far – there turn out to be persuasive scientific arguments for neither “X” nor “not X”, and again, the difference must be resolved theoretically/philosophically).
[Being empirical]… means reading dozens of papers that are all maddeningly contradictory. It means going through papers agonizingly carefully because the entire thing might be invalidated by an offhand remark.[1] It means reading paper’s references, and the references’ references, and trawling for followup papers, and reading the followup paper’s other references.
That’s a way more nuanced claim than just “code review works!” And it’s way, way more nuanced than about 99% of the Code Review takes I see online that don’t talk about the evidence. Empiricism means being more diligent and putting in more work to understand, not less.
I value this sort of disciplined thinking – but I think it’s a mistake to brand this as “science” or “being empirical”. After all, historians and philosophers also agonize through papers, crawling the reference tree, and develop highly nuanced, qualified claims. There’s nothing unique to science about this.
I think we should call for something broader than merely disciplined empirical thinking. We want disciplined empirical and philosophical/anecdotal thinking.
My ideal is that software developers accept or reject ideas based on the strength or weakness of the argument behind them, rather than whims, popularity of the idea, or the perceived authority or “charisma” of their advocates. For empirical arguments, this means doing what you described – reading a bunch of studies, paying attention to the methodology and the data description, following the reference trail when warranted. For philosophical/anecdotal arguments, this means doing what I described – mentally searching for inconsistencies, evaluating the argument against your own experiences and other evidence you are aware of.
Occasionally, this means the strength of a scientific argument must be weighed against a philosophical/anecdotal argument. The essence of my thesis is that, sometimes, a thoughtful, well-explained story by a practitioner can be a stronger argument than an empirical study (or more than one) with limited data and generality. “X worked for us at Dropbox and here is my analysis of why” can be more persuasive to a practitioner than “X didn’t appear to work for undergrad projects at 12 institutions, and there is not a correlation between X and good outcome Y in a sampling of Github Repos”.
Hi, thanks for responding! I think we’re mostly on the same page, too, and have the same values. We’re mostly debating the degrees and methods of here. I also agree that the issues you raise make things much more difficult. My stance is just that while they do make things more difficult, they don’t make it impossible, nor do they make it not worth doing.
Ultimately, while scientific research is really important, it’s only one means of getting knowledge about something. I personally believe it’s an incredibly strong form- if philosophy makes one objective claim and science makes another, then we should be inclined to look for flaws in the philosophy before looking for flaws in the science. But more than anything else, I want defence in depth. I want people to learn the science, and the history, and the philosophy, and the anthropology, and the economics, and the sociology, and the ethics. It seems to me that most engineers either ignore them all, or care about only one or two of these.
(Anthro/econ/soc are also sciences, but I’m leaving them separate because they usually make different claims and use different ((scientific!)) than what we think of as “scientific research” on software.)
One thing neither of us have brought up, that is also important here: we should know the failure modes of all our knowledge. The failure modes of science are really well known: we covered them in the article and our two responses. If we want to more heavily lean on history/philosophy/anthropology, we need to know the problems with using those, too. And I honestly don’t know them as well as I do the problems with scientific knowledge, which is one reason I don’t push it as hard- I can’t tell as easily when I should be suspicious.
When doctors get involved in fields such as medical education or quality improvement and patient safety, they often have a similar reaction to Richard’s. The problem is in thinking that the only valid way to understand a complex system is to study each of its parts in isolation, and if you can’t isolate them, then should just give up.
As Hillel illustrated nicely here, you can in fact draw valid conclusions from studying “complex systems in the wild”. While this is a “messier” problem, it is much more interesting. It requires a lot of creativity but also more rigor in justifying and selecting the methodology, conducting the study, and interpreting the results. It is very easy to do a subpar study in those fields, which confounds the perception about the fields being “unscientific”.
Can we study teachers? Can we study doctors and nurses?
The answer to that question might be “no”.
When you’re replying to an article that’s titled “The False Promise of Science”, with a bunch of arguments against empirical software engineering that seem applicable to other fields as well, and your whole argument is basically an analogy, you should probably consider the possibility that Science is Just Wrong and we should all go back to praying to the sun.
The education field is at least as fad- and ideology-driven as software, and the medical field has cultural problems and studies that don’t reproduce. Many of the arguments given in this essay are clearly applicable to education and medicine (though not all of them obviously are, I can easily come up with new arguments for both fields). The fundamental problem with applying science to any field of endeavor is that it’s anti-situational at the core. The whole point of The Scientific Method is to average over all but a few variables, but people operating in the real world aren’t working with averages, they’re working with specifics.
The argument that software isn’t special cuts both ways, after all.
I’m not sure if I actually believe that, though.
The annoying part about this is that, as reasonably compelling as it’s possible to make the “science sucks” argument sound, it’s not very conducive to software engineering, where the whole point of the practice is to write generalized algorithms that deal with many slight variants of the same problem, so that humans don’t have to be involved in every little decision. Full-blown primativism, where you reject Scalable Solutions(R) entirely, has well-established downsides like heightened individual risk; one of the defining characteristics of modernism is risk diffusion, after all.
Adopting hard-and-fast rules is just a trade-off. You make the common case simpler, and you lose out in the special cases. This is true both within the software itself (it’s way easier to write elegant code if you don’t have weird edge cases) and with the practice. The alternative, where you allow for exceptions to the rules, is decried as bad for different reasons.
I’m don’t know very much about classroom teaching or nursing, so I can’t deep-dive into that research as easily as I can software… but there are many widespread and important studies in both fields that give us actionable results. If we can do that with nursing, why not software?
To be honest, I think you’re overselling what empirical science tells us in some of these domains, too. Take the flipped classroom one, since it’s an example I’ve seen discussed elsewhere. The state of the literature summarized in that post is closer to: there is some evidence that this might be promising, but confidence is not that high, particularly in how broadly this can be interpreted. Taking that post on its own terms (I have not read the studies it cites independently), it suggests not much more than that overall reported studies are mainly either positive or inconclusive. But it doesn’t say anything about these studies’ generalizability (e.g. whether outcomes are mediated by subject matter, socioeconomic status, country, type of institution, etc.), suggests they’re smallish in number, suggests they’ve not had many replication attempts, and pretty much outright says that many studies are poorly designed and not well controlled. It also mentions that the proxies for “learning” used in the studies are mostly very short-term proxies chosen for convenience, like changes in immediate test scores, rather than the actual goal of longer-term mastery of material.
Of course that’s all understandable. Gold-standard studies like those done in medicine, with (in the ideal case) some mix of preregistration, randomized controlled trials, carefully designed placebos, and longitudinal follow-up across multi-demographic, carefully characterized populations, etc., are logistically massive undertakings, and expensive, so basically not done outside of medicine.
Seems like a pretty thin rod on which to hang strong claims about how we ought to reform education, though. As one input to qualitative decision-making, sure, but one input given only its proper weight, in my opinion significantly less than we’d weight the much better empirical data in medicine.
My favorite software book is Data and Reality, which is a philosophical investigation into the nature of information representation.
A beautiful book, one of my favorites as well.
rest of post….
While I thought the article articulated something important which I agree with, its conclusion felt a bit lazy and too optimistic for my taste – I’m more persuaded by the POV you’ve articulated above.
While we’re making analogies, “writing software is like writing prose” seems like a decent one to explore, despite some obvious differences. Specifically relevant is the wide variety of different and successful processes you’ll find among professional writers.
And I think this explains why you might be completely right that something like TDD is valuable for you, even though empirical studies don’t back up that claim in general. And I don’t mean that in a soggy “everyone has their own method and they’re all equally valid” way. I mean that all of your knowledge, the way think about programming, your tastes, your knowledge of how to practice TDD in particular, and on and on, are all inputs into the value TDD provides you.
Which is to say: I find it far more likely that TDD (or similar practices with many knowledgeable, experienced supporters) have highly context sensitive empirical value than none at all. I don’t foresee them being one day unmasked by science as the sacred cows of religious zealots (though they may be that in some specific cases too).
For something like TDD, the “treatment” group would really need to be something like “people who have all been taught how to do it by the same expert over a long enough time frame and whose knowledge that expert has verified and signed off on.”
I’m not shilling for TDD, btw – just using it as a convenient example.
The broader point is that effects can be real but extremely hard to show experimentally.
“We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be.”
That’s exactly what I do for the sub-fields I study. Especially formal proof which I don’t understand at all. Just constantly looking at what specialists did… system type/size, properties, level of automation, labor required… tells me a lot about what’s achievable and allows mix n’ matching ideas for new, high-level designs. That’s without even needing to build anything which takes a lot longer. That specialists find the resulting ideas worthwhile proves the surveys and integration strategy work.
So, I strongly encourage people to do a variety of focused studies followed by integrated studies on them. They’ll learn plenty. We’ll also have more interesting submissions on Lobsters. :)
“When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. “
I didn’t think about that angle. Actually, you got me thinking maybe we can all start telling that to new programmers. They get warned the field is full of hype, trends, etc that usually don’t pan out over time. We tell them there’s little data to back most practices. Then, experienced people cutting them down or getting them onto new trend might have less effect. Esp on their self-confidence. Just thinking aloud here rather than committed to idea.
“Science is about admitting that we are going to make mistakes”
I used to believe science was about finding the truth. Now I’d go further than you. Science assumes we’re wrong by default, will screw up constantly, and are too biased or dishonest to review the work alone. The scientific method basically filters bad ideas to let us arrive a beliefs that are justifiable and still might be wrong. Failure is both normal and necessary if that’s the setup.
The cognitive dissonance make it really hard like you said. I find it a bit easier to do development and review separately. One can be in go mode iterating stuff. At another time, in skeptical mode critiquing the stuff. The go mode also gives a mental break and/or refreshes the mind, too.
You’d be surprised. “Two comparisons of programming languages”, in “making software”, does this with nine teams (but only for one day).
My reading (which is congruent with my experiences) indicates a newly-put-together team takes 3-6 months before productivity stabilizes. Some schools of management view this as ‘stability=groupthink, shuffle the teams every 6 months’ and some view it as ‘stability=predictability, keep them together’. However, IMO this indicates to me that you might not be able to infer much from one day of data.
To clarify, that specific study was about nine existing software teams- they came to the project as a team already. It’s a very narrow study and definitely has limits, but it shows that researchers can do studies on teams of professionals.
People bring up how expensive it is. And it is expensive, just as it’s expensive to study people in general. But here’s the thing. We are one of the richest industries in the history of the world. Apple’s revenue last year was a quarter trillion dollars. That’s not something we should leave to folklore and feelings. We’re worth studying.
I don’t think I understand what you’re saying. Software is expensive, and for some companies, very profitable. But would it really be more profitable if it were better studied? And what exactly does that have to do with the kinds of things that the software engineering field likes to study, such as defect rates and feature velocities? I think that in many cases, even relatively uncontroversial practices like code review are just not implemented because the people making business decisions don’t think the prospective benefit is worth the prospective cost. For many products or services, code quality (however operationalized) makes a poor experimental proxy for profitability.
Inasmuch as software development is a form of industrial production, there’s a huge body of “scientific management” literature that could potentially apply, from Frederick Taylor on forward. And I would argue it generally is being applied too: just in service of profit. Not for some abstract idea of “quality”, let alone the questionable ideal of pure disinterested scientific knowledge.
Mistakes are becoming increasingly costly (e.g., commercial jets falling from the sky) so understanding the process of software-making with the goal of reducing defects could save a lot of money. If software is going to “eat the world”, then the software industry needs to grow up and become more self-aware.
Aviation equipment and medical devices are already highly regulated, with quality control processes in place that produce defect rates orders of magnitude less than your average desktop or business software. We already know some things about how to make high-assurance systems. I think the real question is how much of that reasonably applies to the kind of software that’s actually eating the world now: near-disposable IoT devices and gimmicky ad-supported mobile apps, for example.
There’s a chapter in Alasdair MacIntyre’s After Virtue (1981) called “The Character of Generalizations in Social Science and their Lack of Predictive Power” which makes the point that the weakness of prediction in social science is a necessary result of human activity’s essential complexity and novelty. He gives four “sources of systematic unpredictability in human affairs”:
Human activity involve radical conceptual innovation.
Humans cannot predict themselves, and this chaos spreads socially.
Human affairs have a game-theoretic character of extreme complexity.
Pure contingency.
Interestingly he gives one argument for (1) that invokes Alonzo Church and the general undecidability of mathematical propositions to indicate that it is provably impossible to accurately predict the future of mathematical innovation.
I think these arguments are quite applicable to software development, for example concerning the question of why it’s so hard to plan a project accurately.
He then goes on to a corresponding list of sources of systematic predictability. These are things like clocks and schedules, seasons, institutions, and so on. These are environmental support systems. Part of his point is that if we want some predictability in human affairs, we can’t just rely on scientific observation—we have to actually support the conditions that enable prediction, introduce stable sources of regularity in our personal and social lives.
Taken too far that can lead to totalitarianism as the only way to eliminate the sources of unpredictability, by eliminating innovation, freedom, and our precious ability to “remain to some degree opaque and unpredictable.”
“Software developers are domain experts. We know what we’re doing. We have rich internal narratives, and nuanced mental models of what it is we’re about, …”
For a large proportion of software undertakings surely this is not true. Much of software development is outside the domains of the computing and data sciences, and computing infrastructure. While popular to consider that these are the only endeavors of importance to today’s developers, the modeling of systems in domains other than these into code represents the majority of software running in the world today.
In these, we don’t know to model reliably and predictably, the systems that stakeholders (think they) want, and that external domain experts know. How can one consider applying scientific rigor to that?
Great point. You are certainly correct that software developers are not always experts in the domains of their products. They are still experts in the domain of their tools and practices though, so they should be considered “domain experts” from the perspective of researchers.
I agree with the thrust of this, but I don’t think it’s quite as bleak as he claims.
It is indeed difficult, perhaps impossible, for academic researchers to arrange the kinds of studies of professional software developers that would answer important questions about development practices. But large tech companies can do it, and have done so on occasion.
The bigger challenge is getting companies to pay attention to the results when they’re available.
Adequate sleep maximizes delivery speed and minimizes defects? Cool, but it’s crunch time, so you can sleep a bit less just for this one project, right?
Single-person offices with doors maximize productivity? Awesome, welcome to our huge open floor plan.
Code reviews produce better designs and increase maintainability? Eh, your coworker is on a deadline and doesn’t have time to read your code, so just commit it and we’ll let QA find the bugs.
That last one is funny because so many cos did take note of the fact that code review and automated testing are tremendously effective… by firing entire QA depts and assuming code review and testing would make up the difference, even if they’re not given any priority and don’t get done.
The problem is that upfront cost savings are easier to justify more so than long-term cost savings. You see a similar effect with companies cramming products in to make quarterly numbers, and shipping less complete things with more bugs which generate less earnings legs underneath them.
My favorite cost-saving measure of companies is not providing up-to-date hardware. Apparently wasting each developer’s time at $50-100/hr on compiles for months at a time is cheaper than spending a few thousand dollars on decent hardware.
I am sympathetic to the company perspective on that in some circumstances, though not in all circumstances. If you’re still in the product-market fit, “Will we even be able to sell this new thing we’re building to any customers?” phase, spending more time or money optimizing for the long term is probably a bad call because of the chance there won’t be a long term. That can cause the expected value of additional short-term work to drop below zero.
Past the exploratory phases of a project, it gets a lot harder to justify, though once you’ve gotten in the habit of focusing exclusively on banging out user stories as quickly as possible, it’s hard to change course.
If I remember correctly, a while ago I heard a Podcast by some Facebook manager. She said one of Zuckerbergs skills is to identify which phase a project is in and to get the right people for the phase in there.
In my experience, there has been a distinct lack of critical thinking in software engineering. Whether it’s citing scientific studies as the end-all-be-all, or saying “hey, it works for company X, therefore it will work for us,” it’s the same problem. I think software needs to be more scientific and data-driven, but on a smaller scale (within your team/organization).
As an anecdote, I’ve had an experience where a full rewrite was proposed for performance issues without any measurement as to what those issues actually were. Experiences from the past were put forth to justify areas to focus on for performance issues. The rewrite went ahead and a lot of effort was put into optimizing something that brought very little improvement. In this case, measuring first would have helped identify the more significant issues.
Agree on everything. Great article. Nonetheless it’s just scratching the surface. Like History and Philosophy (but also some scientific fields) have done in the last few centuries, we should transform our discipline around the idea of challenging ideological narratives that shape the way in which we build software. Even in open source and free software there are many assumptions on what it means to develop, publish and maintain software that are still heavily ideological in nature and are left unquestioned.
Clearly this is due to the discipline being young and because it existed only under Capitalism (be it wild californian capitalism, state capitalism in URSS or the most recent flavor of Chinese centralized hyper-capitalism). Nonetheless after decades of software development, the time has come to start explore new forms and escape the ideological cages in which software is bound and possibly escape into bigger and more diverse cages.
Yes. As enkiv2 here has pointed out on a few occasions, one of the ideologies we need to question is scaleability - the idea that any software service ought to be able to scale up to the point where it is the only one of its kind serving the whole world. It turns out that this is a recipe for disaster, and that we ought to be making our software aggressively unscaleable.
Since empirical and quantitative measurement of software development is hard, I think the only way to scientifically to get to the bottom of the science in software development is an extremely thorough qualitative study with a large sample size (n=10000 or more). Doing such research would require an extreme amount of time and resources, however.
I’m missing a few points, otherwise a really good piece.
I do think we’re often using parts of real science[tm]. Our building blocks, mostly algorithms and data structures, O notation, networking. Much of this stuff is researched and backed up by science.
The problem to me seems to be that as an average software developer, to use some building metaphor, you are: the architect, the project leader, the mason, the electrician, the guy carrying 50kg sacks of cement, the cleaning crew, the plumber, … all in one person. (Well, except if you are an Enterprise Architect and hand in a PDF of how the common people shall henceforth create the software from your design…) Last I checked there’s also not much science on the stuff electricians and masons do in everyday work of laying brick and cables. You learn it, you get experience, you do it. Much of software development is like this.
Last I checked there’s also not much science on the stuff electricians and masons do in everyday work of laying brick and cables. You learn it, you get experience, you do it.
Maybe ask a structural, mechanical, electrical, or civil engineer about that? Or have a look at the International Building Code? Or OSHA? Construction is a mature field, with many rigorous standards in place.
I never said it’s the wild west and everyone does random things. I just think “following safety practices” which are probably based on some sort of science and studies a few layers down the line does not mean “I am actively applying science in my daily job”. Also let me stress that I said mason and electrician and not electrical or civil engineer. Also that quote was kinda out of context as I specifically said I feel like software engineers are doing the jobs of several people on several layers at the same time. And I certainly don’t think actively about what task I am performing right now. You are the sum of your knowledge and experience.
I think that some building inspectors might disagree. But if construction workers can just follow best practices (as dictated by building and safety codes) and not have to “do science” themselves, that’s only because those practices have already been established. Plenty of science was done along the way.
Yeah - I guess my point is that it’s useful to distinguish between ‘doing science’ and, say, reading papers and using what you learn to make something. Aside from anything else, the funding models (and appropriate times to say ‘this is not working, stop it’) are drastically different.
I do think we’re often using parts of real science[tm]. Our building blocks, mostly algorithms and data structures, O notation, networking. Much of this stuff is researched and backed up by science.
That’s all math, a completely different form of knowledge. Also it’s debatable if Software Engineering is about that: we use those blocks to encase them in a very complex and different structure built in a completely different way.
Oh dang another essay on empirical software engineering! I wonder if they read the same sources I did
Reads blog
I think I’m now officially internet famous. I feel like I crossed a threshold or something :D
So I’m not sure how much of this is frustration with ESE in general or with me in particular, but a lot of quotes are about my talk, and so I’m not sure if I should be defending myself? I’m gonna err on the side of defending myself, mostly because it’s an excuse to excitedly talk about why I’m so fascinated by empirical engineering.
One thing I want to open with. I’ve mentioned a couple of times on Lobsters that I’m working on a long term journalism project. I’m interviewing people who worked as “traditional” engineers, then switched to software, and what they see as the similarities and differences. I’ve learned a lot from this project, but one thing in particular stands out: we are not special. Almost everything we think is unique about software, from the rapid iteration to clients changing the requirements after we’ve released, happens all the time in other fields.
So, if we can’t empirically study software engineering, it would follow that we can’t empirically study any kind of engineering. If “you can’t study it” only applied to software, that would make software Special. And everything else people say about how software is Special turns out to be wrong, so I think it’s the case here.
I haven’t interviewed people outside of engineering, but I believe it goes even further: engineering isn’t special. If we can’t study engineers, then we can’t study lawyers or nurses or teachers or librarians. Human endeavor is incredibly complex, and every argument we can make about why studying software is impossible extends to any other job. I fundamentally reject that. I think we can usefully study people, and so we can usefully study software engineers.
Okay so now for individual points. There’s some jank here, because I didn’t edit this a whole lot and didn’t polish it at all.
Accelerate’s research is exclusively done by surveying people. This doesn’t mean it’s not empirical- as I say in the talk, qualitative information is really helpful. And one of my favorite examples of qualitative research, the Gamasutra Study on Crunch Mode, uses a similar method. But it’s far from being settled, and it bothers me that people use Accelerate as “scientifically proven!!!”
You’d be surprised. “Two comparisons of programming languages”, in “making software”, does this with nine teams (but only for one day). Some labs specialize in this, like SIMULA lab. Companies do internal investigations on this- Microsoft and IBM especially has a lot of great work in this style.
But regardless of that, controlled experiments aren’t supposed to be holistic. They test what we can, in a small context, to get solid data on a specific thing. Like VM Warmup Blows Hot and Cold: in a controlled environment, how consistent are VM benchmarks? Turns out, not very! This goes against all of our logic and intuition, and shows the power of controlled studies. Ultimately, though, controlled studies are a relatively small portion of the field, just as they’re a small portion of most social sciences.
For that matter, using students is great for studies on how students learn. There’s a ton of amazing research on what makes CS concepts easier to learn, and you have to use students for that.
This is true for every form of human undertaking, not just software. Can we study teachers? Can we study doctors and nurses? Their world is just as chaotic and dependent as ours is. Yet we have tons of research on how educators and healthcare professionals do their jobs, because we collectively agree that it’s important to understand those jobs better.
One technique we can use cross-correlating among many different studies on many different groups. Take the question “does Continuous Delivery help”. Okay, we see that companies that practice it have better outcomes, for whatever definiton of “outcomes” we’re using. Is that correlation or causation? Next we can look at “interventions” where a company moved to CD and see how it changed their outcomes. We can see what practices all of the companies share and what things they have different, to see what cluster of other explanations we have. We can examine companies where some teams use CD and some teams do not, and correlate their performance. We can look at what happens when people move between the different teams. We can look at companies that moved away from CD.
We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be. This isn’t easy! But it’s worth doing.
Pretty much all studies take this as a given. When we study things like “defect rate”, we’re always studying it in the context of unit time or unit cost. Otherwise we’d obviously just use formal verification for everything. And it’s totally feasible to ask this of field data. In some cases, companies are willing to instrument themselves- see TSP or the NASA data sets. In other cases, the data is computable- see research on defect rates due to organizational structure and code churn. Finally, we can cross-correlate between different projects, as is often done with repo mining.
These are hard problems, certaintly. But lots of things are “hard problems”. It’s literally scientists’ jobs to figure out how to solve these problems. Just because we, as layfolk, can’t figure out how to solve these problems doesn’t they’re impossible to solve.
This is why we do a lot of different studies and test a lot of different hypothesis. Again, this is an accepted fact in empiricial research. We know it’s hard. We do it anyway.
A better analogue is healthcare, the actual system of how we run hospitals and such. Thats in the same boat as software development: there’s a lot we don’t know, but we’re trying to learn more. The difference is that most people believe studying healthcare is important, but that studying software is not.
The lack of empirical evidence for most things doesn’t mean we’re “doomed to charisma-driven development.” Rather it’s the opposite: I find the lack of evidence immensely freeing. When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. They just believe it. And maybe it’s true! But if they want to be honest with themselves, they have to accept that doubt. Nobody has the secret knowledge. Nobody actually knows, and we all gotta be humble and honest about how little we know.
Of course science isn’t the only kind of knowledge! I just gave a talk at Deconstruct on the importance of studying software history. My favorite software book is Data and Reality, which is a philosophical investigation into the nature of information representation. My claim is that science is a very powerful form of knowledge that we as software folk not only neglect, but take pride in our neglecting. It’s like, yes, we don’t just have science, we have history and philosophy. But why not use all three?
Of course we can do that. Most of our knowledge will be accumulated this way, and that’s fine. But I think it’s a mistake to be satisfied with that. For any argument in software, I can find two experts, giants in their fields, who have rigorous arguments and beautiful narratives… that contradict each other. Science is about admitting that we are going to make mistakes, that we’re going to naturally believe things that aren’t true, no matter how mentally rigorous we try to be. That’s what makes it so important and so valuable. It gives us a way to say “well you believe X and I believe not X, so which is it?”
I think I know what you’re referencing here, and if it’s what I think it is, yeah that got ugly fast.
Regardless of how Thought Leaders use science, my experience has been the opposite of this. Being empirical is the opposite of easy. If I wanted to not think, I’d say “LOGICALLY I’m right” or something. But I’m an idiot and want to be empirical, which means reading dozens of papers that are all maddeningly contradictory. It means going through papers agonizingly carefully because the entire thing might be invalidated by an offhand remark.[1] It means reading paper’s references, and the references’ references, and trawling for followup papers, and reading the followup paper’s other references. It means spending hours hunting down preprints and emailing authors because most of the good stuff is locked away by the academic paper hoarders.
Being empirical means being painfully aware of the cognitive dissonance in your head. I love TDD. I recommend it to beginners all the time. I think it makes me a better programmer. At the same time, I know the evidence for it is… iffy. I have to accept that something I believe is mostly unfounded, and yet I still believe in it. That’s not the easy way out, that’s for sure!
And even when the evidence is in your favor, the final claim is infuriatingly nuanced. Take code review! “Code Review works”. By works, I mean “in most controlled studies and field studies, code review finds a large portion of the extant bugs in reviewed code in a reasonable timeframe. But most of the comments in code review are not bug-finding, but code quality things, about 3 code improvements per 1 bug usually. Certain things make CR better, and certain things make it a lot worse, and developers often complain that most of the code review comments are nitpicks. Often CRs are assigned to people who don’t actually know that area of the codebase well, which is a waste of time for everyone. There’s a limit to how much people can CR at a time, meaning it can easily become a bottleneck if you opt for 100% review coverage.”
That’s a way more nuanced claim than just “code review works!” And it’s way, way more nuanced than about 99% of the Code Review takes I see online that don’t talk about the evidence. Empiricism means being more diligent and putting in more work to understand, not less.
So one last thought to close this out. Studying software is hard. People bring up how expensive it is. And it is expensive, just as it’s expensive to study people in general. But here’s the thing. We are one of the richest industries in the history of the world. Apple’s revenue last year was a quarter trillion dollars. That’s not something we should leave to folklore and feelings. We’re worth studying.
[1]: I recently read one paper that looked solid and had some really good results… and one sentence in the methodology was “oh yeah and we didn’t bother normalizing it”
Hi Hillel! I’m glad you found this, and thank you for taking the time to respond.
I’m not sure you necessarily need to mount a defense, either. I didn’t consciously intend to set your talk up as the antagonist in my post, but I realize this is sort of what I did. The attitude I’m trying to refute (that empirical science is the only source of objective knowledge about software) is somewhat more extreme than the position you advocate. And the attitude you object to (that software “can’t be studied” empirically, and nothing can be learned this way) is certainly more extreme than the position I hoped to express. I think in the grand scheme of things we largely share the same values, and our difference of opinion is rather esoteric and mostly superficial. That doesn’t mean it’s not interesting to debate, though.
Re: Omitted variable biasYou seemed to suggest that research could account for omitted variable bias by “cross-correlating” studies
I submit to you this is not the case. Continuing with the CD example, suppose CD doesn’t improve outcomes but the “trendiness” that leads to it does. It is completely plausible for
If these hold, then all of the studies in the “cross-correlation” you describe will still misattribute an effect to CD.
You can’t escape omitted variable bias just by collecting more data from more types of studies. In order to legitimately address it, you need to do one of:
If you don’t address a plausible omitted variable bias in one of these ways, then basically you have no guarantee that the effect (or lack of effect) you measured was actually the effect of the practice and not the effect of whatever social conditions or ideology led to the adoption of your practice (or something else that those social conditions caused). This is a huge threat to validity, especially to “code mining” studies whose only dataset is a git log and therefore have no possible hope of capturing or controlling the social or human drivers behind the practice. To be totally honest, I assign basically zero credibility to the empirical argument of any “code mining” study for this reason.
Re: The analogy to medicineAs @notriddle seemed to be hinting at, professions comprehensively guided by science are the exception, not the rule. Science-based lawyering seems… unlikely. Science-based education is not widely practiced, and is controversial in any case. Medicine seems to be the major exception. It’s worth exploring the analogy/disanalogy between software and medicine in greater detail. Is software somehow inherently more difficult to study than medicine?
Maybe not. You brought up two good points about avenues of software research.
and
I think analysis of this form is miles more persuasive than computer lab studies or code mining. If a company randomly selects certain teams to adopt a certain practice and certain teams not to, this solves the realism problem because they are, in fact, real software teams. And it solves the omitted variable bias problem because the practice was guaranteed to have been adopted randomly. I think much of the reason medicine has been able to incorporate empirical studies so successfully is because hospitals are so heavily “instrumented” (as you put it) and willing to conduct “clinical trials” where the treatment is randomly assigned. I’m quite willing to admit that we could learn a lot from empirical research if software shops were willing to instrument themselves as heavily as hospitals, and begin randomly designating teams to adopt practices they want to study. I think it’s quite reasonable to advocate for a movement in that direction.
But whether or not we should advocate for more better data/more research is orthogonal to the main concern of my post: in the meantime, while we are clamoring for better data, how ought we evaluate software practices? Do we surrender to nihilism because the data doesn’t (yet) paint a complete picture? Do we make wild extrapolations from the faint picture the data does paint? Or should we explore and improve the body of “philosophical” ideas about programming, developed by programmers through storytelling and reflection on experience?
It is very important to do that last thing. I wrote my post because, for a time, my own preoccupation with the idea that only scientific inquiry had an admissible claim to objective truth prevented me from enjoying and taking e.g. “A Philosophy of Software Design” seriously (because it was not empirical), and realizing what a mistake this was was somewhat of a personal revelation.
Re: EpistemologyScience won’t rescue you from the fact that you’re going to believe things that aren’t true, no matter how mentally rigorous you try to be. Science is part of the attempt to be mentally rigorous. If you aren’t mentally rigorous and you do science, your statistical model will probably be wrong, and omitted variable bias will lead you to conclude something that isn’t true.
Science, to me, is merely a toolbox for generating persuasive empirical arguments based on data. It can help settle the debate between “X” and “not X” if there are persuasive scientific arguments to be found for X, and there are not persuasive scientific arguments to be found for “not X” – but just as frequently, there turn out to be persuasive scientific arguments for both “X” and “not X” that cannot be resolved empirically must be resolved theoretically/philosophically. (Or – as I think describes the state of software research so far – there turn out to be persuasive scientific arguments for neither “X” nor “not X”, and again, the difference must be resolved theoretically/philosophically).
I value this sort of disciplined thinking – but I think it’s a mistake to brand this as “science” or “being empirical”. After all, historians and philosophers also agonize through papers, crawling the reference tree, and develop highly nuanced, qualified claims. There’s nothing unique to science about this.
I think we should call for something broader than merely disciplined empirical thinking. We want disciplined empirical and philosophical/anecdotal thinking.
My ideal is that software developers accept or reject ideas based on the strength or weakness of the argument behind them, rather than whims, popularity of the idea, or the perceived authority or “charisma” of their advocates. For empirical arguments, this means doing what you described – reading a bunch of studies, paying attention to the methodology and the data description, following the reference trail when warranted. For philosophical/anecdotal arguments, this means doing what I described – mentally searching for inconsistencies, evaluating the argument against your own experiences and other evidence you are aware of.
Occasionally, this means the strength of a scientific argument must be weighed against a philosophical/anecdotal argument. The essence of my thesis is that, sometimes, a thoughtful, well-explained story by a practitioner can be a stronger argument than an empirical study (or more than one) with limited data and generality. “X worked for us at Dropbox and here is my analysis of why” can be more persuasive to a practitioner than “X didn’t appear to work for undergrad projects at 12 institutions, and there is not a correlation between X and good outcome Y in a sampling of Github Repos”.
Hi, thanks for responding! I think we’re mostly on the same page, too, and have the same values. We’re mostly debating the degrees and methods of here. I also agree that the issues you raise make things much more difficult. My stance is just that while they do make things more difficult, they don’t make it impossible, nor do they make it not worth doing.
Ultimately, while scientific research is really important, it’s only one means of getting knowledge about something. I personally believe it’s an incredibly strong form- if philosophy makes one objective claim and science makes another, then we should be inclined to look for flaws in the philosophy before looking for flaws in the science. But more than anything else, I want defence in depth. I want people to learn the science, and the history, and the philosophy, and the anthropology, and the economics, and the sociology, and the ethics. It seems to me that most engineers either ignore them all, or care about only one or two of these.
(Anthro/econ/soc are also sciences, but I’m leaving them separate because they usually make different claims and use different ((scientific!)) than what we think of as “scientific research” on software.)
One thing neither of us have brought up, that is also important here: we should know the failure modes of all our knowledge. The failure modes of science are really well known: we covered them in the article and our two responses. If we want to more heavily lean on history/philosophy/anthropology, we need to know the problems with using those, too. And I honestly don’t know them as well as I do the problems with scientific knowledge, which is one reason I don’t push it as hard- I can’t tell as easily when I should be suspicious.
What a fantastic response.
When doctors get involved in fields such as medical education or quality improvement and patient safety, they often have a similar reaction to Richard’s. The problem is in thinking that the only valid way to understand a complex system is to study each of its parts in isolation, and if you can’t isolate them, then should just give up.
As Hillel illustrated nicely here, you can in fact draw valid conclusions from studying “complex systems in the wild”. While this is a “messier” problem, it is much more interesting. It requires a lot of creativity but also more rigor in justifying and selecting the methodology, conducting the study, and interpreting the results. It is very easy to do a subpar study in those fields, which confounds the perception about the fields being “unscientific”.
A paper titled Research in the Hard Sciences, and in Very Hard “Softer” Domains by Phillips, D. C. discusses this issue. Unfortunately, it’s behind a paywall.
The answer to that question might be “no”.
When you’re replying to an article that’s titled “The False Promise of Science”, with a bunch of arguments against empirical software engineering that seem applicable to other fields as well, and your whole argument is basically an analogy, you should probably consider the possibility that Science is Just Wrong and we should all go back to praying to the sun.
The education field is at least as fad- and ideology-driven as software, and the medical field has cultural problems and studies that don’t reproduce. Many of the arguments given in this essay are clearly applicable to education and medicine (though not all of them obviously are, I can easily come up with new arguments for both fields). The fundamental problem with applying science to any field of endeavor is that it’s anti-situational at the core. The whole point of The Scientific Method is to average over all but a few variables, but people operating in the real world aren’t working with averages, they’re working with specifics.
The argument that software isn’t special cuts both ways, after all.
I’m not sure if I actually believe that, though.
The annoying part about this is that, as reasonably compelling as it’s possible to make the “science sucks” argument sound, it’s not very conducive to software engineering, where the whole point of the practice is to write generalized algorithms that deal with many slight variants of the same problem, so that humans don’t have to be involved in every little decision. Full-blown primativism, where you reject Scalable Solutions(R) entirely, has well-established downsides like heightened individual risk; one of the defining characteristics of modernism is risk diffusion, after all.
Adopting hard-and-fast rules is just a trade-off. You make the common case simpler, and you lose out in the special cases. This is true both within the software itself (it’s way easier to write elegant code if you don’t have weird edge cases) and with the practice. The alternative, where you allow for exceptions to the rules, is decried as bad for different reasons.
That is absolutely a valid counterargument! In response, I’d like to point out that we have learned a lot about those fields! Just a few examples:
I’m don’t know very much about classroom teaching or nursing, so I can’t deep-dive into that research as easily as I can software… but there are many widespread and important studies in both fields that give us actionable results. If we can do that with nursing, why not software?
To be honest, I think you’re overselling what empirical science tells us in some of these domains, too. Take the flipped classroom one, since it’s an example I’ve seen discussed elsewhere. The state of the literature summarized in that post is closer to: there is some evidence that this might be promising, but confidence is not that high, particularly in how broadly this can be interpreted. Taking that post on its own terms (I have not read the studies it cites independently), it suggests not much more than that overall reported studies are mainly either positive or inconclusive. But it doesn’t say anything about these studies’ generalizability (e.g. whether outcomes are mediated by subject matter, socioeconomic status, country, type of institution, etc.), suggests they’re smallish in number, suggests they’ve not had many replication attempts, and pretty much outright says that many studies are poorly designed and not well controlled. It also mentions that the proxies for “learning” used in the studies are mostly very short-term proxies chosen for convenience, like changes in immediate test scores, rather than the actual goal of longer-term mastery of material.
Of course that’s all understandable. Gold-standard studies like those done in medicine, with (in the ideal case) some mix of preregistration, randomized controlled trials, carefully designed placebos, and longitudinal follow-up across multi-demographic, carefully characterized populations, etc., are logistically massive undertakings, and expensive, so basically not done outside of medicine.
Seems like a pretty thin rod on which to hang strong claims about how we ought to reform education, though. As one input to qualitative decision-making, sure, but one input given only its proper weight, in my opinion significantly less than we’d weight the much better empirical data in medicine.
Dammit, man. That was a great response. I don’t think I’ll ever comment anything anywhere just so my comment won’t be compared to this.
A beautiful book, one of my favorites as well.
While I thought the article articulated something important which I agree with, its conclusion felt a bit lazy and too optimistic for my taste – I’m more persuaded by the POV you’ve articulated above.
While we’re making analogies, “writing software is like writing prose” seems like a decent one to explore, despite some obvious differences. Specifically relevant is the wide variety of different and successful processes you’ll find among professional writers.
And I think this explains why you might be completely right that something like TDD is valuable for you, even though empirical studies don’t back up that claim in general. And I don’t mean that in a soggy “everyone has their own method and they’re all equally valid” way. I mean that all of your knowledge, the way think about programming, your tastes, your knowledge of how to practice TDD in particular, and on and on, are all inputs into the value TDD provides you.
Which is to say: I find it far more likely that TDD (or similar practices with many knowledgeable, experienced supporters) have highly context sensitive empirical value than none at all. I don’t foresee them being one day unmasked by science as the sacred cows of religious zealots (though they may be that in some specific cases too).
For something like TDD, the “treatment” group would really need to be something like “people who have all been taught how to do it by the same expert over a long enough time frame and whose knowledge that expert has verified and signed off on.”
I’m not shilling for TDD, btw – just using it as a convenient example.
The broader point is that effects can be real but extremely hard to show experimentally.
“We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be.”
That’s exactly what I do for the sub-fields I study. Especially formal proof which I don’t understand at all. Just constantly looking at what specialists did… system type/size, properties, level of automation, labor required… tells me a lot about what’s achievable and allows mix n’ matching ideas for new, high-level designs. That’s without even needing to build anything which takes a lot longer. That specialists find the resulting ideas worthwhile proves the surveys and integration strategy work.
So, I strongly encourage people to do a variety of focused studies followed by integrated studies on them. They’ll learn plenty. We’ll also have more interesting submissions on Lobsters. :)
“When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. “
I didn’t think about that angle. Actually, you got me thinking maybe we can all start telling that to new programmers. They get warned the field is full of hype, trends, etc that usually don’t pan out over time. We tell them there’s little data to back most practices. Then, experienced people cutting them down or getting them onto new trend might have less effect. Esp on their self-confidence. Just thinking aloud here rather than committed to idea.
“Science is about admitting that we are going to make mistakes”
I used to believe science was about finding the truth. Now I’d go further than you. Science assumes we’re wrong by default, will screw up constantly, and are too biased or dishonest to review the work alone. The scientific method basically filters bad ideas to let us arrive a beliefs that are justifiable and still might be wrong. Failure is both normal and necessary if that’s the setup.
The cognitive dissonance make it really hard like you said. I find it a bit easier to do development and review separately. One can be in go mode iterating stuff. At another time, in skeptical mode critiquing the stuff. The go mode also gives a mental break and/or refreshes the mind, too.
My reading (which is congruent with my experiences) indicates a newly-put-together team takes 3-6 months before productivity stabilizes. Some schools of management view this as ‘stability=groupthink, shuffle the teams every 6 months’ and some view it as ‘stability=predictability, keep them together’. However, IMO this indicates to me that you might not be able to infer much from one day of data.
To clarify, that specific study was about nine existing software teams- they came to the project as a team already. It’s a very narrow study and definitely has limits, but it shows that researchers can do studies on teams of professionals.
I don’t think I understand what you’re saying. Software is expensive, and for some companies, very profitable. But would it really be more profitable if it were better studied? And what exactly does that have to do with the kinds of things that the software engineering field likes to study, such as defect rates and feature velocities? I think that in many cases, even relatively uncontroversial practices like code review are just not implemented because the people making business decisions don’t think the prospective benefit is worth the prospective cost. For many products or services, code quality (however operationalized) makes a poor experimental proxy for profitability.
Inasmuch as software development is a form of industrial production, there’s a huge body of “scientific management” literature that could potentially apply, from Frederick Taylor on forward. And I would argue it generally is being applied too: just in service of profit. Not for some abstract idea of “quality”, let alone the questionable ideal of pure disinterested scientific knowledge.
Mistakes are becoming increasingly costly (e.g., commercial jets falling from the sky) so understanding the process of software-making with the goal of reducing defects could save a lot of money. If software is going to “eat the world”, then the software industry needs to grow up and become more self-aware.
Aviation equipment and medical devices are already highly regulated, with quality control processes in place that produce defect rates orders of magnitude less than your average desktop or business software. We already know some things about how to make high-assurance systems. I think the real question is how much of that reasonably applies to the kind of software that’s actually eating the world now: near-disposable IoT devices and gimmicky ad-supported mobile apps, for example.
There’s a chapter in Alasdair MacIntyre’s After Virtue (1981) called “The Character of Generalizations in Social Science and their Lack of Predictive Power” which makes the point that the weakness of prediction in social science is a necessary result of human activity’s essential complexity and novelty. He gives four “sources of systematic unpredictability in human affairs”:
Interestingly he gives one argument for (1) that invokes Alonzo Church and the general undecidability of mathematical propositions to indicate that it is provably impossible to accurately predict the future of mathematical innovation.
I think these arguments are quite applicable to software development, for example concerning the question of why it’s so hard to plan a project accurately.
He then goes on to a corresponding list of sources of systematic predictability. These are things like clocks and schedules, seasons, institutions, and so on. These are environmental support systems. Part of his point is that if we want some predictability in human affairs, we can’t just rely on scientific observation—we have to actually support the conditions that enable prediction, introduce stable sources of regularity in our personal and social lives.
Taken too far that can lead to totalitarianism as the only way to eliminate the sources of unpredictability, by eliminating innovation, freedom, and our precious ability to “remain to some degree opaque and unpredictable.”
For a large proportion of software undertakings surely this is not true. Much of software development is outside the domains of the computing and data sciences, and computing infrastructure. While popular to consider that these are the only endeavors of importance to today’s developers, the modeling of systems in domains other than these into code represents the majority of software running in the world today.
In these, we don’t know to model reliably and predictably, the systems that stakeholders (think they) want, and that external domain experts know. How can one consider applying scientific rigor to that?
Great point. You are certainly correct that software developers are not always experts in the domains of their products. They are still experts in the domain of their tools and practices though, so they should be considered “domain experts” from the perspective of researchers.
I like this degree of optimism, and hope one day to overcome my experience enough to share it!
I agree with the thrust of this, but I don’t think it’s quite as bleak as he claims.
It is indeed difficult, perhaps impossible, for academic researchers to arrange the kinds of studies of professional software developers that would answer important questions about development practices. But large tech companies can do it, and have done so on occasion.
The bigger challenge is getting companies to pay attention to the results when they’re available.
Adequate sleep maximizes delivery speed and minimizes defects? Cool, but it’s crunch time, so you can sleep a bit less just for this one project, right?
Single-person offices with doors maximize productivity? Awesome, welcome to our huge open floor plan.
Code reviews produce better designs and increase maintainability? Eh, your coworker is on a deadline and doesn’t have time to read your code, so just commit it and we’ll let QA find the bugs.
That last one is funny because so many cos did take note of the fact that code review and automated testing are tremendously effective… by firing entire QA depts and assuming code review and testing would make up the difference, even if they’re not given any priority and don’t get done.
The problem is that upfront cost savings are easier to justify more so than long-term cost savings. You see a similar effect with companies cramming products in to make quarterly numbers, and shipping less complete things with more bugs which generate less earnings legs underneath them.
My favorite cost-saving measure of companies is not providing up-to-date hardware. Apparently wasting each developer’s time at $50-100/hr on compiles for months at a time is cheaper than spending a few thousand dollars on decent hardware.
I am sympathetic to the company perspective on that in some circumstances, though not in all circumstances. If you’re still in the product-market fit, “Will we even be able to sell this new thing we’re building to any customers?” phase, spending more time or money optimizing for the long term is probably a bad call because of the chance there won’t be a long term. That can cause the expected value of additional short-term work to drop below zero.
Past the exploratory phases of a project, it gets a lot harder to justify, though once you’ve gotten in the habit of focusing exclusively on banging out user stories as quickly as possible, it’s hard to change course.
If I remember correctly, a while ago I heard a Podcast by some Facebook manager. She said one of Zuckerbergs skills is to identify which phase a project is in and to get the right people for the phase in there.
In my experience, there has been a distinct lack of critical thinking in software engineering. Whether it’s citing scientific studies as the end-all-be-all, or saying “hey, it works for company X, therefore it will work for us,” it’s the same problem. I think software needs to be more scientific and data-driven, but on a smaller scale (within your team/organization).
As an anecdote, I’ve had an experience where a full rewrite was proposed for performance issues without any measurement as to what those issues actually were. Experiences from the past were put forth to justify areas to focus on for performance issues. The rewrite went ahead and a lot of effort was put into optimizing something that brought very little improvement. In this case, measuring first would have helped identify the more significant issues.
Agree on everything. Great article. Nonetheless it’s just scratching the surface. Like History and Philosophy (but also some scientific fields) have done in the last few centuries, we should transform our discipline around the idea of challenging ideological narratives that shape the way in which we build software. Even in open source and free software there are many assumptions on what it means to develop, publish and maintain software that are still heavily ideological in nature and are left unquestioned.
Clearly this is due to the discipline being young and because it existed only under Capitalism (be it wild californian capitalism, state capitalism in URSS or the most recent flavor of Chinese centralized hyper-capitalism). Nonetheless after decades of software development, the time has come to start explore new forms and escape the ideological cages in which software is bound and possibly escape into bigger and more diverse cages.
Yes. As enkiv2 here has pointed out on a few occasions, one of the ideologies we need to question is scaleability - the idea that any software service ought to be able to scale up to the point where it is the only one of its kind serving the whole world. It turns out that this is a recipe for disaster, and that we ought to be making our software aggressively unscaleable.
Since empirical and quantitative measurement of software development is hard, I think the only way to scientifically to get to the bottom of the science in software development is an extremely thorough qualitative study with a large sample size (n=10000 or more). Doing such research would require an extreme amount of time and resources, however.
I’m missing a few points, otherwise a really good piece.
I do think we’re often using parts of real science[tm]. Our building blocks, mostly algorithms and data structures, O notation, networking. Much of this stuff is researched and backed up by science.
The problem to me seems to be that as an average software developer, to use some building metaphor, you are: the architect, the project leader, the mason, the electrician, the guy carrying 50kg sacks of cement, the cleaning crew, the plumber, … all in one person. (Well, except if you are an Enterprise Architect and hand in a PDF of how the common people shall henceforth create the software from your design…) Last I checked there’s also not much science on the stuff electricians and masons do in everyday work of laying brick and cables. You learn it, you get experience, you do it. Much of software development is like this.
It’s a weird profession, really.
Maybe ask a structural, mechanical, electrical, or civil engineer about that? Or have a look at the International Building Code? Or OSHA? Construction is a mature field, with many rigorous standards in place.
I never said it’s the wild west and everyone does random things. I just think “following safety practices” which are probably based on some sort of science and studies a few layers down the line does not mean “I am actively applying science in my daily job”. Also let me stress that I said mason and electrician and not electrical or civil engineer. Also that quote was kinda out of context as I specifically said I feel like software engineers are doing the jobs of several people on several layers at the same time. And I certainly don’t think actively about what task I am performing right now. You are the sum of your knowledge and experience.
That’s… sort of the point.
Construction involves very little “come up with a hypothesis and test it”, which is step 0 of doing science.
I think that some building inspectors might disagree. But if construction workers can just follow best practices (as dictated by building and safety codes) and not have to “do science” themselves, that’s only because those practices have already been established. Plenty of science was done along the way.
Yeah - I guess my point is that it’s useful to distinguish between ‘doing science’ and, say, reading papers and using what you learn to make something. Aside from anything else, the funding models (and appropriate times to say ‘this is not working, stop it’) are drastically different.
That’s all math, a completely different form of knowledge. Also it’s debatable if Software Engineering is about that: we use those blocks to encase them in a very complex and different structure built in a completely different way.
[Comment removed by author]
[Comment removed by author]
Oof thanks! I wrote it out in a separate app in the fear of losing my progress and then apparently posted it into the wrong tab! /0\