1. 3

    But another sort of test is possible. Consider a test that asserts that every type of component in your app supports dark mode, or a test that iterates through the schema of every data field your app writes to any database, and asserts that it has an appropriate GDPR annotation. These tests aren’t “unit” or “integration”. I’m not sure what they are. Are they “system tests”? Whatever their name is, this technique is not as common as perhaps it should be. There aren’t a million blog posts written about this sort of test.

    Challenge Accepted

    1. 1

      A million posts is a tall order.

    1. 13

      Even for IRs, this article seems to say “tagged unions are better and more ergonomic, but you can live without them” (including the exhaustiveness checking mentioned at the end). So it’s not really “overrated”, just “you can write code without them, with more pain” which holds for almost all PL constructs.

      1. 8

        Yeah alternative title: “tagged unions are great and you should emulate them in languages that don’t natively support them”

        1. 2

          I’ve mentioned this a bunch of times, but for those who haven’t seen it, Oil uses the same DSL as Python to describe “tagged unions”, called Zephyr ASDL.



          Right after I put it into the codebase I saw how much better the code was structured (even without pattern matching). And it continues to grow more useful after 4 years.

          It can now even express relationships that Rust can’t, i.e. what I’d call “first class variants”: https://github.com/rust-lang/rfcs/pull/2593

          (and it doesn’t have the ergonomic issues with boxing, etc.)

          In Oil’s case it compiles to BOTH statically typed Python and C++. It works in both cases. Objects and subtyping are sort of “dual” to sum types.

          So the bottom line is that you can code in an imperative language (Python, Java, C++), and still structure your compiler nicely. You can have the best of both worlds!

        1. 1

          The rocket emojiis cracked. me. up.

          1. 1

            I’m really struggling with this “winner’s game” vs. “loser’s game” distinction. I don’t understand how the outcome of an amateur tennis game is “determined” by the loser hitting the ball into the net more times than the winner when you can just as easily say it was determined by the winner successfully hitting the ball in the court more times than the loser… Or why is a pro game determined by the winner acing the ball more times than the loser, and not by the loser failing to ace it as many times as the winner?

            1. 8

              In a “losers game”, you win by never dropping below some threshold for performance.

              In a “winners game”, you win by performing above some threshold.

              For most software, the difference in value delivered between “buggy” and “works” is greater than the difference between “works” to “and it’s amazing”; thus, it’s a “losers game” in that the key is to consistently not ship bugs.

              For tennis:

              Amateurs win points by repeatedly returning shots until their opponent messes up. Dropping below ‘minimum acceptable’ means hitting the ball out when they could have hit it in (an ‘unforced error’).

              In a pro game, ‘unforced errors’ are rare enough that the commentators spend a minute talking about them when they happen. The professionals simply don’t make that kind of mistake often, so points are won by actively creating a situation where the other player can’t respond (eg rapidly switching the ball to the opposite side of the court so their opponent can’t get across in time).

              1. 1

                Yes! This is a much better explanation than most of the comments in this thread.

            1. 2

              This is a defense of OOP by redefining OOP, essentially “Yes, inheritance and trying to model the world with classes are bad, but that’s not what OOP means, OOP means DRY and YAGNI and the law of demeter continuous refactoring and valuing simplicity above all else” seems like sophistry to me. The OOP moniker has no claim to DRY and YAGNI and refactoring and simplicity above any other paradigm.

              1. 1

                I agree that those things you listed are not especially related to OOP, but the author doesn’t say they are:

                Unfortunately, design patterns can easily become a way to smuggle in overly complex OOP design under a veneer of respectability. […] How do you avoid this trap? Focus on the rock-solid principles of good programming [that you listed].

                The author says those principles are ways to guide yourself to apply OOP in a good way instead of in a bad way. They don’t say OOP consists of those principles.

                What does the author say OOP consists of, then? I think their view is represented by these quotes:

                An object is a programming construct that lets you pack together data and functionality in a somewhat reusable package.

                Object-oriented languages give you a set of tools for using objects (formalizing their interactions with interfaces, extending them with inheritance, and so on). But they don’t say much about how you should apply these objects to a problem.

              1. 7

                the coup de gras!

                This translate to the blow of fat, which could be a valid pun in French. I think that you meant coup de grâce. :)

                1. 4

                  coup de grâce

                  Oof! Thank you, I fixed it. I’ll never turn down a chance to put hats on my vowels!

                  1. 4

                    It turns out actually what I meant is “pièce de résistance”. Apparently there’s something wrong with my brain’s hash table when resolving hash collisions in the “french phrase” bucket.

                1. 1

                  It would’ve been nice to see more of the do notation examples in whatever the other style is called. Bind style?

                  1. 2

                    There’s not too much of a difference, because in the bookMeeting examples the operations don’t depend on values from the previous operations. You just do >> at the end of every operation, which means “>>=, but ignore the result”

                    So instead of

                    bookMeeting participants room details = do
                      addMeeting details (calendarOf room)
                      for_ participants ((addMeeting details) . calendarOf)
                      for_ participants ((sendEmail details) . emailOf)

                    it is

                    bookMeeting participants room details =
                      (addMeeting details (calendarOf room)) >>
                      (for_ participants ((addMeeting details) . calendarOf)) >>
                      (for_ participants ((sendEmail details) . emailOf))
                    1. 1

                      whatever the other style is called. Bind style?

                      Maybe “point-free style”?

                      1. 1

                        I just want nine day weeks

                        1. 3

                          I’m afraid to ask why

                          1. 1

                            6 day week, 3 day weekend – it’d work well for a system where not everybody took their weekend at exactly the same time.

                            Also it would lend a completely new meaning to the hit Beatles song “eight days a week”

                          2. 2

                            Since 365 x 4 + 1 = 1461 = 3 x 487 I always thought that 3 day weeks with 487 week years made the most sense.

                          1. 12

                            This article seems to assume that “process” is nothing but an abbreviation for “by-the-book Scrum.”

                            You can have an engineering process that doesn’t involve story points or constant prioritizing of a backlog. Does your team require code review before changes are merged? You have a process. Have a QA team that sanity-checks your changes before they’re released to customers? You have a process. Are you expected to discuss the design of a significant change with your teammates before diving into implementation? You have a process.

                            1. 2

                              I think review/qa/code design are subject to similar arguments. You can force your reviewers and testers to review according to a predetermined rubric, or you can allow them to review/test in a freeform fashion. You can try and formalize the design review process or make it informal.

                              Lightweight processes are good because they prompt people to ask “is this change a good idea” at many stages. Process for the sake of process is bad because it displaces “is this change a good idea?” with “did I follow the correct process with respect to this change?”

                              1. 5

                                I think having kind of checklist (or “process”) is not necessarily a bad idea, as it ensures that things actually get done. Humans are rather prone to forgetting things and the like. If you look at avionics then there are a huge amount of checklists for everything, ranging from standard procedures to emergencies. This is good, because it’s just too easy to miss something, with potentially disastrous consequences; quite a few crashes could have been prevented if the pilots has followed the checklist.

                                Writing software is not avionics, but it’s still interesting to look at it, as it does highlight the value of checklists/processes.

                                I think having some sort of review process is a good idea. My own rather anti-authoritarian nature hates processes, but I’m also not blind to the limited capabilities of humans, and having a vague “is this a good idea?”-kind of review makes it much easier to make “oops, didn’t think of that!” mistakes. There is some overhead, yes, but it also comes with some advantages.

                                I don’t disagree that following process for the sake of it is not a good idea, or that disallowing any and all deviation from it is bad, but for some things at least, there are some advantages.

                                A similar example are the issue templates that many projects have; for some issues, such a template just doesn’t make much sense (“Description”: “Copy/paste doesn’t work”; What happened?”: “It didn’t work”; “What did you expect instead?”: “That it works”), and deviating from that should be okay in those cases. But it’s still a good idea to have the template/process in place, as it’s a good default that works well for many issues, and it prevents things like people just posting “X doesn’t work”, or forgetting to post stuff like the full error message or version (well, most of the time anyway; some people seem unteachable in this regard).

                                Treating process as a law is a bad idea because you keep running in to edge cases where it doesn’t 100% fit, but if you treat it as a default template that works well for most cases then usually it works quite well.

                                I think the biggest problems from processes comes when people (usually managers or lead devs) treat it as law because they don’t trust developers, but the issue there isn’t really the process but the lack of trust (which, sometimes with some devs, it not entirely undeserved); the rigid process is just a symptom of that lack of trust.

                            1. 1

                              What functional programmers do

                              1. 1

                                I think Make is a poor choice for a simple wrapper script, and a disasterous choice for a sophisticated one. Your make tasks are shell code. If you don’t need the dependency invalidation, it is better to write a shell script so you only have to deal with one layer of arcane syntax, instead of two. You can e.g. run a linter on your Bash inside a bash script, but you don’t get any static analysis of the little snippets of bash you embed in your Makefile. Bash will let you break your code up into functions. The closest thing Make gives you to that are macros – which are an absolute nightmare. If you want to experiment, you can copy and paste your directly from your script into the bash prompt. With Make, there is no repl.

                                If you do need the dependency tracking, Make is also a bad tool. It has two modes: fail silently without any explanation of what tasks it performed and why – or extremely verbose mode where I will emit millions of lines and it’s up to you to weed out the interesting stuff. Again, you can’t write functions so there is no capacity for abstraction or code re-use. Also, using timestamps instead of checksums for invalidation will break spectacularly if you are trying to have any sort of build cache over a network.

                                Make is fine until it gets bigger than say 200 lines or until you have to reach for the manual. After that, i would rewrite in Shake, Bazel, or… anything else.

                                1. 1

                                  Your points are valid. In my experience, having a dependency tree and a tool that is already available on all systems, is worth the trade-offs. Absolutely, their are poor Makefiles, and hugely complex Makefiles, but I think as a wrapper for the common case of node / ruby / elixir projects, where you are just calling out to docker and/or the build tool, maybe some ad-hoc commands, is the right amount of simplicity for a Makefile to be productive.

                                  1. 1

                                    Yeah there’s definitely a sweet spot. I just broke my rule and introduced a 50-line Makefile for a 48-hour hackathon project that relied on a dependent series of JSON files of data scraped from Wikipedia.

                                    (I would have used shake but 48 hour project is not the time to thrust Haskell upon 2 teammates)

                                    But I “has dependency graph, so Bash won’t do” and “not complicated/important enough that’s it’s worth sacrificing ubiquity and familiarity for a more disciplined tool” is a relatively small space.

                                    1. 1

                                      And I mean, I wasted probably 20 min total dealing with stupid mistakes putting together/working with even this 50-liner Makefile that uses nothing advanced – just because Make has no linter, and doesn’t allow the introduction of intelligent programming practices like functions and variable

                                1. 3

                                  If your middlewares represent simple, independent operations, I still think middleware is a poor way of expressing these operations, but it is mostly benign. The trouble begins when the operations become complex and interdependent.

                                  I would disagree, and I think what you are struggling with is language/framework choices not the middleware concept. Just to expand on your example with 100% admin request; over time your main dispatch middleware will be so complicated nobody will understand what is going on. I’ve seen complicated projects with these “grand central dispatchers” bloating up to the point of unmanageable code. I would still keep an auth middleware that reads the user on top and adds it as a “context” to request (however express.js might do it), and let the chain continue. That one dead straight list of middleware will save you nightmares of testing and keep your code simple. Your central dispatch middleware is a bad smell to me.

                                  1. 1

                                    I’ve seen non-trivial middleware stacks in Ruby, PHP, and Node.js and would apply my analysis to all of them. I don’t think it’s language specific.

                                    “Context” is a better name than “request” for an arbitrary grab bag of properties, but req.context.isAdmin = … or req.context.account = ... isn’t really any morally different than req.isAdmin = ... or req.account = ....

                                    If your “grand central dispatcher” is complicated, that means that the operations you are performing on your requests are complicated. Breaking your dispatcher up into middleware won’t make the complexity go away – it will just make the complexity implicit in the assumptions that each middleware makes on the structure and meaning of the universal “context” object, rather than having it be expressed explicitly through the parameters and return types of functions, and the control flow of your dispatcher.

                                    But I don’t necessarily advocate one big “grand central dispatcher”. You can break it up. But if you break it up, I just advocate against decomposing it into multiple middleware. Instead, decompose it into functions with meaningful return values, whose parameters reflect their actual dependencies, where the control flow and interdependencies between these functions are explicit, instead of into crippled “middleware” functions that are not allowed to have a meaningful return value and can only communicate via implicit interactions inside an ill-typed, arbitrary grab bag of properties.

                                    Such functions, I would argue, are easier to test than middlewares. In order to test a middleware, you must artificially construct an http request and response, when likely the operation your middleware performs only cares about some parts of the request, and effects some parts of the request (or response).

                                    In order to test e.g.

                                    const rateLimitingMiddleware = async (req, res) => {
                                      const ip = req.headers['ip']
                                      if (await db.nRequestsSince(Date.now() - 60000, ip) > 100) {
                                        return res.send(423)

                                    You have to

                                    const req = {headers: {ip: ''}}
                                    db.incrementNRequests = sinon.stub()
                                    db.nRequestsSince = sinon.stub.returns(101)
                                    const res = { send : sinon.stub() }
                                    await rateLimitingMiddleware(req, res)
                                    sinon.assert.calledWith(res.send, 423)

                                    whereas for

                                    const shouldRateLimit = async (ip) => {
                                      return await db.nRequestsSince(Date.now() - 60000, ip) < 100

                                    Has one less mock, at least, and doesn’t require you to construct these nested request and response data structures.

                                    db.incrementNRequests = sinon.stub()
                                    db.nRequestsSince = sinon.stub.returns(101)
                                    const result = await shouldRateLimit('')
                                  1. 4

                                    I don’t think the world of classical music is necessarily all that happy-go-lucky: e.g. the existence/importance of music “critics” https://www.gramophone.co.uk/reviews

                                    And let me tell you, with any sort of group artistic performance “I don’t want to argue about artistic vision, I just want to express myself” might be a fine attitude to have for like, low-key community performances and such, but I highly doubt it’s going to put you on good terms with your colleagues in a professional troupe.

                                    1. 23

                                      Oh dang another essay on empirical software engineering! I wonder if they read the same sources I did

                                      Reads blog

                                      You watched the conference talk “What We Know We Don’t Know”, by Hillel Wayne, who, also disturbed by software’s apparent lack of scientific foundation, found and read as many scholarly papers as he could find. His conclusions are grim.

                                      I think I’m now officially internet famous. I feel like I crossed a threshold or something :D

                                      So I’m not sure how much of this is frustration with ESE in general or with me in particular, but a lot of quotes are about my talk, and so I’m not sure if I should be defending myself? I’m gonna err on the side of defending myself, mostly because it’s an excuse to excitedly talk about why I’m so fascinated by empirical engineering.

                                      One thing I want to open with. I’ve mentioned a couple of times on Lobsters that I’m working on a long term journalism project. I’m interviewing people who worked as “traditional” engineers, then switched to software, and what they see as the similarities and differences. I’ve learned a lot from this project, but one thing in particular stands out: we are not special. Almost everything we think is unique about software, from the rapid iteration to clients changing the requirements after we’ve released, happens all the time in other fields.

                                      So, if we can’t empirically study software engineering, it would follow that we can’t empirically study any kind of engineering. If “you can’t study it” only applied to software, that would make software Special. And everything else people say about how software is Special turns out to be wrong, so I think it’s the case here.

                                      I haven’t interviewed people outside of engineering, but I believe it goes even further: engineering isn’t special. If we can’t study engineers, then we can’t study lawyers or nurses or teachers or librarians. Human endeavor is incredibly complex, and every argument we can make about why studying software is impossible extends to any other job. I fundamentally reject that. I think we can usefully study people, and so we can usefully study software engineers.

                                      Okay so now for individual points. There’s some jank here, because I didn’t edit this a whole lot and didn’t polish it at all.

                                      You were disappointed with Accelerate: The Science of Lean Software and DevOps. You agreed with most of its prescriptions. It made liberal use of descriptive statistics.

                                      Accelerate’s research is exclusively done by surveying people. This doesn’t mean it’s not empirical- as I say in the talk, qualitative information is really helpful. And one of my favorite examples of qualitative research, the Gamasutra Study on Crunch Mode, uses a similar method. But it’s far from being settled, and it bothers me that people use Accelerate as “scientifically proven!!!

                                      1. Controlled experiments are typically nothing like professional programming environments […] So far as I know, no researcher has ever gathered treatment and control groups of ten five-developer teams each, put them to work M-F, 9-5, for even a single month, in order to realistically simulate the conditions of a stable, familiar team and codebase.

                                      You’d be surprised. “Two comparisons of programming languages”, in “making software”, does this with nine teams (but only for one day). Some labs specialize in this, like SIMULA lab. Companies do internal investigations on this- Microsoft and IBM especially has a lot of great work in this style.

                                      But regardless of that, controlled experiments aren’t supposed to be holistic. They test what we can, in a small context, to get solid data on a specific thing. Like VM Warmup Blows Hot and Cold: in a controlled environment, how consistent are VM benchmarks? Turns out, not very! This goes against all of our logic and intuition, and shows the power of controlled studies. Ultimately, though, controlled studies are a relatively small portion of the field, just as they’re a small portion of most social sciences.

                                      For that matter, using students is great for studies on how students learn. There’s a ton of amazing research on what makes CS concepts easier to learn, and you have to use students for that.

                                      1. The unpredictable dynamics of human decision-making obscure the effects of software practices in field data. […] This doesn’t hold for field data, because real-life software teams don’t adopt software practices in a random manner, independent from all other factors that might potentially affect outcomes.

                                      This is true for every form of human undertaking, not just software. Can we study teachers? Can we study doctors and nurses? Their world is just as chaotic and dependent as ours is. Yet we have tons of research on how educators and healthcare professionals do their jobs, because we collectively agree that it’s important to understand those jobs better.

                                      One technique we can use cross-correlating among many different studies on many different groups. Take the question “does Continuous Delivery help”. Okay, we see that companies that practice it have better outcomes, for whatever definiton of “outcomes” we’re using. Is that correlation or causation? Next we can look at “interventions” where a company moved to CD and see how it changed their outcomes. We can see what practices all of the companies share and what things they have different, to see what cluster of other explanations we have. We can examine companies where some teams use CD and some teams do not, and correlate their performance. We can look at what happens when people move between the different teams. We can look at companies that moved away from CD.

                                      We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be. This isn’t easy! But it’s worth doing.

                                      1. The outcomes that can be measured aren’t always the outcomes that matter. […] So in order to effectively inform practice, research needs to ask a slightly different, more sophisticated question – not e.g. “what is the effect software practice X has on ‘defect rate’”, but “what is the effect software practice X has on ‘defect rate per unit effort’”. While it might be feasible to ask this question in the controlled experiment setting, it is difficult or impossible to ask of field data.

                                      Pretty much all studies take this as a given. When we study things like “defect rate”, we’re always studying it in the context of unit time or unit cost. Otherwise we’d obviously just use formal verification for everything. And it’s totally feasible to ask this of field data. In some cases, companies are willing to instrument themselves- see TSP or the NASA data sets. In other cases, the data is computable- see research on defect rates due to organizational structure and code churn. Finally, we can cross-correlate between different projects, as is often done with repo mining.

                                      These are hard problems, certaintly. But lots of things are “hard problems”. It’s literally scientists’ jobs to figure out how to solve these problems. Just because we, as layfolk, can’t figure out how to solve these problems doesn’t they’re impossible to solve.

                                      1. Software practices and the conditions which modify them are varied, which limits the generality and authority of any tested hypothesis

                                      This is why we do a lot of different studies and test a lot of different hypothesis. Again, this is an accepted fact in empiricial research. We know it’s hard. We do it anyway.

                                      But if you’re holding your breath for the day when empirical science will produce a comprehensive framework for software development – like it does for, say, medicine – you will die of hypoxia.

                                      A better analogue is healthcare, the actual system of how we run hospitals and such. Thats in the same boat as software development: there’s a lot we don’t know, but we’re trying to learn more. The difference is that most people believe studying healthcare is important, but that studying software is not.

                                      Is this cause for despair? If science-based software development is off the table, what remains? Is it really true as Hillel suggests, that in the absence of science “we just don’t know” anything, and we are doomed to an era of “charisma-driven development” where the loudest opinion wins, and where superstition, ideology, and dogmatism reign supreme?

                                      The lack of empirical evidence for most things doesn’t mean we’re “doomed to charisma-driven development.” Rather it’s the opposite: I find the lack of evidence immensely freeing. When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. They just believe it. And maybe it’s true! But if they want to be honest with themselves, they have to accept that doubt. Nobody has the secret knowledge. Nobody actually knows, and we all gotta be humble and honest about how little we know.

                                      Of course not. Scientific knowledge is not the only kind of knowledge, and scientific arguments are not the only type of arguments. Disciplines like history and philosophy, for instance, seem to do rather well, despite seldom subjecting their hypotheses to statistical tests.

                                      Of course science isn’t the only kind of knowledge! I just gave a talk at Deconstruct on the importance of studying software history. My favorite software book is Data and Reality, which is a philosophical investigation into the nature of information representation. My claim is that science is a very powerful form of knowledge that we as software folk not only neglect, but take pride in our neglecting. It’s like, yes, we don’t just have science, we have history and philosophy. But why not use all three?

                                      Your decision to accept or reject the argument might be mistaken – you might overlook some major inconsistency, or your judgement might be skewed by your own personal biases, or you might be fooled by some clever rhetorical trick. But all in all, your judgement will be based in part on the objective merit of the argument

                                      Of course we can do that. Most of our knowledge will be accumulated this way, and that’s fine. But I think it’s a mistake to be satisfied with that. For any argument in software, I can find two experts, giants in their fields, who have rigorous arguments and beautiful narratives… that contradict each other. Science is about admitting that we are going to make mistakes, that we’re going to naturally believe things that aren’t true, no matter how mentally rigorous we try to be. That’s what makes it so important and so valuable. It gives us a way to say “well you believe X and I believe not X, so which is it?”

                                      Science – or at least a mysticized version of it – can be a threat to this sort of inquiry. Lazy thinkers and ideologues don’t use science merely as a tool for critical thinking and reasoned argument, but as a substitute. Science appears to offer easy answers. Code review works. Continuous delivery works. TDD probably doesn’t. Why bother sifting through your experiences and piecing together your own narrative about these matters, when you can just read studies – outsource the reasoning to the researchers? […] We can simply dismiss them as “anti-science” and compare them to anti-vaxxers. […] I witnessed it play out among industry leaders in my Twitter feed, the day after I started drafting this post.

                                      I think I know what you’re referencing here, and if it’s what I think it is, yeah that got ugly fast.

                                      Regardless of how Thought Leaders use science, my experience has been the opposite of this. Being empirical is the opposite of easy. If I wanted to not think, I’d say “LOGICALLY I’m right” or something. But I’m an idiot and want to be empirical, which means reading dozens of papers that are all maddeningly contradictory. It means going through papers agonizingly carefully because the entire thing might be invalidated by an offhand remark.[1] It means reading paper’s references, and the references’ references, and trawling for followup papers, and reading the followup paper’s other references. It means spending hours hunting down preprints and emailing authors because most of the good stuff is locked away by the academic paper hoarders.

                                      Being empirical means being painfully aware of the cognitive dissonance in your head. I love TDD. I recommend it to beginners all the time. I think it makes me a better programmer. At the same time, I know the evidence for it is… iffy. I have to accept that something I believe is mostly unfounded, and yet I still believe in it. That’s not the easy way out, that’s for sure!

                                      And even when the evidence is in your favor, the final claim is infuriatingly nuanced. Take code review! “Code Review works”. By works, I mean “in most controlled studies and field studies, code review finds a large portion of the extant bugs in reviewed code in a reasonable timeframe. But most of the comments in code review are not bug-finding, but code quality things, about 3 code improvements per 1 bug usually. Certain things make CR better, and certain things make it a lot worse, and developers often complain that most of the code review comments are nitpicks. Often CRs are assigned to people who don’t actually know that area of the codebase well, which is a waste of time for everyone. There’s a limit to how much people can CR at a time, meaning it can easily become a bottleneck if you opt for 100% review coverage.”

                                      That’s a way more nuanced claim than just “code review works!” And it’s way, way more nuanced than about 99% of the Code Review takes I see online that don’t talk about the evidence. Empiricism means being more diligent and putting in more work to understand, not less.

                                      So one last thought to close this out. Studying software is hard. People bring up how expensive it is. And it is expensive, just as it’s expensive to study people in general. But here’s the thing. We are one of the richest industries in the history of the world. Apple’s revenue last year was a quarter trillion dollars. That’s not something we should leave to folklore and feelings. We’re worth studying.

                                      [1]: I recently read one paper that looked solid and had some really good results… and one sentence in the methodology was “oh yeah and we didn’t bother normalizing it”

                                      1. 3

                                        What a fantastic response.

                                        When doctors get involved in fields such as medical education or quality improvement and patient safety, they often have a similar reaction to Richard’s. The problem is in thinking that the only valid way to understand a complex system is to study each of its parts in isolation, and if you can’t isolate them, then should just give up.

                                        As Hillel illustrated nicely here, you can in fact draw valid conclusions from studying “complex systems in the wild”. While this is a “messier” problem, it is much more interesting. It requires a lot of creativity but also more rigor in justifying and selecting the methodology, conducting the study, and interpreting the results. It is very easy to do a subpar study in those fields, which confounds the perception about the fields being “unscientific”.

                                        A paper titled Research in the Hard Sciences, and in Very Hard “Softer” Domains by Phillips, D. C. discusses this issue. Unfortunately, it’s behind a paywall.

                                        1. 3

                                          Hi Hillel! I’m glad you found this, and thank you for taking the time to respond.

                                          I’m not sure you necessarily need to mount a defense, either. I didn’t consciously intend to set your talk up as the antagonist in my post, but I realize this is sort of what I did. The attitude I’m trying to refute (that empirical science is the only source of objective knowledge about software) is somewhat more extreme than the position you advocate. And the attitude you object to (that software “can’t be studied” empirically, and nothing can be learned this way) is certainly more extreme than the position I hoped to express. I think in the grand scheme of things we largely share the same values, and our difference of opinion is rather esoteric and mostly superficial. That doesn’t mean it’s not interesting to debate, though.

                                          Re: Omitted variable bias

                                          You seemed to suggest that research could account for omitted variable bias by “cross-correlating” studies

                                          • across different companies
                                          • within one same company before and after adopting/disadopting the practice
                                          • across different teams within the same company.

                                          I submit to you this is not the case. Continuing with the CD example, suppose CD doesn’t improve outcomes but the “trendiness” that leads to it does. It is completely plausible for

                                          • trendy companies to be more likely to adopt CD than non-trendy companies
                                          • trendy teams within a company to be more likely to adopt CD than non-trendy teams
                                          • a company that is becoming more trendy is more likely to adopt CD and be trendier before the adoption than after adoption
                                          • a company that is becoming less trendy is more likely to disadopt CD and be trendier before the disadoption than after

                                          If these hold, then all of the studies in the “cross-correlation” you describe will still misattribute an effect to CD.

                                          You can’t escape omitted variable bias just by collecting more data from more types of studies. In order to legitimately address it, you need to do one of:

                                          • Find some sort of data that captures “trendiness” and include it as a statistical control.
                                          • Find an instrumental variable
                                          • Find data on teams within a company that were randomly assigned to CD (so that trendiness no longer correlates with the decision to adopt).

                                          If you don’t address a plausible omitted variable bias in one of these ways, then basically you have no guarantee that the effect (or lack of effect) you measured was actually the effect of the practice and not the effect of whatever social conditions or ideology led to the adoption of your practice (or something else that those social conditions caused). This is a huge threat to validity, especially to “code mining” studies whose only dataset is a git log and therefore have no possible hope of capturing or controlling the social or human drivers behind the practice. To be totally honest, I assign basically zero credibility to the empirical argument of any “code mining” study for this reason.

                                          Re: The analogy to medicine

                                          As @notriddle seemed to be hinting at, professions comprehensively guided by science are the exception, not the rule. Science-based lawyering seems… unlikely. Science-based education is not widely practiced, and is controversial in any case. Medicine seems to be the major exception. It’s worth exploring the analogy/disanalogy between software and medicine in greater detail. Is software somehow inherently more difficult to study than medicine?

                                          Maybe not. You brought up two good points about avenues of software research.

                                          Companies do internal investigations on this- Microsoft and IBM especially has a lot of great work in this style.


                                          In some cases, companies are willing to instrument themselves- see TSP or the NASA data sets.

                                          I think analysis of this form is miles more persuasive than computer lab studies or code mining. If a company randomly selects certain teams to adopt a certain practice and certain teams not to, this solves the realism problem because they are, in fact, real software teams. And it solves the omitted variable bias problem because the practice was guaranteed to have been adopted randomly. I think much of the reason medicine has been able to incorporate empirical studies so successfully is because hospitals are so heavily “instrumented” (as you put it) and willing to conduct “clinical trials” where the treatment is randomly assigned. I’m quite willing to admit that we could learn a lot from empirical research if software shops were willing to instrument themselves as heavily as hospitals, and begin randomly designating teams to adopt practices they want to study. I think it’s quite reasonable to advocate for a movement in that direction.

                                          But whether or not we should advocate for more better data/more research is orthogonal to the main concern of my post: in the meantime, while we are clamoring for better data, how ought we evaluate software practices? Do we surrender to nihilism because the data doesn’t (yet) paint a complete picture? Do we make wild extrapolations from the faint picture the data does paint? Or should we explore and improve the body of “philosophical” ideas about programming, developed by programmers through storytelling and reflection on experience?

                                          It is very important to do that last thing. I wrote my post because, for a time, my own preoccupation with the idea that only scientific inquiry had an admissible claim to objective truth prevented me from enjoying and taking e.g. “A Philosophy of Software Design” seriously (because it was not empirical), and realizing what a mistake this was was somewhat of a personal revelation.

                                          Re: Epistemology

                                          Science is about admitting that we are going to make mistakes, that we’re going to naturally believe things that aren’t true, no matter how mentally rigorous we try to be. That’s what makes it so important and so valuable. It gives us a way to say “well you believe X and I believe not X, so which is it?”

                                          Science won’t rescue you from the fact that you’re going to believe things that aren’t true, no matter how mentally rigorous you try to be. Science is part of the attempt to be mentally rigorous. If you aren’t mentally rigorous and you do science, your statistical model will probably be wrong, and omitted variable bias will lead you to conclude something that isn’t true.

                                          Science, to me, is merely a toolbox for generating persuasive empirical arguments based on data. It can help settle the debate between “X” and “not X” if there are persuasive scientific arguments to be found for X, and there are not persuasive scientific arguments to be found for “not X” – but just as frequently, there turn out to be persuasive scientific arguments for both “X” and “not X” that cannot be resolved empirically must be resolved theoretically/philosophically. (Or – as I think describes the state of software research so far – there turn out to be persuasive scientific arguments for neither “X” nor “not X”, and again, the difference must be resolved theoretically/philosophically).

                                          [Being empirical]… means reading dozens of papers that are all maddeningly contradictory. It means going through papers agonizingly carefully because the entire thing might be invalidated by an offhand remark.[1] It means reading paper’s references, and the references’ references, and trawling for followup papers, and reading the followup paper’s other references.

                                          That’s a way more nuanced claim than just “code review works!” And it’s way, way more nuanced than about 99% of the Code Review takes I see online that don’t talk about the evidence. Empiricism means being more diligent and putting in more work to understand, not less.

                                          I value this sort of disciplined thinking – but I think it’s a mistake to brand this as “science” or “being empirical”. After all, historians and philosophers also agonize through papers, crawling the reference tree, and develop highly nuanced, qualified claims. There’s nothing unique to science about this.

                                          I think we should call for something broader than merely disciplined empirical thinking. We want disciplined empirical and philosophical/anecdotal thinking.

                                          My ideal is that software developers accept or reject ideas based on the strength or weakness of the argument behind them, rather than whims, popularity of the idea, or the perceived authority or “charisma” of their advocates. For empirical arguments, this means doing what you described – reading a bunch of studies, paying attention to the methodology and the data description, following the reference trail when warranted. For philosophical/anecdotal arguments, this means doing what I described – mentally searching for inconsistencies, evaluating the argument against your own experiences and other evidence you are aware of.

                                          Occasionally, this means the strength of a scientific argument must be weighed against a philosophical/anecdotal argument. The essence of my thesis is that, sometimes, a thoughtful, well-explained story by a practitioner can be a stronger argument than an empirical study (or more than one) with limited data and generality. “X worked for us at Dropbox and here is my analysis of why” can be more persuasive to a practitioner than “X didn’t appear to work for undergrad projects at 12 institutions, and there is not a correlation between X and good outcome Y in a sampling of Github Repos”.

                                          1. 2

                                            Hi, thanks for responding! I think we’re mostly on the same page, too, and have the same values. We’re mostly debating the degrees and methods of here. I also agree that the issues you raise make things much more difficult. My stance is just that while they do make things more difficult, they don’t make it impossible, nor do they make it not worth doing.

                                            Ultimately, while scientific research is really important, it’s only one means of getting knowledge about something. I personally believe it’s an incredibly strong form- if philosophy makes one objective claim and science makes another, then we should be inclined to look for flaws in the philosophy before looking for flaws in the science. But more than anything else, I want defence in depth. I want people to learn the science, and the history, and the philosophy, and the anthropology, and the economics, and the sociology, and the ethics. It seems to me that most engineers either ignore them all, or care about only one or two of these.

                                            (Anthro/econ/soc are also sciences, but I’m leaving them separate because they usually make different claims and use different ((scientific!)) than what we think of as “scientific research” on software.)

                                            One thing neither of us have brought up, that is also important here: we should know the failure modes of all our knowledge. The failure modes of science are really well known: we covered them in the article and our two responses. If we want to more heavily lean on history/philosophy/anthropology, we need to know the problems with using those, too. And I honestly don’t know them as well as I do the problems with scientific knowledge, which is one reason I don’t push it as hard- I can’t tell as easily when I should be suspicious.

                                          2. 3

                                            Can we study teachers? Can we study doctors and nurses?

                                            The answer to that question might be “no”.

                                            When you’re replying to an article that’s titled “The False Promise of Science”, with a bunch of arguments against empirical software engineering that seem applicable to other fields as well, and your whole argument is basically an analogy, you should probably consider the possibility that Science is Just Wrong and we should all go back to praying to the sun.

                                            The education field is at least as fad- and ideology-driven as software, and the medical field has cultural problems and studies that don’t reproduce. Many of the arguments given in this essay are clearly applicable to education and medicine (though not all of them obviously are, I can easily come up with new arguments for both fields). The fundamental problem with applying science to any field of endeavor is that it’s anti-situational at the core. The whole point of The Scientific Method is to average over all but a few variables, but people operating in the real world aren’t working with averages, they’re working with specifics.

                                            The argument that software isn’t special cuts both ways, after all.

                                            I’m not sure if I actually believe that, though.

                                            The annoying part about this is that, as reasonably compelling as it’s possible to make the “science sucks” argument sound, it’s not very conducive to software engineering, where the whole point of the practice is to write generalized algorithms that deal with many slight variants of the same problem, so that humans don’t have to be involved in every little decision. Full-blown primativism, where you reject Scalable Solutions(R) entirely, has well-established downsides like heightened individual risk; one of the defining characteristics of modernism is risk diffusion, after all.

                                            Adopting hard-and-fast rules is just a trade-off. You make the common case simpler, and you lose out in the special cases. This is true both within the software itself (it’s way easier to write elegant code if you don’t have weird edge cases) and with the practice. The alternative, where you allow for exceptions to the rules, is decried as bad for different reasons.

                                            1. 6

                                              That is absolutely a valid counterargument! In response, I’d like to point out that we have learned a lot about those fields! Just a few examples:

                                              I’m don’t know very much about classroom teaching or nursing, so I can’t deep-dive into that research as easily as I can software… but there are many widespread and important studies in both fields that give us actionable results. If we can do that with nursing, why not software?

                                              1. 1

                                                To be honest, I think you’re overselling what empirical science tells us in some of these domains, too. Take the flipped classroom one, since it’s an example I’ve seen discussed elsewhere. The state of the literature summarized in that post is closer to: there is some evidence that this might be promising, but confidence is not that high, particularly in how broadly this can be interpreted. Taking that post on its own terms (I have not read the studies it cites independently), it suggests not much more than that overall reported studies are mainly either positive or inconclusive. But it doesn’t say anything about these studies’ generalizability (e.g. whether outcomes are mediated by subject matter, socioeconomic status, country, type of institution, etc.), suggests they’re smallish in number, suggests they’ve not had many replication attempts, and pretty much outright says that many studies are poorly designed and not well controlled. It also mentions that the proxies for “learning” used in the studies are mostly very short-term proxies chosen for convenience, like changes in immediate test scores, rather than the actual goal of longer-term mastery of material.

                                                Of course that’s all understandable. Gold-standard studies like those done in medicine, with (in the ideal case) some mix of preregistration, randomized controlled trials, carefully designed placebos, and longitudinal follow-up across multi-demographic, carefully characterized populations, etc., are logistically massive undertakings, and expensive, so basically not done outside of medicine.

                                                Seems like a pretty thin rod on which to hang strong claims about how we ought to reform education, though. As one input to qualitative decision-making, sure, but one input given only its proper weight, in my opinion significantly less than we’d weight the much better empirical data in medicine.

                                            2. 2

                                              Dammit, man. That was a great response. I don’t think I’ll ever comment anything anywhere just so my comment won’t be compared to this.

                                              1. 1

                                                “We’re not basing our worldview off a single study. We’re doing many of them, in many different contexts, to get different facets of what the answer might actually be.”

                                                That’s exactly what I do for the sub-fields I study. Especially formal proof which I don’t understand at all. Just constantly looking at what specialists did… system type/size, properties, level of automation, labor required… tells me a lot about what’s achievable and allows mix n’ matching ideas for new, high-level designs. That’s without even needing to build anything which takes a lot longer. That specialists find the resulting ideas worthwhile proves the surveys and integration strategy work.

                                                So, I strongly encourage people to do a variety of focused studies followed by integrated studies on them. They’ll learn plenty. We’ll also have more interesting submissions on Lobsters. :)

                                                “When someone says “you are unprofessional if you don’t use TDD” or “Dynamic types are immoral”, I know, with scientific certainty, that they don’t actually know. “

                                                I didn’t think about that angle. Actually, you got me thinking maybe we can all start telling that to new programmers. They get warned the field is full of hype, trends, etc that usually don’t pan out over time. We tell them there’s little data to back most practices. Then, experienced people cutting them down or getting them onto new trend might have less effect. Esp on their self-confidence. Just thinking aloud here rather than committed to idea.

                                                “Science is about admitting that we are going to make mistakes”

                                                I used to believe science was about finding the truth. Now I’d go further than you. Science assumes we’re wrong by default, will screw up constantly, and are too biased or dishonest to review the work alone. The scientific method basically filters bad ideas to let us arrive a beliefs that are justifiable and still might be wrong. Failure is both normal and necessary if that’s the setup.

                                                The cognitive dissonance make it really hard like you said. I find it a bit easier to do development and review separately. One can be in go mode iterating stuff. At another time, in skeptical mode critiquing the stuff. The go mode also gives a mental break and/or refreshes the mind, too.

                                                1. 1

                                                  You’d be surprised. “Two comparisons of programming languages”, in “making software”, does this with nine teams (but only for one day).

                                                  My reading (which is congruent with my experiences) indicates a newly-put-together team takes 3-6 months before productivity stabilizes. Some schools of management view this as ‘stability=groupthink, shuffle the teams every 6 months’ and some view it as ‘stability=predictability, keep them together’. However, IMO this indicates to me that you might not be able to infer much from one day of data.

                                                  1. 2

                                                    To clarify, that specific study was about nine existing software teams- they came to the project as a team already. It’s a very narrow study and definitely has limits, but it shows that researchers can do studies on teams of professionals.

                                                  2. 1

                                                    People bring up how expensive it is. And it is expensive, just as it’s expensive to study people in general. But here’s the thing. We are one of the richest industries in the history of the world. Apple’s revenue last year was a quarter trillion dollars. That’s not something we should leave to folklore and feelings. We’re worth studying.

                                                    I don’t think I understand what you’re saying. Software is expensive, and for some companies, very profitable. But would it really be more profitable if it were better studied? And what exactly does that have to do with the kinds of things that the software engineering field likes to study, such as defect rates and feature velocities? I think that in many cases, even relatively uncontroversial practices like code review are just not implemented because the people making business decisions don’t think the prospective benefit is worth the prospective cost. For many products or services, code quality (however operationalized) makes a poor experimental proxy for profitability.

                                                    Inasmuch as software development is a form of industrial production, there’s a huge body of “scientific management” literature that could potentially apply, from Frederick Taylor on forward. And I would argue it generally is being applied too: just in service of profit. Not for some abstract idea of “quality”, let alone the questionable ideal of pure disinterested scientific knowledge.

                                                    1. 1

                                                      Mistakes are becoming increasingly costly (e.g., commercial jets falling from the sky) so understanding the process of software-making with the goal of reducing defects could save a lot of money. If software is going to “eat the world”, then the software industry needs to grow up and become more self-aware.

                                                      1. 1

                                                        Aviation equipment and medical devices are already highly regulated, with quality control processes in place that produce defect rates orders of magnitude less than your average desktop or business software. We already know some things about how to make high-assurance systems. I think the real question is how much of that reasonably applies to the kind of software that’s actually eating the world now: near-disposable IoT devices and gimmicky ad-supported mobile apps, for example.

                                                    2. 1

                                                      My favorite software book is Data and Reality, which is a philosophical investigation into the nature of information representation.

                                                      A beautiful book, one of my favorites as well.

                                                      rest of post….

                                                      While I thought the article articulated something important which I agree with, its conclusion felt a bit lazy and too optimistic for my taste – I’m more persuaded by the POV you’ve articulated above.

                                                      While we’re making analogies, “writing software is like writing prose” seems like a decent one to explore, despite some obvious differences. Specifically relevant is the wide variety of different and successful processes you’ll find among professional writers.

                                                      And I think this explains why you might be completely right that something like TDD is valuable for you, even though empirical studies don’t back up that claim in general. And I don’t mean that in a soggy “everyone has their own method and they’re all equally valid” way. I mean that all of your knowledge, the way think about programming, your tastes, your knowledge of how to practice TDD in particular, and on and on, are all inputs into the value TDD provides you.

                                                      Which is to say: I find it far more likely that TDD (or similar practices with many knowledgeable, experienced supporters) have highly context sensitive empirical value than none at all. I don’t foresee them being one day unmasked by science as the sacred cows of religious zealots (though they may be that in some specific cases too).

                                                      For something like TDD, the “treatment” group would really need to be something like “people who have all been taught how to do it by the same expert over a long enough time frame and whose knowledge that expert has verified and signed off on.”

                                                      I’m not shilling for TDD, btw – just using it as a convenient example.

                                                      The broader point is that effects can be real but extremely hard to show experimentally.

                                                    1. 7

                                                      “Software developers are domain experts. We know what we’re doing. We have rich internal narratives, and nuanced mental models of what it is we’re about, …”

                                                      For a large proportion of software undertakings surely this is not true. Much of software development is outside the domains of the computing and data sciences, and computing infrastructure. While popular to consider that these are the only endeavors of importance to today’s developers, the modeling of systems in domains other than these into code represents the majority of software running in the world today.

                                                      In these, we don’t know to model reliably and predictably, the systems that stakeholders (think they) want, and that external domain experts know. How can one consider applying scientific rigor to that?

                                                      1. 4

                                                        Great point. You are certainly correct that software developers are not always experts in the domains of their products. They are still experts in the domain of their tools and practices though, so they should be considered “domain experts” from the perspective of researchers.

                                                        1. 2

                                                          They are still experts in the domain of their tools and practices though

                                                          I like this degree of optimism, and hope one day to overcome my experience enough to share it!