1. 12

  2. 2

    To construct a list of bugs we could uniformly sample, we took a snapshot of all publicly available JavaScript projects on GitHub, with their closed issue reports. We uniformly selected a closed and linked issue, using the procedure described above and stopped sampling when we reached 400 bugs. The resulting corpus contains bugs from 398 projects, because two projects happened to have two bugs included in the corpus.

    One threat to validity they didn’t cover is that they didn’t control for testing practices, so we don’t know if the bugs were from well-tested projects or not.

    1. 3

      That is irrelevant to the method they used: take actual bugfix commits and see whether the bug would have been caught by static typing. If the project has tests or not does not matter, it had the bug and it needed to fix the bug in a commit. Static analysis is run before one commits.

      1. 2

        It’s irrelevant to their method but not their thesis.

        Evaluating static type systems against public bugs, which have survived testing and review, is conservative: it understates their effectiveness at detecting bugs during private development, not to mention their other benefits such as facilitating code search/completion and serving as documentation. Despite this uneven playing field, our central finding is that both static type systems find an important percentage of public bugs: both Flow 0.30 and TypeScript 2.0 successfully detect 15%!

        They haven’t demonstrated that these bugs survived testing, because they haven’t controlled for the prevalence of unit testing Javascript projects. Consider the extreme case where 0% of their 400 bugs came from tested projects. Then a viable objection would be “You don’t know whether or not static typing is that effective, because for all we know, all of those bugs and more would have been caught with the same investment of unit testing!”

        Don’t get me wrong, I was incredibly excited about this study until I noticed this. I want to finally have an answer to this debate, and I think studying gradual typing is the best way to find it. But I’m also a huge curmudgeon about these things and want to keep things absolutely watertight.

        Fortunately, the reviewers were wonderful enough to put their full list online. I clicked on four projects at random: two had some form of unit tests, two did not. Thinking of digging in deeper later.

        EDIT: just emailed them to get their cleaned-up data.

        1. 3

          … all of those bugs and more would have been caught with the same investment of unit testing!

          Surely not the same investment? These are all bugs they fixed with a typechecker + within ten minutes of coding. It’s fair to say that adding those few judicious annotations and typechecking in the first place would have prevented those bugs in less than half the time each. In fact, it’s hard to believe a single unit test could provide even the same level of guarantee that one static type would provide. So I can’t bring myself to believe that this basic typechecking would have been the same or greater investment.

          1. 2

            To be clear, I personally find this pretty exciting evidence, but there’s been a lot of research on this topic that misses some critical objection that muddles it. And if there’s one thing physics has taught me, it’s that common sense is a lying bastard ;) That’s why I like research as hardened as possible.

            Re testing, there’s a couple of arguments you could make here. One is that unit tests can catch logic problems in addition to type errors, so maybe you could get the type-checking “free” in your unit tests. Another is that while unit tests are pretty high investment per test, they’re also among the least powerful testing tools you have! But what about something like Quickcheck?

            That’s why I think it’s worth, at the very least, reviewing the testing infra these projects had. If it turns out that they all had extensive testing, that’s a slam-dunk for typechecking. If none of them did, it’d be worth doing another experiment to see how much time you need to cover the same bugs with tests.

            There’s a couple of tests (haha) I’m planning on running:

            1. Of the files with bugs, how many were imported into files with ‘test’ or ‘spec’ in their name? This should give us a rough lower bound of test coverage.
            2. Of the commits that fixed bugs, how many also added a covering unit test? This should give us a rough upper bound of test coverage.

            Another fun idea might be to look at projects with low and high test coverage and see how many intra-project issues could have been fixed with a type-checker.