See also their video on the paper, the CACM rebuttal, and the TOPLAS rebuttal rebuttal. I wrote a brief overview of this all here which immediately went out of date (it was before rebuttal²). I’m working on a longer analysis, which is currently topping 5,000 words and still has a ways to go.
FWIW I followed this argument a bit a few weeks ago, but got exhausted pretty quickly. I think it’s valuable but I was doing other stuff.
Here’s my shortcut, which is essentially argument by authority: if Jan Vitek and Emery Berger do a replication study, and rebut the rebuttal – i.e. put their reputations on the line – then I believe it.
Vitek has written a couple great papers on R which I learned a lot from. Several years ago I was amazed by the methodology and thoroughness in this paper, which answered a lot of questions:
And I also learned a lot from reading Berger’s work on allocation going back many years. The very cool mesh allocator is the most recent example.
I’m not familiar with the work of the other authors. I guess I could be convinced otherwise, but it all sounded pretty sloppy to me.
So yeah it’s a shortcut, but that’s what I’m going with for now…
Also I have a bunch of programming experience that tells me that the person / team matters a lot more than the language. Bugs and performance problems bunch up at architectural boundaries. Conway’s law says that team boundaries are architectural boundaries.
That said, shorter code has fewer bugs, regardless of language, but languages can be closer or further from the problem domain, which matters a lot. So 10K SLOC has fewer bugs than 100K SLOC, which has fewer bugs than 1M SLOC (and sure replace SLOC with a different metric, but we’re talking orders of magnitude here.)
I’m not familiar with the work of the other authors. I guess I could be convinced otherwise, but it all sounded pretty sloppy to me.
So yeah it’s a shortcut, but that’s what I’m going with for now…
It’s this and the way they’ve handled the replication that persuaded me. After the rebuttal² one of the FSE authors linked Berger’s Facebook account, which is absolutely Not Okay.
The Ray et al. work aimed to provide evidence for one of the fundamental assumptions in programming language research, which is that language design matters. For decades, paper after paper was published based on this very assumption, but the assumption itself still has not been validated.
[…]
we have shown that the conclusions of the FSE and CACM papers do not hold. It is not the case that eleven programming languages have statistically significant associations with bugs. An association can be observed for only four languages, and even then, that association is exceedingly small.
One thing that (I believe) is very difficult to measure when all you have is the code, is how much time and effort went into it.
Without that data, I have this feeling that bug hunting simply stopped when the thing were deemed “good enough”. And I mean that at every stage of development. If something sorta makes sense, I fire up the compiler. If it sorta works, I test a few cases. If it looks like it works, I may write some unit tests or something to increase confidence. Then I just push the commit. And before publishing an official release, I simply make sure the code and tests are up to snuff, where “snuff” here is determined almost entirely by the problem domain: a simple one off script is often usable even if it’s buggy, but a crypto library better be hardened up to 10 point steel.
Assuming I work correctly, whether I use C or Haskell will have little bearing on the end result. It may however have a significant effect on how much time I spent on the whole project. (And that effect may depend on the domain.)
I’m not sure how much evidence can be derived from GitHub repositories. My guess is not much.
I’m not sure how much evidence can be derived from GitHub repositories. My guess is not much.
Someone asked this in Jan Vitek’s talk on the paper. His response was that he thinks the entire project was doomed from the start, but nobody would believe them. They had to first prove the original paper was internally flawed, “beat them at their own game”, before people would accept the more extreme claim that “comparing language bugs by github repos is a bad idea.”
Agreed. This is also a topic covered in the talk Intro to Empirical Software Engineering. It’s amazing how much we think we know, vs how much we actually do.
Uncontrolled influences
Additional sources of bias and confounding should be appropriately controlled. The bug rate (num616 ber of bug-fixing commits divided by total commits) in a project can be influenced by the project’s culture, the age of commits, or the individual developers working on it
Emphasis mine. I would expect an experienced developer working in Haskell to create fewer bugs than someone fresh out of college hacking on some PHP code.
See also their video on the paper, the CACM rebuttal, and the TOPLAS rebuttal rebuttal. I wrote a brief overview of this all here which immediately went out of date (it was before rebuttal²). I’m working on a longer analysis, which is currently topping 5,000 words and still has a ways to go.
I will be very interested to see where this all ends up
FWIW I followed this argument a bit a few weeks ago, but got exhausted pretty quickly. I think it’s valuable but I was doing other stuff.
Here’s my shortcut, which is essentially argument by authority: if Jan Vitek and Emery Berger do a replication study, and rebut the rebuttal – i.e. put their reputations on the line – then I believe it.
Vitek has written a couple great papers on R which I learned a lot from. Several years ago I was amazed by the methodology and thoroughness in this paper, which answered a lot of questions:
Evaluating the design of the R language
https://scholar.google.com/scholar?cluster=7289073113769360932&hl=en&as_sdt=0,5&sciodt=0,5
And I also learned a lot from reading Berger’s work on allocation going back many years. The very cool mesh allocator is the most recent example.
I’m not familiar with the work of the other authors. I guess I could be convinced otherwise, but it all sounded pretty sloppy to me.
So yeah it’s a shortcut, but that’s what I’m going with for now…
Also I have a bunch of programming experience that tells me that the person / team matters a lot more than the language. Bugs and performance problems bunch up at architectural boundaries. Conway’s law says that team boundaries are architectural boundaries.
That said, shorter code has fewer bugs, regardless of language, but languages can be closer or further from the problem domain, which matters a lot. So 10K SLOC has fewer bugs than 100K SLOC, which has fewer bugs than 1M SLOC (and sure replace SLOC with a different metric, but we’re talking orders of magnitude here.)
It’s this and the way they’ve handled the replication that persuaded me. After the rebuttal² one of the FSE authors linked Berger’s Facebook account, which is absolutely Not Okay.
An earlier lobsters thread on this topic.
One thing that (I believe) is very difficult to measure when all you have is the code, is how much time and effort went into it.
Without that data, I have this feeling that bug hunting simply stopped when the thing were deemed “good enough”. And I mean that at every stage of development. If something sorta makes sense, I fire up the compiler. If it sorta works, I test a few cases. If it looks like it works, I may write some unit tests or something to increase confidence. Then I just push the commit. And before publishing an official release, I simply make sure the code and tests are up to snuff, where “snuff” here is determined almost entirely by the problem domain: a simple one off script is often usable even if it’s buggy, but a crypto library better be hardened up to 10 point steel.
Assuming I work correctly, whether I use C or Haskell will have little bearing on the end result. It may however have a significant effect on how much time I spent on the whole project. (And that effect may depend on the domain.)
I’m not sure how much evidence can be derived from GitHub repositories. My guess is not much.
Someone asked this in Jan Vitek’s talk on the paper. His response was that he thinks the entire project was doomed from the start, but nobody would believe them. They had to first prove the original paper was internally flawed, “beat them at their own game”, before people would accept the more extreme claim that “comparing language bugs by github repos is a bad idea.”
Excellent - I wish more effort was spent in general on trying to replicate widely-cited papers.
Agreed. This is also a topic covered in the talk Intro to Empirical Software Engineering. It’s amazing how much we think we know, vs how much we actually do.
Emphasis mine. I would expect an experienced developer working in Haskell to create fewer bugs than someone fresh out of college hacking on some PHP code.
For sure. Maybe even an experienced developer hacking on some PHP code!