I fear that these problems occur very often and there is a lot of research that is non-repeatable. It has occurred several times that I have tried to reproduce papers in natural language processing, but did not get the claimed results. Everyone makes these mistakes, even trained programmers. The consequences can range from small embarrassments, through paper retractions, to danger to human life.
Within our research project we have introduced the rule that all code that is directly used in research papers needs to be peer-reviewed (we use GitHub PRs). We also try to cover most code with unit tests, for more expensive test (e.g. training models that can take several days), we check if there are no regressions between releases.
I have also started a repository of Nix derivations of software + models, so that in principle some work could be reproduced with exactly the same set of dependencies.
I agree with basically everything in these slides, and have seen or experienced first hand a lot of what he mentions. I do have a question, though.
Let’s say you have an experiment that generates a large amount of data, say 50GB. Where do you put that so that it’s freely available? It doesn’t seem like putting it in git along with the code is a good solution. I’m also not going to throw it on my personal Dropbox and host it for the rest of eternity.
A quick search reveals that my university provides a Dropbox-like service, but it appears that you need some kind of permissions for access i.e. a random stranger can’t access the data without wading through the BS of getting access. (Note: I’m not saying that you shouldn’t be able to have private data, just that lots of research data has no reason to be private and there should be an easily accessible depot of this data).
On top of that, I’ve never heard of this service that my university provides. I suspect that a top-down effort from the university would make a big difference here e.g. all data used for publications, theses, etc must be uploaded to a university-run research data storage depot. I also suspect that more people would use these open-data services if the university actually advertised them.
Now, time for an anecdote to drive home the point of the slides.
I do experiments that measure that coordinate several pieces of equipment. At one point I was starting on a new experiment, picking up where someone else had left off years ago. I got the experiment back up and running to the point where I could collect data, but I noticed that the data looked wrong (the decay I was measuring had a lifetime larger than expected). I discovered that someone had missed a factor of 2 in the program that runs the experiment. In a moment of panic I started to wonder whether our group had published data based on this program. It turns out this particular experiment/program was never used (the lead on it had graduated before using it), so the crisis was averted in this case.
bittorrent would be pretty okay…
Hm I wonder what the seeding rate would be here. I imagine in most cases there’s less than one concurrent downloader, and if they don’t seed, it reduces to hosting it on a web server.
http://academictorrents.com/ is something I’ve used in the past when I’ve had a need, but strictly speaking I’m not an academic - and it does kind of depend on hitting a critical mass of people willing to seed.
This is what Dat aims to solve, but yes bittorrent is also nice.