1. 29

  2. 21

    Hearing stories from folks doing academic HPC, the answer is you can’t. Even with source code, there are often a bajillion stupid scripts, vendor-specific compiler fuckups, scheduling issues, hardware failures, silent data corruption, and everything else. And that source code, the grad who wrote it is long gone and it’s probably impenetrable templated C++ with some Boost and MPI in there somewhere. Maybe some badly-ported Fortran code–if it isn’t linked directly. If they’re in a really progressive lab, they might have git. Otherwise, it’s subversion and divergent branches and general clusterfuckery.

    Oh, and some of them, when asked about this, will snarf and talk about how they are special academic HPC snowflakes that somehow decades of software engineering practice don’t apply to. “We’re too large to log errors properly, debugging can’t be done like you’re suggesting, everything takes too long, we are very smart PhDs, etc.”

    As long as the paper gets published, though, it’s fine. How many people have the machine hours to prove you wrong? And at what risk to their own equally-shaky research? Most people can’t even understand the abstracts of the papers they work on–who, exactly, is going to call bullshit?

    1. 9

      Having done academic HPC, I can’t really disagree with anything you’ve said, at least not as of 7-8 years ago. I can only hope it’s gotten better. Also, there are some fields in which the potential impact of results makes it easier to get resources for professional software engineering practices, independent reproductions, and validation runs. In my experience, weather, climate, and computational fluid dynamics can get those resources.

      If anyone’s interested in a readable overview of what the practice of software engineering in computational science looked like as of ‘08, this article by Victor Basili et al: Understanding the High-Performance Computing Community: A Software Engineer’s Perspective is a good start - I worked with all of those authors, and they’re all great. In particular Jeff Carver seems to be continuing this line of work with regular workshops on “Software Engineering for High Performance Computing…” at the Supercomputing and ICSE annual conferences.

      If you’re wondering how seriously the scientific software quality problem is/was taken, much of that work on understanding HPC development processes was funded by part of a DARPA program for “High Productivity” computing systems, and the software engineering parts of it were de-funded unexpectedly in the last round of that program… The public face of that effort was at highproductivity.org (wayback link), but now it’s so dead I couldn’t even find enough references to its workshops to fact-check my dates.

      1. 2

        If anyone’s interested in a readable overview of what the practice of esoftware engineering in computational science looked like as of ‘08, this article by Victor Basili et al: Understanding the High-Performance Computing Community: A Software Engineer’s Perspective is a good start.

        That’s a good article, thanks!

    2. 8

      I followed up the austerity paper story and it’s not clear how large a factor the excel spreadsheet thing was. There were several factors in the paper that were dodgy

      1. Mysteriously excluded data on some high-debt countries with decent growth immediately after World War II, which would have greatly weakened their result;
      2. They used an eccentric weighting scheme in which a single year of bad growth in one high-debt country counts as much as multiple years of good growth in another high-debt country;
      3. They dropped a whole bunch of additional data through a simple coding error. (which is what is referenced in the article)

      Reference: Paul Krugman’s summary in NYT

      The original paper itself was not in a peer reviewed journal but rather a “working paper” put out by a Washington D.C. based “think tank” i.e. a machine to manufacture public opinion.

      In the context of bad scientific coding the x-ray machine fiasco still is at the top.

      1. 4

        I followed up the austerity paper story and it’s not clear how large a factor the excel spreadsheet thing was.

        Yeah the paper on the whole was pretty sketchy. I’m focusing just on the software bug because that’s the area I have the most expertise, and the area that most scientists have the least expertise in.

        Note that that’s not the only high-profile paper that had serious software bugs. For example, the famous Levitt paper linking abortion to crime rates accidentally left out the control groups.

        In the context of bad scientific coding the x-ray machine fiasco still is at the top.

        I dislike the Therac-25 example for two reasons. First of all, it’s software, not science code. It’s tragic and a cautionary tale for programmers, but it doesn’t have much relevance for scientists and researchers. They’re not putting code into production, ‘just’ using it for their own purposes. The problem is that bad science code can warp what those purposes are.

        Second, it’s 30 years old, and leads to a sense of “this happened back in the day but things are better now.” Whereas just seven years ago FDA recalled a bunch of eye surgery devices that, due to software bugs, could blind people. This was a problem back in the day and it remains a problem now.

      2. 6

        Bioinformatics seems to, at least ostensibly, care about reproducibility. Tracking which versions of tools, including preparation scripts that turn raw data into processable data, etc get a lot of lip service. The papers are also all located (generally for free) on PubMed which makes discovery pretty straight forward. There are multiple websites that exist to host datasets. Of course, people interested in reproduction is a different problem.

        1. 5

          Automated tests and code reviews are fundamentally different: the former enforce stability. Stability may involve staying incorrect! The latter can catch whether your code matches the spec. The spec may be wrong! See https://www.youtube.com/watch?v=Vaq_e7qUA-4&feature=youtu.be&t=63s (I really need to make a prose version of that).

          What’s missing is a third kind of testing: comparing against reality, to the extent possible.

          I’m going to start job doing gene sequencing data processing pipeline, so this is the problem I will be wresting with over next few years. I asked about current approaches, and the answers are suggestive:

          1. Do the answers make sense? E.g. if you know the 80% of RNA is a thingummie, but only 5% of sequenced RNA data is thingummies… probably you screwed up the sequencing.

          2. Do the answers match a previous, reasonable model? E.g. humans can find cell borders on a photo. Do the software’s borders match what the human did? There’s some tricks you can do to make this faster, e.g. visual diffs are a neat way to combine computer processing with human image processing abilities.

          1. 4

            Testing was originally used to prove correctness. One can also test against known spec for stability. That’s regression testing. One can even generate tests directly from formal specs. That’s just one goal, though. Another set of strategies for automated tests assume the spec or intuitive understanding of the algorithm might be wrong. They typically look for wrong outputs or crashes (esp with fuzzers). Testing can also be used to help the developer better understand the domain or black-box implementations. Finally, testing can be layered in combination with other assurance techniques on the same properties but just as a check against the failure of an individual technique.

            The Wikipedia article on Software Testing lists many different types of testing with their goals if you want more information on the topic:


            1. 2

              I think automated testing against “known good” should be able to enforce correctness as well as stability. As I understand, Genome in a Bottle is an initiative in that direction.

              1. 1

                Automated tests and code reviews are fundamentally different: the former enforce stability. Stability may involve staying incorrect!

                Particularly important for scientific code. Segfaults are way better than code that runs to completion and gives you the wrong floating point numbers.

              2. 5

                Your opening is a mind-blowing example of bad science and damage from a program that I’m just hearing about. Although normally I use medical or exploding rockets, I think I’ll add this one to list of costly bugs I use when arguing for engineered software. Does anyone know an estimate of how much austerity cost Greece on top of whatever else was going on? And has anyone checked the specs and source for what produced that estimate? ;)

                Far as adoption, there’s a lot of people using Matlab, Python, and R. I think investing heavily in fire-and-forget tooling that spots common things for those languages might be a start. Then, an easy way for them to write specs of what they’re doing that can feed into other checkers. Especially catch data, interface, consistency, or ordering errors. There used to be CASE tools for a lot of stuff like this with some success in the market. They mostly failed due to overpromising and being greedy. Honest, FOSS tooling might fair better.

                1. 8

                  Greece is a little different, insofar as it had austerity forced on it pour l’encouragement des autres.

                  And it’s not helped by fact that mainstream economics and its love of austerity are basically a form of religion as advocated by Hobbes in Leviathian: a means of keeping the hierarchy in control. It’s certainly not a science (Debunking Economics is a good book on the subject, because it debunks economics using its own intellectual assumptions; other attacks on mainstream economics are usually along the lines “but it forgets about X or Y”.)

                  1. 9

                    Mainstream economics is involved to some extent, but there’s a good deal of domestic EU politics exacerbating the Greece debacle (partly to make an example as you note, but even beyond that). Even the IMF, which is pretty well known for a certain kind of by-the-books mainstream economics and support of austerity measures, has been distancing itself from the approach taken in Greece, because IMF in-house economists don’t believe the current program will work. They attempted to push for either some kind of debt forgiveness or rolling over debt into long-dated, low-interest bonds (which is a sort of “soft” debt forgiveness), but the EU wouldn’t go for either option, because EU finance ministers from countries like Germany and Finland didn’t think they could sell it to their domestic voters. Part of the problem there is that a number of countries have invented a kind of national-chauvinist mythos where they’re hard-working, responsible northern Europeans, in contrast to those layabout, profligate southern Europeans. Once you get that kind of national superiority and morality-play in the mix it doesn’t matter what the economics say.

                    1. 3

                      The IMF would have been bankrupted if Greece and defaulted - which is part of the problem…

                2. 4

                  During college I got a job at the research computing group, optimizing a researcher’s Fortran code from single-core to parallel and distributed. The code was some kind of field propagation across a 3d grid, and the main loop was what you might expect: each cell takes its new value from it and its neighbors’ old values. In parallelizing this I found something interesting, though: this was being done in-place, so three of the point’s six neighbors would be from the current iteration and three would be from the previous, so the graph would end up skewing towards one quadrant. This was pretty obviously broken, and the grad student who wrote it hadn’t noticed. I passed word and the optimized/fixed version back to him, so I hope they updated the results for whatever project they were working on.

                  The moral of the story being yes, code reviews for scientific codes would be super helpful. I would wager this sort of problem isn’t uncommon, and they don’t all get caught before release.

                  1. 4

                    Academics should come to expect published papers to be in the form of Jupyter Notebooks, with the data embedded (or otherwise available for re-calculation).

                    The notebook tech doesn’t have to be that Jupyter brand. Org-mode with babel is another option.

                    The point is that during peer review, the peer should be able to change a datapoint and re-render the rest of the document to see how that change cascades through the work.

                    Academics seem to use the term ‘reproducible research’ when talking about this.

                    1. 7

                      That does seem to be the term catching on, but I’m wary of using “reproducible” for this kind of thing where you’re literally just re-running the original author’s code, in their original setup (sometimes even down to a whole VM with gigabytes of gunk in it), rather than independently reproducing the results. With the classic idea of reproducibility in the natural sciences, independently constructing your own experimental setup, using your own completely separate materials, is critical, because part of the point of reproduction is to try to tease out if there was any hidden dependence on specifics of the original equipment, or confounding factors like minor impurities or miscalibrated equipment. So you don’t reproduce research by going into the original lab, pushing the same buttons on the same equipment, and observing the same results— you do it by reading a published description of the experimental protocol, and then independently working only from that, see if you can reproduce the results on your own completely separate equipment, with different staff, different sample suppliers, in a different location, etc.

                      For computational work, I do see benefit to sharing the data specifically, especially in the case of data that’s expensive to collect. But for good reproducibility the person trying to reproduce a result really shouldn’t be working directly from the original author’s Jupyter notebooks, VMs, scripts, etc.; ideally they’d write their own scripts to avoid inadvertantly copying some quirk that seemed irrelevant but turns out to matter.

                      1. 2

                        Hell yeah!

                        So, one problem to solve is the Greece debt paper issue: that paper itself wasn’t ‘reproducible’ using their own materials. At least the authors released their excel sheet, without which the problems wouldn’t have been identified. Documents-as-code or ‘reproducible papers’ or ‘literate programming’ addresses this problem.

                        The problem you describe is outside of my area of expertise. It’s important, thank you!

                        1. 1

                          Would the characteristic be better named “public method” or something?

                          In science, your code frequently is your method; if you don’t publish your method it should be hard to take your results seriously.

                        2. 1

                          Speaking to angersock’s ‘clusterfuckery’ comment re: terrible code and practices… What you do is author your paper using the notebook. So if you want your fancy cluster to crunch some numbers, you put the shell commands to make that happen into the notebook. Right down to the first command:

                          for i in `seq 1 1000`; do ssh node-$i.cluster wget http://the-local-shared-drive/20gb-list-of-numbers.csv; done

                          For that one, you’d want tell your readers where to get that data. The point is, the first metric peers will use during review is “can I reproduce this?”.

                          1. 1

                            Hasn’t this been somewhat of a solved issue for a while now, with technologies like Wolfram’s Computable Document Format in wide use for 5+ years, and in the FOSS world there are alternatives, albeit more geared to number theory or computation like SageMath having been around for more than a decade now?

                            Maybe the question becomes how to get people to adopt new practices? As the original articles states, this is not a technical issue - that part of the puzzle is already solved.

                            This is a problem of making people change their ways and use new tools and adopt new practices.

                          2. 4

                            Being an active academic HPC practitioner, at least in our research field, we are very careful in deploying our production code. In the US alone, we have multiple groups managing multiple production code bases in multiple languages to extensively cross check our results. All of our codes are on Github.

                            That being said, some bad practices do exist. Not to speak for all of the academic HPC fields, but I guess it is generally true. Most of the codes are written by grad students and postdocs, and there are just not enough people to call it a team to work on the same thing. Some amounts of heroic programming and publication oriented developing schedules are unfortunate consequences of career pressure and lack of fundings. Those, career pressure and lack of fundings, are the true reasons that we cannot compete with the highest standards out of silicon valley.

                            Nevertheless we do our absolute best to maintain the correctness of our research.

                            1. 2

                              “[A] well-known fintech company accidentally introduced hundreds of mock applicants into their production database” Does anyone have a link to this story (if it’s publicly known)?