Really interesting perspective on the differences between the ‘traditional HPC’ and machine learning communities' approaches to problems of large-scale computation.
He notes the somewhat paranoid mind-set of HPC folks about errors, being worried about everything from numerical stability to silent data corruption and whole-node failure. As he says, some of this is a property of the computations they’re doing - lots of repeated calculations on the same input data are more susceptible to stability problems. Some of it is also historically bred into the HPC programmer, as access to those machines is precious, and HPC folklore has it that the first Cray at Los Alamos suffered from Single-Event-Upset problems (apparently one every 6 hours )
Edit: Because I’m a sucker for supercomputing history, Here’s more info about the Cray-1’s memory reliability, from the LANL report on the initial evaluation of the Cray-1 (which also has a nice description of the Cray’s architecture at the end, with classic diagrams.)
They experienced a MTTF during testing of ~2.5 - 7 hours, and 89% of the failures were memory parity errors.
This post also makes it sound like HPC codes use system-level full-memory checkpoint & restart facilities. Maybe that’s more common in industry HPC, but in the DOE and academic scientific computing world I was familiar with, most applications handled checkpointing themselves, persisting much less than the total contents of node memory. The general point does stand though, it’s still not really scalable to write out enough data often enough to guarantee good progress.
BTW, why did he link to a video of Modern English’s “I Melt with you” in his paragraph about checkpoint restart?
As an example of blind full-memory checkpoint being overkill - stencil computations, where you have a regular 2-or 3d grid and break it up into subgrids to distribute over processors, often have an optimization where each node will store some duplicate data from its neighbors in “ghost cells”. These ghost cells can be used as input for several local iterations before requiring slow network IO to update them.
However, because they’re redundant, saving them to persistent storage as part of a blind checkpoint wastes time and space.
This post also makes it sound like HPC codes use system-level full-memory checkpoint & restart facilities. Maybe that’s more common in industry HPC, but in the DOE and academic scientific computing world I was familiar with, most applications handled checkpointing themselves, persisting much less than the total contents of node memory.
Application-level checkpointing is a lot more common these days, but I’ve definitely seen system-level checkpointing used in both academia and industry. I’ve mostly seen it on smaller systems with a lot of parallel storage — on bigger systems, the admins won’t let you do anything nearly so wasteful. :)
I also thought the OP made interesting points about restricting the computation model to improve fault-tolerance, a la MapReduce. I’d be really interested to see more “framework-style” distributed computing models come into being, because I also think they’d help improve ease-of-use and adoption for parallel programming in general. The trick would be finding ways to port existing applications to those models (or at least find ways to solve the same problems).
From the author, via twitter:
@mikemccracken The “melt with you” video was a (bad?) joke emphasizing global checkpoint inefficiency, as in “i stop the world …”
I guess I should’ve listened to the song, I didn’t remember that part. :)