1. 42

  2. 13

    I’m upvoting this mostly in the hope that HPC programmers will comment on it. I’m quite curious about how HPC programmers actually see the world, but they appear quite elusive online, or maybe I just can’t find their meeting places. I only ever meet them at conferences and such.

    1. 15

      I have a masters degree in CS with a focus in HPC. This article is mostly correct for commodity HPC. The lack of fault tolerance primitives in MPI was a pain, meaning you’d start a job and hope for no serious errors, checkpointing as often as possible, and then if something went wrong (hardware, net, etc) you’d have to restart. HPC for me was molecular dynamics simulations and things like that, the control of MPI was needed if you were going to run your systems on large supercomputer setups like the USG has. Still that would often require porting and compiler fun to make things work.

      I wouldn’t say HPC is dying, it’s just diffusing from “hard physics” (mostly rote floating point vectorized calculations) into the worlds of bio and data science that have different needs and are often just as much about data processing as anything. Physics like astronomy and particle physics have been dealing with scads of data already and have their own 20-30 year old data formats and processes.

      The article is correct, the sort of big 1000 core simulation groups are limited to maybe 10-25 research groups worldwide if that (in the multi-disciplinary world of materials science) and in my time in grad school I met most of the big names. That’s not a market of people, that’s a niche user group with their own needs and they can do what they want with the primitives available. I don’t know much about large scale simulation (ie that isn’t ‘embarrassingly parallel’ and requires near lock step execution across tons of machines) in other fields like civil, molecular bio, etc but I’m sure their user bases are small as well.

      In the end the needs of a handful users won’t sway the direction of the market. See for instance the bet on the Cell processor at BlueGene/L not making sony/toshiba/ibm pursue the design (even though it’s influenced CPU/GPU designs to this day). There’s your ramble. :)

      1. 2

        I’m here to agree with this. HPC traditionalists are largely struggling to achieve performance in worlds like deep learning, where their tools and architectures are designed for the wrong problem (e.g. Lustre is great for random access to files, not so great for AI applications where you do a whole lot more reading than writing).

        Meanwhile, cloudy novelty fans struggle to achieve performance in areas where traditional HPC both performs well and has been optimised over decades. I remember a fluid simulation demo, though not the domain, where some Apache-stack people wanted to show off how “performant” Apache-stack was. The MPI code was done before the MapReduce thing had finished launching.

    2. 7

      This article made a huge flap in the HPC community when it was published in 2015, and Jonathan’s posted a number of direct and indirect followups since.

      I agree with most of Jonathan’s opinions on MPI as a programming model, and I think other parallel programming models are going to gain ground and eventually dominate within scientific computing. However, I think that MPI will be around a long time, continue to improve, and and that it will continue to make sense for new and continuing development. For example, I suspect that large physics simulation and other national-lab-type workloads will stick with MPI for the foreseeable future, just because it’s so well tailored for those workloads. It’s a narrow niche, but it’s a well-funded one.

      1. 5

        However, I think that MPI will be around a long time

        In case anyone doubts this, just look at Fortran. I firmly believe most (all?) Fortran developers now are involved in HPC, and there are no signs that it is going away.

        1. 2

          The most impressive MPI application I’ve been involved with was Fortran-based. A hybrid British Meteorological Office weather model (ocean-based) combined with a U.S. Army Corps of Engineers weather model (land-based) to cover the Mississippi Gulf Coast. At the time (2004), it would have been considered a 64-node Beowulf cluster (16 dual-processor hyper-threaded Xeons), and it actually divided the processing time by 64. Rarely can you just add processors and get the predicted time-savings. My hat is off to the author of this book.

          1. 3

            Rarely can you just add processors and get the predicted time-savings

            This is Amdahl’s Law, but weather and climate simulations are well-placed to take advantage of all the parallelisation. They’ve got a cellular architecture, where each cell takes inputs from its neighbours, works out its update, and tells its neighbours. Up to a very large number, the speed of the simulation is how many cells can independently be running (the messaging overhead takes over beyond that very large number), which is how many threads you have.

      2. 4

        HPC and MPI are both great solutions when applied correctly to an appropriate problem. I’ve seen great practical applications of both approaches, but I’ve seen a lot of abuses of them as well. There are too many cases out there of naive approaches to problem solving that are covered up by running on an HPC cluster or adding MPI support.

        Case in point, the genomics toolkit I’ve relied on since 2009 recently removed MPI support from their code base because:

        The MPI version is slower and requires dual maintenance. No advantage to keeping it.

        I tested this, and it was indeed true (for that software).

        In another case, I’ve seen a genome pipeline running on a Windows-based HPC cluster that launched 24GB containers that merely act as IDL wrappers around a Fortran program that would otherwise take less than 10MB of memory to run pairwise distance calculations. For each pairwise comparison. Sheer insanity.

        Since then, I’ve implemented my own parallel code in C++ to replace that software, and I plan to employ MPI with GPU kernels to run on my home-brew cluster. Even running on a single node, my implementation is 25x faster than the software I’ve been using for the past 10 years. I’ve reduced a 4 hour process to less than 10 minutes, and my ultimate goal is to solve a genome comparison problem that takes “48 wall hours” to run on a HPC cluster to less than 2 hours on $2,500 of used cash registers.

        1. 4

          it is very odd the ‘cluster == fast’ thing that has made folks like Frank McSherry quite the star in some circles (mine included) for showing how many graph computations can be done on a laptop faster than a cluster. In the world of HPC I was in (materials science) it was either something like MD where you are simulating X number of atoms, you could ‘fit’ so many in RAM at a time (w/ position/velocity possibly other information) and then you’d do the work to see how many would move to another CPU or system and do this boundary area calculation to see who needs to move where (possibly with buffering) and you’d just fill up a single CPU in testing, then multicore, then move to HPC.

          Perhaps the ‘big data’ aspect of genomics has led to folks trying to do more hadoop style distributed calculations that don’t necessarily redline the chips they run on (due to network / disk IO overwhelming the calculation) . Even stranger are the EC2 etc offerings where your network and disk are ??? speed depending on how much you’re willing to pay, at which point you really should think about getting your own rack or 20.

          That said I haven’t seen a lot of huge (ie 1000 core) genomics problems, with mat sci you could always multiply your system size, but in DNA you have a limited length and it lends itself to serial computation or embarrassingly parallel computation (to my naive mind). good luck!

          1. 2

            May I know which executor you were using for your workflow?

            1. 3

              Are you referring to the HPC pipeline? If so, they had developed a Python 2.6-based IDL architecture from scratch, to run on their Windows cluster. They actually threw away the first attempt, which amounted to nearly 5 years of development effort deleted. I’m not certain which container they used for bringing up their IDL components - I have a feeling they changed that several times as well. I can’t explain their design decisions, except that the person who spec’d out the cluster left prior to the project kicking off.

              If you are instead referring to my genome analysis pipeline, I use Java for managing the workflow, and I have a pure Java implementation of my bioinformatics library as well as my faster C++ implementation (same design and class hierarchy). I had looked into using Jenkins to manage the workflow, but it looked like it would add unnecessary complexity.

              1. 2

                Thanks. It’s interesting to see how different people set up pipelines. I’m part of an effort to standardize how bioinformatics workflows are represented and we use a scheme where the tools are put into docker images. Each tool is described using a YAML schema that describes the interface (inputs, outputs and how the command line is formed). Tools can be hooked together into a workflow, also described in YAML and workflows can be recursively nested. The idea is that having the binary in a docker image, and specifying the interface in a standard manner makes the pipeline reproducible and portable.

          2. 2

            On going US DOE effort for your reference: https://www.ecpannualmeeting.com

            1. 1

              Yeah, it was Exascale Computing. They were trying to fund tools to boost productivity and handle errors on insanely-large clusters.