Obligatory Taco Bell programming.
Awesome link. I didn’t know xargs could do that.
Obligatory: “Scalability! But at what COST?
Hah, I am just cleaning/tokenizing/tagging/parsing/lemmatizing a 27.3B word corpus (German text from the common crawl). The problem is embarrassingly parallel and just fits a single higher-end machine (40 cores, plenty of RAM). Each annotation layer is produced by a separate program (mostly in Rust, plus some Go, Python and Java). It’s all driven by some shell scripts and xargs -P for processing chunks in parallel, so there is virtually no overhead for the actual work distribution. UNIX pipelines are extremely flexible and still fit many job sizes well.
It makes me think back of a prior job where they would deploy Hadoop clusters for tasks that required only a tiny fraction of the processing. But hey, throwing Hadoop at the problem sells (it must be big data then).
(Of course, the are many scenarios where you actually want a Hadoop cluster, but I agree that people are often overestimating the computational requirements of their problem.)
In my day job, I work with patent data. I’m part of a unix shop that is in the middle of a much larger java-oriented company. We regularly do the same job on the same data faster & easier on a single beefy workstation with awk than our colleagues with a reserved 400 node hadoop cluster. So, OP’s experience & yours resonate with me a lot.
I’m not sure it’s even a matter of overstating computational requirements. If you’ve got something that can easily be modeled as a pipeline, map-reduce is likely to be a poor fit, since too much emphasis will be on the reduce part. So, it’s a matter of understanding your own problem and your own algorithm for solving it, and understanding whether or not the hard bits can be parallelized.
People seem to think Hadoop is about processing power. It’s not, it never was. Google only used MapReduce for processing datasets much too large to fit on a single machine. If your data can physically fit in the drives of one machine, and your processing pipeline’s working memory can physically fit in the RAM of one machine, you certainly should not use Hadoop.
Sure - but how many people does that apply to? AWS will rent you a machine with 128 cores and 4tb ram by the hour, and a truck to carry your backup tapes to it - and if that’s not enough you can buy bigger iron.
Exactly, it barely applies to anyone.
Knuth vs McIlroy, round 2: now with parallelism.
I feel like this isn’t quite fair to Knuth…
This reminded me of R17 http://www.rseventeen.com/, which is sadly not maintained anymore.