The blog post this blog post is blog posting about is found here: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html
All of this sounds really nice if you think big data is a sham, but the problem is the dataset they show this on is only a few gigs. It fits in the RAM of a laptop and, more importantly, the disk of a modern laptop effortlessly. So, yes, I don’t think it’s surprising at all that the single core version grossly outperforms the distributed version. But that seems to be missing the point: what do you do when you can no longer run something on a laptop? A lot of big data stuff I see is doing simple, low RAM consumption, work over massive amounts of data.
The other aspect that @apc has pointed out in other mediums is that, for an organisation, there is a lot of value in standardising and centralising the processing of the data. Sharing the results and sharing new data becomes much harder when everyone has it on their laptops.
Personally, I think there is a bit of truth to this. It’s clear Hadoop is massively inefficient, even at scale. Spark claims to be much more efficient, but it’s still a lot of overhead I’d wager. I’d love to see where it fits in, but perhaps Joyent’s Mantis would be a viable alternative. It mixes the feel of running something on your laptop (you can “log in” to your data), with the scalability of larger systems. It’s also object based which means you will pay less of an overhead tax if you know you can process the dataset on 1 core or 1 machine. Perhaps @bcantrill could weigh in.
So, I kind of had the same thought: you probably don’t really need a cluster to process three gigs of data, and if your startup overhead for the cluster is longer than the time to process three gigs of data on your laptop, then you have “unbounded COST”.
On the other hand, the timing numbers they’re showing there seem like they’re probably pretty big compared to startup overhead. There’s just no way that the 1759 seconds to run a 128-core PageRank calculation in Spark has more startup overhead than the 651-second single-thread time, especially since the same calculation on a smaller dataset was only 857 seconds. And the label-propagation numbers are even more damning.
So, to me, this suggests that if you daisy-chain a couple of levels of USB3 hubs and hook up 16 terabytes of disk to your laptop, some of these parallel systems will still probably have large or unbounded COST, in that if it takes you 1759 seconds to PageRank the 980MB uk-2007-05 dataset on your 128 cores, your entire cluster is only processing half a megabyte of input per second. (Processing it 20 times.) Or maybe a megabyte or two, if most of that 1759s is startup overhead. But the cluster isn’t going to magically start processing 10 or 100 megabytes per second once your dataset is 10 terabytes, is it?
there is a lot of value in standardising and centralising the processing of the data. Sharing the results and sharing new data becomes much harder when everyone has it on their laptops.
Can you maybe just use git-annex or something?