And it’s perfectly understandable and sometimes our fault.
Here’s an illustration: I consulted for a search project at a larger publisher. I made the error of asking “do you have lots of data?” to which they replied: “yes”.
When I came there, I asked about the actual size: 100000 documents, about 1GB. That’s… well… not a lot for a modern machine. Turns out were talking on different scales: 100000 articles is “a lot” for a publishing house, it’s the output of a larger magazine for a decade. In technical terms, it’s nothing. And people, also developers, get caught by that and suddenly start setting up Hadoop clusters.
So, my first job was telling them that on the technical scale, they have almost no data. I was the first.
I worked for a few years at a company doing almost all its analytics with shell scripts, which gives me a healthy respect for how fast the standard shell tools are. (There are a wide variety of non-performance issues with such a setup and I wholeheartedly recommend that you not follow their example, but the single-thread performance is very difficult to beat.) Eventually, you’ll need to develop parallelism and break out into a compute cluster, but that point happens much later than a lot of people think.
This is probably the biggest issue I have with the big data fad: people who don’t have big data think they do.
And it’s perfectly understandable and sometimes our fault.
Here’s an illustration: I consulted for a search project at a larger publisher. I made the error of asking “do you have lots of data?” to which they replied: “yes”.
When I came there, I asked about the actual size: 100000 documents, about 1GB. That’s… well… not a lot for a modern machine. Turns out were talking on different scales: 100000 articles is “a lot” for a publishing house, it’s the output of a larger magazine for a decade. In technical terms, it’s nothing. And people, also developers, get caught by that and suddenly start setting up Hadoop clusters.
So, my first job was telling them that on the technical scale, they have almost no data. I was the first.
Here’s a pretty good article from a few years ago making the same point.
I worked for a few years at a company doing almost all its analytics with shell scripts, which gives me a healthy respect for how fast the standard shell tools are. (There are a wide variety of non-performance issues with such a setup and I wholeheartedly recommend that you not follow their example, but the single-thread performance is very difficult to beat.) Eventually, you’ll need to develop parallelism and break out into a compute cluster, but that point happens much later than a lot of people think.
Sounds like a case of technical premature scaling, pretty easy to fall into that trap.