1. 5
  1.  

  2. [Comment removed by author]

    1. 6

      Hadoop isn’t a database. Hadoop at its core is a map reduce framework. Its not about “scaling to 10 TB”, its about processing data.

      Here’s a simple example:

      You generate 100GB of log data a day. For some people, that is a ton of data, for others, not very much. I need to be able to find information in that 100GB within a couple hours. A python script on a single or a java app on a single box won’t cut it. They won’t get me the answer I need in the amount of time that I need it. So I spread the load across many machines. I’m getting the answer that I need but, I’m in a painful scenario of dealing with my home grown cluster of machines.

      Hadoop is just a standard framework handling cluster’s of map reduce jobs. A nice chunk of hard work has been done for you. Lots of rough edges have been shaved off.

      It doesn’t matter what you think as an outsider about someone’s choice to use Hadoop or other technologies unless you understand their problem. I have data, I need to get an answer in X period of time, to do that I need a cluster. You don’t get to decide what is someone else’s X. Are there people who could use something other than hadoop? Sure. Maybe like some “look at me” blog posts and articles like the point of “that problem could be solved in excel”, maybe it could, maybe a single Python script could handle it. But if that one script dies? What then? No answer. Maybe that person went with Hadoop after looking at the problem and deciding that a job tracker as a SPOF was way more likely to fail them than just a single python script.

      How about, we stop assuming our colleagues are idiots and try to understand why people make the tradeoffs that they do. Yes, Hadoop is a beast in many ways. Yeah, operationally, it can be a giant pain. But, there are plenty of reasons people want/need to use it that cynic’s sniping from the sidelines simply won’t see.

      1. 3

        How about, we stop assuming our colleagues are idiots and try to understand why people make the tradeoffs that they do. Yes, Hadoop is a beast in many ways. Yeah, operationally, it can be a giant pain.

        I have met cases where people want to go with Hadoop based on hype (or to gain career experience), when they have a dataset < 10GB. These people are not “idiots” per se, but don’t always consider that traditional single node technologies are most suitable many (if not most) projects. Articles like this can help keep perspectives in the right place, where sometimes “boring old-fashioned RDBMS” are a valid approach.

        You generate 100GB of log data a day. For some people, that is a ton of data, for others, not very much. I need to be able to find information in that 100GB within a couple hours. A python script on a single or a java app on a single box won’t cut it. They won’t get me the answer I need in the amount of time that I need it. So I spread the load across many machines. I’m getting the answer that I need but, I’m in a painful scenario of dealing with my home grown cluster of machines.

        Curious what type of operations you perform that take that much time for < 5GB / hour of logging?

        1. 1

          I had a hard time listening to a lot of the sessions at strata in Santa Clara this year. The enterprise vendors definitely smell money in the hadoop ecosystem and this had created a feedback loop which had resulted in a lot of “you need a hadoop” type cargo cult behavior.

          I use a small hortonworks cluster to process about 200 Gb/day of video viewing and ad data from log files. We had a home brewed distributed log processing system that was less functional than what we got for free (development wise) by using hadoop. Using hadoop as a big dump parallel hammer to apply functions to large data sets without having to write custom code is the sweet spot for my needs.

          1. 1

            Maybe that person went with Hadoop after looking at the problem and deciding that a job tracker as a SPOF was way more likely to fail them than just a single python script.

            Did you mean “less likely”?

            1. 1

              I did.

              Sadly can’t edit now. :/