1. 115
  1.  

  2. 25

    This really struck home, I’ve been on both sides of the fence, and made a career specializing in collecting shitloads of data.

    Here’s the real world tiers of big data:

    First you get to: “this hundred megabyte excel spreadsheet loads slowly” scale. Next level up is: “I have to run a couple awk/grep processes and split my file” scale. You might be lucky enough to reach the: “$100 hard drive could hold the entire history of our company worth of data” scale.

    Somewhere, several orders of magnitude later, you get to “statistically a piece of hardware often fails while I’m processing data on thousands of nodes and I need to schedule/retry idempotent jobs”.

    Most people put on their big data hats and shove their overgrown spreadsheet into a Cassandra cluster with R=N=3, pat themselves on the back and update their resume.

    However the article does miss one thing. Success at one order of magnitude doesn’t always preclude usefulness at a lower order of magnitude. Some solutions scale down quite well.

    1. 15

      Somewhere between is “this is uncomfortably large and I need more IO devices to load it/query it at a reasonable rate.” 4TB isn’t a lot of data, but it is a lot of data to query quickly and load in <6 hours. At that point you already need a few spindles/SSDs and that’s before replication.

      I’m not accusing your comment of this (or even the article), but one thing that bothers me is the argument that goes along the lines of, “Well, if your data can fit onto a 1TB disk…” no! Data density is THE ENEMY in non-archival data storage. A 1TB disk at 100MB/s will take 2.7 hours to completely read. Retrieval speeds aren’t increasing on spinning rust, and SSDs while better don’t (yet) fundamentally change the equation. SSDs are lightyears better for random retrieval, but at the end of the day throughput is a scarce quantity.

      If you need to store lots of data and only ever need a subset of it (e.g. a well-indexed table, infrequently accessed data), great. But at soon as you’re pushing the terabyte level analytics get pretty painful, as does normal OLTP – with your index in RAM it’s still ~1 IO per query (at least). Even 100GB full tablescan is 16-17 minutes assuming 100% of 100MB/s. Of course, if you can load it once and cache in RAM it’ll be plenty fast.

      1. 8

        I thought the point was you can keep the whole 4TB in ram and only reload from disk on total system failure.

        1. 4

          Exactly, instead of engineering a crazy solution with lots of i/o etc… Just buy a bigger server with more ram. Can get servers with ~6TiB of ddr4 and nvme drives.

          Then you just need to worry about reboots, and even then if you have nvmex4 drives at read speeds of about lets lowball and expect 4 nvmex4 drives at 2GiB/s read each you can read in 4TiB in 8 ish minutes. Lets round up to 10 minutes. Hell add a factor of 3 on to be a jerk but the xeons that have this hardware should do 8GiB/s easy.

          Buy two of the large servers and I’d say you’re pretty much set with throwing hardware at the problem instead of getting cute with a crazy solution that scales horizontally.

        2. 7

          Excellent points, but you’re also reminding me that there’s another common syndrome in this world. It’s basically the “your data is only big data because you’re storing it in a brain dead way”.

          Compression and smart binary encoding can often reduce a big data problem to to a laptop-sized problem. I’ve seen a ton of companies that only have big data because they write out extremely verbose JSON to long term storage. The alternative is usually something like protobufs + gzip/snappy/etc.

          [edit]

          I seem to have gotten a little off-track with my rambling. I guess the overall point is that big data frameworks and many of the google sized solutions solve real problems that, at a certain scale, do eventually have to be dealt with. I just recommend people try to smartly avoid that added complexity for a long time, not jump into it head first when their excel instance starts to bog down.

          1. 2

            This is a great point. It’s one of the reasons things like Netezza took off where they actually split the data across many discs attached to small computers to reduce how much data had to go through the disk or the system as a whole to achieve a result.

        3. 19

          I feel like you could work a pithy quote out of this. Something like “Before you copy Google, remember that they have more devops interns than you have employees.”

          1. 3

            …from great schools willing to throw everything they have at the problems just because the employer has “Google” written on or near the building.

          2. 12

            If you go even smaller, do you really need a DBMS or is sqlite enough?.

            1. 1

              When is sqlite better though? Even for a trivial thing like indexing my music collection I’d far rather use postgres - I’m already running it on my home machine, I already have backups set up, I have a bunch of existing tools that I can use to inspect it…

              1. 2

                Well, if you already have a nice Postgres setup like that, not a lot of reasons. But scp my.db myserver:my.db.backup is a lot more straightforward as a backup system.

                Here’s a concrete example for you: I was on a remote system, and had a list of a few thousand files* distributed across a cluster, along with things like hostname and other metadata. I wrote a quick schema in sqlite, loaded the data, and was immediately able to issue queries to find which files were not redundantly replicated, not in the right location, were out of sync with other replicas, and so on. I then used sql queries to generate shell commands that would repair the situation. I essentially used SQLite over awk when things got too complicated to easily do in awk.

                Could I have used Postgres? Sure. Was it installed? Who knows? My guess is no. But sqlite3 was installed, there was no password on the newly created ./scratch.db, and I could just get started.

                If you like your setup, that’s great. No reason to switch. I personally use SQLite for any local use case that doesn’t need the most sophisticated features of full engines like Postgres.

                Regarding tools to inspect, SQLite actually has excellent interactive commands to do this and that, starting with dot. For me, .tables is easier to remember than the \ command for Postgres. I think it’s \\dt? It’s also crazy easy to .import data, change the import .separator, the .output file for a query, or output .mode.

            2. 9

              The statement “Did you know you can buy a terabyte of RAM for around $10,000?” led me to try and figure out if you can buy systems with such huge amounts of RAM. It led me down the following rabbit holes:

              1. How Do You Program A Computer With 10 Terabytes Of RAM? (2015)
              2. Memory driven computing (2017, slides + audio)
              3. ZDNet’s article on HPE’s 160 TB system
              1. 10

                Well there are a lot of high memory systems out there. The DL580 alone can go up to 6TiB, there are larger you can buy today without succumbing to “the machine” memory marketing.

                https://www.hpe.com/us/en/product-catalog/servers/proliant-servers/pip.specifications.hpe-proliant-dl580-gen9-server.8090149.html

                But yes, in memory databases are pretty common like timesten.

                1. 5

                  When dreaming of big hardware, just type three letters into the URL before the .com. You always see something impressive:

                  https://www.sgi.com/products/servers/uv/

                  1. 5

                    You can actually get one from SuperMicro with 1TB of RAM without breaking the bank (for certain values of “without breaking the bank,” of course) – around $30,000 for a quad-CPU Opteron machine with 1 TB of RAM.

                  2. 5

                    The COST paper[0] discusses how frameworks like Hadoop are much more expensive for smaller datasets than they need to be, which seems to relate to this.

                    https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

                    1. 3

                      Generally agree but there are operational costs to vertical scaling. That single DB is also a single point of failure and achieving high availability is often just as hard as scaling horizontally. (master / slave failover and backups may seem mundane but there are plenty of examples of companies screwing it up)

                      Something like Cassandra, Elasticsearch or Kafka has redundancy built in, which is a big win. I think spanner style sql databases could hit a real sweet spot.

                      As for SOA I think it depends on what you’re working on. Sometimes breaking up applications into separate processes with well defined interfaces can make them easier to reason about and work on.

                      As an application evolves over time the complexity can grow out of control until any time someone touches the code they break it. How often have new developers thrown up their hands, scrapped the product, started over and wasted 6 rebuilding what they had in the first place?

                      Maybe SOA could help with that by limiting the scope? (Though maybe better code discipline would achieve the same result?)

                      I guess all I’m saying is that good engineering practices can help smaller software too.