1. 21
  1.  

  2. 9

    Second, no; NoSQL does not scale better than SQL. At least not in the manner you think. MySQL is being used from top-traffic websites like Facebook and Twitter, to high traffic websites like GitHub and Basecamp and almost every PHP installation out there. If this is not scaling, then what is?

    Facebook’s MySQL setup (and Google’s pre-spanner), as far a I am aware, look NOTHING like your standard rails/PHP crud app setup. They look a lot more like NoSQL schemas.

    1. 11

      Which suggests that it is flexible enough to obviate the need for many use-cases supposedly geared to NoSQL.

      1. 8

        Precisely. You only need to build a NoSQL database if you need a different storage backend. For example, Google built BigTable on top of GFS because they needed to control the size of SSTables, and merge and split based on size. Building a NoSQL API is faster and easier, so that’s where they started. Notice they then added back a full sql query interface with Dremel.

        If they hadn’t cared about storing SSTables in GFS for MapReduce, they could have easily implemented the whole thing on top of MySQL. You can trivially implement the SSTable interface like so:

        create table sstable (
            k blob,
            v blob,
            primary key (k)
        );
        

        Or you could use varbinary instead of blob, whatever fits your needs. Or bytea in PostgreSQL.

        1. 2

          Notice they then added back a full sql query interface with Dremel.

          But this is just the query interface. It’s not like they wrote Dremel on top of MySQL because MySQL == Scale.

          If they hadn’t cared about storing SSTables in GFS for MapReduce, they could have easily implemented the whole thing on top of MySQL. You can trivially implement the SSTable interface like so:

          What? Just because you’ve recognized that a key-value store has this schema doesn’t mean it’s easy to implement something that fills the same niche as BigTable on MySQL. BigTable has a lot more semantics around it than just being a key-value store.

          1. 3

            Bigtable DOES have a lot more semantics around it than just being a key value store. Lets examine.

            Routing. BigTable key ranges are managed in a central database that maps key ranges to storage nodes. The client then queries the storage nodes directly for data, avoiding going through this central sharding database for every query. The sharding database is updated when SSTables are split or merged, or shuffled around in general. Storage nodes also keep a history of their recent sharding operations, so they can redirect clients who have out-of-date information. Clients very out of date are redirected back to the central sharding database to get new info.

            • the sharding and storage nodes could easily be implemented on top of MySQL
            • they aren’t purely because Google also wanted the SSTables in GFS for MapReduce
              • originally MapReduce jobs created the BigTable SSTables for the search index
              • until Google developed Percolator to incrementally update search indexes in BigTable

            Consistency groups. BigTable tables can have certain columns grouped together so they can be updated atomically as a group.

            • consistent updates to multiple columns could easily be implemented with transactions

            Compression. BigTable SSTables are compressed on disk.

            • MySQL supports compressed tables
            • even if it didn’t, it would not be hard to run snappy on a blob before you inserted it into MySQL

            Timestamp dimension and immutable storage. BigTable is append only, every “update” is actually an insert at the current timestamp. This makes it way easier to manage SSTables, because they can then be made immutable, and then you don’t have to write and test a mutable storage engine. But of course Google wrote LevelDB so they don’t exactly have a problem doing that. Really, the biggest reason SSTables are immutable is because, again, Google wanted to store them in GFS.

            • in MySQL just add a timestamp column to the table, at the tail of the index
            • it wouldn’t be necessary to manage immutable SSTables with mysql
              • seriously, they are only like that for GFS
              • it was silly that Cassandra did that, since they don’t store in GFS
                • it’s been a pain point for them since the beginning
                • it’s because they copied Google blindly
                • then they moved to leveled storage when they realized their mistake

            So you’re right, it doesn’t make it easy to implement BigTable on MySQL. But it’s definitely easier because a lot of the storage layer functionality is there for you. All of the extra features of BigTable can be implemented on MySQL, except immutable storage compatible with GFS and MapReduce. If you have other semantics you want me to address, I will happily explain how they could be implemented on MySQL and why they aren’t because of the GFS requirement.

            1. 1

              If you have other semantics you want me to address, I will happily explain how they could be implemented on MySQL and why they aren’t because of the GFS requirement.

              But I’m not arguing that something could be done. But writing that something is possible in a comment on the internet is quite different than the reality. For example, at the read and writes load of a BigTable cluster I am not convinced that getting the time dimension is just adding a timestamp column and an index. BigTable doesn’t have to deal with the generic case and can optimize just this case for better performance.

              One interesting thing you left out was resharding and splitting, which is where some of the dynamic scalability benefits come from. MySQL offers nothing for that, AFAIK.

              (A small note on your compression point: BigTable offers block-level compression, which means multiple rows that are similar will compress together, and compress well, which is not the same as compressing each row individually, at petabyte tables that can make a big difference).

              Google could have modified the underlying MySQL storage format to do all of the SSTable things as well, but they didn’t. I don’t know why they didn’t. Maybe it says something about MySQL, maybe not. I think you’re being far too optimistic in your belief that because you’ve found a mapping of concepts, the performance and scalability come with it.

              1. 3

                But writing that something is possible in a comment on the internet is quite different than the reality.

                True. Luckily for me, Google, Facebook, Uber, and others have all done this.

                For example, at the read and writes load of a BigTable cluster I am not convinced that getting the time dimension is just adding a timestamp column and an index.

                It is. You overestimate the throughput per node of BigTable. The whole point massive scalability of storage volume, not massive throughput on individual keys.

                You’re also not adding an index, you’re adding the timestamp column to your existing index. Joining the primary index and timestamp index would be ridiculous.

                One interesting thing you left out was resharding and splitting, which is where some of the dynamic scalability benefits come from. MySQL offers nothing for that, AFAIK.

                No, you’d have to implement it. My point is, it’s easier to implement BigTable-like routing on top of MySQL than implementing all of BigTable. You claimed it wasn’t as easy to build BigTable on MySQL, that’s what I’m addressing here.

                (A small note on your compression point: BigTable offers block-level compression, which means multiple rows that are similar will compress together, and compress well, which is not the same as compressing each row individually, at petabyte tables that can make a big difference).

                Yeah, but the uncompressed tablet is present on disk on the node that hosts the tablet, and any new keys are stored in memory and a log file until they’re large enough to be stored as a tablet.

                Google could have modified the underlying MySQL storage format to do all of the SSTable things as well, but they didn’t. I don’t know why they didn’t. Maybe it says something about MySQL, maybe not.

                MySQL’s execution engine requires a mutable storage engine. If you’re rewriting the storage engine and the execution engine, then all that’s really left is the query engine, which you don’t need in BigTable. That’s why they didn’t.

                I think you’re being far too optimistic in your belief that because you’ve found a mapping of concepts, the performance and scalability come with it.

                Why didn’t ad tech use BigTable? Why did they stick to their MySQL based solution? Performance. BigTable has pretty terrible latency. I’m not being “far too optimistic” about anything, I’m talking from experience working at Google with BigTable, and with other distributed systems elsewhere. MySQL’s InnoDB is highly performant, and scalability comes purely with the distributed layer. There are a lot of other things to optimize before InnoDB is your bottleneck, and BigTable SSTables are not an optimization on that.

        2. 4

          By this logic, since NoSQL and SQL DBs are implemented on flat files, we should be using flat files instead because they are even more flexible.

          This argument is nonsense, if you look at the amount of engineering that went into getting MySQL to work at Facebook scale (given all of the public information), they end up with something that, unsurprisingly, looks a lot like a NoSQL DB. So if that’s what you need, just go with a NoSQL DB that has been battle tested more than an ad-hoc MySQL setup. If you read the Spanner paper, Google was using Spanner to replace the MySQL cluster for their ad system. One of the reasons they were replacing it with Spanner is because resharding the MySQL cluster took years. The companies using MySQL like this pre-date many NoSQL DBs so the fact that they use them might have more to do with the fact that they did not exist rather than they chose not to.

          To be clear: I think many people should use a SQL database. But precisely because they probably do not have scale and uptime issues that necessitate something else.

          1. 3

            This argument is nonsense, if you look at the amount of engineering that went into getting MySQL to work at Facebook scale (given all of the public information), they end up with something that, unsurprisingly, looks a lot like a NoSQL DB. So if that’s what you need, just go with a NoSQL DB that has been battle tested more than an ad-hoc MySQL setup.

            This presumes that you wouldn’t end up fighting a lot with the NoSQL solution to get it to work at scale. This has not been my experience with NoSQL solutions – you will end up fighting them just as mush as you fought MySQL, to get them up to something like Facebook scale. Hell, I’ve fought them just to get them to behave reliably at 1/1000th a Google or Facebook scale. Facebook scale is just plain hard, and no silver bullet will get you there without that fight.

            It’s not the query interface that’s the problem at Facebook scale, but the lack of one is a hell of a problem at any smaller scale. In my experience, the only real difference between building on an RDBMS vs building on a NoSQL is the flexibility to work well across a range of use-cases. Start with an RDBMS because there’s a 99.99XXXX% chance you’ll never hit Facebook scale. When you do hit that scale, yeah, you’ll fight it. You’ll fight anything at that scale.

            1. 1

              I agree with you that getting to Facebook scale is a technological marvel now matter how you cut it. Which I think is an even stronger argument for why using it as an example as to why the average company should be using it is ridiculous.

              There is a lot of space between your first CRUD app and Facebook though, and I don’t think it’s quite so easy to just claim that SQL fills that entire space. I don’t know what NoSQL DBs you’ve used, however I have used Riak and Cassandra in anger. I have found both of them easier to manage than a SQL database, when their strengths are desirable. That is availability and scalability. Making MySQL operate like Cassandra is simply not worth it and Cassandra has been battle tested enough that scaling it up is well understood. Custom MySQL setup, not so much.

            2. 3

              By this logic, since NoSQL and SQL DBs are implemented on flat files, we should be using flat files instead because they are even more flexible.

              What? Flat files are less flexible, because they don’t have the huge number of features present in SQL databases.

              This argument is nonsense, if you look at the amount of engineering that went into getting MySQL to work at Facebook scale (given all of the public information), they end up with something that, unsurprisingly, looks a lot like a NoSQL DB.

              Yeah, so it’s perfectly viable to build a large distributed database on top of MySQL. All the extra code to do that—mostly sharding and replication logic—is necessary in a NoSQL database too. So why write an entire storage and network layer? All the distributed sharding logic can be implemented in the client (see BigTable), so just do that and keep MySQL. This is exactly what Google and Facebook did.

              It only looks like a NoSQL database because the operations you can do at scale are fairly limited, so when NoSQL later came around they didn’t bother with a SQL front end to implement select from where. Well, except CQL in Cassandra. And even then, SQL came back as the technology to do more complex operations developed, leaving NoSQL databases with a crippled query interface.

              If you read the Spanner paper, Google was using Spanner to replace the MySQL cluster for their ad system. One of the reasons they were replacing it with Spanner is because resharding the MySQL cluster took years.

              This is not a shot against MySQL. This is a testament to the flexibility of SQL databases: Google used this MySQL based system for many years before the engineering effort of building a new system was worth it.

              The companies using MySQL like this pre-date many NoSQL DBs so the fact that they use them might have more to do with the fact that they did not exist rather than they chose not to.

              Exactly, SQL databases were flexible enough to use to implement a distributed database. You’ll notice that Google didn’t move ad tech to BigTable once that was available. Instead they continued to use their MySQL based system. Did you ever stop to think why?

              To be clear: I think many people should use a SQL database. But precisely because they probably do not have scale and uptime issues that necessitate something else.

              Yeah, but your bar is probably way off. You need a really huge database before SQL becomes unviable. Operating a NoSQL cluster is only worth it cost-wise if you truly have a tremendous amount of data. Seriously, an X1.32xlarge instance on AWS is $35k a year (128 cores, 2TB RAM, 4TB SSD). That’s cents on the dollar compared to the cost of even a single additional operations engineer to manage your NoSQL cluster. And that doesn’t include the time saved by engineers writing SQL queries instead of manually joining and grouping data in their app.

              As for uptime, you can only get perfect uptime on a NoSQL database with a full operations team dedicated to managing that cluster. This is worth it if you’re Apple, and you run 10k Cassandra nodes. But it takes a lot to really get to a scale where it’s actually cost-effective.

              1. 3

                I have some responses to specific comments you made below, but they are more because I can’t help myself and not because they are super important. But I read your post several times and, for the core of your comment, I believe rather than making a complete narrative, you’re responded to my specific claims and pointing out that it’s possible to do what I say, but on MySQL, rather than if one should and what it will cost. I don’t mean to imply you’re doing this on purpose.

                If a company came to me asking for recommendations, here is what I would tell them:

                Start with a SQL database with replication unless immediately clear your problem does not apply. If you get to the point where either write-availability is extremely important, or you’ve out grown the amount of disk that’s reasonable on a single machine, or you’re hitting write throughput limits on a single machine, migrate the necessary data over to a distributed database such as Riak or Cassandra, or maybe Cloud BigTable. If global consistency is really important, maybe consider Spanner. I would not try to turn my SQL database into something with those semantics because I believe that it is expensive in terms of engineering time and operations as there are a lot of edge cases that are not immediately obvious and those distributed databases have resolved them.

                The above is based on my experiences, so for a reader of this it depends on who they trust more. I would urge any reader to deeply investigate what it takes to make a SQL database look and operate like a Riak or a Cassandra. Very likely one can get a 10% solution up and going in a week or two. But the remaining 90%, my experience tells me, will be a lot of work.

                Here are some specific comments I had but I don’t think they are as important as the complete narrative I gave above.

                What? Flat files are less flexible, because they don’t have the huge number of features present in SQL databases.

                Oh, but all it takes is a little bit of engineering to turn a file into a SQL database. Just like it only takes a little bit of engineering to turn MySQL into a highly-available, sharded, multi-master (? depending on your NoSQL flavor) database.

                That’s cents on the dollar compared to the cost of even a single additional operations engineer to manage your NoSQL cluster.

                I’m not sure exactly what you mean here. It’s not like a SQL database, especially if you plan on implementing some sharded MySQL thing doesn’t need an operations team either. Your comparison seems to be SQL = Free, NoSQL = Cost, but they both cost something and a bespoke SQL setup can cost quite a bit.

                And that doesn’t include the time saved by engineers writing SQL queries instead of manually joining and grouping data in their app.

                But if you have a sharded, multi-master, highly scaled MySQL setup, you will have to write those joins by hand anyways, unless you have chosen a very particular sharding schema to make them easier but if you ever change it you’re in a long project to reshard. You also have to write and manage a sharded, multi-master, bespoke MySQL setup.

                As for uptime, you can only get perfect uptime on a NoSQL database with a full operations team dedicated to managing that cluster.

                How is that any different than a SQL database?

                This is worth it if you’re Apple, and you run 10k Cassandra nodes. But it takes a lot to really get to a scale where it’s actually cost-effective.

                Apple has 100k+ Cassandra nodes, actually. I have been paid to manage Cassandra nodes and the problem we actually ran into was that it was so stable, teams managing their own clusters forgot how to operate them between incidents. In general, operations costs were quite low. We ended up consolidating the operations in a single team in order to keep the knowledge there and fresh. I’m not saying it’s easy, just that you are not painting a picture that matches my own experiences. And the scale I worked was around 1k clusters and around 7k nodes.

                1. 2

                  As for uptime, you can only get perfect uptime on a NoSQL database with a full operations team dedicated to managing that cluster. This is worth it if you’re Apple, and you run 10k Cassandra nodes. But it takes a lot to really get to a scale where it’s actually cost-effective.

                  If you are thinking of availability, my experience is exactly the opposite. Making a consistent data store highly available requires a lot of engineering, and is ultimately limited by how much can you spread replicas, which is typically not that much given the challenges posed by reaching distributed consensus. A good example of that is spanner which is by far not an easy thing to implement.

                  A valid question is how many use cases are there for small or mid size companies to consider dropping consistency to achieve high levels of availability, which I presume are not that many.

                  1. 2

                    challenges posed by reaching distributed consensus

                    Sigh. Yes, I agree with you, it is tremendously difficult to do, and only a few teams have done it successfully in a meaningful way that scales. But I think that’s missing the point.

                    When do you actually need fully automated distributed consensus and failover? Same as anything else you’d consider automating: whenever it’s more efficient to automate than perform manually. For 25k node clusters, definitely necessary. For a 12 node cluster? Give an ops engineer a pager and spend a week doing practice scenarios. It’ll be easier than testing a distributed system.

                    As for architecture you can get a lot of mileage out of simple sharding and 3 node replica sets. See AWS DynamoDB. That’s literally how it works. Most people don’t need distributed databases, and out of the people who do, most people don’t need Spanner. With 4 shards / 3 replicas each, you can pretty easily make it to 40TB of capacity these days. And don’t forget, 40TB goes a lot farther in a SQL database with packed rows than a NoSQL database with string keys embedded in every row.

                    A valid question is how many use cases are there for small or mid size companies to consider dropping consistency to achieve high levels of availability, which I presume are not that many.

                    I definitely agree there. I always see the pattern of engineers thinking they need to scale massively, so they set up a Cassandra cluster or something. On 5 machines roughly as powerful as my laptop. And then spend ages trying to build an application that actually works on top of an inconsistent system. It’s not even frustrating anymore, I just accept the whole sad state of it. Maybe I spend too much time with tech hipsters.

          2. 6

            This post is no more than a rant with little (if any) effort in providing reasonable support for any claim:

            To “support” the claim that SQL is easier than NoSQL (whatever that is, I didn’t know it was a language) it provides one example of a SQL query vs the alleged equivalent in … MongoDB.

            It claims that NoSQL (all of them) doesn’t scale better than MySQL (maybe just this one?) because … Facebook, Twitter and some others use MySQL??

            It claims that ACID is universally what you want. Without any support for this. ACID is very complex to implement and inevitably, the abstraction leaks in certain situations. A common example is a careless administrative query that creates lots of locks and affects production queries increasing their latency; it turns out that using ACID effectively at scale is not easy either. And this:

            with SQL you can be 100% certain that either your exact intent gets reflected in the database or nothing happens at all.

            Is simply not true

            The truth is that which database you choose depends on your use case. If you choose blindingly, you have higher chances to be wrong. This posts encourages you to blindingly choose MySQL. Posts like this don’t help you choose wisely.

            1. 0

              Scalability is not such a big concern for many, but high availability often is. This is something NoSQL databases typically do much better out of the box than a SQL database, and is a very valid reason for picking NoSQL.

              1. 2

                While fully distributed systems are easier in NoSQL environments (often by giving up things like ACID), I’d argue that high-availability is easier with SQL- a Log Shipping, Primary/Secondary-type solution can be rigged for transparent failover, without the complexity of managing a distributed system.

                If you need a distributed system (for throughput), then the high-availability of something like a Cassandra cluster is certainly preferable. But managing even a small Cassandra cluster is much more complicated than managing a SQL database with failover clones.

                1. 1

                  What is the state of the art for fail-over in MySQL or Postgresql?

                  1. 5

                    I like the approach that Joyent’s Manatee takes with Postgres HA. It offloads the job of tracking who the master is to Zookeeper and will automatically reconfigure the replication relationships accordingly.

                    I merely have a test cluster set up, but I know that Joyent uses it in anger. It underpins much (all?) of the persistent storage within the Triton system.

                    Manatee targets SmartOS and depends on ZFS, and this is how I run it, but I have seen a Linux port too.

                    (Edited for clarity)

                    1. 1

                      Great, that is about the best I came up with as well.