1. 37

I’ve been noticing this pattern that there’s always one production database server. It’s always a special box. The details change but it’s mostly like this at every shop. In addition, certain ideas persist and work on many different tiers or layers but then stop or die on the database.

I list out some ideas that die and talk about each one. It’s a huge subject that can probably be picked apart but I like the overall theme of the exception, the special case.

  1.  

  2. 47

    The things the article complains you can’t do in the database are only possible in stateless components, which is actually the special case. I mean, systems without state are of limited utility—you have to put it somewhere. So the database is the only “normal” part of the system. It just doesn’t look that way because there are all these stateless boxes on the diagram. It’s actually a big improvement that all the state has been pushed down into one component where you can keep an eye on it.

    1. 34

      The database IS the good idea that lets the team partition risk and coordination into a system that handles the freakishly hard exponentially-compounding problems of performant durable concurrent consistency.

      1. 1

        Having state doesn’t mean you can’t make the database a distributed component. You just need good mechanisms for sharding, replication, consensus, etc. This is a solved problem and the way most large databases work nowadays.

        1. 12

          I think calling it a “solved problem” downplays a lot of the consistency problems we have as an industry still.

          There are solutions, but they are largely incredible feats of engineering and toil.

          One need only look at the Jepsen tests to understand just how dire the situation can be, especially for tools that look to tackle hard problems in replicated database spaces.

          1. 10

            FoundationDB showed how building distributed systems with a simulator can push race discovery much earlier into the dev process, practically solving the issues that Jepsen finds in low-quality (yet ubiquitous) distributed systems that are produced today. A few other industries that create safety-critical software tend to lean heavily into things like discrete event simulation to find their bugs with fewer deaths/headaches/etc… And the hyperscalers are employing these techniques to a greater and greater extent over time because they are able to better understand the benefits of doing so. Open source databases will generally be crappier and crappier compared to the systems offered only as services through the hyperscalers over time. As flashy as any of the NewSQL systems seem, they are really incredibly naive, unreliable, and low-performance compared to the functionality of their closed-source inspirations running internally at cloud providers today.

            Choosing how to shard without hitting perf bottlenecks will also eventually become increasingly automated and driven by basic control theory, but this will happen more from the systems offered as services through hyperscalers rather than the simpler systems that are available in open source.

            Data gravity will continue to cause state to be consolidated in hyperscaler datacenters. Cloud APIs will continue to give preferential interoperability to data that sits in the more expensive hosted databases. The profits of the cloud providers will continue to be focused on hosted storage, rather than its commodified compute complement.

            So, yeah, the database product industry will keep cranking out garbage, not because it’s impossible to avoid, but because their management doesn’t prioritize correctness generally. Open source database products generally don’t have to care about correctness because they are already the system of last resort for people who don’t want to pay for cloud providers or expensive licenses. Meanwhile the database service industry will keep eating the customer base that is fed up with being burned and is willing to pay a premium to think about other problems.

            1. 4

              You’re right, I was being a bit flippant. My point was that there are real-world products that do this, not just research systems.

          2. 1

            I think just by dissect the database (data storage, query engine, transactor) you can gain a lot of knowledge about how to have optimal systems. For example, things that do not require transactional mutations or complex SQL queries can live in a simple k:v store.

          3. 7

            I like the article, but it seems mired in yesteryear. Today we’ve got Percona XtraDB Cluster, Timescaledb, Citus for distributed PostgreSQL, CockroachDB – just to name a few.

            I’d be interested in hearing the war stories of folks running any of today’s cluster extensions for MySQL/PostgreSQL/etc.

            1. 24

              Percona Cluster and Citus require substantial operational expertise and hand-holding from the application code to work properly. Vitess is probably the least tricky of the database sharding layers, and even it requires lots of in-depth knowledge. You can’t just put these databases in Kubernetes / AWS Auto Scaling Groups / whatever and let ’em rip. Even NoSQL databases like Cassandra—designed from the ground up to be active-active—require dedicated provisioning and special care by ops teams.

              As a database person, I think the author is basically right. But I would instead say that good ideas have to face reality at some point. Reality usually includes concepts like “users don’t like data loss” and “distributed consensus is hard.”

              1. 4

                could someone take a moment to elaborate briefly on what kind of experience and special care these sharded sql systems need? i have seen this work happen from a distance don’t quite know what the pitfalls and pain points are. how much of this is avoided when using “sharded database in a box” services, and how much can’t be insulated from the application layer?

                1. 3

                  I’m also curious about real-world stories. I looked around a bit about Vitess. Slack and Square Cash shared info about their migrations. Alas, incentives are probably against sharing all the bad parts, but e.g. Square talked about deadlocks in their migration.

                  There are nearly-inherent tradeoffs to sharding. Your choice of shard key for each table (user ID, product ID, etc.) has performance implications; queries spanning shards or using cross-shard indexing will be slower (more hops), and cross-shard transactions are much slower if supported, because of the coordination. You tend to need management infrastructure, e.g. metadata servers and something like etcd or zookeeper. Having more stateful boxes means you need smooth automated spin-up, monitoring, backups, and so on.

                  Seems like there’s work/complexity/resource overhead even for well-packaged sharding, but 🤷‍♀️. When you need it, you need it, and running a larger monolithic database isn’t worry free either. The growth of mature, relational distributed database tools (and some experience at the edge of when a single DB makes sense) has moved me a bit from “stay monolithic until you can’t do otherwise” (that’s a long time!) to thinking you might look at options when you’re feeling real growing pains even if you could stretch it out longer.

                  1. 10

                    I work at Cash and have built multiple applications on top of Vitess. Everything you said matches my experience.

                    • Sharding is hard. Query pattern lock-in.
                    • Sharding is hard. Cross-shard anything is strongly NOT recommended.
                    • Sharding is hard. The application will have a lot of special code for it.

                    For Vitess in particular:

                    • Running Vitess and MySQL more than doubles operational complexity.
                    • Vitess is not popular. Good luck 🤞🏾
                    • I found the admin tooling for Vitess very poor.

                    We’ve deprecated Vitess for all new services. Most teams have moved off it … though a couple critical services linger. We’ve got dedicated teams working on them.

                    We’re increasingly an Aurora and DynamoDB shop now.

                    1. 5

                      Find an alternative to DynamoDB before you outgrow it. Once you hit that wall it is insanely expensive in engineering to work around it.

                      I helped run one of the largest DynamoDB tables in all of AWS and we ran into such fun problems as “our backup restore was cancelled because they didn’t have cap” and “autoscaling is failing because it’s taking so long to scale it’s timing out.”

                      The cloud is fun until it runs out.

                      We just use S3 today it works much better.

                      1. 2

                        How do you use S3 as a replacement for DynamoDB?

                        1. 1

                          You treat the entire thing as a KV store. Your bucket key path is whatever you were using for dynamo. Cache the heck out of data in a redis cluster and you should be good to go.

                          I know this sounds simple and it is because I don’t have all the details, another team did it. It took them a few months but the outcome was really good.

                        2. 1

                          To share a bit of “hidden knowledge”, this is true for everything AWS (and I suspect most cloud providers) has, not just DynamoDB. We’re one of AWS’s largest customers at $WORK and we frequently run into all sorts of scaling issues with their services. But we also do ensure that data that sits in DynamoDB is small and doesn’t run into hot shard issues. Larger data usually sits in dbs that we manage which are plagued by all the problems everyone in this thread is discussing.

                        3. 1

                          How’s Aurora working out for you? We’re considering using the PostgreSQL version.

                          1. 2

                            We use the MySQL version and not the serverless offering. It’s given me no new pain when contrasted against MySQL instances hosted in our data centres. It’s mostly taken away vertical scaling pain.

                            I’m waiting for the other shoe to drop.

                          2. 1

                            Ah, wow, that changed my outlook on Vitess substantially, even having known parts of it in theory. I wish info like this–not the launch post type stuff but what people learn from day-to-day use–was public more often, though I get the reasons it’s hard for that to happen.

                      2. 1

                        One of my clients uses a relatively small PXC (Percona XtraDb Cluster, for those unaware of the “official” name) setup, with three nodes.

                        It has some operational overhead in terms of the service itself obviously, but I’m struggling to think many instances where application code has to allow for the clustered nature (or side effects thereof) specifically.

                        Could you elaborate on what you meant here?

                      3. 2

                        You did not mention but TiDB gets my vote.

                        1. 2

                          We’re experimenting with TiDB. What’s your experience?

                          1. 1

                            Overall positive. Easy to use for sure. I need more experience with stability and reliability.

                      4. 6

                        This is way out of date, at least from the perspective of large enterprises. This was the pain point they ran into 10-15 years ago and is where “NoSQL” came from — or more accurately, non-relational databases.

                        Big systems use heavily distributed databases with dozens or hundreds of server nodes, often across multiple data centers. I don’t just mean Google or Amazon, but pretty much every Fortune 500 company and their equivalents worldwide.

                        There was a thread last week where some of y’all were heaping disdain on “NoSQL” systems, like “most people who use them don’t even know what schema migration means.” Uh huh. My $dayJob is at a big NoSQL vendor and that’s the complete opposite of what our customers are like.

                        1. 3

                          All true. But those systems also have more application-side constraints than using a single MySQL or Postgres instance. I have used BigTable and writing application code for it absolutely sucks ass. Spanner is much better, but still has more pain points and complexity than a single instance SQL database.

                          Either way, I believe this post primarily addresses conventional SQL databases:

                          There are alternate daemons and methods (NewSQL) but here I mean common relational SQL databases.

                          Even commercial SQL databases with good operational tools can’t eliminate the sharding-is-hard problem. At some point you have to pick a shard key, write queries against sharded tables, and reshard when your dataset grows. Many perfectly acceptable single-node queries become n^2 shuffles when moving to a sharded schema.

                          Big companies tackle these problems with ops teams and engineering expertise. I think that’s what the author means: you have to implement database-specific solutions for all these problems.

                          1. 1

                            BigTable and Spanner are themselves pretty old tech in this space. Not to shill for my employer, but Couchbase Server supports SQL queries (it’s actually a close cousin called N1QL that supports nested JSON data) while remaining highly scalable. Most of the time when you need more capacity you can just plug in some more server nodes and edit a few settings in the admin UI to hook them up.

                            1. 3

                              BigTable is definitely old, but Spanner isn’t. Spanner supports the full gamut of SQL.

                              Most of the time when you need more capacity you can just plug in some more server nodes and edit a few settings in the admin UI to hook them up.

                              BigTable and Spanner both work this way, but that’s just how everything at Google works.

                              But what you’ve described isn’t the same as throwing all the Couchbase Server instances into an AWS auto scaling group and having it Just Work™️ like a stateless app server would work. You have to do special management just for Couchbase Server. That’s the author’s point.

                              Even though Spanner is a cutting edge, highly scalable “NewSQL” engine, it’s still more complex to use and understand than a single MySQL or Postgres instance. And the easy parts aren’t fundamentally easy, they’re easy because Google engineers have spent substantial time making Spanner easy to operate. Same goes for Couchbase engineers working on the admin UI and corresponding server features.

                        2. 5

                          It is not the databases that ruin good ideas. It is the fear of changing a vital component of the production system that does. Sometimes even database migrations feel like a heart transplant and the powers that be require zero downtime. There in lies the fear, the anxiety and the killer of any idea.

                          1. 4

                            Zero (intentional) downtime is a much more expensive requirement than I think people give it credit for, mostly because it’s usually more of a death-by-a-thousand-cuts kind of expensive than a put-a-big-project-on-the-roadmap kind of expensive.

                            In some cases it’s a legit requirement, but I suspect if the ongoing cost were more obvious, it’d be a much less frequent requirement.

                            1. 3

                              Zero (intentional) downtime is a much more expensive requirement than I think people give it credit for

                              I think if folks don’t understand this, they should look at the uptime of their own systems as a start. While this doesn’t have data associated with it, I feel comfortable saying that the uptime/effort curve is logarithmic, such that only an exponential increase in effort will get you a linear increase in uptime after a certain point. Knowing that you should be able to look at a system you run and come to a rough idea on how much effort it would take to scale its uptime.

                              1. 3

                                I think that is totally true if you’re talking about minimizing unintentional outages, but it seems much less clear to me that it applies to intentional ones, if only because it’s possible to actually get all the way to zero intentional downtime without infinite effort.

                                One can argue that there’s no good reason to distinguish the two, that downtime is downtime whether it’s scheduled or not. But I think in some contexts there’s a meaningful difference between, “Our service will be unavailable from 2AM to 4AM on Thursday of next week,” and, “Oops, something broke all of a sudden and it took us two hours to recover.”

                                1. 1

                                  Fair enough. I work in a context where we have guarantees around uptime, planned or unplanned, which is where my comment came from. But yes, I do think it’s a lot easier to drive intentional downtime to zero.

                              2. 1

                                I think there are definitely ways to mitigate issues with database change in particular:

                                1. Just use your error budget. It’s there to be used.
                                2. Make a new database with the new schema, deploy the system writing to both, run a migration process behind the scenes then turn off the original database. Have some error budget in the bank for a rollback.
                                3. Build tolerance into the system interfaces to be able to deal with the old and new scenario. This is what we do at Google constantly, we just build out a protocol buffer with more and more fields and mark the deprecated fields as such, then file bugs to get rid of those fields from the code when we can.

                                State doesn’t have to be scary, especially with databases. I worry about things like corrupt files, cascading failures and security holes. Basically anything that requires specialist knowledge and foresight, which no one can have all of.

                                1. 3

                                  Oh, it is absolutely possible to mitigate the risks. But everything you mentioned requires additional work compared to a “shut the whole system down, upgrade all of the components, start it up again” model.

                                  Google easily falls into the “sometimes it’s worth the cost” category, of course. Taking Google down for a couple hours on a Saturday night is obviously totally out of the question.

                                  But I’ve worked at B2B shops that insisted on zero-downtime code deployment and were willing to have the engineering team do the extra work to make it happen, even though none of the customers would have noticed or cared if we’d taken the system offline in the middle of the night. In one case, our system was a front end for an underlying third-party service that had regular scheduled downtime, but we had to stay online 24x7 even though our system was completely useless during the other system’s outages. “Our system never goes down!” was apparently a valuable enough talking point for our sales team that management was willing to pay the price, even though it added no actual value for customers.

                            2. 4

                              The complexity of getting right the semantics of Active-Active is out of reach for most tech shops. People should learn to work with a sharded database, but even that is some kind forbidden knowledge depressingly often.

                              1. 3

                                It’s always a special box.

                                That’s, broadly, operationally immature. You can do much more than a pet DB Also, you’re thinking of SQL as a simple input/output box; it’s not. A DBMS, used by an expert, it’s a full fledged data management framework. Arguably you could push far more of many applications into a RDBMS and have much thinner server tiers.

                                1. 3

                                  Vitess looks really interesting (but see below) for scaling out a lot of projects targeting MySQL. Notably, it’s mostly MySQL-compatible, so you can just run some framework targeting MySQL on it.

                                  (EDIT: see quad’s comment, though. Significantly less sunny picture than I paint below and they’ve actually used it!)

                                  It supports sharded keyspaces, and can sometimes execute queries/joins within one shard. It can make trickier queries Just Work with scatter/gather and other execution strategies. It has some operational tools for moving tables, doing backups, etc. It has some large users and there seems to be a developed operational approach around it (e.g. shards of 256GB instead of as big as the server can support). There are some neat primitives like VReplication that I suspect you could do useful stuff with.

                                  There can be blockers–it has its own SQL parser so it does not support quite 100% of what MySQL does (there are examples in docs and the test suite). The cost/benefit depends a lot on whether each of those features gives you something new, replaces something you already have in another form, or simply isn’t useful. [And adding in the real-world experience quad mentioned, I’m a lot more hesitant than before! Interesting tech, at least.]

                                  1. 3

                                    Even if you’re just looking at it from the limited perspective of an N-tier web stack, maybe a less frustrating way to visualize it is that the database is the box where you’ve shoved all the ugly stateful code that would otherwise clutter up the rest of your system and prevent you from scaling anything without a ton of headaches.

                                    If you got rid of the database, state would have to move to your app tier, and then the app tier would have all the same stateful behaviors the database does.

                                    1. 2

                                      There are so many interesting problems in software engineering. I think state is an interesting problem; stateless systems are still interesting, but introduce state and that for me is super interesting.

                                      Rather than “the database ruins all good ideas”, I would instead say “scaling state exposes all bad ideas”.

                                      FTA:

                                      Horizontal scaling doesn’t work on the database tier.

                                      I disagree, horizontal scaling is the only sane way to scale both the compute and storage portions of a database tier. Scaling either vertically is usually a losing game. This is how AWS DynamoDB and a myriad of internal services that power it scales internally, horizontally not vertically. Same with AWS Aurora.

                                      So your project might have fake adapters but not for mysql/postgres. Or maybe you use sqlite in dev/tests and something bigger in prod. But you don’t change entire products for your caches/queues based on the environment do you? See what I mean?

                                      Why can’t you use the exact same database and mimic whatever you need to in order to do testing? Unit tests and mocks that layer over state can be very brittle, so some sort of functional test or end-to-end test is better.

                                      Having said all of this, the problems at software companies like AWS face at scale with respect to state are not the same as smaller applications or scales. I still think state is an interesting problem at all scales, but if you can get away with not having state go for it.

                                      1. 2

                                        Out of the list of the ‘good ideas’ listed in the article, I did not see an idea describing ‘functional buffers’.

                                        Meaning that reads and writes to/from central database happen ‘occasionally’ and not as often as client’s move the data. Writes are batches and then ‘synched up’ in micro-batches (at several seconds intervals). Reads are cached into local data stores (something like LMDB) or other ‘smart disk caching’ alternatives, and most data reads from there. In memory caches on the ‘backend servers’ cache within limits as well.

                                        This particular approach works when read data can be ‘eventually consistent’ for some operations, and writes do not need to be immediate.

                                        Perhaps this idea is ‘somewhat’ reflective in the ‘active/active’ paragraph in the article. But my preference in designs/architectures – is building a ‘custom’ buffering layer, rather than asking an active/active distributed system to synch up the data. Because not all ‘business functions’ fit into that model.

                                        I tend to employ the above approach in mobile apps world – where not all user actions are ‘sent immediately’ to the backend, but instead stored locally and then synched up periodically to the backend. This helps also if a particular user action cancels another ones before it.


                                        Furthermore,

                                        To optimize ‘hardware affinity to the database engine’, I gravitate ( for large systems), to never use database triggers or ‘percolations’, or anything like that. With that, if possible, I prefer a design, where in one particular point in time, only a particular type of database access is happening across all the clients (although this is really hard/impossible to do across the whole 24 h period).


                                        Finally, I tend to avoid designing a system where write-heavy database is deployed in any form of containers (docker/BSD jails/etc). My understanding of the I/O scheduling and hardware sharing challenges could be outdated, though, so I could be wrong on this.

                                        1. 4

                                          Yeah, I like this in memory trick and the other stuff you said. :) A lot of people are chiming in that this info is out of date. I’m really interested in “modern tooling” but then the suggestions have trade-offs (eventual consistency, complexity, systems of systems, nosql). They aren’t wrong. The active-active Sun horror story is from 2004 and that was peak context for me before switching to dev. So that’s good insight from those that are saying this is old.

                                          But I think CAP theorem survives forever. Alternatives have tradeoffs. You can’t have it all. Traditional RDMSs are everywhere and have many benefits. This is probably a good default for juniors, interns, early projects.

                                          It’s hard to have a pithy title that isn’t click bait but still represents my chunking/motto. I’m not dunking on databases. I’m just saying it’s really weird. This is a great way to explain this to interns and juniors. We have to be CAREFUL because of state etc.

                                          1. 5

                                            You are doing hard part. Writing a cohesive article, opening it up for criticism, collecting feedback and asking follow up questions. This type of process is an essential part of advancement, in any field.

                                            You are right about trade offs. There are databases, forgot which one, that recognize the different trade offs, and offering developers to specify trade off choices, on per connection bases.

                                            You are also right, that the decisions around data consistency, change propagation velocity, etc – have a significant impact (often deciding impact) on the overall architecture of a solution(s). Splitting out database solution engineering into a separate ‘isolated’ task – rarely works (it works for cookie cutter projects with little to no data volume and access patterns variability)

                                        2. 2

                                          It reads like everyone’s first real system. Yeah, you didn’t know what you were doing. I’ve been there. Hopefully you know now.

                                          1. 1

                                            To add to this, another anti-pattern that i’ve seen is, simply querying or writing to the DB in the middle of application code. Everyone agrees that global state singletons are bad but when I ask them why they’re accessing the DB in the middle of code (that is now un-reusable, and can’t compose cause it has a flipping database call in the middle of it) they look at me like I’m telling them to rewrite it in COBOL.

                                            An even more advanced method that I’ve see is, taking incoming data, processing it, storing the processed result to the DB, and then in the same method, querying the data you just wrote “cause it’s easier than accessing deeply nested objects” and then processing that before outputting it as the result and clearing the DB.