This is a good retrospective on the Slack outage. The initial Slack response privately sent to customers within a week was extremely disappointing and reduced my confidence in Slack. As a customer, I would ask Slack to just skip the inadequate initial response and send out a deep dive of this quality regardless of how long it takes.
That being said, in my opinion, there are issues with Slack’s architecture and data schema that are not addressed by the short-term or long-term actions at the bottom. So…please accept my personal two cents.
If data isn’t in the cache, or if an update operation occurs, the Slack application reads it from the Vitess datastore and then inserts the data into Memcached…Furthermore, membership of GDMs is immutable under the current application requirements, so there is a long cache TTL, and therefore the data is almost always available via the cache…Other more long-term projects are exploring other ways of increasing the resilience of our caching tier.
You have created a bi-modal system. In one mode, everything is peachy and the cache is full and latencies are low and sometimes you hit the datastore. In the second mode, if there is a surge of unexpectedly cold traffic, an unexpected deployment, a cold restart of your system, your cache is useless, and you hammer your data store. Modes are bad. You want one mode. On top of that you’re suffering from scale inversion; the size of requests and the cache fleet exceed your data stores capacity and can wipe it out.
Falling back to direct database queries was an intuitive solution that did work for a number of months. But eventually the caches all failed around the same time, which meant that every web server hit the database directly. This created enough load to completely lock up the database…The thinking behind our fallback strategy in this case was illogical. If hitting the database directly was more reliable than going through the cache, why bother with the cache in the first place? We were afraid that not using the cache would result in overloading the database, but why bother having the fallback code if it was potentially so harmful? We might have noticed our error early on, but the bug was a latent one, and the situation that caused the outage showed up months after launch.
So questions to ask yourself:
Vitess is magical, scales horizontally, and supports read replicas. If it’s so magical, why can’t you scale it horizontally and replicate it globally for reducing latency. Why do you need a caching layer?
If you need a caching layer, you are caching immutable stable data. Why are you pulling it and not pushing it? If it is immutable why can’t you reload from disk, or reload from an object store, why refresh it?
If I have designed a system with modes, how do I regularly test the system in all of its modes? The modes are orthogonal axes in a space of system configurations….how do I find the corners and test them? Can I test the corners? If I can’t test the corners, maybe I should eliminate some axes?
Client retries are often a contributor to cascading failures, and this scenario was no exception. Clients have limited information about the state of the overall system. When a client request fails or times out, the client does not know whether it was a local or transient event such as a network failure or local hot-spotting, or whether there is ongoing global overload. For transient failures, prompt retrying is the best approach, as it avoids user impact. However, when the whole system is in overload, retries increase load and make recovery less likely, so clients should ideally avoid sending requests until the system has recovered. Slack’s clients use retries with exponentially increasing backoff periods and jitter, to reduce the impact of retries in an overload situation. However, automated retries still contributed to the load on the system.
Load. Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted. AWS added this behavior to the AWS SDK in 2016. So customers using the SDK have this throttling behavior built in.
So questions to ask yourself:
Can you update the library that you use to hit the datastore to automatically implement a token-bucket for retries then circuit break?
What additional metrics and alarming would you add to be able to monitor this?
What runtime configuration could you implement to automatically trip all circuit breakers?
How could you regularly test all of this?
RE: data schema:
An alternative is to dual-write the data under a different sharding strategy. We also have a keyspace that has membership of a channel sharded by channel, which would have been more efficient for this query. However, the sharded-by-user keyspace was used because it contained needed columns that were not present in the sharded-by-channel table…We have modified the problematic scatter query to read from a table that is sharded by channel. We have also analyzed our other database queries which are fronted by the caching tier to see if there are any other high-volume scatter queries which might pose similar risks
This is always the trouble with NoSQL database schemas and denormalization, sometimes there’s just a little something missing that means you can’t use that way and now you have to keep using this way. Regular operational reviews and good design reviews are needed to catch bigger subtle issues with data schemas. I also try to have a rule in my mind “Never let them touch every row”, as in always think have I created a query or pattern where there’s O(n) behavior going on. If there is I’ve just boxed myself in and n just have to get large enough to tip me over. Queries need to be O(1), then anything that genuinely need O(n) access you can e.g. use a changelog stream to publish the data to an alternative data store like an object store for access. The caching layer is your crutch/mode that is hiding this time bomb and paving the road to the next outage.
The PBR step on February 22 updated Consul on 25% of the fleet. It followed two previous 25% steps the prior week, both of which had occurred without any incident. However, on February 22, we hit a tipping point and entered a cascading failure scenario as we hit peak traffic for the day. As cache nodes were removed sequentially from the fleet, Mcrib continued to update the mcrouter configuration, promoting spare nodes to serving state and flushing nodes that left and rejoined. This caused the cache hit rate to drop.
Can you connect the loop here, why are you continuing deployments as your system is degrading? You know your system is bi-modal, as it veers into the other mode you need to slam on the big red button. Some may say it is impossible to avoid the other mode….but at least you can try.
“Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.”
A pithy quote, and an interesting paper. But after incidents, rather than pithy quotes, I prefer sharp controversial questions that try to go to the heart of the issue. I am not saying there are right or wrong answers, but asking such questions and the journey of answering them is more enlightening then seeking generic yogi-like advice. I am no yogi, I do not have magical answers, so I like to ask myself sharp questions.
So, maybe:
Get rid of caching? Vitess is awesome, let’s just nuke the memcached layer entirely.
Get rid of Vitess? Vitess is not awesome, we couldn’t horizontally scale it during the incident, we forget to denormalize columns, let’s nuke Vitess.
Test all the modes? Is it even possible to test a system’s modes fully, or should we try to eliminate modes?
Test recovering once a system is in catastrophic failure and requires severe throttling and cold starting vs. test the system’s breaking point and avoid the breaking point? Ah, the ever eternal controversial question. Ask both questions. What works for your organization?
Better deployments? Why do Consul deployments empty the cache, especially for stable items with infinite TTL? What needs to change during a deployment and what can stay the same? How do I know a deployment is going side-ways?
Vitess is magical, scales horizontally, and supports read replicas. If it’s so magical, why can’t you scale it horizontally and replicate it globally for reducing latency. Why do you need a caching layer?
This is a very good question.
I don’t work at Slack, but in Booking.com, we tried to push MySQL(without Vitess) to quite the limit.
Those who are interested can check out the details in https://blog.koehntopp.info/2021/03/12/memory-saturated-mysql.html written by Kristian our principle DBE. Essentially you can serve the data from memory page of MySQL which effectively make read perf of a fully saturate page to be on the same level with using a cache store like Redis or Memcache. For this reason, there is no ‘cache’ layer in Booking.com and the read heavy workloads can be served from MySQL directly, heavily sharded.
“Mcrib” is the best code name I’ve heard in a while.
In a way it’s comforting that Slack, with these brilliant engineers, has these issues too. I mean, I know everyone does, but when a product I designed or worked on has an issue, I always beat myself up about it. Realizing I can make mistakes too and not be an imposter is a lesson I still haven’t fully learned after all these years.
Let me tell you about some internal only post mortems (We call them COEs - Correction Of Error) that made my hair stand up :)
And I guarantee you that every other BigCorp in existence has them too. Solving problems at crazy pants scale means that sometimes despite everyone’s best efforts you end up with disasters at said scale :)
I work for AWS, my opinions are my own.
This is a good retrospective on the Slack outage. The initial Slack response privately sent to customers within a week was extremely disappointing and reduced my confidence in Slack. As a customer, I would ask Slack to just skip the inadequate initial response and send out a deep dive of this quality regardless of how long it takes.
That being said, in my opinion, there are issues with Slack’s architecture and data schema that are not addressed by the short-term or long-term actions at the bottom. So…please accept my personal two cents.
You have created a bi-modal system. In one mode, everything is peachy and the cache is full and latencies are low and sometimes you hit the datastore. In the second mode, if there is a surge of unexpectedly cold traffic, an unexpected deployment, a cold restart of your system, your cache is useless, and you hammer your data store. Modes are bad. You want one mode. On top of that you’re suffering from scale inversion; the size of requests and the cache fleet exceed your data stores capacity and can wipe it out.
As per https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/:
So questions to ask yourself:
It does not necessarily follow that limited local information prevents appropriate inference about global state. You can use local circuit breakers for an automated approach, as per https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/:
So questions to ask yourself:
RE: data schema:
This is always the trouble with NoSQL database schemas and denormalization, sometimes there’s just a little something missing that means you can’t use that way and now you have to keep using this way. Regular operational reviews and good design reviews are needed to catch bigger subtle issues with data schemas. I also try to have a rule in my mind “Never let them touch every row”, as in always think have I created a query or pattern where there’s
O(n)
behavior going on. If there is I’ve just boxed myself in andn
just have to get large enough to tip me over. Queries need to beO(1)
, then anything that genuinely needO(n)
access you can e.g. use a changelog stream to publish the data to an alternative data store like an object store for access. The caching layer is your crutch/mode that is hiding this time bomb and paving the road to the next outage.Can you connect the loop here, why are you continuing deployments as your system is degrading? You know your system is bi-modal, as it veers into the other mode you need to slam on the big red button. Some may say it is impossible to avoid the other mode….but at least you can try.
A pithy quote, and an interesting paper. But after incidents, rather than pithy quotes, I prefer sharp controversial questions that try to go to the heart of the issue. I am not saying there are right or wrong answers, but asking such questions and the journey of answering them is more enlightening then seeking generic yogi-like advice. I am no yogi, I do not have magical answers, so I like to ask myself sharp questions.
So, maybe:
Thanks for Slack, and good luck!
This is a very good question.
I don’t work at Slack, but in Booking.com, we tried to push MySQL(without Vitess) to quite the limit.
Those who are interested can check out the details in https://blog.koehntopp.info/2021/03/12/memory-saturated-mysql.html written by Kristian our principle DBE. Essentially you can serve the data from memory page of MySQL which effectively make read perf of a fully saturate page to be on the same level with using a cache store like Redis or Memcache. For this reason, there is no ‘cache’ layer in Booking.com and the read heavy workloads can be served from MySQL directly, heavily sharded.
“Mcrib” is the best code name I’ve heard in a while.
In a way it’s comforting that Slack, with these brilliant engineers, has these issues too. I mean, I know everyone does, but when a product I designed or worked on has an issue, I always beat myself up about it. Realizing I can make mistakes too and not be an imposter is a lesson I still haven’t fully learned after all these years.
Its appropriate for a service called “mcrib” to be only intermittently available…
Bravo.
This is the greatest comment I have ever seen here.
EVERYONE does!
Let me tell you about some internal only post mortems (We call them COEs - Correction Of Error) that made my hair stand up :)
And I guarantee you that every other BigCorp in existence has them too. Solving problems at crazy pants scale means that sometimes despite everyone’s best efforts you end up with disasters at said scale :)
Haha! Not just Slack … literally every place I’ve worked at has had something like this.
Worst date notation to date. Why would you do that??