1. 15

I am collating ideas on bringing resiliency in distributed systems at scale. I had previously written a article on this here: https://blog.gojekengineering.com/resiliency-in-distributed-systems-efd30f74baf4

Above article includes:

  1. Timeouts
  2. Retries
  3. Circuit breakers
  4. Fallbacks
  5. Resiliency Testing

More patterns I can think of include

  1. Rate limiting and Throttling
  2. Bulkheading
  3. Queuing to decouple tasks from consumers
  4. Monitoring/alerting (Observability?)
  5. Redundancies

Please let me know your experiences with these resiliency patterns. Also please feel to pitch in more other patterns if you have encountered any.

Thanks for this.


  2. 7

    It depends what you mean by resiliency. I tend to work on things with strong consistency requirements, and to be honest I think the way most people build and talk about distributed systems engineering is pretty gross and unprincipled.

    Why is Jepsen so successful against most systems, despite it being so incredibly slow to actually exercise communication interleaving patterns? People are building systems from a fundamentally broken perspective where they are not actually considering the realistic conditions that their systems will be running in.

    In my opinion, the proper response to this should be to ask how we can simulate realistic networks (and filesystems for that matter) on our laptops as quickly as possible, without requiring engineers to work with new tools.

    My approach is to use quickcheck to generate partitions + client requests and implement participants as things that implement an interface that usually looks like:

    • receive(at, from, msg) -> [(to, msg)]
    • tick(at) -> [(to, msg)]

    And this way a distributed algorithm can be single-stepped in accelerated time, and for each outbound message we use the current state of network weather to assign an arrival time / drop it. Stick it in a priority queue and iterate over this until no messages are in flight.

    Then as the “property” in property testing, ensure that linearizability holds for all client requests.

    With something like this, every engineer can get a few thousand jepsen-like runs in a couple seconds before even opening a pull request. They don’t have to use any tools other than their language’s standard test support. You can write the simulator once and it has very high reuse value, since everything just implements the interface you chose. Way higher bug:cpu cycle ratio than jepsen.

    This does not replace jepsen, as jepsen is still important for catching end-to-end issues in an “as-deployed” configuration. But I really do think jepsen is being totally misused.

    We should build things in a way that allows us to quickly, cheaply measure whether they will be successful in the actual environments that we expect them to perform in.

    Maximize introspectability. Everything is broken to some extent, so be sympathetic to future selves that have to debug it in production while failing spectacularly and causing everyone to freak out.

    One kind of concurrency that few seem to consider until it’s time to upgrade: multiple versions of your code may be live in some situations. Did you build your system in a way that ensures this is safe? Did you build your system in a way that allows you to un-deploy a new version if it fails unexpectedly? Or did you build in points of no return?

    One reason why I don’t do distributed systems fault injection consulting anymore is because of the egos of people who can’t accept their babies have problems. That got tiring really quickly. The #1 most important thing to building a reliable system is being humble. Everything we do is broken. That’s OK. So many engineers who learn how to open sockets begin to think of themselves as infallible rockstars. It’s really hard to build systems that work with these people.

    1. 6

      A lot of that looks like what’s in this book by Nygard. Anyone wanting to learn about that stuff will find in there both patterns to use and reasons why they should strongly consider using them.

      1. 2

        I came to post the same link. This book should be required reading for scaling distributed systems. Ask Ajey, I’m sure he’s read it!

        1. 1

          Agreed. I skimmed a lot of IT books but it was one of few I actually bought. Great writing on how to apply the patterns, too.

          1. 1

            I have read the book, wanted to know more on the experiences you guys had building distributed systems and what were the patterns you guys think are relevant and which are not from the above list

          2. 2

            I’m not a fan. I remember it as being what felt like “common sense” combined with some ideas that I think are bad. Of course, common sense means “lessons that took me years to learn”. No book is all good advice. Part of the learning process is implementing ideas that aren’t yours and learning when they are applicable and when they aren’t.

            My biggest problem with “Release It!” was it felt more like “all good engineers do X”. Best practices never are. They are good ideas in a certain context. Sadly, most tech thought leaders teach them as absolutes.

            1. 2

              It was kind of like a cookbook. Context is also important. He usually said what they were good for doing, though, along with examples of why. I figured people could make contextual decisions that way.

          3. 3

            There are some good descriptions of some patterns at the Azure Cloud Design Patterns sie

            1. 2

              You should submit this as a link on its own!