1. 13
  1. 4

    Speculating: The thing about testing in production is that it is maximally resistant to excuses. You can, as a group, decide to do tests like this, and you can do the tests, and the tests cannot fail to happen - however, assuming you have buy-in from the top, any failures will still be the fault of the people or group who failed to implement the mitigation/error handling setup. As such, chaos testing is a way to bypass internal resistance to fault tolerance overhead/effort. (The same concept applies to pentesting.)

    Specifically, this sentence:

    But, presumably we should be testing this in development, when we’re writing the code to contact that service.

    Note that the “we” doing the chaos test and the “we” who should be testing this in development may not be the same “we”!

    1. 3

      Yeah, and I feel that chaos engineering is in some ways symmetric to the same social friction-bypassing aspect of writing services at all. It’s a messy technique for a messy world. It’s not a particularly fast way to find bugs in distributed systems, and it can incur heavy reproduction costs (bisecting git commit logs for a big batch of commits under test takes a lot longer when you have to run the highly non-deterministic fault injection for a long enough time on each commit to gain confidence about whether the bug is present or not at that point). But it lets whoever is writing the bugs decouple themselves more from whoever is fixing them :P (and often allowing social credit to accumulate with the bug producers rather than the bug fixers).

    2. 1

      Disclaimer: I don’t know that much about chaos engineering.

      If I read this correctly: the author speculates that chaos engineering is done mainly for testing error handling code (and accidentally test the metrics). But I think the main point of chaos engineering is to discover “unknown unknowns” (as the author mentions) in production. Yes, you can test for resiliency in staging, or even on a developer’s machine, but you will never have the same configuration/behaviour in production and staging. I see chaos engineering more as a way to discover those differences between the environments, than as a way to test the resiliency per se. Said otherwise: by testing the resiliency in production, you can uncover unknown unknowns (like misconfigurations) about the production environment.

      Also, with something like Netflix, there’s just so many micro-services, meaningfully reproducing that on a dev’s machine is almost wishful thinking.

      P.S. Very nice article.