This all sounds so nice and falls so damn flat if you’re not operating at scale.
Simplest example: To have Chaos Monkey be useful you first have to have at least n+1 instances running (with n being the regular thing)
Sure, if you’re maybe hosting a web app it’s easy to just put another frontend node there or another load balancer, but especially for some databases it gets hairy real quick (master/master vs master/slave vs cluster). And getting all your infrastructure to HA (whatever amount of nines be your target) is so often just not worth it, especially at a not-yet-profitable startup.
This all sounds so nice and falls so damn flat if you’re not operating at scale.
Simplest example: To have Chaos Monkey be useful you first have to have at least n+1 instances running (with n being the regular thing)
Sure, if you’re maybe hosting a web app it’s easy to just put another frontend node there or another load balancer, but especially for some databases it gets hairy real quick (master/master vs master/slave vs cluster). And getting all your infrastructure to HA (whatever amount of nines be your target) is so often just not worth it, especially at a not-yet-profitable startup.
This reminded me of Experience with some Principles for Building an Internet-Scale Reliable System(pdf) from 2006.