1. 22

  2. 6

    Personally. Attacked. :|

    More seriously, I think a potential thing to preventing this is being able to articulate at each step the probability and cost of both incurring and preventing the various failure modes. It doesn’t make sense–yet some still do!–to spend 100k on a full-time SRE annual salary if an outage, for example due to app instance version mismatches, only occurs once a year and only loses you 5K in revenue.

    I assert that it is the inability to capture business needs and do sane reliability analyses (and given the complexity of our stacks, understandably so) that makes this descent into complication so common.

    1. 2

      Very good one :)

      Oh, look at that storage server! It will definitely not going to fail at any time…

      A single point of failure, until it gets managed by “the cloud”, where it gets all suddenly magic. Who knows how to build a failproof storage system these days? Mr. “the cloud”?

      1. 1

        I see two kind of moves through it:

        1. Adding redundancy which permits one to scale things up.
        2. Making it practical to keep everything in sync while being split across multiple servers.
        3. I said 2, not 3!

        Deploymaster always happen. Even on one server, the local shell script that fetch an archive from git onto a directory and switch a symlink to that dir is already an ad-hoc “deploymaster”. It always is a thing.

        Reverse proxy is just some application router. It always happen when you have more than one server. Putting it on the same machine or another machine is not much complex…

        Multiple IPs in DNS is not making it more complex either. DNS was already there, just shipping 1 IP instead of 2 or 3…

        Creating more app servers easily is just a shell script of the commands you was running on the host to get it configured into an app server. Is there much more than installing the server with its dependencies (libraries & co) ?

        Asynchronous workers make the scene more diverse, but it is a simpler kind of service. Furthermore, it can also make the inside of the app simpler. Also, it can be managed just like an app server, with just a different release.

        And then RilkefCorp wanted to have it interact with FooStore in the cloud:

        Woops! Here comes the marketting / finance team! Get away from our admin lawn! No, moving to the cloud is also a bad idea, extra tools to manage the stack are extra pieces to infrastructure making the stack higher… It’s over, you passed across the right spot and continued in a straight line. There is no more benefit in adding more tools at this point. Keep deploying your infrastructure like you did at the steps before, but do x1000 instead of x10. This scales. I don’t know a workload that can make multiple reverse proxies with multiple back-ends die… unless…

        That goddamn storage back-end! We let our whole scalable infrastructure live off a single NFS store!

        That’s bad, very bad. Even with a distributed filesystem it will be our bottleneck. We must think the application again so that we can split the storage in:

        • what is static (easy, copy the data between multiple servers with that deploymaster: no extra complexity)
        • what is dynamic (the app write to it, maybe new static content, not a big deal, sanitize it and pipe it to deploymaster)
        • what is dynamic and synchronous (the app write to it and the others need to acknowledge it right away, that’s the hot stuff).

        So instead of thinkging of the entire big-goofy-goose-of-an-automated-infrastructure-deployment-foolproof-manager-web-interface, let’s have a look at how you can solve the actual one problem we face: distributed synchronous storage.

        It is not that hard:

        • transform what is dynamic and changed dynamically into something that is static content (hash the content and distribute it maybe?), until only one final tiny information is left: what static content do you point at.
        • turn something that needs to be distributed world-wide into something that only needs to be in sync through few servers, like local variants deployed only on interested regions.

        Unfortunately, this requires the App to integrate it: splitting the localized content from the rest, not expecting a MySQL database at hand (160-nodes Percona Cluster across the oceans anyone?), identifying what may write back to the storage, and what is just fetching static files that are not expected to change at run-time.

        Taking an application from a developer and “make it scale like crazy” is the american dream of the “Cloud”, but because of this, the bottleneck becomes the developer himself, who fails to understand this over-complex infrastructure.

        My bet is: A simpler infrastructure that require the developer to understand it would beat one in term of effort/result compared to a sliced-up stack where the stack below sells a “platform” for $$$$ so that the stack above have not to worry anymore to the stack below.

        Are CTO getting afraid of the big bad technicality (words like “storage array failure”, “bad BGP announcement”, “blacklisted IP block”, or “kernel panic in loop after kernel upgrade” ? I understand, it is letting the other companies taking the risks, and that kind of service like “outsourcing the risks” is making a few giants with the whole redundant server stack with the full bells and whistles take over everything : Hello Cloudflare, Amazon, Google, Azure and friends…