1. 14

  2. [Comment removed by author]

    1. 2

      I might be misunderstanding, but on the last point, aren’t you testing durability, basically making sure that you have backups that work, rather than availability? It’s pretty common in finance, at least consumer-facing, to be very interested in durability, but somewhat “meh” about availability. My bank’s ebanking website goes down for routine maintenance about 6 hours a week on Saturday night, occasionally with other outages on weekday nights if they have more things to do. I’d be surprised if overall availability is much higher than ~95%, i.e. not even two nines. That seems to work ok for them. But people would be very angry if they lost transactions.

      1. 3

        For testing backups, a quote I stand by: nobody cares about backups, they only care about restores. If you aren’t testing your restores you don’t have a backup.

        Also note, the consumer side of finance is a lot less hectic than the day trader side where money is made.

        1. 1

          Availability is very rarely (even in academic literature) defined in an explicit manner. The “A” in CAP and the “A” in ACID are different, and often conflated if precisely considered at all. If you have “ACI”, it’s questionable that “D” is really orthogonal and is anything other than the creation of a cool acronym. In the end, we want to be able to rely on our systems to meet our objectives. Reliability involves the availability of the things that we need to accomplish the objective. Nines are sort of an economic pseudo-quantification. If you don’t account for the value of the thing that is available, a description of availability stands naked and useless. Without backups, you can keep nine nines, but your responses are trivial and sometimes severely disappointing. It’s not super value to consider any of this stuff in isolation.

      2. 7

        I wonder if shooting for the mythical five 9’s of availability is a poor proxy for “try to make the service available enough to piss the fewest amount of people off”, and far too many people try for the former and end up wasting time/effort.

        Techno-wankery around metrics seems to happen depressingly often. A coworker once hung his hat on a ~90% average reduction in time for an API call - sounds amazing at first! But then we asked:

        What were the average times before/after your work?

        ~80 ms, ~13 ms (not too shabby after all!)

        What level of work was involved?

        n weeks, introduction of a new technology to the stack (redis), …

        How often is the API hit?

        tens of times an hour

        What is the maximum acceptable response time for that service?

        (sheepishly) 150 ms.

        The problem came up, it turns out, after we introduced monitoring to make it easy to see times for API calls. Just measuring this meant everyone immediately tried to reduce the numbers… and in this case, it meant spending a significant amount of time and complexity solving a problem that just didn’t exist.

        I don’t know what a good solution is. We experimented with alternative monitoring techniques - at first we hid the raw times and just exposed the top five slowest calls, but that didn’t give enough actionable information. We also tried to come up with an “api score” that heavily penalized calls that were close to their max timeout or had significant response time variance, but getting the numbers to work out were tricky.

        I guess what I saw is just another manifestation of the same issue - throwing cycles at a problem without actually understanding what you’re trying to solve.

        So has anyone measured “availability” in a way that is actually meaningful/actionable?

        1. 4

          Rather than looking at the 5 slowest calls, why not sort your calls by usage, and then work through the list until you find something that is worth devoting energy to?

          1. 3

            Newrelic colours your response times green/yellow/red which I found was pretty psychologically effective even without hiding the raw numbers - a call may be taking 80ms, but as long as it’s in the soothing green I don’t worry about it, even when I know that there’s no way that task should be taking 80ms.

          2. 4

            If the availability of your product doesn’t matter at all, it might mean that your customers don’t care that much about your product.

            With that said, it is a tradeoff, and if you can create more value for your business by spending time elsewhere, you do you.

            1. 4

              This post is counter beneficial. Availability is an important discussion between engineering and business that needs to take place. This post (the first half, at least) basically says you’re an idiot for even wanting to address the question. The actual technical suggestions are fine and probably what most people should be doing, but to approach it in such a way that, IMO, demeans people who just want to have the conversation about what availability is important is not helping. In most cases, we need more cooperation between technical decisions and business needs, and this is post is something a technically illiterate business person can throw down to strong arm engineers to do what they want, risks be damned.

              I also disagree with the author with a number of their claims. Specifically, much of the argument against caring about availbaility is a strawman of it being binary. It’s a dial one turns. And depending on where they want that dial, it’s cheaper or more expensive. But, again, it’s a discussion that needs to happen between engineering and business, not a blanket “shut up and do what I want”.

              There is a great article, here by @dl that got posted over the last two days that is worth reading for, in some ways, the alternative view. A culture of not taking informed decisions catches up after awhile.

              1. 1

                The specific steps being advocated here sound like they’d result in pretty high availability. Four 9s is none too shabby.