1. 2
  1.  

  2. 1

    This whole issue gives me tummy aches… I live this stuff and so many get it wrong.

    • Failing hard is better than failing flakily. ie. If the only fix is a software upgrade, you want to know soon as possible, as deterministically as possible so you can find root cause and fix as soon as possible. ie. Continuing to sort of work sort of not work in an undefined manner is not degrading gracefully and is not a good thing.
    • If the value of the entire system was hard dependent on the stock reordering thing working, halting to maintain the consistency of the stock accounting would be The Right Thing. But it wasn’t. “Re-ordering was not automatic, but staff would be alerted when stock was low on any items and decisions could be made. “ ie. The value of the system was it operating as a till. In selling things. If they ran out of stock… because they sold it all, that actually would be “A Good Thing”. Having stock but being unable to sell it… not so good. Not running out of stock and selling as fast as possible would be better. ie. I would have designed as “Best Effort”. Maintain stock info if I could, drop info and continue selling if I couldn’t.
    • A better solution would be push instead of poll. No doubt some tills would be quiet. They should have a high tide mark and start pushing their data to the server when it hits it… and then drop info if the server is still to slow. (The server could still poll when it was quiescent.)
    • Dropping info is OK, just log that you are doing so.

    In a previous life a CEO told me a war story of a multi user real time system that had complaints that it was “slow in the morning”.

    Investigation show it performed to spec, but was slower than at other times, but human typists occasionally screwed up because they were used to normal response times.

    Solution: A tiny “sleep” between characters…. enough to make the response time consistent across the day and let the cpu catch up during the busy period. Result: Happy humans.

    1. 1

      This is a long reply because I wanted to address all your points.

      This whole issue gives me tummy aches…

      Sorry.

      I live this stuff and so many get it wrong.

      I agree that many get it wrong, which is one reason, I think, why stories like this should be shared more widely and become part of the culture, including - perhaps especially - when people have got things wrong.

      Failing hard is better than failing flakily.

      Not always. Yes, if you have control over the software, and if total loss of service is acceptable, then failing hard is the fastest way to find problems and have the chance to fix them. I often work in situations where non-service is unacceptable, and we’re working with unstable software. In those cases failing hard isn’t an option, and definitely is not “better”.

      What’s needed in all cases is full and proper analysis, followed in the fullness of time by a full and proper post-mortem, with some way of integrating the lessons learned into future development processes.

      If the value of the entire system was hard dependent on the stock reordering thing working, halting to maintain the consistency of the stock accounting would be The Right Thing. But it wasn’t.

      “Hard dependent” isn’t the only criterion. In this case there was a balance of concerns, and one’s own objective function as a software developer might not be the same as that of one’s customer.

      “Re-ordering was not automatic, but staff would be alerted when stock was low on any items and decisions could be made.” ie. The value of the system was it operating as a till. In selling things.

      Not the only value. In general operation the system worked perfectly, and the flags for ordering stock were an important part of their operation. The only problem was when demand exceeded capacity. Since capacity couldn’t be increased, demand had to be decreased. There was no option.

      So demand could be increased by abandoning the stock logging. That was deemed less acceptable than slowing own the ringing up of sales by a few percent.

      If they ran out of stock… because they sold it all, that actually would be “A Good Thing”.

      That’s not the case for every vendor. For some vendors, part of their reputation is not being out of stock. Yes, for some vendors running out of stock isn’t a problem … that’s not universally true. Sales and marketing interacts with psychology, and people are weird. I’ve seen many times that apparently water-tight analyses by programmers turn out to be less good than those done by others with more “soft” skills.

      “No plan survives contact with the enemy” – Field Marshal Helmuth Karl Bernhard Graf von Moltke

      Having stock but being unable to sell it… not so good.

      Before the “graceful degradation” change that was a problem … they had stock and couldn’t sell it. After the change they could sell the stock, albeit at a slightly reduced rate … one that was deemed acceptable by the vendor.

      Not running out of stock and selling as fast as possible would be better.

      Yes.

      I would have designed as “Best Effort”. Maintain stock info if I could, drop info and continue selling if I couldn’t.

      That was presented as an option, it was discussed, all the ramifications considered, and it was rejected.

      A better solution would be push instead of poll.

      Possibly, but not viable given the hardware.

      Dropping info is OK, just log that you are doing so.

      Again, not universally true.

      1. 1

        Failing hard is better than failing flakily.

        Not always. Yes, if you have control over the software, and if total loss of service is acceptable, then failing hard is the fastest way to find problems and have the chance to fix them. I often work in situations where non-service is unacceptable, and we’re working with unstable software

        The problem is Nasal Daemons.

        This is asm code (C code suffers from this a lot, all languages to some degree).

        So what would happen if instead of checking for the storage bound and halting… it just didn’t check and kept on storing?

        For some undefined amount of time, it might continue working.

        For some undefined amount of time it might start behaving more and more crazily. Including sending crazy results to the server with undefined consequences of doing that.

        Possibly months of corrupted and unrecoverable data.(Yes, I have seen that happen).

        At some undefined point in time it would STILL cease working.

        As they put it in the C world… venture into “undefined” behaviour, the compiler is entitled to make daemons fly out of your nose….

        And all this would depend on exact version of firmware, hardware, device usage history, system load, phase of moon, server state, server version….. A complete and horrible nightmare to debug, and recover from.

        No. No. NO! Failing HARD IS the right thing to do rather than venture into undefined and undefinable behaviour.

        Of course, handling sensibly is better than failing hard, but failing hard is better than going flaky and unreliable and undefined.

        You can’t argue, “The customer prefers it to stopping working”, as you can’t tell the customer what the consequences are as you cannot know them your self!

        If you could, that would be defined behaviour!

        1. 1

          Hah! We’re arguing past each other.

          You’re talking about bugs, and I’m talking about choosing between types of behaviour under limiting circumstances.

          The situation here wasn’t a bug, it was a question of how to behave when certain circumstances arose.

          1. 1

            Ok. Agreed.

            The normal way in the world is when the programmer cods that limit, he asks so what should we do if we hit that limit?

            I bet his marketing folk, said, Nah, probably won’t happen, handling that’s a version 2 feature, we’ll discuss it with the customer when and if it arises.

            And that’s what you get.

            The programmer did the Right Thing, he made the till stop working instead of going undefined.

            Graceful degradation in the real world is a V2 (or v10) feature.

            1. 1

              Yes.

              V1: if you hit the limit, stop the tills so we know it’s happening. (It probably won’t)

              Ah. It does. OK, then …

              V2: Stopping the tills is unacceptable, and it’s happening. So instead, when they hit a lower tide mark, start slowing the tills down.

              So V2 is degraded, but gracefully, rather than catastrophically, and that was the point of the original post.