1. 6
  1.  

  2. 4

    I think this article misses a critically important point about load testing – you need it when your service has value to customers and-also either there’s an upcoming spike in traffic or a change to your service’s capacity to handle requests.

    If your service doesn’t have value to customers, then downtime during peaks is acceptable, because nobody cares.

    If your service is operating at a constant-forecasted traffic plateau and is not changing, then you’re fine.

    But if downtime would significantly erode customer experience and trust, and your traffic is variable and threatens to be an outlier (e.g., black friday, prime day, superbowl commercials, christmas morning), then you need it.

    And, if customers care, then even at a steady state of traffic, if you are making significant architectural changes (e.g., switching from one database to another) which could expose surprise bottlenecks, then you need it.

    You also need it in certain other special circumstances that don’t impact everyone, but you should think about this one regardless: if your service has a deep dependency graph (for example my teams at Amazon have this in Alexa), then being able to run a load test even during non-peak periods can help flush out emergent surprise bottlenecks. Sometimes teams make changes to solve a problem, and their reasoning is solid but not complete, and the resulting system bakes in an assumption which impacts performance. Among other emergent problems.

    1. 3

      This comment might sound very critical, but that’s not my point. I agree with the author that load testing is very hard and if you’re going to do it wrong, you might as well not do it at all. This comment is meant to further illustrate how many nuances there are that need to be considered.

      Some other things that are hard and not explicitly mentioned in the article:

      • actually generating load the way you think you do;
      • reporting something more informative than just a mean value, or a percentile, or something equally over-simplified; and
      • gathering enough samples to back up your data.

      This has been my pet peeve the last few months and I’m planning on expanding it into an article but haven’t gotten that far yet.

      You don’t have continuous load, i.e. your traffic is bursty […]

      Essentially all open systems have bursty traffic. Some would argue Pareto distributed, others lognormal, but the point is that when your requests come from independent clients, they will sometimes bunch up and arrive really close to each other by chance. (Note that lognormal is more bursty than nice, simple Poisson arrivals. This could vary from system to system though.)

      I.e. this point, which is presented almost as a rare occurrence, is probably the common case for many people.

      If you don’t have good metrics, you won’t be able to prove that the LT workload is similar to prod and interpreting LT results will be an uphill battle.

      I agree with this in spirit, but in practise I’ve found the uphill battle to be interpreting results when the load generation tries to mimick production traffic. It is, in a sense, easier to interpret what happens at sustained loads because that gives a hint about in what ways the system will need to recover after peaks in traffic.

      The way systems tend to react to high loads is by building up some sort of computational debt, be it by buffering/queueing, memory allocations, algorithms with good amortized performance but awful worst cases, and so on. The system does a bunch of things cheaply and then normally has time to recover and repay this debt, but under sustained high loads further requests will be stalled while the computational debt is repaid. This is also the way you get large performance variance in production when the system becomes loaded, but it’s much, much easier to spot with sustained high loads.

      It’s also worth mentioning that local metrics are a good diagnostic tool, but they are often deceptive when it comes to system performance, because of interdependent operations, queuing effects, and whatnot. The author touches on this in the next section, though.

      What does it mean for your LT to pass?

      I don’t agree load tests have to have go/no-go requirements. It might be even more useful to do them to keep an eye on how the performance of your system changes over time. If your system performs predictably at given loads, you know a lot about what will happen in prod.

      You might be trying for a fail, i.e. instead of saying yes/no can I handle traffic T, trying to solve for the T at which your system falls over.

      You might, but this is rarely a useful metric. So what if the system falls over at 3751 load? a) you’ll never operate anywhere near that in production, because that’s where even a tiny hiccup will break everything, and b) the specific number depends on so many other things it might just as well be 3524 had you measured a day earlier or later.

      Knowing that the LT system is in a bad state is more art than science

      This is very true, and a good reason to keep an XmR control chart over the results. Significant deviations should be explainable, or they are likely a bad load test.

      1. 1

        I do think there’s a difference between types of bursty traffic. In my particular corner of the enterprise world, our traffic comes from customers that we have a contractual relationship with, executing business processes that don’t double overnight. Of course, traffic is bursty at a micro-level, but I don’t need to think about “can I handle 5x as many users without provisioning a new database?”

        That’s in contrast to a lot of areas where going viral could, in principle, lead to traffic increasing by an order of magnitude from one day to the next.

        1. 1

          Here’s some examples of what you’re describing / examples of applications that can expect to see bursts far outside normal traffic patterns:

          • e commerce site on black Friday
          • MMO login server on the day a new patch drops which adds rideable ponies
          • survey website which gets iframed on the front page of a national newspaper’s website

          Any of the above would be far outside the 99% confidence interval for arrivals drawn from a Poisson distribution at those applications’ normal day to day traffic levels.

          1. 1

            Yes, what you are describing does not sound to me like a completely open system. I don’t know if closed is the right word either, so semi-open? I think Poisson arrivals could be a meaningful model there – same as the classical phone line subscribers.