1. 41
  1.  

  2. 5

    This is the most concerning paragraph:

    Transient resource or network failures can completely disable Chronos. Most systems tested with Jepsen return to some sort of normal operation within a few seconds to minutes after a failure is resolved. In no Jepsen test has Chronos ever recovered completely from a network failure. As an operator, this fragility does not inspire confidence.

    1. 4

      This is really neat! Thanks for this. The Mesos team’s responses in the github issue have been a little baffling.

      Instrument your jobs to identify whether they ran or not

      This is something we’re struggling with; an operation like “did we schedule a job to charge the user’s credit card” might happen in a 5 minute window, but it might not, and definitely not at 2am, when we’re not open. Any alerting that’s part of a job, by definition, won’t alert if the job doesn’t run. Are there any best practices here?

      1. 5

        Standard distributed systems instrumentation:

        • processes should submit success/failure metrics to a TSDB
        • if a process is creating work for another, it can submit a metric about the expectation that a request will complete
        • unanticipated errors should be logged and aggregated, and create a separate metric that has a much higher sensitivity alerting threshold
        • automated alerting should be external to these systems, and should assert expectations based on metrics like “is the current counter of successful tasks within X% of the counter for expected tasks 5 minutes ago?” etc…

        There are many ways to ensure systems are behaving according to your assumptions. In general, over-report metrics, make logs tell a really good story (otherwise they should have been metrics), and encode your assumptions into alerting thresholds. Start with over-sensitive thresholds and ease them back when you learn for a fact that they are safe at certain points. It’s better to for your system to over-communicate before you have learned what its limits actually are, but it’s vital to tune it back before operators become desensitized.

        1. 1

          I think it’s interesting to consider that reporting to the time-series database (TSDB) you mention can, itself, become part of the problem under conditions of adverse system health.

          It is, as far as I’m aware, not really possible to ensure exactly once delivery of a message that has side effects – e.g., the example above of “charge the user’s credit card”. You need some way of making the process idempotent, like being able to ask the merchant if you just charged the user for a particular pre-generated transaction reference; or, if you don’t believe you have charged, asking if you made any charges within a particular time period.

          I think that in the same vein, you want to build your monitoring to look at the (hopefully robust) storage of the actual effects in the primary system, rather than some adjunct system. For example, if you intend to run a metering/billing roll-up every hour of every day, you should periodically check for the results of the last 24 or 48 (or N) billing runs.