1. 4

I will admit that I haven’t faced this issue in previous jobs, but we have a hell of a time with flaky QA servers here. They just go down randomly, sometimes for days, without warning. It interrupts testing (which to be perfectly honest, they should be mocking these services) and generally delays releases way too much.

Currently we have a few shell scripts that make calls to endpoints and record the response, but it’s not very robust. There’s no alerting involved with it or trending.

We use AWS for hosting, which should provide satisfactory failover. I’m not sure of the exact setup though. We have an extremely narrowly scoped set of teams.

Aside from mocking those services, does anyone have a sort of monitoring and alerting solution for this scenario? Keep in mind that I don’t have access to most of these services (HUGE company).

  1.  

  2. 4

    The same way I monitor prod. For us that’s Prometheus (https://prometheus.io/), but there are many, many OSS options.

    1. 1

      This looks pretty interesting. I’ll check it out tomorrow. Thanks!

    2. 3

      We also use the same services to monitor non-production environments, nagios in this case. We only use Pingdom to monitor the production services from external sources however. (Mostly so it doesn’t alert us to things going down out of hours we don’t care about until the next day.)

      1. 2

        You might have different level of “service”/SLA for different environments (we don’t get paged in weekends/nights for non production/tooling), but we do in the day like it was production. If it’s treated like you don’t care, or can wait, it’s probably that it’ll either slow someone else or that you don’t really need it.

        1. 2

          IMO, you should monitor them using the same tools that you use to monitor your prod servers, because you’re testing the behavior of those tools too. If a lower env service goes down and your monitoring system doesn’t let you know about it, then that system might not be letting you know when prod goes down either. Of course, you should be able to set up the urgency of the alerts appropriately.

          Our department uses a custom-written in-house monitoring service that periodically queries each service to run a set of smoke tests that tests the connection to various dependencies, including DB, cache, etc. It produces emails that you can set filters and alerts for as needed.