1. 3

  2. 3

    One of the GoCardless SREs here.

    Happy to discuss anything and answer questions, though I’m going to bed in the next hour!

    1. 2
      1. What’s one lesson you learned from this incident that would be useful to share with developers who are not SREs?
      2. What’s one assumption you had challenged?
      1. 5
        1. Cold code is broken code. Code that only exists to handle failure is most susceptible to this. A more common example is an infrequently run (e.g. monthly) cron job. In many cases I’d prefer it to run daily, even if it has to no-op at the last second, so that more of the code is exercised more often. Better still, in some cases it could do its work incrementally! Either way is better than having the job fail on the day it really has to run.
        2. Our ability to manually take actions that have been handled by automation for a long time. Turns out that’s not so good, and prolonged the incident even after we’d decided to bring Postgres up by hand.