This is a well-written post-mortem for public reading. I encourage people to read through it.
Being someone who also works in the Payments space currently, relying on gateways, I have gone through several similar outages, where we detected a gateway issue causing an outage, notified the gateway who ack’d… and then we waited. More than one time, like Monzo, we built a workaround on our end, before the gateway provider could even mitigate the outage.
Hats off to the Monzo team, who clearly have a solid oncall and incident mitigation strategy in-place. They determined an outage happening in 4 minutes, built a workaround as best they could and deployed it in 2 hours, while it took the gateway provider 9 hours to mitigate their change that caused the issue.
Unfortunately, in cases like this, the best one can do is make sure there is a clear SLA in-place with the third party, with a contract stating financial liability in case the third party fails this SLA. Monzo will not tell us much about this part, but I suspect the gateway will have to pay a hefty fee to Monzo, as their availability dropped to under 99% for this month, which is extremely poor. It is good to see they are pushing the third party to do a proper post-mortem and prevention actions, as well as holding them accountable on this post.