Twilio is currently having a problem double-billing customers and then automatically shutting them off because their balances are too low. My own account had its credit card charged 3 times this morning for the same amount.
Earlier this week Google sent out duplicate payments to a lot of Android developers. My company’s bank account had last month’s payout from Google credited to it three times the other day, and then taken back this morning.
While I’m sure neither are related, it makes me wonder how these sorts of things happen and how the code in those particular systems is designed. Whenever I have to make automated systems that do “serious” things like charge credit cards, transfer money, or shut down accounts, I am always extremely paranoid about things going haywire and getting into situations like Twilio and Google are in right now (lots of unhappy customers, lots of money getting moved around that is not authorized to be). I can’t believe some internal report or process at Google didn’t notice that their Android payment systems were suddenly sending out many hundreds of thousands of dollars more than it normally does and to stop part-way through that process.
I’ve written billing systems, credit card processing scripts, and even a system at my old ISP that would go so far to automate shutting down delinquent customers that it would telnet into our routers and switches and shut off T1 interfaces. That is scary shit to be running on its own. Shutting off the wrong interface could take down huge parts of our network, or at the very least, the wrong customer.
Being defensive isn’t writing a unit test to handle Twitter’s API failwhale'ing, it’s writing code along the path of your nightly customer deactivation script so that if the number of accounts it’s about to deactivate is more than some percentage higher than yesterday or last week, to fail and fail loudly. Worst case someone has to manually double-check the output and re-run it, but it’s better than continuing and deactivating a bunch of accounts that it shouldn’t have because of some weird external condition.