1. 9

Twilio is currently having a problem double-billing customers and then automatically shutting them off because their balances are too low. My own account had its credit card charged 3 times this morning for the same amount.

Earlier this week Google sent out duplicate payments to a lot of Android developers. My company’s bank account had last month’s payout from Google credited to it three times the other day, and then taken back this morning.

While I’m sure neither are related, it makes me wonder how these sorts of things happen and how the code in those particular systems is designed. Whenever I have to make automated systems that do “serious” things like charge credit cards, transfer money, or shut down accounts, I am always extremely paranoid about things going haywire and getting into situations like Twilio and Google are in right now (lots of unhappy customers, lots of money getting moved around that is not authorized to be). I can’t believe some internal report or process at Google didn’t notice that their Android payment systems were suddenly sending out many hundreds of thousands of dollars more than it normally does and to stop part-way through that process.

I’ve written billing systems, credit card processing scripts, and even a system at my old ISP that would go so far to automate shutting down delinquent customers that it would telnet into our routers and switches and shut off T1 interfaces. That is scary shit to be running on its own. Shutting off the wrong interface could take down huge parts of our network, or at the very least, the wrong customer.

Being defensive isn’t writing a unit test to handle Twitter’s API failwhale'ing, it’s writing code along the path of your nightly customer deactivation script so that if the number of accounts it’s about to deactivate is more than some percentage higher than yesterday or last week, to fail and fail loudly. Worst case someone has to manually double-check the output and re-run it, but it’s better than continuing and deactivating a bunch of accounts that it shouldn’t have because of some weird external condition.


  2. 4

    I think this is a general problem where people consider operations with (major) side effects to be idempotent, or at least idempotent in the case of failure.

    e.g, iterate through database and bill each customer. If you receive success from payment service, mark customer paid. Else leave customer marked unpaid. This doesn’t consider the case where you were disconnected mid payment. A more robust system (from the customer’s view) would be to mark them paid first, then talk to the payment service. If the payment service says denied/overcharged/no funds, then revert to unpaid. But for other unknown errors, leave them in the paid state until somebody figures it out.

    I think this flies in the face of conventional wisdom that you only mark things done after you know that they’re done, but there are a lot of things you really don’t want to try twice until you’re certain the first time really didn’t work. Assuming web services, a good rule would be to only POST things once unless you get back a specific error you know is safe to repeat. Don’t assume the correct response to unknown errors is to retry.

    We had a minor incident once wherein we discovered that postfix would only deliver a message with a maximum of 200 recipients. We were trying to send an email to 1000 (bcc) recipients. postfix dutifully accepted the message for relay, sent it to the first 200, then returned an error. Our software dutifully tried again. And again. And again. Those first 200 people were not pleased. The sysadmins noticed it pretty quickly when the traffic monitor alert went off and we shut things off. After that, we changed things to only try mail submission twice, after that it gives up and begs for help. I think the root cause bug was postfix stupidly returning a message rejected error code and then relaying the mail anyway, but this is hardly the only time a bug in somebody else’s software requires safeguards in your own.

    1. 1

      A winner is me! The new fixed billing system won’t charge people until after it verifies the database is writable to record the charge. http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html