1. 29

  2. 8

    I think the takeaway here is a) don’t confuse all kind of errors with a http request with invalid tokens (I’m not familiar with the Github API, but I suppose it returns 503 unauthorized correctly) and b) don’t delete important data, but flag it somehow.

    1. 5

      It returns a 404 which is a bit annoying since if you fat finger your URL you’ll get the same response as if a token doesn’t exist.


      Invalid tokens will return 404 NOT FOUND

      I’ve since moved to using a pattern of wrapping all external requests in objects that we can explicitly check their state instead of relying on native exceptions coming from underlying HTTP libraries. It makes things like checking explicit status code in the face of non 200 status easier.

      I might write on that pattern in the future. Here’s the initial issue with some more links https://github.com/codetriage/codetriage/issues/578

      1. 3

        Why not try to get issues, and if it fails with a 401, you know the token is bad? You can double check with the auth_is_valid method you’re using now…

        1. 2

          That’s a valid strategy.

          Edit: I like it, I think this is the most technically correct way to move forwards.

        2. 1

          Did the Github API return a 404 Not Found instead of a 5xx during the outage?

          1. 1

            No clue.

            1. 1

              Then there’s your problem. Your request class throws RequestError on every non-2xx response, and auth_is_valid? thinks any RequestError means the token is invalid. In reality you should only take 4xx responses to mean the token is invalid – not 5xx responses, network layer errors, etc.

              1. 1

                Yep, that’s what OP in the thread said. I mention it in the post as well.

        3. 2

          I think the takeaway is that programmers are stupid.

          Programs shouldn’t delete/update anything, only insert. Views/triggers can update reconciled views so that if there’s a problem in the program (2) you can simply fix it and re-run the procedure.

          If you do it this way, you can also get an audit trail for free.

          If you do it this way, you can also scale horizontally for free if you can survive a certain amount of split/brain.

          If you do it this way, you can also scale vertically cheaply, because inserts can be sharded/distributed.

          If you don’t do it this way – this way which is obviously less work, faster and simpler and better engineered in every way, then you should know it’s because you don’t know how to solve this basic CRUD problem.

          Of course, the stupid programmer responds with some kind of made up justification, like saving disk space in an era where disk is basically free, or enterprise, or maybe this is something to do with unit tests or some other garbage. I’ve even heard a stupid programmer defend this crap because the the unit tests need to be idempotent and all I can think is this fucking nerd ate a dictionary and is taking it out on me.

          I mean, look: I get it, everyone is stupid about something, but to believe that this is a specific, critical problem like having to do with 503 errors instead of a systemic chronic problem that boils down to a failure to actually think really makes it hard to discuss the kinds of solutions that might actually help.

          With a 503 error, the solution is “try harder” or “create extra update columns” or whatever. But we can’t try harder all the time, so there’ll always be mistakes. Is this inevitable? Can business truly not figure out when software is going to be done?

          On the other hand, if we’re just too fucking stupid to program, maybe we can work on trying to protect ourselves from ourselves. Write-only-data is a massive part of my mantra, and I’m not so arrogant to pretend it’s always been that way, but I know the only reason I do it is because I deleted a shit-tonne of customer data on accident and had the insight that I’m a fucking idiot.

          1. 4

            I agree with the general sentiment. It took me a bout 3 read throughs to parse through all the “fucks” and “stupids”. I think there’s perhaps a more positive and less hyperbolic way to frame this way.

            Append only data is a good option, and basically what I ended up doing in this case. It pays to know what data is critical and what isn’t. I referenced the acts_as_paranoid and it pretty much does what you’re talking about. It makes a table append only, when you modify a record it saves an older copy of that record. Tables can get HUGE, like really huge, as in the largest tables i’ve ever heard of.

            /u/kyrias pointed out that large tables have a number of downsides such as being able to perform maintenance and making backups.

            1. 2

              you can do periodic data warehousing though to keep the tables as arbitrarily small as you’d like but that introduces the possibility of programmer error when doing the data warehousing. it’s an easier problem to solve than making sure every destructive write is correct in every scenario though.

              1. 1

                Tables can get HUGE, like really huge, as in the largest tables i’ve ever heard of

                I have tables with trillions of rows in them, and while I don’t use MySQL most of the time, even MySQL can cope with that.

                Some people try to do indexes, or they read a blog that told them to 1NF everything, and this gets them nowhere fast, so they’ll think it’s impossible to have multi-trillion-row tables, but if we instead invert our thinking and assume we have the wrong architecture, maybe we can find a better one.

                /u/kyrias pointed out that large tables have a number of downsides such as being able to perform maintenance and making backups.

                And as I responded: /u/kyrias probably has the wrong architecture.

              2. 2

                Of course, the stupid programmer responds with some kind of made up justification, like saving disk space in an era where disk is basically free

                It’s not just about storage costs though. For instance at $WORK we have backups for all our databases, but if we for some reason would need to restore the biggest one from a backup it would take days where all our user-facing systems would be down, which would be catastrophic for the company.

                1. 1

                  You must have the wrong architecture:

                  I fill about 3.5 TB of data every day, and it absolutely would not take days to recover my backups (I have to test this periodically due to audit).

                  Without knowing what you’re doing I can’t say, but something I might do differently: Insert-only data means it’s trivial to replicate my data into multiple (even geographically disparate) hot-hot systems.

                  If you do insert-only data from multiple split brains, it’s usually possible to get hot/cold easily, with the risk of losing (perhaps only temporarily) a few minutes of data in the event of catastrophe.

                2. 0

                  Unfortunately, if you hold any EU user data, you will have to perform an actual delete if the EU user wants you to delete their stuff if you want to be compliant with their stuff. I like the idea of the persistence being an event log and then you construct views as necessary. I’ve heard that it’s possible to use this for almost everything and store an association of random-id to person, and then just delete that association when asked to in order to be compliant, but I haven’t actually looked into that carefully myself.

                  1. 2

                    That’s not true. The ICO recognises there are technological reasons why “actual deletion” might not be performed (see page 4). Having a flag that blinds the business from using the data is sufficient.

                    1. 1

                      Very cool. Thank you for sharing that. I was under the misconception that having someone in the company being capable of obtaining the data was sufficient to be a violation. It looks like the condition to be compliant is weaker than that.

                      1. 2

                        No problem. A big part of my day is GDPR-related at the moment, so I’m unexpectedly versed with this stuff.

                  2. 0

                    There’s actually a database out there that enforces the never-delete approach (together with some other very nice paradigms/features). Sadly it isn’t open source:


                3. 6

                  Some more possible things that could be learnt from this event.

                  • Don’t do things that can have a large scale effect without thinking about the worst that can happen. The for loop in that code is entirely to blame for the magnitude of the problem (17,000). You could have caught this by precomputing the list of emails to go out and failing if this exceeded the normal amount by a drastic amount.
                  • Perhaps you could have spread the checks across 24 hours instead of at a specific point in time to avoid being dependent on an external call that might not be available during an outage.
                  • External service calls that you rely on will fail. Consider using a retry strategies that has exponential backoff with jitter. https://www.awsarchitectureblog.com/2015/03/backoff.html
                  1. 3

                    The problem is these are all arguably over-defensive measures without the benefit of hindsight. In an “agile” setting, you won’t be able to justify writing code this defensively all the time. Even less so, since if you code like that, there won’t be any incidents to justify writing code like that :)

                    An exception could be if you managed to capture these patterns in abstract, reusable libraries, toolchains etc., then the up-front cost could be justifiable.

                    1. 6

                      I wouldn’t put that on “agile”. I worked in quite a lot of agile organisations where safe and defensive practice was exercised around all mission-critical and safety-related data. They just didn’t bother too much if something was released with a button two pixels too far left (which is an issue to be fixed, I don’t want to downplay frontend, but it can easily be fixed with another deployment and has no lasting repercussions).

                      “pressure” is the killer here.

                      1. 2

                        Or you just make all that irrelevent/impossible by picking append-only data structures?

                    2. 3

                      On the bright side of things, at least you had 17k auth tokens to lose. :)

                      1. 2

                        The pink sidebar burns my eyes.