1. 14
  1.  

  2. 8

    The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct – this meant that if a site ID is passed, a site would be deleted; if an app ID was passed, an app would be deleted. There was no warning signal to confirm the type of deletion (site or app) being requested.

    So technically a type error then.

    EDIT: they claim there were two problems, but I’d like to add a third: not doing a dry run of the script and printing out what would be deleted.

    1. 5

      I agree it should be a type error. Let’s see if we can get there.

      Fourth problem: actually deleting instead of deactivating and marking for deletion. We should have been able to say “oh whoops extremely sorry, let me undo that” and had customers running within minutes of getting a support ticket.

      In the same way that it’s safer to generate garbage and GC rather than free() something you need and SEGFAULT.

    2. 3

      The detail on this is decent, and the lessons look useful.

      Though this only impacted a small number of customers, total unavailability for 14 days is a big deal.

      This sounds like another example of “Your nines are not my nines”: https://rachelbythebay.com/w/2019/07/15/giant/

      1. 2

        We want to acknowledge the outage…

        An actual apology would have been more appropriate.

        1. 8

          I’m an Atlassian employee.

          One of the co-founders, Scott Farquhar, did send out an apology to affected customers but it was tricky to even get a list of people to contact, since the contact information was deleted as part of the script. It was definitely incomplete initially.

          I probably can’t say much, but there was confusion during the incident over who had been contacted, who should be contacted, how to contact people, etc. due to the initially missing data. Some huge changes will be implemented around contacting customers in the future - we have to do better.

        2. 1

          I remember reading a blog post about it (link below), and affected people just getting generic error messages from the status page or support, and being considered not worthy of having a real human talk to them. So this PIR is clearly a load of bullshit. It’s not because you write: “one of our core values is “Open company, no bullshit”” that it is necessarily true. Not only it was NOT open, but also full of bullshit.

          From the blog post: For most of this outage, Atlassian has gone silent in communications across their main channels such as Twitter or the community forums. It took until Day 9 for executives at the company to acknowledge the outage. […] Impacted companies received templated emails and no answers to their questions.

          The post: https://newsletter.pragmaticengineer.com/p/scoop-atlassian