1. 25
  1.  

  2. 1

    If I read this correctly, an unexpected reboot of 25% of the servers in 1 datecenter (they have a main datacenter???) caused a 6 hour outage. This is really terrible risk mitigation for something that has been positioning itself as a core peice of internet software development.

    But to put this in another way, in one day GitHub has given itself an availablity of under 99.95% for the year of 2016.

    EDIT: Whooops, I flipped the time, I thought it was 6 hours. That has an availability of somewhere between 99.99% and 99.95%. So have already lost 4 9’s.

    1. 4

      I read ‘two hours and six minutes’, not six hours, but yes.

      (edit: well, actually, no, because that does bring it above 99.95%, but the actual number is obviously not the important thing here.)

      1. 2

        Thank you for the correction! I fixed it in my post above. My sentiment still stands in that I would have expected something much worse and unpredictable (machines going down in a datacenter is pretty predictable) to have caused a multi hour outage.

        1. 3

          Yep; I’m totally with you there. Reading the post-mortem, it’s clear they have a few parts of their automated disaster recovery process to be ironed out yet!