If I read this correctly, an unexpected reboot of 25% of the servers in 1 datecenter (they have a main datacenter???) caused a 6 hour outage. This is really terrible risk mitigation for something that has been positioning itself as a core peice of internet software development.
But to put this in another way, in one day GitHub has given itself an availablity of under 99.95% for the year of 2016.
EDIT: Whooops, I flipped the time, I thought it was 6 hours. That has an availability of somewhere between 99.99% and 99.95%. So have already lost 4 9’s.
I read ‘two hours and six minutes’, not six hours, but yes.
(edit: well, actually, no, because that does bring it above 99.95%, but the actual number is obviously not the important thing here.)
Thank you for the correction! I fixed it in my post above. My sentiment still stands in that I would have expected something much worse and unpredictable (machines going down in a datacenter is pretty predictable) to have caused a multi hour outage.
Yep; I’m totally with you there. Reading the post-mortem, it’s clear they have a few parts of their automated disaster recovery process to be ironed out yet!