This is one of the least informative postmortems ever. I knew more details about the outage from 3rd party observations before it was even over.
This is not a postmortem. That will likely come later once there has been a thorough investigation.
It’s Facebook, so I doubt there will ever be a real engineering postmortem.
My point exactly, I think this is what we get. I suspect there will be another post in a few days or weeks with about 10× as many words and some investor-calming action plan “so this doesn’t happen again”, but no more technical details than what we have.
Mod stuff: I removed a couple submitted stories about the big Facebook outage that had no info. I merged the third-party guess about what happened into the current official writeup. This event doesn’t have much to dicsuss yet and is mostly business news, but I’m leaving up this useless press release just to keep things together and easily hidden.
Please post any future links to third-party speculation as comments; I’m just going to delete more stories. If a proper postmortem comes along I’m going to merge it into this story, even if it doesn’t arrive within the next week (our usual window).
In general: “is X down?” or “X is down!” stories are off-topic and shouldn’t be submitted. I’m making an exception here because this one is so big and famous and an experiment, but so far it hasn’t prompted any topical discussion and is acting as more proof that this kind of story isn’t worth permitting.
Why were these BGP updates being made in the first place? Must these updates be made periodically?
We’ll have to wait and see if FB releases a postmortem. I’d suspect, based on outages I’ve seen at much smaller companies, someone made the wrong update, and then got locked out of the networking infrastructure because it was done remotely. This change might have passed review because it seemed unrelated to their primary BGP advertisements, but had a bug or fall through that affected way more than it should have.
There’s also the possibility it was intentional malice (I’ve heard some lead network engineers quit last week; might be rumors/complete BS or unrelated. I’m sure FB wouldn’t admit if it was).
I suspect if this was just a user/developer failure, FB will probably put additional safety procedures in place. They might add additional LTE connections to the data centers just for network engineers as backups. They might also require a network engineer with credentials to be in the physical data center and logged in, prior to major route changes.
Surprised FB does not plonk someone down in the same city as any of their DCs so they don’t have to make the six-hour drive to the router.
If you acquire a new set of IP addresses, or change the allocation of your existing IP addresses to different parts of your network so that the outside world needs to take a different route to get to them, then you’ll update your network peers via a BGP update.
I really want to read their postmortem once available. I wonder how much detail they’ll provide about how they manage their network infrastructure. It could be pretty helpful to learn from them despite the eventual hiccups like this one.
I got really excited by this headline, until I realised it was only temporary.
I mean, the sheer momentum of all of the salaries and bodies that are paid to keep Facebook up and running means it won’t stay down forever.
This thinking applies to RadioShack and Sears, though on a longer timeline
I was more referring to this specific incident, not to Facebook in general, though my phrasing didn’t make that clear enough.
This is essentially a press release and off-topic. Please wait and post something that is actually a post-mortem.
It’s from the engineering blog, I guess as a good example of how engineering blogs are just branches of PR departments.
It’s Facebook, not Cloudflair. I doubt we’ll ever get a real breakdown of what happened.