How do any of these points prove they have a quality problem?
The last point seems particularly stretched. Is there any organization anywhere that doesn’t experience fewer introduced defects when nobody is working? In other news, the Coca Cola bottling plant broke fewer bottles when everybody was on vacation too.
The Coca Cola bottling plant had zero activity when no one was working. Facebook’s activity remains the same on weekends and holidays.
It’s just a bad analogy - presumably the same number of people were drinking Coca Cola on the weekends.
Nor do people stop drinking Coke when the plant that makes it is closed.
Bugs and breakages and defects have a power law distribution in software and gross number is not a good measure of total cost. Most bugs are annoying and minor; occasionally, there’s a major bug that is a serious threat to the business (or worse).
Defect/crisis count is going to go up when people are working, but a lot of that is issue detection.
So yeah, I’m underwhelmed by this finding. It doesn’t prove that Facebook has a worse code quality problem than any other organization or even (although I don’t assert this to be true) refute a potential claim it it has better code quality.
Defect/crisis count is going to go up when people are working
I love that we, as a community, stress the importance of distributed systems and fault tolerance, but statements like this indicate that human error is still by far the leading cause of outages. It’s not just your statement, this article also says the same thing and so does my experience.
If you have more bugs when people are working, what do you think the cause is? Probably the humans. How do you fix the humans? Well you make it harder for them to make mistakes. Code quality could be one thing to fix, also deployment procedures, design reviews or possibly company culture. From my experience, fixing code quality goes a long way toward reliable software. It’s not certain, but I’d also bet that Facebook is running into a code quality issue.