It takes some courage to write a descriptive incident report after the worst happens and I appreciate it - not all companies describe what went wrong in so much detail.
For sure. There are many companies that could describe part of their system as super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented, but the vast majority of them would never admit it.
This gives me more confidence in GitLab, but at the same time another reason to set up something to back my private repos up to BitBucket.
Agreed - I like how GitLab do so much in the open, particularly as things unfold and not just in a post-mortem.
It makes for quite a read, though - seems like too much of their architecture is brown paper and string (not uncommon, I guess). Even the fact that someone could accidentally delete data in an interactive session sounds like something that could’ve been prevented (although I guess database replication is not the easiest thing to manage with a configuration management system). Also, LVM snapshots - I’ve never been a fan of those, particularly in this day of ZFS' almost zero-cost snapshots (snapshots are not at all zero cost with LVM).
Of course, it’s so easy for me to say that as an outsider.
too much of their architecture is brown paper and string (not uncommon, I guess)
Since their main audience is developers, perhaps they figure developers already assume everyone is in this same boat with them, and pretending otherwise wouldn’t benefit them like it would a company whose customers don’t know how software startups work.
Am I the only one who thinks they should spend more time fixing their issues / backups / infrastructure than writing those large post-mortems? It sure takes a lot of time to write a document like this…
You can’t fix broken process without understanding it and writing a post-mortem about it.
Fixing for the sake of fixing things is a quick way to have more things needing to get fixed.
Post-mortems are expected by some customers (even if they’re never seen by those customers, the process is expected). And post-mortems are important for actually discovering and documenting those issues with backups and infrastructure they’re actually vulnerable to. Having a post-mortem also lets more people work on the solution than just the one person who screwed things up and has the problem set in their head.
Basically, post-mortems are good, and they should have written one regardless. Sharing them with the public is also good.
I’m guessing the point is to get a discussion going on social media about their product.
I can understand never performing a test restore because that sounds like work, but never even listing the S3 bucket to see if there’s anything in it?
As someone who works in backups, I think this is unfortunately probably industry average. Risk mitigation is hard to get behind for an early company.
Poor guys, but Github did it once too ! i remember they lost production database at some point.
I was cutting gitlab slack when people were complaining at their seat-of-the-pants data center migration a few years ago, but I think we should expect the engineering maturity to verify your backups at this point.
EDIT: But providing us these details is something they deserve credit for. Good point, kb.
It certainly doesn’t give me any confidence that their plans to switch from the cloud to bare metal is going to work out well.
They gave up on that plan after the blogpost you link generated many insightful comments advising them not to switch to bare metal.
Oh, I did not realize that, I thought it was just in-progress. They even had a post about the hardware they were planning on buying… I didn’t see a post about abandoning the plan, I just thought it’s one of those things that takes time to play out. Do you have a source to point me to on that decision?
I wish I had a source. I might have misunderstood but I’m pretty sure that’s what one of the gitlab engineer on the live YouTube feed answered during the Q&A which lasted while they were waiting for the backup rsync to finish.
I just went to their team webpage but couldn’t find the face of the engineer who said it. Hope I got it right, apologies otherwise.
[EDIT] Oh but the stream is recorded! It might be around this time: https://youtu.be/nc0hPGerSd4?t=3782 Nope that’s not it. Not sure when it was.
They are preparing a blog post explaining their decision to stay in the cloud. The original issue has some more details.
Wow, I missed that. I’ll have to go back and re-read the comments (I read some of them at the time, but obviously not all of them!).
I always figured, and still do, that Gitlab users are happiest when hosting their pwn setup. The CE feels like a very good solution in that, and why not the paid editions as well.
Just the fact that they are live-streaming their response… awesome.
https://www.youtube.com/watch?v=nc0hPGerSd4 Link for those who want it