Removing a data directory by manual actions is dangerous, as this proved.
When I was reading the log of what happened, I was fairly incredulous that this action was taken at all.
I think this was a good write-up from the perspective of the database software behavior - both with what (probably) happened, and what the expected behavior was from a person who has an intimate understanding of the guts of the system. Unfortunately most of us don’t have as deep of an understanding, which is why we make these errors that - in retrospect - seem avoidable. For whatever it’s worth, I never considered Postgres to be culpable in the mistake, but this helps me better-understand the perspective of YP as well.
Some solid, practical advice around backups in there, though.
It pays to look at what happened vs blaming. Great post.
“ For whatever it’s worth, I never considered Postgres to be culpable in the mistake,”
I am. I’m not too harsh given issues with major databases to begin with. What fault I find is that high-integrity systems should be designed where stuff like this can’t happen, is corrected as its happening, or a monitoring code reports weird stuff to admins in a very, visible way. They should see major problems from break invariants coming at the very least. Not enough software is designed to do that.
You cannot design a durable data store which will survive its underlying storage being mutated outside its management (e.g. by rm -r /var/opt/gitlab/postgresql/data). What would you propose Postgres do here? Replicate all data in memory so it can restored if the disk screws up? This is obviously not tenable outside toy datasets. Manage block devices directly? That way lie dragons, and it doesn’t actually solve the problem. Stricter permissions? Can root create a directory so locked down it cannot edit it? Ultimately, there are some failure scenarios that cannot be recovered from, and blaming the tool for that is counterproductive.
rm -r /var/opt/gitlab/postgresql/data
I am quite sure, by the way, that Postgres complained loudly and clearly in logs when it discovered its data directory had been wiped. Everything else was, as this blog post describes, normal operation (which certainly doesn’t merit squawking in logs) and human error.
“I am quite sure, by the way, that Postgres complained loudly and clearly in logs when it discovered its data directory had been wiped. ”
That’s the main, safety measure I expected. That with some kind of notification that reaches an operations person. Others I’ll have to think on. Some designs do append-only storage or versioned files on disk specifically for recovery from the risk of unwanted mutations. The PostgreSQL author also mentioned some settings that shouldn’t have been that way. Makes me wonder if defaults, documentation, or setup messages could be improved. Or user just did something dumb.
The notification of deletion and some standard way to rollback changes to filesystem storage are main areas for improvement where many scenarios go from integrity to availability issue.
Hard to fault PostgreSQL for not providing a file system to the user that is robust against operators doing rm -rf .... Now , if they were on ZFS and did ZFS snapshots they at least could have had an easy to way to rollback to those levels of operator mistakes. But then hopefully they wouldn’t do zfs delete ... of the snapshot by accident :)
rm -rf ...
zfs delete ...
Postgres itself might do something like that if not the filesystem. OpenCM by Jonathan Shapiro et al did for a high-assurance repo. Bottom layer was append-only storage to allow spotting bad changes and roll-back.
How exactly does that work? No matter where the database stores its information, if the operator does rm -rf on it, it’s gone.
It’s true for that specific vector. A deletion or corruption of one file won’t. Especially if it’s backed up regularly to append-only storage. I used DVD’s or append-only filesystems with plenty extra storage for that in the past.
User just straight up killing everything can’t be helped in live setup, though. That just bad admin there.
In the case of GitLab, their backups were not working, so I don’t think your suggestion helps much :)
It looks to be a bad admin decision all the way. After all this discussion, I looked again at original story to see if I overlooked a problem in my brief assessment. GitLab said the replication was lagging too much then other side saying it was normal. Also added there was a bug in it they were working on. The GitLab people should’ve known that. That means the root cause was in either documentation, GitLab not reading it, and/or GitLab not having experienced pros in Postgres HA that know its quirks. That they didn’t look for a restart command vs deleting the raw files also jumped out at me. Messes up a lot of things.