1. 15
  1. 27

    We’ve learned that having a function that deletes your database is too dangerous to have lying around. The problem is, you can never really test the safety mechanisms properly, because testing it would mean pointing a gun at the production database.

    I’d argue that you should have this function “laying around”, for development use – and it should be the only way you drop and create tables. Otherwise, you’ll fall back to dropping tables by hand, where it is impossible to add more safety. At least with this function you can add checks like “don’t do it if any table has 1000 records” or something. This is called poka yoke.

    1. 2

      Thank you for sharing the link to poka yoke! I had never heard this expressed in such formal terms, but it makes a lot of sense after it is.

      1. 1

        This is interesting. I’m always surprised at how frequently software development leans on learnings from manufacturing. I guess it’s just another form of manufacturing.

      2. 12

        I’m a little annoyed at the “after a couple of glasses of red wine” business. It’s late at night, you’re doing things in production–coke or Adderall would be kinda understandable, but depressants?

        More generally, I think that there is this weird notion that writing code while not sober or doing ops that way is somehow different from showing up drunk to any other job. I disagree with this. Save the booze for after you deploy successfully.

        1. 9

          More generally, I think that there is this weird notion that writing code while not sober or doing ops that way is somehow different from showing up drunk to any other job.

          I would extend that to writing code when sleep deprived. If you get paged in the middle of the night you should do the least invasive thing possible.

          1. 6

            Agreed wholeheartedly. Have done migrations where we backed down because we were just too tired to see a runbook through correctly.

          2. 1

            It sounded like “I did some local development”, I don’t see how you get the connection to production - “still figuring out what happened” is written right there.

          3. 9

            this post has nothing of interest, it is just staging a thanks for some db hoster mentioned twice too many. no reveal why it happened or is even possible, how weak

            1. 5

              I think this is a good marketing idea. At least now some people head about you. :) If it was intentional, my applause 👏 If not, meh 😒

              1. 5

                If it is marketing it should be elsewhere.

                1. 3

                  I feel it is more of a marketing post, since there’s nothing of importance that it talks about. There’s no mention of how it happened, and instead just has a piece of code thrown into the post that shows tables are being dropped. Even the “What have we learned? Why won’t this happen again?” part does not really state why it wouldn’t happen again. The entire post feels poorly written, stating the obvious points like have backups, do not hardcode, do not use same passwords everywhere; which, even if, is very common and should be known, does not provide any significant context in this case.

                  1. 5

                    I would agree, after reading the article, I felt reading the subject line gave about as much information. Of particular note is the lack of root cause understanding, which makes many of those mitigations questionable. He says they don’t share passwords, for example, and yet somehow they worked. I think a far more informative post for many of us will be the real answer as to why this happened. “Don’t share passwords, also if you use package Y, be aware that it replicates the password to all environments” or some such.

                  2. 2

                    I dunno. Having heard of them is one thing, but hearing that they appear to be at least somewhat incompetent, and possibly alcoholics is not really a positive thing for me.

                    Not all publicity is good.

                  3. 3

                    I think the worst variant of this I witnessed was a new/junior-ish teammate who tried to solve it silently themselves for 2 hours (probably from shame) until it came to light that an important database was missing.

                    Mistakes happen. If you drop a prod db, don’t be embarrassed. Blast that shit loud, like a foghorn, on every single Slack channel (or whatever communication channel) you can find. You never know who has a magic script or backup hidden up their sleeve.

                    (less related to the article than just general advice)

                    1. 3

                      Given the “localhost” hard coded there, my guesses would be:

                      • someone has the tcp port forwarded to production and all the configuration variables overridden to match production credentials in order that they can use their WIP working copy code to view live production data. I think this is most plausible.
                      • that script was run on production instead of on someone’s isolated dev machine
                      • disgruntled insider left a logic bomb
                      • maybe it was actually something else unrelated? coincidental ransomware hit at the same time, which happened to succeed in deleting the data but failed to deliver the ransom note? Eh kinda far fetched.

                      A safe replacement for this would be to put your local dev DB in a sandbox (VM or container) that isn’t even orchestrated via the same mechanism as prod. Recreate it by deleting & recreating the sandbox.

                      Also: much sympathy. Btdt, never want to again.

                      1. 1

                        I’d agree with your most-plausible prediction, and hence will probably take something valuable away from the article: programmatically make sure localhost is really localhost before you do something destructive, and only keep code that’s designed for mass destruction if it’s absolutely necessary. You could imagine this mistake ending in “Error: refusing to run on production system” or “Access denied: invalid credentials” and everyone sleeping well.

                        1. 1

                          someone has the tcp port forwarded to production and all the configuration variables overridden to match production credentials in order that they can use their WIP working copy code to view live production data. I think this is most plausible.

                          take something valuable away from the article: programmatically make sure localhost is really localhost before you do something destructive

                          There’s a much simpler “take away” from this: Don’t ever do that. Like, ever. I mean ever.

                          There is literally zero reason to connect your local, untested, in-dev code, to a production database. None. Zip. Zilch. Nada. No matter what “but..” you can think of, there is a better solution.

                          To clarify something, I didn’t say no one should have access to prod. I’m talking about processes, not access rights. To use a git analogy, if prod is the main/default/stable branch of your codebase, the take away is, “don’t do dev on the main branch”.

                          1. 1

                            Agree that is an absolute footgun idea, but it’s possible definitely to do by accident in disorganized or overly-permissive environments.

                            1. 1

                              Anything is possible “by accident” if you have zero idea what you’re doing.

                              1. 1

                                Plenty of accidents are caused by people who thought they knew damn well what they were doing.

                      2. 2

                        I’ve dropped a live database by accident as well, and seen a few colleagues do the same. Happens to the best of us, but mistakes like let you grow quickly.

                        Enjoy the journey in figuring out what went wrong!

                        1. 1

                          This is a tangent, but is there a good work on human factors in SRE? I’ve always done some informal things (prompt coloring, aliases to warn, trying to actually sleep, etc.) – but has anyone presented a more systematic approach?

                          1. 2

                            Fine grain ACLS. Only let one user from one location do drops. It should be like a ceremony to get a database dropped.

                            1. 1

                              I think the molly-guard (apparently named after the plastic cover that guards a physical flip switch) package is a good example. It prevents you from accidentally running shutdown or reboot on remote hosts by requiring you to type in the name of the host before actually running the command. Very useful when you forget that you’re ssh’d into a host on a certain xterm and absentmindedly want to reboot your own computer.

                              Would be nice if there were more things like this!

                          2. 1

                            My own ‘prod DB by accident’ story: years ago (hell, nearly two decades now!) working on a custom CMS written in PHP. I was making some experimental changes to the beta site, but they weren’t being rendered in my browser. Cleared the cache, restarted the browser, restarted the server … then while doing that, realised I was working on the prod server, not beta.

                            A few years later, also working on a CMS (this time, an offline-capable SPA for creating and editing books online, back when SPAs were a very new thing), a colleague ran a query that, instead of updating a single paragraph from a single book, updated every paragraph of every book to the same block of text.

                            I think it happens to everyone at some point in their career. Although I’d note that in both cases a) there was a way to roll back without losing any data, and b) everyone involved was sober.

                            1. 1

                              Sigh never deploy directly to production. This is exactly what staging is for. And no backup solution! Oh my goodness! When you have other people’s data it’s wisest for hourly diffs and 12 hour fulls if possible! And test your restore.

                              The article is a bit teehee and “we being the royal we” but it’s more so failure to keep dev and stage and prod isolated and why things need testing.

                              I get it’s not a “big deal” to the author, I don’t know if the 7 hours lost in a panic restore was any hit to the users, but I hope the author is setting up a far more robust backup solution along with offsites. And code snapshots as well as I suspect they are live developing anyway.. also don’t do everything as root!! There is a reason why you have dbo/data reader/writer roles, and transactional logs.

                              Perhaps this is a good time to learn more about the systems it stands on and about segregation of deployment roles, or some magical k8s of predefined awesome database and middleware , front end and load balancing.

                              At the same time I host stuff on a garbage tier host, but I have 8 hour full db dumps and 12 hour content rsyncs.

                              Sometimes you don’t do anything wrong, your hosting just disappears or catches fire. Off site has saved me more than once.

                              1. 1

                                Having looked at the service it doesn’t look as though enough revenue is being generated or possible to be generated to fund a more desirable backup solution than the one provided with the managed database solution that they use. That being said, what you use is a great suggestion and likely quite affordable at the level this service operates at.

                                I do wonder if DO Managed Databases allow for more regular backup snapshots than once every 24hours? I know with aws we have hourly snapshots kept for 7 days as well as hot redundancy in place. While the cost is reasonable for a business, but not really something that a small operation like this could afford.

                                1. 2

                                  It’s all in how you configure it. Rsync and cron is great for periodic dumps to a location and rsync to grab those.

                                  Also even things like cPanel can be gamed to do backups by grabbing the URLS that it uses to do downloads of files and using wget or curl to grab.

                                  You don’t (and shouldn’t) rely on hosting to do your backups that should be the last line of defense where you can’t undo the change and somehow your backup strategy has failed.

                                  Going to the hosting is like setting up VMware with 24 hour rolling snapshots as a backup. Sure if things are totally hosed it’s better than nothing and quickly gets you to a last known good but it also means both your app safety and backups have failed.

                                  It doesn’t cost anything either, again I’m in a “elcheapo” sub $5 a month hosting thing and I’ve setup my backups to local and offsite.