1. 65
  1. 44

    the stories that could be told about the machines that run manufacturing plants. It’s great because everything is prod and everything will do horrible things if you fuck up. sometimes you are fixing something on a kiln 3x the size of your house and it breaks and you melt a couple thousand bricks, and sometimes you get told “yeah don’t worry about the puddles of acid on the ground we’ve neutralized them”.

    1. 25

      Please write a newsletter, and subscribe me to it.

      1. 2

        Same

    2. 21

      I appreciate posts like this. So honest and reasonable. Real life description of a real life business and the software that is running it.

      1. 6

        I agree. I’m at the end of a five year stint at a large company and paying attention to tech twitter and co, and I very much just want people to tell me what terrifying things they’re doing and for what bad reasons and not lie to me about how Best Practice N+1 is totally practiced by their whole org and not a thing someone considered doing once but then dropped.

        1. 1

          War stories like this are exciting and macho and cool. But definitely not suitable for juniors ;)

      2. 16

        Editing stuff in prod is something that ideally you should never have to do–unfortunately when ideals fall short of reality, it’s good to be able to.

        I’ve touched many production systems–including one terrible hot-potato contract where I was asked to do some major work with nothing but the root login and the code deployed on the server; no repos or docs or anything–and it’s always something that should be cause for concern, but it also shouldn’t be something you never do.

        My major complaints around it are:

        • There are places that expect you to log into prod now and again, and places that don’t. Both places will eventually require you to log into prod, but only the former will likely have the signage and organization to keep you from making easy, bone-headed mistakes.
        • There’s kind of a skillset around safely working inside an active production system; by continually scaring off new developers that this is some kind of dark art that should never be attempted we’re kinda shortchanging the next generation.
        • Sometimes companies encourage production hot work but don’t have a backup or disaster recovery plan in place. This can have negative long-term outcomes.
        • People, while trying to avoid making systems that one might have to log into, often complicate things so much that you end up being near unable to reason about a system that needs the laying on of hands.
        • If there isn’t a good lock-out tag-out procedure on prod systems, something like a CI/CD deploy might trash a system halfway through your work or worse.

        I’m of the opinion that we’re never going to be able to completely remove the need for prod hot work, and that instead of whatever we seem to be slouching towards now we should make it easy to do and safe to recover from when we inevitably get stuck doing it.

        I am of course in favor of automating as much as we can to reduce the need for it, but we shouldn’t do so in such a way as to make the occasional hot work harder or less safe.

        1. 3

          Sometimes companies encourage production hot work but don’t have a backup or disaster recovery plan in place. This can have negative long-term outcomes.

          Anecdotally, I believe that many of these places are the same ones that will blame “engineer error” and possibly even reach for disciplinary measures when production hot work goes wrong.

          1. 3

            Perhaps a good rule would be to only log into production in pairs, so that there’s always a second person double-checking what the first one does. Always open a transaction when doing destructive database work, and commit after double checking that your change only touched the records you wanted to touch by doing some basic stat queries.

          2. 2

            There’s kind of a skillset around safely working inside an active production system; by continually scaring off new developers that this is some kind of dark art that should never be attempted we’re kinda shortchanging the next generation.

            I think people should be Sufficiently Intimidated into behaving carefully, which this kind of scaring off may do. And yet, even though you might know it, if working on prod becomes something that’s sort of routine you will make terrible mistakes. I’ve once dropped an entire database which had no backup. Another colleague once caused a massive downtime simply by accidentally leaving some logging on after they were done (causing the disk to fill up). And so on.

            So I think you’re spot on with saying that working on production should have enough safety nets to make sure that any random screwup can be restored with minimal damage. But I still think it should also be avoided as much as possible, and having it deeply ingrained into a company’s culture that “We Don’t Edit In Production” as a general rule is a good thing, just like “In Tests We Trust” should be ingrained in every new hire.

          3. 10

            I have seen this and unfortunately had to do something similar (the editing on production part). What’s more scary is that you get used to it. It stops feeling like you are doing said haircuts with chainsaws. At first it definitely feels that way, later on … not so much. I even had the unfortunate pleasure of having to actually sometimes go in and manually edit fields in the live production database because there was literally no other way to fix entire sites.

            1. 20

              What’s more scary is that you get used to it.

              For anyone interested in learning more, this concept is called normalization of deviance, and applies in a variety of low-probability, high-consequence activities.

              1. 6

                I did a few fixes like this for a different reason: I didn’t know better. The boys said, “fix this ASAP”, so I ssh’d in, yadda, yadda, yadda, stuff fixed. To my “defence”, i also worked as a support at a major hosting company, where ssh’ing to prod, finding out which client is hammering the shared database, and acting upon it (restarting the DB server or apache if it’s stuck, blocking some script kiddie IP if he it she is trying to “hack” the site, or just suspending the client and letting them deal with troubleshooting.

                Since then, I’ve started doing both things in better ways :)

                1. 2

                  A place I used to work had no way to reverse a common accidental click on our admin backend… except by opening a rails console on production. And this was done frequently with no checklists and having to guess which fields to edit. Needless to say, the poor guy doing this spent just as much time dealing with the resulting data corruption.

                2. 10

                  I’m happy to see more writing on this subject. I’ve seen way bigger fights fought over way smaller violations of “best practices”.

                  My most recent example was submitting a pull request with two commits: a documentation change and gitignoring a single file generated by the documentation. I was forced to create a separate pull request for the .gitignore change, in blind pursuit of the best practice that “unrelated” changed should aways be in separate pull requests.

                  1. 3

                    What benefit do people who want every separate change think they get out of putting them in separate PRs? I could understand wanting them in different commits (though that’s still tenuous at best), but PRs?

                    1. 4

                      Probably their merge-to-main workflow squashes by default, so each PR becomes one commit.

                      1. 5

                        That was the case. They mentioned wanting the ability to revert.

                        However 1) they had never reverted a commit in the repo’s 5+ year history, 2) it’s extremely unlikely that change would need to be reverted, and 3) it was a single line change; changing one line is arguably easier than finding the commit SHA to pass to “git revert”.

                        But the really frustrating thing is that arguing against this kind of purism is a negative sum game. Since the time savings we’re arguing over are so tiny, even winning the argument (not having to create a separate PR) isn’t worth the ~half hour it would take to convince a purist to lighten up. I have to imagine a there are other “best practices” out there that are also pointless, but just too inconsequential to argue against.

                        1. 2

                          This is often the case when someone with little real world experience starts writing down policy and misses what is important and what is not. Some things matter a lot (keeping your build system easy to set up and simple to reason about is worth a lot of effort). Some things don’t (who cares about code formatting style so long as someone picks a vaguely sane style and everyone runs an autoformatter with that style).

                          1. 2

                            I believe that the big benefit from autoformatters is that they move arguments about formatting style out of individual PRs into PRs against the autoformatter config. In individual PRs they can invisbly add up to a lot of time wasted. Endless arguments on the autoformatter’s config are easier to see and shut down.

                    2. 1

                      But these changes are related??

                    3. 5

                      I feel pretty lucky that from very early on we were pretty far from this at $WORK. But we do still have “operational stuff that gets pretty dangerous”, and it’s all kinda scary (like most toil honestly).

                      One recent strategy has been to build runscripts based off “do-nothing scripts” (see this post).

                      It establishes all the steps, allows for admonitions, and establishess ways to provide good automation.

                      One thing that has been pretty nice is hooking this up with our cloud provider CLI tooling to allow for more targetted workflows (much more assurances that you are actually in prod or in staging etc), and then turning our “do-nothing script” to a proper runbook that just does the thing that needs to happen.

                      One trick here for people who have dog-slow CI or the like, is to have this in a repo that is more ammenable to fast git merges. That + having a staging environment that is allowed to be broken temporarily gets rid of a lot of excuses. Along with giving people the space to actually do this kind of work!

                      1. 2

                        do-nothing scripts look awfully like checklists (which are a great idea!)

                        1. 1

                          Yeah, but then they provide a way to automate some of the steps incrementally.

                        2. 2

                          Thank you for showing me “do nothing scripts”. That’s a really interesting, actionable idea.

                          1. 1

                            If you have anything outside your own project that relies on the staging environment, you will also need some other environment that is not allowed to break.

                            Working as a front end developer against a staging environment that is always breaking is really bad for morale.

                          2. 5

                            That person’s hands were tied, and there was no useful way to do anything about it. The “infra” which had accreted at this company forced us into any number of terrible patterns, and this was the path of least resistance that still worked. He was biasing for useful outcomes and to his credit was being very careful about it.

                            It was either “edit in prod” or endure DAY LONG development cycles: make a change, wait 24 hours, come back tomorrow, and try it again. That’s how broken it was.

                            Does he not get paid either way? Forcing costs to expose themselves by doing things the way bad systems force us to is how you force companies to fix bad systems. Show management what their bad infra really costs, and they might be motivated to fix it. It’s not your job to work around your boss’s bad decisions.

                            1. 7

                              I would love to know how people get these things fixed in practice. Maybe it’s my own experience that is out of the ordinary in this regard, but I’ve yet to see an instance where so-called bad practices are replaced with so-called good practices in a way that satisfies everyone and that makes things actually better (where better means simpler, faster, safer, etc.). My experience has been that trying to go from a lackluster infra to a by-the-book infra is extremely difficult and more than likely to fail.

                              One approach is the Hero Approach. An engineer notices a problem in the infrastructure and has the knowledge of all the parts (code, build system, CI, cloud infra, secret management, deployment, orchestration) as well as all the necessary accesses to go and fix the situation by themselves and impose it. This usually has the downside of creating a situation where only one person really understands how the whole system works and they might have a blind spot to the needs of certain groups of users.

                              Another approach is the Committee Approach. Engineers from every relevant group (devs, ops, SRE, etc.) are put on a team to discuss and improve the infra situation. After everyone has stated their particular desirata (devs want to be able to connect to live systems to gather data on issues; ops wants every action to be automated with no human in the loop; SREs want every part of the system to be elastic; etc.), it feels like there are no solutions which will satisfy everyone. Meetings are planned to discuss options, but really the whole effort just peters out.

                              I know it’s hard to give general advice when most of the problems are going to be specific, but I’d like to hear stories of teams that went from, for example, “we modify a config file in prod” to “we have a configuration distribution system” (or whatever) where all parties felt (devs, ops, SREs, managements, customers) recognized that the new solution was objectively better in all regards.

                              1. 5

                                The Phoenix Project is a fun work of fiction that made me feel like it would be possible to complete a transition like this.

                                1. 8

                                  I read The Phoenix Project last year, and though it was fun and cathartic by moments, I found that I finished reading having more unanswered questions than before I started.

                                  Going back to the example of a config file, what I got from The Phoenix Project was that we wouldn’t want a human to go and modify the file in prod; instead, what we (should) want is an automated, repeatable, computerized process that will apply config changes.

                                  But doesn’t that automated process necessarily entail a big bump in complexity? Where we had no code before, now we need an entire system that must manipulate configuration files, has its own set of challenges (how do we detect and handle errors while writing a new config file?), and has to deal with all the not-so-fun stuff of writing and maintaining software. What about authentication? I’m guessing we’d want a secrets management solution, but that’s more complexity, more possible points of failure. How will our config modification service get the right permissions? A new layer of administrivia? And what’s the actual process for making a config change? Is it an obtuse process that requires a PR in GitHub with at least two approvals and feels like trying to do surgery with blacksmith gloves on?

                                  So if we recognize that editing config files in production is a form of normalization of deviance and we want to address it, it seems that we are immediately faced with a long and complex project that would take engineering resources away from other endeavours, the number of hard tasks is immediately intimidating, and there are a number of questions that don’t seem to have an immediate good answer.

                                  So going back to my original question: how do people manage to get it done? Are there ways to introduce the changes incrementally and gradually in a way that does not require the whole world to change how they do things? Would seriously love to hear successful war stories here.

                              2. 4

                                does he not get paid either way?

                                Not for much longer, no. Unless he wants to lie and say “this can’t be done” instead of “I refuse to do this.”

                                1. 1

                                  It’s not your job to work around your boss’s bad decisions

                                  I agree with your sentiment here, but sometimes you do what’s best for the business (or at least what you think is best) to keep it running since it is how you make money. I think secretly working around a problem just to avoid conflict or to be assertive is likely a problem in many of these scenarios though, and in that case I think I might say what you said in a different way: it is your job to fix (not work around) your boss’s or colleagues bad decisions. It is what will make you a leader. Might be thankless, but it will be the right thing.