1. 41

  2. 13

    I’m not sure who this is written for. I’ve worked on-call at several companies and talk to people who are on-call for others and except for 1 or 2, the norm is to use on-call as a tool to automate away on-call stuff. Maybe this is an old-guard type thing or maybe the circle I keep myself in is special.

    1. 3

      I think it is an old guard thing but it is definitely still out there. I’m in one of those places right now.

    2. 10

      “Put your developers on-call. You’ll be surprised what stops breaking.”

      I love that. Suckers will suddenly get an interest in Erlang, Haskell, or Ada 2012. Bonus technique is giving them copies of those plus the Release It! book. A few others on OS’s and tools that automatically catch/fix things below the applications. You’re covered for a lot of things at that point.

      The top tier will have a better solution, though. It comes from the old guard who solved the problem a long time ago. They just deployed simple, transactional apps on top of clusters of mainframes, AS/400, OpenVMS, and NonStop. I’ve often run across two of these professionally but never seen one go down. They just drop transactions on occasion plus tell the user they did. Redo and it works unless a network problem or something. Systems designed to stay up usually do. They cost money & have weird interfaces from the 80’s and shit. They work though.

      One legitimate complaint I hear is how they cost a lot of money. Running all your stuff on them might be prohibitive for a lot of businesses. Not to mention lack of support for (software thingy here). Easiest method, similar to how high-assurance security handles trust, is to split your infrastructure between critical stuff whose failure is catastrophic (gets you paged) and less-critical stuff that’s not. Put critical stuff on expensive, robust systems. The other stuff can go on COTS BSD/Linux servers. Even the Web, front ends might go on them transformed into simple formats over simple protocols to the robust backends. Makes the front end code either stateless or minimally stateful which increases its robustness. One can put input validation, firewalls, etc there to benefit from 3GHz multicores, too.

      1. 16

        I love that. Suckers will suddenly get an interest in Erlang, Haskell, or Ada 2012. Bonus technique is giving them copies of those plus the Release It! book. A few others on OS’s and tools that automatically catch/fix things below the applications. You’re covered for a lot of things at that point.

        I would love to believe this, but I don’t think that it’s true. There are plenty of companies where developers do on-call rotations just so management can understaff and pay itself bonuses. What you actually get are resentful engineers doing work they don’t like and aren’t especially good at.

        Besides, it’s usually not engineers but management that makes technical choices, and the typical executive objections to Haskell and Erlang (“No one uses it!”, “$IDIOT_I_TRUST says it’s not Ready for Production, and I have to pretend to know what that means!”) are both common and shrill enough that very few of the people who are seasoned enough to appreciate the values of those languages and systems (which are excellent, I agree) feel like dressing up for the fight.

        It would be a better world if you were right. My experience, mostly in corporate/startup programming, tells me that engineers would just adapt to the new political environment, and being the guy who’s always agitating for doing things right (especially if that involves invoking Haskell) is politically maladaptive.

        1. [Comment removed by author]

          1. 7

            So while I’m currently at a Java shop, I’ve been shipping lots of Go based glue services. I think that’s a good way to get something else in the door.

            I used to have faith in this, but now that I’m older, I generally think it’s better either to get management buy-in or not to do something (like deploy a new language). Tech managers love to give the encouragement-that’s-not-really-permission. If you succeed, they can take credit. If you fail, they can blame you for going off the reservation.

            The other problem around language adoption in particular is that you end up having to teach the language to others. Now, I enjoy that kind of work. There are two problems with it. First, you can end up in unexpected and fully unwitting political battles with the managers of the people taking your classes, even if you teach after hours, because the zero-sum MBA types who employ programmers see that time/energy as something “their people” ought to be rendering unto them. Second, and I absolutely hate that it is this way and this is one of the reasons why I’d love to see the corporate software industry burn to the ground and then scorch the earth under it, but it’s a reality that one has to deal with: teaching is seen by most corporations as “female work” and you will be seen as a nurturer rather than a manager/alpha-prick who “gets shit done”.

        2. 6

          In my experience, nothing stops breaking. However, the the political shenanigans to weaken SLAs, delay response, shift responsibility and redirect blame become downright outlandish.

          I don’t have personal philosophical objections to taking call. Not all problems are self-healing or unimportant, so somebody’s got to do it, and as partial author of the software in question, I’m probably in a better position than most. However, I categorically reject it in reality because I categorically don’t get paid for it. Write me a software engineer employment contract that pays hourly, and I will gladly collect my time-and-a-half at 3AM fixing whatever stupid thing broke. (Also, I suspect problems actually showing up on the business’s bottom line will have a much bigger impact on them getting fixed than any attempt to manipulate programmers will.) I’ve never even heard of such a thing in the US, though.

          1. 5

            On my teams we don’t coerce any engineers into being in the on-call rotation but we pay them 33% more for weeks that they’re on-call, which seems to do the trick.

            1. 3

              Some people want money more than time or rest. The OP probably goes to far in the critique of the problem by forgetting these people exist. They’re always worth remembering. There’s also some that like the status of being the single most important link in the ops chain. Management doesn’t even have to promote it to get the sacrifice: this type will do it anyway for a feeling of importance.

              Author’s point remains that its work that should be minimized and shown for (slash rewarded for) the hard thing it is. Just some people like on your teams that would happily (or semi-grudgingly) do it for what motivates them.

        3. 11

          A few cultural notes.

          Taking up the pager is an ops rite-of-passage, a sign that you are needed and competent.

          I always took it, in most companies, to mean the opposite. The VPs come in and go as they please, check out on errands as they wish, and write their own performance reviews. On-call is for the plebs who have to prove themselves and for the people who are inexperienced enough to compete on suffering rather than perceived competence. I’m not saying that it should be that way. It probably shouldn’t. A permanent underclass is never a good thing to have, and when operations is actually important, it shouldn’t be relegated to an underclass at all.

          Being on-call is the best way to ensure your infrastructure maintains its integrity and stability. […] Except that’s bullshit.

          Yes, absolutely. It is. On-call usually exists, like any other form of technical or organizational debt, because the people at the top get the upsides of certain risks (such as pushing product to market quickly) and the people of low status get to deal with the tailings: bad/breaking software, on-call rotations, lots of blame for things that go wrong.

          The best way to ensure the stability and safety of your systems is to make them self-healing. Machines are best cared for by other machines, and humans are only human. Why waste time with a late night page and the fumblings of a sleep-deprived person when a failure could be corrected automatically?

          This is what “DevOps” was supposed to be: a progressive elimination of operational grunt work by replacing ad-hoc issue fixing with automation. Unfortunately, it devolved into either (a) a MBA buzzword that just means “ops”, or (b) an excuse for understaffing by adding to programmers the expectation of carrying a pager.

          Because system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.

          Right. I don’t think we can change the culture, for two reasons. First, bad code and practices are seen as a limited risk by business types, whereas doing things right (and taking longer to do them) is seen as an intolerable risk. They’d rather deliver garbage on time than do the job right and risk blowing through the time budget. Of course, the long term result is delivering garbage and being late/slow. Second, most of what we do is just making business processes more efficient, which is a polite way of saying that we unemploy other people to make businessmen rich. That by definition is an ugly thing to be doing, and the hideous, sleazy culture of business programming would come from that if nothing else.

          Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride. […] This is the language of the disenfranchised. This is the reaction of the unappreciated.

          Yes. This is absolutely correct. (This is the Red Pill of corporate programming.) The term that I use is “macho-subordinate”. It is bullshit. Competition to suffer is one of the oldest divide-and-conquer games in the book. However, so long as there’s an army of 22-year-old boot camp grads who haven’t figured out that they’re actually not going to be CEOs inside of 3 years, it will never go away.

          1. 5

            I don’t really agree with a lot of this post and your response. As someone who has been on-call, I’ve been given responsibility to make the system not need on-call. I also live in a country where a company is legally required to give a day off from work for every 7 days of on-call.

            But one thing that I do think many developers don’t realize and need to is: they are in charge of their own output. The only thing a manager can do is to try to coerce you to do work, but a manager is incapable of solving the problems they are asking their subordinates to solve, either by knowledge or by person-power. Once one realizes that, it shifts the balance of power.

          2. -2

            Font size for ants.