1. 50
  1.  

  2. 27

    So far, I’m of the opinion that system management tools that require confirmation every step of the way leads to warning fatigue, and you wind up with the same destructive by default behavior, just with more typing.

    I’d rather see systems set up where the cost of recovering from these changes is near zero.

    It’s a common pattern in container orchestration, at least. I’d like to see it applied even further– automatic snapshots at the filesystem level that make it easy to undelete.

    1. 4

      I definitely agree w/r/t the major advantages of change recovery (git has saved me so many times in the past, I wish every system worked like it). Things like containers, or at a lower level, having self-contained programs can reduce error-prone operational difficulties.

      Many tools that have these sorts of prompts have a “no interactive” mode, for when you want to set up an automated process (For example). Though generally they still output what kind of work they’re doing.

      I don’t use ansible but I do use Salt, and generally you’re building up one final command to execute many changes at once. So it’s not like you’re hitting “Y” tens of times at once, and it will definitely be faster than SSH’ing into boxes one by one to roll out changes. I think this post is about major changes of that manner, not “confirm that you really want to commit to git”. And having a final confirmation is nice (though many tools instead opt for “run dry run, then rerun everything”)

      If this change is important enough to roll out on all your prod machines, but not important enough for you to check the final operations (checklists is what keep planes flying in the air), is it actually important enough to do in the first place?

    2. 11

      prgmr has several tools with no-op flags like --dry-run. The tools we interact with the most do work like this blog post suggests, using something like --force, --really, or --yes and default to dry run. It is a safer design.

      We also separate our idempotent playbooks from those unsafe to run more than once. For particularly complex or sensitive work we’ll manually review the commands an operator plans to run, either via IRC or email depending on whether conversation or annotation is more appropriate to the work. Here it’s typical for us to run the command dry and then force it and that shows up in the runbook that gets reviewed and signed off on.

      As a further line of defense, one can completely unplug your hardware key and if necessary flush your cache, completely removing the ability to access production.

      1. 11

        Making --dry-run the default is very counter intuitive for those already used to how most, it not all, command-line utilities currently work. In Unix and Unix-like operating systems the assumption has, AFAICT, always been to execute the primary function of the utility, assuming the user knows what they want to do. Any deviation from the default is controlled by flags. Many utilities will check an rc file to override the defaults. For those that don’t, there are aliases and shell functions.

        I’m hard pressed to think of many cli utilities that display what they do first. The exceptions are those with an interactive mode, usually invoked with -i. The only utility I can think of that backs off before making a catastrophic change is git push, but that’s because the user’s repository is out of sync with the target repo. The default is always to push.

        That’s not to say that the user community couldn’t swap the default to safety first for functions and utilities that make changes to the system, but it’s a massive shift in terms of training, expectations, and productivity for the broader community at large. If I had to type mv --no-dry-run every time I wanted to rename a file, I’d pretty quickly write a bunch of aliases to invoke them the traditional, and correct (IMHO) way. :D

        That said, I’ve always appreciated rsync’s --dry-run flag. I use it often and many of my utilities offer the same feature. My counter-suggestion to the author (@moshez) would be to write utilities the “traditional way” so the default case is to do whatever the utility was designed to do, and add a -n or --dry-run flag for pre-flight verification, and consider balking only in cases of real conflict a la git.

        1. 3

          I think the trade off here is about the frequency of tool use, and the scope of change a tool can make, or how many people it can affect. With core untils like cp or common tools like tar, which you may execute hundreds of times a day, a —no-dry-run default would be expensive and annoying for no upside - because the annoyance cost would be extremely high, and potential damage is isolated to a single system. But imagine a tool that someone on your team may run weekly or monthly; and that could destroy or re-provision hundreds of machines running mission-critical systems: with that sort of tool, it’s well worth it to be extremely verbose, print debugging info by default, and require special action to enact changes.

          In that case the downside to writing tools “the traditional way” is huge injury. You can argue that one shouldn’t allow the untrained such permission or responsibility to use a tool like that, but I’d rather the have systemic oops protection if I can get it.

          An example of a tool at my job that requires a —no-dry-run flag is a mass refactoring tool that runs a user-defined job, and could open hundreds of PRs on your behalf. Much better to have dry run the default than a new user of the tool spam every project with a nonsense PR. We have other tools that require the env var I_AM_AN_EXPERT_AND_ALSO_RECKLESS set before performing actions against production without confirmation.

          1. 8

            I’d prefer all tools be dangerous or safe, but not a mixture. It’s inconsistency that causes confusion:

            “Hmm…does rsync default to dry-run or no-dry-run?” man rsync

            “Ansible, that’s –no-dry-run, right?” man ansible

            Consistency leads to confidence. Are all command-line utilities consistent? No. But that comes typically from folks deciding their approach is more important than the user interface.

            One of the great geniuses of Apple Human Interface Guidelines in the 80s was the emphasis on consistency and predictability (not that Apple cares about that as much anymore). Users knew where to look and what to expect. The same should be true with cli tools.

            I worked at a shop where one of the sysadmins set “sane defaults” via aliases for all users on the systems he managed. Unfortunately, they were often diametrically opposed to the real defaults. Users would login to systems not under his control and find out the hard way mv -i wasn’t the aliased default on those systems. It led to false confidence because they thought they knew how the utilities worked.

            You can argue that utilities like rm are safer on a single machine than a utility that touches hundreds of machine, but rm -rf on a control server might be worse. I want my users to be as conscientious on a single box as they are on a thousand. The fact of the matter is that broadly speaking, cli utilities do what you ask of them without question or hesitation. Users should be taught to expect that and execute a dry-run if that’s what’s needed.

            I’ll also echo @KevinMGranger’s comment elsewhere in this thread about warning fatigue. Great point!

            1. 1

              I agree with you that changing the defaults for existing tools like mv sounds like a nightmare. Even if you think mv should act differently than it does, silently changing behavior on some machines is begging for accidents on the others…

              But while it may be common for many Unix utilities to offer no warning messages, I agree with @jitl’s point about frequency and scope — tools that you use frequently with limited scope (such as moving files on a single machine) shouldn’t have warnings, because that leads to warning fatigue. But, tools that are both potentially very dangerous and used infrequently probably should require warnings, because an error is painful to recover from and it’s unlikely that you’ve built safe habits around the tool since it’s rarely used.

              Off the top of my head, here are some real-world examples of tools / CLI-based workflows that distinguish between dangerous and regular operations, or prompt you for potentially dangerous ones:

              • Protected branches with git, either by disabling force-pushes to master, or potentially disabling pushes to master entirely. This is common enough that Github built a UI for branch protection; previously you’d need access to the git server itself to install pre-receive hooks. Pushing or force pushing to non-master branches is fine; doing it to master isn’t, since it’s rarely needed or desired (usually you’ll only do that if… someone else accidentally force-pushed to master and you’re trying to clean up).
              • Running git clean requires a -f to actually delete any files (or special configuration in your git config).
              • Even regular old rm has built-in controls. For example, deleting a single file is fine, but to delete an entire directory you must pass -r. And on current versions of GNU rm, it will refuse to delete the root directory unless you pass the special flag --no-preserve-root.
              • Some commands check for UID 0 and fail with an error message if you’re running them as root, and require special configuration or flags to run as root. For example, brew does this, and IMO this is really good practice. Running brew as root will likely mess up whatever you’re trying to install by making the files owned by root (and thus unmanageable by anyone other than root, e.g. brew itself unless invoked with sudo again), it’s an easy mistake for novices to make, and intentionally running brew as root should almost never happen.

              That being said it’s probably no surprise I agree with @jitl, since we sit next to each other at our office and work on many of the same internal tools :)

        2. 5

          I see your point but I don’t want to have another “Do you really want to do this?” dialog-ish behaviour. If I decide to run your application with the permissions and keys to erase all and everything, I’m doing it on purpose. You can’t protect people against themself. So I expect a command to run. At least for the normal tools I use every day.

          1. 5

            We have commands that can cause harm require a captcha (i.e. “please type this random number and press enter”) when run in a production environment, but no captcha when run in a development environment.

            The idea is that if devops are conditioned to enter the captcha in a development environment, they will be likely to enter the captcha without thinking when running the command in a production environment.

            A similar argument can be made for --no-dry-run; if it is required even in a development environment, devops will be conditioned to always provide it, and the added safety of --no-dry-run will be negated.

            The captcha provides additional safety: it prevents a potentially unsafe command from being saved in the history.

            1. 1

              Do you also have a way to bypass the captcha in order to automate things? If not, this seems like a very poorly scaling solution.

              1. 1

                Yes, there is a way to bypass the captcha for automated tools. This is particularly important when building tools on top of existing tools (in the spirit of the so-called “unix philosophy”).

            2. 3

              I think you bring up a good point, but I think if you don’t want an effect I would prefer default behavior to be an error and require either --dry-run or --force. I would be annoyed if I thought it did the thing and it just did a dry run and the opposite end, I don’t understand the full effects of my actions and a do a very much bad thing.

              1. 1

                Agreed, and the error message should displayed prominently, otherwise there’s little difference between what you suggest and a no-op, apart from the process exit code.

              2. 3

                It’d be good in general for commands that have an effect on the state of your filesystem to have a way to declare their inputs and their outputs for a given set of options. This way you’d be able to analyze e.g. install scripts to review which files would be read and written before even running it, and check if you’re missing anything the script depends on.

                1. 4

                  They are unfortunately very chatty though. Alias rm to rm -i and it’s just an obnoxious amount of y y y for any non-trivial removal. I wish someone would fix this to something more like a table output showing a reasonable summary of what’s going and let me confirm the once.

                  1. 4

                    This is a powerful paradigm.

                    Incremental confirmation can result in a state where the system has a half-performed operation when the operator decides to abort.

                    Displaying a plan – and requiring confirmation of the entire plan – ensures that the operator intends to do everything, or intends to do nothing.

                    1. 3

                      The book Unix Power Tools, originally published in 1993, includes a recipe for the behavior to rm you’re describing. It is surprising this feature hasn’t made it in to coreutils sometime in the intervening 2 1/2 decades.

                      1. 2

                        I’ve worked on systems that enforce -i and it just made me develop a very bad -f habit.