1. 7

  2. 5

    I found the post vexing and the title mostly incorrect. Which is to say, not only can you have a rollback button, but you almost certainly must have a rollback button, and I have found this to be true and increasingly true over the last 27 years of my software development career. YMMV.

    It is true that skyliner.io can’t have a rollback button, because as the OP correctly points out, state will have mutated, and there’s no way for skyliner to find an automatic generic solution in the event of schema change, given arbitrary applications deployed on the platform.

    But it is also true that if you-the-application-owner have an application that is interesting, then you will want it to continue working after a failed deployment, rather than limping along or crashing to the floor until you are able to fix the problem and develop it forward and redeploy.

    The easiest way to do that is to design your application, or the changing part of your application at least, to be stateless. This of course is usually impossible, and so you have to roll up your sleeves and do the work.

    So then you have to manually implement an explicit rollout and rollback strategy. In Erlang, this is part of the culture and there is tooling. You describe the way your schema is changing, you describe the way to change it back, and you decide what process you want to follow to conform any ‘new schema’ data to the old schema. This could be as simple as throwing all the new data away, or as complex as implementing an arbitrary transform on the data.

    Then, you come up with a change management plan, which could be arbitrary complex, using dark deploys, canary deploys, whatever fits your use case, and a runbook, which tells you what to do in the event that there’s a failure necessitating rollback. You probably roll in and roll out a few times in dev, and ensure the tests are still passing before, during and after.

    Finally, you’re ready, and with your magic rollback button in place, you can deploy (relatively) fearlessly.

    Doing anything less than this on any exciting application is a recipe for total disaster and the community is not served by saying that such a thing is not possible. It’s annoying and it’s hard, and it’s frustrating, and it’s definitely not error free, and you cannot have it automatically unless you design for it in advance (with, e.g. a fully rewindable and replayable event bus upon which all commands are placed) or are truly stateless. But consider the alternative; you roll out, and you made a mistake and now all of your customers' data is being possibly incorrectly mutated and destroyed with every passing second, while your coders desperately try to rectify the situation under pressure and try to execute an ad hoc rollout plan.

    1. 4

      I found the post vexing because not everything is a web application where you can move fast and break things. For the “application” I work to be updated in production requires at least ten days notice to our customer (service level agreements between us and the Monopolistic Phone Company) which they can veto. Once we get past that hurdle, we can then do the upgrade and because our service runs as part of the phone network if there’s the slightest bit of problem, we roll back (which I’ve personally called when code I wrote didn’t handle a particular class of phone numbers correctly [1]). There is no “quick fix” where I work.

      [1] I was not told I would receive non-North American Numbering Plan numbers. I was not told I would receive total garbage numbers that couldn’t possibly be real phone numbers (like all zeros, or all ones, or 25 digits long).

      1. 3

        The fundamental problem with a rollback button is that it has very low information density: 1 bit. You can’t rollback the universe, so there’s some implicit scope associated with that button. For a simple, stateless web application, in the absence of data migrations, a rollback is a meaningful operation. However, for full-system deployments, it’s simply not meaningful without parameterization: What do you want to rollback?

        Once you parameterize rollback, you’ll discovered that there’s actually a number of reasonable choices, of which “revert to previous version” is only one option and not even a particularly common one outside of of the stateless use case. And when you do have a truly stateless system, you probably want more isolation anyway, in the form of independent versioning. Rather than rollback the entire application, can you rollback a single screen? Consider revisions of independent blog posts or wiki pages. That can be modeled by a “roll forward” by simply deploying a revert change. Or, you can just fix the bug or repair the faulted data and not “rollback” at all.

        The argument is basically: You can’t rewind time, but you can simulate it by performing a carefully selected and partial inverse operation. Or you can simply take an alternative corrective action.

        1. 2

          I would submit that the last thing you want when you are doing a rollout is an affordance with high information density. Ideally you have very limited choices and decisions to make, which may have arbitrary richness and complexity behind them but which you have pre-baked to be utterly simplistic.

          1. 3

            I think the article section titled “A sharp knife, whose handle is also a knife” covers what I’ve observed in practice quite well.

      2. 2

        I very much agree.

        This is also in the same vein as disaster recovery processes and procedures, insofar as it won’t be used often and therefore when broken won’t be noticed, until an emergency happens and you’re left stranded.

        1. 1

          Rollbacks are better thought of as rollforwards… to an old version. If I deploy code that breaks, I roll forward with code that works- which might be an old version of that code. If the old version of that code can’t deal with bad state in datastores, then my old code is also bad code, and I should feel bad about it.

          1. 1

            I’ve also heard the term negative diffs and a git revert is a positive increment with a negative diffs.