1. 10

Since autoscaler reads from Zookeeper, we shut it off manually during the migration so it wouldn’t get confused about which servers should be available. It unexpectedly turned back on at 15:23PDT because our package management system noticed a manual change and reverted it. Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.

  1.  

  2. 4

    Interesting analysis. I’ve been bitten multiple times by the config management rolling back my local changes because it doesn’t know it’s a maintenance window.

    My lesson has been to do maintenance using the automation, rather than working around it and trying to make it stop. Way more work, but usually safer and more incremental.

    1. 1

      Why not disable salt/chef/puppet clients on the host you’re performing maintenance on?

      That is how we do it in production and it works well – otherwise this exact thing happens. Config management returns to the state you told it to maintain :)

      1. 5

        Doing it via config management lets you test your automation against a staging cluster before rolling out.

        1. 4

          Works but painfully for n=[1,10). Impossible when n=[25,…)

          But the real fun happens somewhere in [10,25]. The odds favor you forgetting a machine - so your environment suddenly becomes a hybrid of old and new, which has lead to insanely disastrous results.

          If, instead, all changes go through your tools, you can guarantee uniform application, you won’t forget to turn off CM at the beginning/on at the end, and you’ve got free rollback in the face of issues. Plus there’s little chance for a well meaning ops person to hammer on a bad change to get it going, hence less divergence.

          The pro move IMO is to make your deployments completely immutable - we do this to great success at $dayjob. Deploying a patch means building and testing a new container/fleet of containers, driven through a CI process. Then deployment is just push to the Docker cluster, smoketest, and DNS cutover. It’s so much easier to work in the large when you get away from the concept of changing apps in situ and instead think about wholesale replacement.

          1. 4

            This generally works - the main issue is that for it to be reliable you need to ensure that the human never forgets to turn off the config management (most works as expected except during emergency changes when it gets forgotten).

            Another issue is that when you restart the config management, you then have a period where it starts to apply changes again and you essentially need to hope that it’s going to maintain the correct state and not undo/damage your work.

            If instead you do the work through the config management tool, you get to have it along for the ride the whole way, so it’s a more deterministic outcome. You also get the added benefit of being able to flow your entire change through a deploy process (assuming your config management goes through PR review, CI, canaries, etc).

            How much all this sounds like it’s worth it probably depends on the number of machines you have running :)