1. 39
  1. 26

    If you want to impress me, set up a system at your company that will reimage a box within 48 hours of someone logging in as root and/or doing something privileged with sudo (or its local equivalent). If you can do that and make it stick, it will keep randos from leaving experiments on boxes which persist for months (or years…) and make things unnecessarily interesting for others down the road.

    Man, yes. At a previous company I setup the whole company using an immutable deployments. Part of this was you could still log in and change stuff, but it marked the box as “tainted” and would terminate and replace it after 24hrs. This let you log in, fix a breaking and go back to bed … but made sure the “port it back to the config management tool” a #1 task for the morning.

    A second policy was no machine existed for more than 90 days.

    These two policies instilled in us a hard-lined attitude of “if it isn’t managed, it isn’t real” and was resoundingly successful in pushing us to solid deployment mechanisms which worked and survived instances being replaced regularly.

    I can’t recommend this approach enough. Thank you Rachel, for writing about this.

    1. 8

      A second policy was no machine existed for more than 90 days.

      I’m curious how you managed the stateful machines (assuming you had some). I’m a DBA, and, well, I often find myself pointing out to our sysads that stateful stuff is just harder to manage (and maintain uptime) than stateless stuff. Did you just exercise the failover mechanism automatically? How did that work downstream?

      1. 7

        Great catch! Our MySQL database cluster was excluded from the rule because of the inherent challenges of making that work, however our caching and ElasticSearch clusters were not. Caching because it is a cache, ElasticSearch because its replication and failure handling is batteries-included. Note this was with a not enormous amount of data, if our data grew to $lots we would likely stop giving ES the same treatment.

        We worked hard to architect our systems in such a way that data was not on random machines, but in very specific places.

        1. 5

          Ah, good, okay. That makes more sense.

          Currently we’re in a private cloud, so nothing’s batteries-included. Plus we’re using a virtual storage system in a way that would make traditional replica/failover structures too expensive. The result is our production DB VMs go for a very long time between reboots, let alone rebuilds.

          I agree, though, that isolation is a great way to limit that impact. Combine that with some decent data-purpose division (e.g. move the sessions out of the DB into a redis store that can be rebuilt, move the reporting data to a separate DB so we can flip between replicas during reboots, etc), and you can really cut down on the SPOFs.

      2. 1

        I’ve been in 2 different orgs where they reimaged the machine as soon as each user logged out!

        1. 1

          Aggressive! I wonder if there were escape hatches for emergencies?

          1. 1

            What sort of emergencies are you envisioning?

      3. 6

        Sometimes, I get on the box, and it’s just me. That is, there’s just one user on board, that user is me, and I’m running my “w”. Nothing else is there. Many times, I’ve gone off and looked at the part of syslog which captures login information. This might be /var/log/secure depending on how the system is setup. I’ve found that grepping for “accepted cert” is a great way to look for prior ssh connections (possibly for interactive logins) while discarding a bunch of other stuff that’s relatively uninteresting.

        On a recent Linux box, equivalent commands I ran to get roughly the same output were

        $ w --from --ip-addr
        $ sudo journalctl -b0 -u sshd.service --grep="Accepted publickey"