1. 1

  2. 1

    tl,dr; keep a list of machines you have running, the ones that aren’t running, and what they should be running.

    1. 1

      Except, don’t keep a manual list, which inevitably will go out of date or otherwise desync from reality – instead, leverage your chosen system’s API (whether AWS, Google Cloud, or a dedicated VMWare cluster) and find a way to treat your infrastructure as your database.

      The OP talks about this a little, but dismisses it without thinking much. Yes, AWS shows you a list of all active EC2’s ‘for free’, but it doesn’t show you your auto-scale groups and so on. That’s true – if all you use is their interface – but you should use their API, and query for that – it’s not that hard to use aws-cli and see what’s running across all of AWS. You should do this. Trying to maintain a separate database, especially for large scale system that grow and shrink automatically isn’t just nightmarish, it’s often straight up impossible.

      Create tools to query your infrastructure – don’t rely on static, hand-maintained databases.

      1. 0

        I do believe you absolutely need a hand-maintained database, but depending on your infrastructure that doesn’t have to be a database of machines (and on cloud providers it mostly shouldn’t be). For EC2 for example you should have a database from which you can derive what your ASGs should look like, without referring to exact instance IDs.

        Tools to query your infrastructure help but aren’t sufficient, as that only tells you what’s running right now. It doesn’t tell you what is meant to be running. You need some form of database that acts as a blueprint for your infrastructure which you can compare against.

        Without that, your live AWS state is your hand-maintained database. With no outside reference you won’t be able to tell if it’s correct. Having a database that ties into your configuration management makes it easier to make changes, recover from disaster and discover discrepancies.

        1. 2

          You’re not arguing for a database, then, you’re arguing for a by-hand design / blueprints.

          You absolutely need an automatic, QYI-style database of actual machines. You also absolutely need a hand-maintained blueprint of what your infrastructure should be. But these are unfathomably different things. An outside reference that tells you “We have an application that’s deployed in this way, with these supporting systems, and so on” is different then “What do I have deployed right now”. That’s not a database, that’s a blueprint, and the wording matters.

          I’m not just speaking from my ass either, I maintain a system which contains dozens of deployments of the same application. Each deployment is identical to the other, and each is comprised of approximately 25 components (LBs, ‘ASGs’, VMs, DNS Configuration, etc[1]). These 25 components are built up via ansible scripts. Those scripts, the design of an arbitrary component (including the naming scheme used, and what metadata is associated with them within VMWare[2]). This information is static, and hand maintained, but it’s also not a database, it’s a half-dozen plaintext files which describe how we build it, and a couple of omnigraffle graphs to support those files.

          A infrastructure database is not your blueprint, it’s something very different. I think you’re conflating those two ideas. You absolutely need to have a separate, QYI-style ‘database’, or it will go out of sync, and you will lose machines (when I inherited this infrastructure, the blueprints did not exist, and we had a hand-maintained database, and that was fucking terrible; the first thing we did is build a QYI-style ‘database’, and that enabled us to develop as-built blueprints, which we could then iterate on). A blueprint is a different story entirely, and I think your OP is, perhaps, causing confusion because of that (at least with me).

          Now, if you want to talk about how that QYI is implemented – specifically with respect to caching, that becomes a more interesting discussion – because now you have an automatically maintained cache of your infrastructure, and that’s valuable for various reasons (mostly query speed). I have some fun stories about when those things go wrong too; but that’s a story for another time, I think.

          [1] It’s deployed on self-hosted VMWare, not AWS, so the ‘ASG’ bits are handmade scripts that link into the VMWare apis

          [2] Well, within Consul now, the VMs used to just have flat textfiles living in /meta with various information you would have to get at by doing, essentially, ssh foo@bar cat /meta/data

          1. 1

            When I say database, I mean it merely as the place you store data. I’d consider your ansible configuration and inventory to be your database. That’s normal and a good way to do it.

            What’s QYI? I can’t find any relevant references online.

            1. 2

              QYI = Query-your-infrastructure – I just abbreviated.