1. 4
  1. 4

    I typically need to manage infrastructure at at least three levels: pre-provisionning to acquire static resources like domain names an static IPs, provisioning to allocate and setup the baseline resources, and then runtime provisioning to create things like log groups or buckets, as well as scaling up and down. Step 1 is typically done manually, step 2 through Pulumi and 3 through Boto. To me that is where the gap is, and how I interpret this article: managing infrastructure is in practice a continuous process, even though our attention has been mostly focus on step 2. I do think that using a general purpose language is best, as you can leverage abstraction and composition primitives, and it is more likely to be a lightweight implementation (as opposed to Pulumi or CDK for instance). As soon as the infrastructure is dynamic, it needs to be managed actively, and the best way to do that IMO is to use code to define the behaviour and operators to unify the target state and the current state. I feel that we’re missing simpler solutions than what is mentioned in the article.

    1. 3

      Why do you do the first step manually rather than through terraform (or cdktf if you’re already using cdk)?

      1. 1

        I guess because we’re using Pulumi and not Terraform directly and we don’t want Pulumi to accidentally deprovison EIPs that we’ve whitelisted to access legacy services. It’s a clear gap, and I’m thinking about betters way to manage and automate static resources and runtime resources provisioning. That will need to be API driven as having to call Terraform from within a service would be quite inefficient.

        1. 2

          For the accidental removal protection, you could add some IAM policies which protect the specific resources either by tag or specific id. Then you have one “normal” infra role with protection and one “power user”. We do that for database instances.

          1. 1

            This seems like a good idea! Although I work in finance and the risk appetite is quite low, and the pain in terms of outage very high. Having to reprovision an interface resource because the permission or role was misconfigured is not something I’d like to go through ;)