1. 34
  1. 23

    Interesting, thanks for writing.

    The problem you run into with Ansible (as an example of a stateless solution) is that removing resources from your config doesn’t ensure they’re removed from the cloud without some care. So say I create a VPC in the YAML, run Ansible and it gets built, then I remove it from my YAML and run Ansible again, the VPC will continue to exist indefinitely without adding some other steps.

    By contrast, Terraform with state typically ensures that the resources will be cleaned up once removed from the HCL.

    In theory you’ll end up with much more cruft over time the stateless way. Whether or not that is more painful than working with terraform state is a compelling question. I think it depends on the team size and apply process.

    1. 4

      This is exactly correct. When Terraform was very early on, I think 0.3 or some such? Well before data resources – I initially did an on-prem to AWS migration solely using Terraform.

      Unfortunately, it wasn’t quite up to snuff so after a few months I rewrote it all in Ansible, which ended up being far, far more verbose and had all the problems you listed. From an operator pov it ‘felt good,’ though.

      Had I to do it again, I would likely use Ansible to ‘bootstrap’ all the Terraform stuff (s3 buckets and iam users/roles/keys) and do the rest with TF. Shooting for 100% Terraform (or really 100% only one tool) is usually the wrong path.

      1. 4

        At a previous company we had a bootstrap terraform repo that would setup the basics for the rest of the automation, like backend buckets and some roles. The bootstrap terraform used local state, committed to the repository. It worked well enough.

        1. 2

          My approach is generally to use a bootstrap Terraform stack which creates the buckets and so on that Terraform needs, then change the bootstrap stack to use the bucket it created as its state backend. Having some backups is useful (as with all statefiles) but it’s generally and easy and safe approach.

        2. 2

          That’s thought provoking. I wonder if it would be reasonable to run Ansible (or some other stateless tool) in a mode where it went looking for things that didn’t exist and removed them. The flaw there would be that no Ansible config sets up an entire system from first principles, but assumes there are some tools already in place.

          Maybe git or other source control could be used to store take on the state burden to detect removal.

          1. 7

            The downside of that idea is that it is extremely common to have multiple Terraform workspaces managing resources at the same provider. If you did “destroy all resources that aren’t in your configuration” you’d end up removing, say, all the QA infrastructure every time someone did an update to production if the two are managed separately.

            1. 2

              terraform et al are great in theory. it’s in practice where they often fall apart. it’s one of those domains where sanity must be enforced though disciplined conventions, which is hard.

              1. 1

                If such culture existed, then ansible, terraform et al would probably never spring to existence. This usage is very much motivated by a mindset of throwing a flashy tool at a problem, rather than understanding the fundamental problem and how to control it.

                For example, using yet another AWS service out of their line up of hundreds, is a choice that is rarely questioned. Then that device has its own challenges… Ok, AWS offers yet another service to ‘solve’ them, and so on. There is no time for discipline in this reality… Just keep adding stuff and hiring engineers to cope with the system untill the next company mass lay off.

            2. 2

              This is a classical problem with stateless tools, going as far back as make.

              1. 2

                This is also a thing in Puppet (another stateless system). The answer is that you have to put the absence of the resource in the config (in Puppet terms, ensure => absent). Then when the config has been applied everywhere it needs to be, you can delete that “tombstone” resource. For some resources there is also a purge attribute that means to delete all unmanaged resources found in a scope (like a directory).

              2. 12

                One more thing that state enables in my case is multiple repositories dealing with one provider. For example I’ve got 20 different projects which have resources in the same account of some provider. Think something like pingdom checks. I’d rather keep those defined in the project repository than some global “all-Terraform-stuff”.

                This allows each project to apply Terraform without going “there’s lots of resources here that I don’t know about”. It also allows to automatically remove those resources, without cleaning up anything else that may not belong to the project.

                1. 10

                  What I took away from this article is that the Terraform docs should do a better job of explaining why state is an absolutely critical feature of the tool. The article breaks down a couple of the (IMO) least important reasons Terraform maintains state and then, not unreasonably, concludes that since those are the only ones the docs mentioned, there aren’t any other reasons and the listed reasons aren’t enough.

                  Being able to reliably remove previously-created infrastructure and being able to control different subsets of infrastructure from separate Terraform projects are the two biggies. I’ll add a variant on the second one: being able to support a mix of Terraform-managed and ad-hoc hand-created resources.

                  For example, we have an AWS account (separate from our production one) where developers have permission to create certain kinds of resources by hand in the AWS console. They use it to spin up temporary EC2 instances and so on. Ideally they’d do that stuff using Terraform, but letting them use the console means one less piece of training to worry about. The network infrastructure and some other components in that account, though, are Terraform-managed.

                  It would suck if the developer-created EC2 instances got blown away every time we ran terraform apply! But because Terraform’s state file only has information about Terraform-managed resources, it completely ignores those other resources when it applies changes.

                  1. 1

                    The stuff about ad-hoc resources mixed in with terraform resources is mitigated with metadata. If an item doesn’t have ‘controlled by terraform’ in the metadata, it wouldn’t get deleted.

                    As for multiple projects, that seems like something a design could enable.

                  2. 9

                    there is only 1 hard problem in cs: state.

                    naming things and cache invalidation are state problems.

                    1. 1

                      Haha, I love this. You simplified one of my favorite quotes.

                    2. 8

                      We use terraform and ansible for our operations automation. Terraform is so so so much better precisely because of the state. Others mentioned the deletion problem. But also Ansible has all kinds of side-effects that aren’t tracked. Like, lineinfile often ends up being a thing that’s not actually declarative in practice.

                      Ansible is often a big headache. I’d love for there to be a terraform that can manage resources on the remote machine, more like what Ansible does.

                      1. 5

                        whether a stateful solution has more benefit than cost, will depend. i think in the typical case it will not.

                        for sprawling aws accounts managed by many people far removed from knowledge of the deployed systems, stateful deploys are undoubtedly best. this isn’t even a bad situation. this sprawling aws account somehow became very valuable. that’s a good thing.

                        if you can, via greenfield or refactoring, there are lots of ways to shift the cost calculation towards stateless. use one aws account per system. avoid sprawl. deploy lots of environments all the time, the more the better. test changes in every environment. have a zero tolerance policy to performance or stability regressions in deploys.

                        if you find yourself with sprawl, and the deployed systems aren’t valuable yet, stop and do better now. a nimble system is always better than a rigid system, even more so when it’s valuable. it may not be possible to do later. it is definitely possible now, and needn’t be that much work!

                        there is a time to move slowly and make change difficult. there is also a time to be fast and nimble. i find fast and nimble to be more fun, and usually more effective. a slow, brittle iteration loop has a high cost. it’s not worth paying unless it’s really needed.

                        shameless plug, how i do stateless on aws:


                        1. 5

                          Yep, well, as a NixOS fan I’m of course in full agreement with all of this. (Not that NixOS or nixops deal with automated deployments as elegantly as I’d wish, or anything like that, but it’s the philosophy…)

                          I think these things are often clearer from outside the relevant community. From inside it, it’s easy to see the way things have been, and harder to see how they might be.

                          1. 7

                            I agree for the case of provisioning a single machine; I have used Terraform to declare the state of a machine, and NixOS is preferable for a variety of reasons. At the same time, a Nix store includes a small database which records the state of the store, both because a scan of the store is expensive and also because it would otherwise be difficult to tell if a package were corrupt/missing or intentionally deleted. The database of a Nix store is analogous to the state file of a Terraform environment.

                            1. 2

                              Yes, that’s very true. (Sorry for the delay in responding!)

                          2. 3

                            My reaction when I saw terraform was:

                            1. Why does this need a new language instead of a schema on top of an existing language?

                            2. What? Add this S3 config so store the state? How is this kind of thing not exactly what supposedly terraform tried to solve? Yet another piece of information, that has to be stored somewhere. I am surprised they didn’t come up with a configuration to store the configurations of where state is stored. While they’re at it.

                            3. If you use, say AWS, is it really that much easy to use rather than rolling your own resource creation script using awscli?

                            The answer to all this scepticism was that I was a nay sayer and that terraform had to be very good because big companies are using it. I though I was at an engineering position, sonafter a few instances of such non sense, I left for greener pastures.

                            I do agree with the author. They hit the nail in the head with the remark that the sync problem is not solved by external state, but rather created by it. It.s just that no one stops to question a popular tool… Until a sexier one comes along…THEN, the herd will parrot these things and nauseum.

                            1. 3

                              What? Add this S3 config so store the state? How is this kind of thing not exactly what supposedly terraform tried to solve? Yet another piece of information, that has to be stored somewhere. I am surprised they didn’t come up with a configuration to store the configurations of where state is stored. While they’re at it.

                              What’s even more funny is that Terraform Cloud stores all state in another instance of Terraform Cloud, which stores it’s state in the original Terraform Cloud, which is actually just a sqlite db on Mitchell’s desktop backed up to s3, nightly. (Yes, I am joking)

                              1. 2

                                If you use, say AWS, is it really that much easy to use rather than rolling your own resource creation script using awscli?

                                A million times yes. You can roll your own script completely wrong so many ways, and fail to keep track of everything you’re creating for future reference. If you trust your employer’s entire infrastructure in the hands of a (hopefully well audited) shell script, where accidental deletions are somewhat irrecoverable, you have bigger balls than most of us.

                                1. 1

                                  I think you are putting too much trust in a tool, when what it does is not black magic. It just calls aws http API much the same way you can do it too.

                                  The question I raised was if terraform was easier to use. Not if it was less error pone, which in my experience is not. There is nothing esoteric in creating resources via aws cli or HTTP API. Checking for their existence and their state is also trivial. Have you tried it? I did it multiple times before terraform even existed. With no issue whatsover. It doesn’t require bravery, but rather confidence. Confidence is achieved by knowing how system works under the hood. Such knowledge moves a step away each time a new layer or complexity is added.