Terraform
What is it?
Terraform is an open-source, free-to-use, mature Infrastructure as Code (IaC) tool that can be used to deploy infrastructure into many commonly-used cloud providers and other platforms.
Who is it for?
Terraform is designed for software engineers who want to build infrastructure in a reliable and repeatable manner.
When should you use it?
When you have anything but a very simple infrastructure to manage.
For relatively simple serverless deployments you may want to consider a simpler approach with a framework such as Serverless or AWS SAM. If your serverless deployment requires capabilities that those products do not support Terraform is worth considering.
Terraform has an architecture based on plugins. This makes it possible to deploy to a wide variety of cloud platforms (even the niche ones) and on-premises. The available plugins, which are called "providers" in Terraform terminology, are held in the provider registry.
How do you use it?
Terraform files are declarative. The files describe the infrastructure you want to deploy, then it makes API calls to the infrastructure provider to bring the infrastructure in line with the desired state. The documentation for the Terraform providers is generally very good, and is available on the Terraform website.
Infrastructure provider primitives such as virtual machines, lambda functions, API Gateways etc are called "resources" in Terraform. Terraform has the concept of a "module", which is a reusable set of related resources that you can use as a template to produce many copies. For example if you have a fleet of virtual machines you might define a parameterised "vm" module that you can use to create new instances with consistent configurations. A module can contain a number of related resources. For example in AWS a "vm" module might contain an EC2 instance resource, an EBS volume resource, and a user data script resource that initializes the instance when it boots.
Terraform also has the concept of data sources, which allow Terraform read-only access to files and documents that it does not manage itself. These include templates, local files on disk and IAM policies.
Terraform files are written in a domain-specific language, Hashicorp Configuration Language (HCL), which is implemented in Go. It supports some Go idioms such as maps and array slicing.
Terraform stores the current state of the infrastructure in a "state file", a JSON document usually stored in a file store such as Amazon S3 or Azure Blob Storage. This file is refreshed at the start of each deployment, so that Terraform can internally compute a graph of the changes that it needs to make, and then apply those changes in the right order. It already knows about many common resource dependencies, and you can explicitly add your own. The state file can be backed up to protect against loss of the state information.
Having control over how Terraform changes the state of the system is one of its most powerful features. If you would like resources to be started or stopped in a specific order, you can instruct Terraform to do that. For example if you have a resource that takes a long time to initialise, like an EMR cluster, you can tell Terraform to bring a new one up before destroying the old one to minimise disruption. You can import existing infrastructure into Terraform's state so it can manage it for you in future.
Terraform supports a "plan" mode that produces a report of the changes that it intends to make, so a human can manually inspect it. The plan can then be "applied", which gives the operator reassurance that they know exactly what will be changed. Alternatively, configuration updates can simply be applied without the separate planning step, which is often used in an automated deployment by a continuous integration system.
A useful technique for organising Terraform code is in "layers". This arrangement separates concerns into functional areas and allows parts of the infrastructure to be deployed and tested independently. Terraform manages separate state files for each layer. These state files can be read by other layers. For example, a "base" layer might contain foundational and rarely-changing infrastructure such as VPCs, that other parts of the infrastructure might need to refer to in their configurations. For example when a virtual machine is created in another layer it may need an ID of its security group, which it can find by reading the "remote state" of the "base" layer.
There is a managed Terraform platform called Terraform Cloud (and its self-hosted version — Terraform Enterprise) that provides compute infrastructure and shared secret support for deployments where multiple teams are using Terraform. This may offer some advantages for larger deployments, but it also means granting a third party tool considerable permissions to modify your environments, which could be a source of concern for some.
Why should you use it?
For managing AWS infrastructure Terraform's main competitor is CloudFormation. There are pros and cons to both products, and it is certainly worth looking at both to see which is right for your use case. It used to be the case that CloudFormation, as a product maintained by AWS, released updates containing new AWS features sooner than Terraform, but recently this imbalance has improved significantly. Additionally, drift detection in Terraform is superior to CloudFormation. For example, if resources are deleted or changed manually using the AWS console, CloudFormation will not recreate or update resources automatically but Terraform will bring them back into line with the desired state automatically.
CDK is a capable alternative to Terraform, and CDK can use Terraform as a backend instead of outputting CloudFormation code. When using the Terraform backend it's debatable whether the additional layer of abstraction is a good thing, because Terraform code isn't particularly difficult to read or write. CDK conveniently wraps up common architectural patterns into higher-level constructs in a way that is analogous to Terraform modules. Numerous modules are available in the Terraform Registry. CDK enables developers to write infrastructure code in a language that is most familiar to them, rather than having to learn the HCL domain-specific language.
Pulumi is an interesting alternative infrastructure-as-code tool. Like CDK, Pulumi doesn't have a domain-specific language like Terraform does, and instead allows you to define your infrastructure in TypeScript, C#, Python, Go and other languages. There's a comprehensive discussion on the similarities and differences on Pulumi's website.
Terraform doesn't modify any infrastructure that isn't described in its configuration files. That is helpful because it won't unexpectedly destroy any infrastructure that it didn't deploy. For additional peace of mind, resources that Terraform manages can be explicitly protected from accidental deletion by adding a prevent_destroy
lifecycle policy.
If you want to use a single "infrastructure as code" tool to deploy to multiple target platforms Terraform is worth considering, because the same idioms, syntax and overall structure is common to all Terraform deployments. It should be noted that each target platform has its own plugin and set of resources, so you can't just use the same code to deploy equivalent infrastructure into different cloud providers.
Are there any common gotchas?
Terraform records the current state of the infrastructure in a "state file", which is usually stored in a file store such as Amazon S3 or Azure Blob Storage. Of course this introduces a little "chicken and egg" problem, because you can't Terraform the file storage until you have somewhere to store the state file! The usual workaround for this issue is to have a small bootstrapping script that creates just the resources that are needed to enable Terraform to function.
Prior to the 1.0 release each Terraform release tended to come with breaking changes, which for large estates could require manual intervention to resolve. Also if the state file was modified by a new version of Terraform, it could no longer be read by earlier Terraform versions. This became particularly problematic if an estate that made use of remote state for accessing data from other layers was migrated to a newer version a layer at a time. These two issues are less relevant since the 1.0 release. From that point Hashicorp committed to incrementing the major version if they need to make breaking changes, in line with standard semantic versioning practices. Migrations involving major version increments may therefore require some manual intervention to fix any incompatible code. It's important to ensure that every process or person that can modify the state uses exactly the same version of the Terraform command line tool.
An issue that isn't unique to Terraform is that the thing that deploys the infrastructure tends to accumulate a lot of permissions. These will need to be managed carefully.
If you are managing multiple environments it is possible to create specific resources in only one environment. Terraform allows you to create resources in only a subset of your environments using the count argument. The count can be set to 0 when deploying environments that don't need the resource, and 1 for those that do. Generally speaking if you end up using a lot of count arguments it's a sign that your environments have significant differences, which is usually sub-optimal. It may be worth revisiting your account structure so that you can make your deployment environments more similar.
Terraform isn't well-suited to rolling out updates of resources such as virtual machines. If you need to manage a fleet of virtual machines, Amazon Systems Manager, Ansible or Puppet may be worth considering as to complement a Terraform deployment that manages all of the other infrastructure.
Further reading / watching
Terraform: Up & Running is a great book that explains why and how Terraform helps software engineering teams deliver at greater pace.
A Comprehensive Guide to Terraform by Gruntwork is a series of posts that teaches best practices for writing and managing Terraform code.
5 Lessons Learned From Writing Over 300,000 Lines of Infrastructure Code is an insightful conference talk by presentation by Yevgeniy "Jim" Brikman that describes best practices for organising and testing Terraform code.