How to structure and size Terraform Stacks

When starting with Terraform or OpenTofu, you typically manage all your infrastructure resources in a single Terraform root module. The root module is the top-level module in a Terraform configuration and is the main directory where Terraform configuration files, or .tf files, are located. It contains the state of the resources defined in those files and is the configuration applied when the terraform apply (or tofu apply ) command is run.

However, as you scale up, managing all your infrastructure resources and environments in a single root module often becomes problematic. Long-running and blocking CI/CD pipelines, large blast radii, and the lack of ownership and governance models are just some of the symptoms that explain why it’s recommended to split large, monolithic root modules and state files into multiple smaller units, often called stacks - that can be deployed and managed independently as a unit.

“Vanilla” Terraform and OpenTofu have very limited capabilities in creating, managing, and orchestrating multiple stacks. That is why we have created Terramate CLI - an open-source CLI that helps teams horizontally scale native Terraform/OpenTofu. With code generation and orchestration, teams can easily orchestrate commands such as terraform apply across multiple stacks. And with change detection scanning for changes in the current commit, branch, or Pull Request, you can drastically reduce the CI/CD runtimes by only executing changed stacks.

Terramate gives you a wide latitude to structure stacks in whatever way you like. Alas, this article explores key considerations when architecting your stacks, provides pros and cons for different patterns, and concludes with our own opinionated recommendation.

Why size matters

Infrastructure as Code tools such as Terraform and OpenTofu allow us to manage the lifecycle of cloud infrastructure with code using a declarative approach. Whenever you deploy a bunch of infrastructure resources, the result of this deployment is stored in a JSON file that we call state. The state contains information about your infrastructure and configuration and represents the state of your last deployment triggered with the terraform apply (or tofu apply ) commands. Your state file contains various information such as:

Track resource state: Accurately account for the current state of your infrastructure
Resource dependencies: Understand the relationships between resources
Store bindings: Between objects in a remote system and resource instances declared in your configuration

A Stack in Terraform and OpenTofu is an abbreviation for what is commonly known as a root module. In a nutshell, a stack is a combination of

Infrastructure code which declares a set of infrastructure resources.
State that describes the status of the resources according to the latest deployment (e.g., Terraform state - which is usually stored in a remote location such as an AWS S3 Bucket).
Configuration used to configure the stack and its managed infrastructure resources (e.g., variables, stack configuration, etc.)

Stacks should not be mistaken for Terraform modules, often referred to as child modules. A Terraform module is a collection of configuration files in a directory containing multiple resource declarations used together. Modules are the primary way to package and reuse resource configurations with Terraform and OpenTofu and usually come with a set of opinionated configuration variables that consumers can optionally adjust and overwrite.

While a stack can use multiple child modules, it’s important to understand that using a single stack will always result in a single state file, except when using Terraform CLI workspaces or approaches such as partial backend configurations. So, while you may use modules to provide and consume abstraction layers around multiple resource declarations used to deploy services such as VPCs, Kubernetes Clusters, Databases and others, it’s important to understand that modules are merely a way of structuring your code better but won’t help you to overcome the issue of using a single and monolithic state file. Or, in simple terms, even if you use multiple child modules in a single stack, those will all result in a single state file.

How to identify what resources belong to a Stack

Choosing the right size for your Terraform stacks can significantly influence your infrastructure's efficiency and manageability. While there are no one-size-fits-all rules, several guiding principles can help you determine the optimal size for your stacks. Here are a few characteristics to identify resources that should be grouped as a stack:

Resource Scope: A stack should encompass a logical grouping of resources that work together to serve a specific purpose. For example, a stack might include all the resources needed for a web application, such as servers, databases, and networking infrastructure.
Frequency of Changes: Consider how often you'll change various parts of your infrastructure. Stacks that change frequently could benefit from being smaller, making it easier and quicker to apply those changes without affecting other resources.
Blast Radius: The total effect of an apply (or destroy ) on your infrastructure is the deployment’s “blast radius.” The blast radius includes resource modifications, infrastructure costs, security implications, and any dependencies that may cause problems in the future. Keeping your blast radius to a minimum is a worthy goal, but it’s not pragmatic to break everything into such small chunks that it’s challenging to manage. The blast radius may increase massively when undetected drift (unwanted changes) are deployed with wanted changes, and deployments may be blocked until resolved. Choosing the right blast radius is a crucial engineering decision you should not take lightly.
Security and Access Control: While security is a significant part of the blast radius, it’s also important to consider when choosing a deployment model due to how you manage the responsibilities of the team members maintaining the code. It’s crucial that you maintain AAA (authentication, authorization, and accounting) policies throughout your deployment. Creating separate stacks for each team may make sense if different teams or individuals are responsible for different parts of your infrastructure. This allows for better control over permissions and can prevent unauthorized access or resource changes.
Ownership: Often, we can identify owners, such as individuals or teams, for specific, or groups of resources. For example, while a team working on a single service might own all backing infrastructure used to run this service in multiple environments, the networks used to deploy those resources might be managed by an entirely different team. Defining and documenting clear ownership levels can be very helpful when deciding what resources to group into a single stack and help us move and reassign ownership to other teams as our organization needs changes along the way.
Execution Time: Smaller stacks often mean faster execution times, which can be especially important in a CI/CD environment. This is due to the stateful nature of Terraform and OpenTofu, which means that multiple deployments that affect the same stack can only be done sequentially. With hundreds or even thousands of resources, simple operations become very painful. Consider a stack that has a runtime of 10 minutes or more. Let’s assume that 8 minutes into a deployment, you run into an error. You now need to fix this error and trigger another deployment, which will take at least another 8 minutes to get clarity on whether or not you managed to fix the issue. Even worse, all other pull requests affecting the same stack will be somewhat blocked, and the CI will always wait for the next one to finish a job before the next one can be invoked. By reducing the number of resources managed in a single run, you can reduce the time spent applying changes and reduce associated costs for CI/CD build minute consumption models used in, e.g., GitHub Actions.
Maintainability: The “squishiest,” or least precise, consideration is the maintainability of your stacks. This aspect highly depends on your team and their comfort level with your infrastructure. While I generally recommend having many smaller, well-defined stacks, this can get quite frustrating for teams unfamiliar with the infrastructure and decrease onboarding efficiency. Many engineers will try to overengineer the stack and end up introducing security issues, version controls, updates, and basic readability. While some highly effective teams can handle complicated stacks with many dependencies, it’s still a cognitive load that must be accounted for when troubleshooting and modifying them. This is more important at a smaller scale when pragmatism is vital. You should deprioritize maintainability once blast radius, security, and deployment efficiency suffer. Of course, this means you may need to invest in upgrading your internal documentation and training once the infrastructure begins to burst at the seams.

Remember, the goal of sizing your stacks is to strike a balance between manageability, performance, and control. Aim for a structure that supports your operational needs, ensures security, and promotes efficiency in your infrastructure management.

Strategies for designing and structuring Stacks

Let’s talk about when and how to structure your stacks. We’ll illustrate this using two three-tier applications with compute, networking, and data resources.

Monolith

Monolithic Stack Design Pattern Diagram A monolithic stack is sometimes called a “terralith” when used in the context of a Terraform deployment. Monoliths are where most IaC journeys begin. A monolith is the entirety of your infrastructure code in one stack. While monoliths are the least scalable, they do have their benefits.

Monolith Pros:

Lower cognitive load for developers and faster troubleshooting while building
Easier to manage dependencies
More transparent security footprint

Monolith Cons:

Larger blast radius and blocking CI/CD pipeline runs
Tight coupling between resources
Teams encounter dependency issues and can cause issues with other team’s deployments.
Difficulty applying least privilege security principles
Much slower to deploy as it scales

Use When:

Your infrastructure is small and still under heavy development.
Your team is small or still ramping up.
Your team has few security and skill silos.

Ditch when:

Your infrastructure is changing less than it’s growing.
Your modules are mostly static, and changes are less frequent.
You have strict security controls between your teams.
You have highly specialized teams.

As you can see in the diagram, there is only one state for the entire deployment.

Application Stack

Application Group Stack Design Pattern Diagram An application stack groups the applications with all of their dependencies. This is excellent for a predictable blast radius and full-stack teams that prefer a smaller feedback loop when making modifications.

AGS Pros:

App teams get full control over their application
Blast radius scoped to a single application
Security concerns scoped to application teams
Easier to troubleshoot

AGS Cons:

Teams responsible need to understand full-stack requirements
To run effectively, teams must access the entire stack, including security.
Larger mix of resources means more entropy
Tighter coupling than service group
Duplicated resources
Duplicated teams or teams that have to context switch between applications

Use When:

Your teams need full control over the lifecycle of their application.
Your applications need tight internal coupling.
Your infrastructure does not require multitenancy, such as multiple apps in one Kubernetes cluster.
You have enough teams to manage the applications independently or your teams are capable of context switching between them.

Ditch When:

Multitenancy within services is important due to cost, management, or other.
Security is siloed to a single team.
Your teams are highly specialized and are unable to manage the full stack.
Your teams are unable to context switch to multiple applications.

Service Stack

Service Stack Design Pattern Diagram A service stack splits up the resources needed for an application into its own stack. The applications are all run in their own stack. This is great in that it allows for multitenancy within your services. For example, you can run multiple deployments in the same EKS cluster. This saves cost but introduces possible security issues. In highly sensitive applications, such as healthcare or financial services, this can raise flags when multiple services are running in the same cluster.

Service Stack Pros:

Allows multitenancy and deduplication of resources.
The Infra team is responsible for security. Smaller feedback loop for security modifications.

Service Stack Cons:

The blast radius is large in the service stack.
Security resources are not decoupled.

Use When:

Cost controls and multitenancy are important to your org.
Your infrastructure team is familiar with Terraform and can handle the blast radius.
Your application team does not specialize in infrastructure.

Ditch When:

Your environment is highly regulated.
You need to separate security from infrastructure.
Blast radius management is critical.

Microservice Stack

Microservice Stack Design Pattern This is probably an overly extreme example of a Microservice stack pattern, but it is still an example. Every piece of the infrastructure puzzle gets a state and its own configuration. This gives the illusion of control, but you’re stuck with a similar blast radius as the service and application patterns. If a service fails, your application can still fail. You have fewer connected parts that can trigger this failure, but troubleshooting can be a nightmare if there is one. The security of a deployment like this is also difficult due to the number of state files that may need access due to the large number of dependencies between services. Like microservice patterns in traditional programming, it can be great if done carefully, but overall, it should be reserved for specific cases.

Micro Stack Pros:

Teams can be scoped to their expertise
Extremely loose coupling of resources
Blast radius can be small
Security footprint is high due to sharing attributes.
Security resources are isolated.

Micro Stack Cons:

The blast radius is more opaque
Longer feedback loop if separate teams are managing
Dependencies are more challenging to manage
No multitenancy

Use When:

You have highly specialized teams.
You have highly regulated workloads.
The blast radius and security benefits outweigh the cost benefits of multitenancy.

Ditch when:

The complexity adds friction.
You do not have an expert infrastructure team.
Multitenancy and cost controls are a concern.

Rule of Thumb: One Stack per Service and Environment

The best approach for most cases is the service stack pattern. We recommend using one stack per service and environment, whereas a service can comprise multiple applications and backing infrastructure.

Based on the specifics of your project and your team setup, there may be alternatives. We are happy to review your IaC structuring considerations. Feel free to share in our Discord Community. Otherwise, we look forward to see how you decided to structure your stacks.