why it’s recommended to split large, monolithic root modules and state files into multiple smaller units, often called stacks - that can be deployed and managed independently as a unit.
“Vanilla” Terraform and OpenTofu have very limited capabilities in creating, managing, and orchestrating multiple stacks. That is why we have created Terramate CLI - an open-source CLI that helps teams horizontally scale native Terraform/OpenTofu. With code generation and orchestration, teams can easily orchestrate commands such as terraform apply across multiple stacks. And with change detection scanning for changes in the current commit, branch, or Pull Request, you can drastically reduce the CI/CD runtimes by only executing changed stacks.
Terramate gives you a wide latitude to structure stacks in whatever way you like. Alas, this article explores key considerations when architecting your stacks, provides pros and cons for different patterns, and concludes with our own opinionated recommendation.
Infrastructure as Code tools such as Terraform and OpenTofu allow us to manage the lifecycle of cloud infrastructure with code using a declarative approach. Whenever you deploy a bunch of infrastructure resources, the result of this deployment is stored in a JSON file that we call state. The state contains information about your infrastructure and configuration and represents the state of your last deployment triggered with the terraform apply (or tofu apply ) commands. Your state file contains various information such as:
Track resource state: Accurately account for the current state of your infrastructure
Resource dependencies: Understand the relationships between resources
Store bindings: Between objects in a remote system and resource instances declared in your configuration
A Stack in Terraform and OpenTofu is an abbreviation for what is commonly known as a root module. In a nutshell, a stack is a combination of
Infrastructure code which declares a set of infrastructure resources.
State that describes the status of the resources according to the latest deployment (e.g., Terraform state - which is usually stored in a remote location such as an AWS S3 Bucket).
Configuration used to configure the stack and its managed infrastructure resources (e.g., variables, stack configuration, etc.)
Stacks should not be mistaken for Terraform modules, often referred to as child modules. A Terraform module is a collection of configuration files in a directory containing multiple resource declarations used together. Modules are the primary way to package and reuse resource configurations with Terraform and OpenTofu and usually come with a set of opinionated configuration variables that consumers can optionally adjust and overwrite.
While a stack can use multiple child modules, it’s important to understand that using a single stack will always result in a single state file, except when using Terraform CLI workspaces or approaches such as partial backend configurations. So, while you may use modules to provide and consume abstraction layers around multiple resource declarations used to deploy services such as VPCs, Kubernetes Clusters, Databases and others, it’s important to understand that modules are merely a way of structuring your code better but won’t help you to overcome the issue of using a single and monolithic state file. Or, in simple terms, even if you use multiple child modules in a single stack, those will all result in a single state file.
Choosing the right size for your Terraform stacks can significantly influence your infrastructure's efficiency and manageability. While there are no one-size-fits-all rules, several guiding principles can help you determine the optimal size for your stacks. Here are a few characteristics to identify resources that should be grouped as a stack:
Resource Scope: A stack should encompass a logical grouping of resources that work together to serve a specific purpose. For example, a stack might include all the resources needed for a web application, such as servers, databases, and networking infrastructure.
Frequency of Changes: Consider how often you'll change various parts of your infrastructure. Stacks that change frequently could benefit from being smaller, making it easier and quicker to apply those changes without affecting other resources.
Blast Radius: The total effect of an apply (or destroy ) on your infrastructure is the deployment’s “blast radius.” The blast radius includes resource modifications, infrastructure costs, security implications, and any dependencies that may cause problems in the future. Keeping your blast radius to a minimum is a worthy goal, but it’s not pragmatic to break everything into such small chunks that it’s challenging to manage. The blast radius may increase massively when undetected drift (unwanted changes) are deployed with wanted changes, and deployments may be blocked until resolved. Choosing the right blast radius is a crucial engineering decision you should not take lightly.
Security and Access Control: While security is a significant part of the blast radius, it’s also important to consider when choosing a deployment model due to how you manage the responsibilities of the team members maintaining the code. It’s crucial that you maintain AAA (authentication, authorization, and accounting) policies throughout your deployment. Creating separate stacks for each team may make sense if different teams or individuals are responsible for different parts of your infrastructure. This allows for better control over permissions and can prevent unauthorized access or resource changes.
Ownership: Often, we can identify owners, such as individuals or teams, for specific, or groups of resources. For example, while a team working on a single service might own all backing infrastructure used to run this service in multiple environments, the networks used to deploy those resources might be managed by an entirely different team. Defining and documenting clear ownership levels can be very helpful when deciding what resources to group into a single stack and help us move and reassign ownership to other teams as our organization needs changes along the way.
Execution Time: Smaller stacks often mean faster execution times, which can be especially important in a CI/CD environment. This is due to the stateful nature of Terraform and OpenTofu, which means that multiple deployments that affect the same stack can only be done sequentially. With hundreds or even thousands of resources, simple operations become very painful. Consider a stack that has a runtime of 10 minutes or more. Let’s assume that 8 minutes into a deployment, you run into an error. You now need to fix this error and trigger another deployment, which will take at least another 8 minutes to get clarity on whether or not you managed to fix the issue. Even worse, all other pull requests affecting the same stack will be somewhat blocked, and the CI will always wait for the next one to finish a job before the next one can be invoked. By reducing the number of resources managed in a single run, you can reduce the time spent applying changes and reduce associated costs for CI/CD build minute consumption models used in, e.g., GitHub Actions.
Maintainability: The “squishiest,” or least precise, consideration is the maintainability of your stacks. This aspect highly depends on your team and their comfort level with your infrastructure. While I generally recommend having many smaller, well-defined stacks, this can get quite frustrating for teams unfamiliar with the infrastructure and decrease onboarding efficiency. Many engineers will try to overengineer the stack and end up introducing security issues, version controls, updates, and basic readability. While some highly effective teams can handle complicated stacks with many dependencies, it’s still a cognitive load that must be accounted for when troubleshooting and modifying them. This is more important at a smaller scale when pragmatism is vital. You should deprioritize maintainability once blast radius, security, and deployment efficiency suffer. Of course, this means you may need to invest in upgrading your internal documentation and training once the infrastructure begins to burst at the seams.
Remember, the goal of sizing your stacks is to strike a balance between manageability, performance, and control. Aim for a structure that supports your operational needs, ensures security, and promotes efficiency in your infrastructure management.
Let’s talk about when and how to structure your stacks. We’ll illustrate this using two three-tier applications with compute, networking, and data resources.
A monolithic stack is sometimes called a “terralith” when used in the context of a Terraform deployment. Monoliths are where most IaC journeys begin. A monolith is the entirety of your infrastructure code in one stack. While monoliths are the least scalable, they do have their benefits.
An application stack groups the applications with all of their dependencies. This is excellent for a predictable blast radius and full-stack teams that prefer a smaller feedback loop when making modifications.
A service stack splits up the resources needed for an application into its own stack. The applications are all run in their own stack. This is great in that it allows for multitenancy within your services. For example, you can run multiple deployments in the same EKS cluster. This saves cost but introduces possible security issues. In highly sensitive applications, such as healthcare or financial services, this can raise flags when multiple services are running in the same cluster.
This is probably an overly extreme example of a Microservice stack pattern, but it is still an example. Every piece of the infrastructure puzzle gets a state and its own configuration. This gives the illusion of control, but you’re stuck with a similar blast radius as the service and application patterns. If a service fails, your application can still fail. You have fewer connected parts that can trigger this failure, but troubleshooting can be a nightmare if there is one. The security of a deployment like this is also difficult due to the number of state files that may need access due to the large number of dependencies between services. Like microservice patterns in traditional programming, it can be great if done carefully, but overall, it should be reserved for specific cases.
The best approach for most cases is the service stack pattern. We recommend using one stack per service and environment, whereas a service can comprise multiple applications and backing infrastructure.
Based on the specifics of your project and your team setup, there may be alternatives. We are happy to review your IaC structuring considerations. Feel free to share in our Discord Community. Otherwise, we look forward to see how you decided to structure your stacks.
Ready to supercharge your IaC?
Explore how Terramate can uplift your IaC projects with a free trial or personalized demo.