Why you should break down your Terraform into Stacks

Introduction

This article introduces one of the most common Terraform problems that Terramate was designed to solve: How to decompose code into more manageable components while maintaining system integrity and reliability.

We are not concerned here with Terraform modules, for a while they do allow us to break out code into reusable, versionable units that can be maintained separately, the modules must still be called from somewhere: the Terraform “root module” — i.e., the directory where you run terraform apply , which for this article we will refer to as a "stack.”

About Large-State Terraform Projects

The Terraform state is a crucial component in the Terraform architecture. It is a JSON file that holds the current state of the infrastructure as known by Terraform. The state file contains a mapping between the resources in your Terraform configurations and the real-world resources in your cloud provider. It allows Terraform to keep track of the infrastructure it manages, enabling it to detect any drift between the desired state (as defined in your code) and the actual state of the infrastructure.

Some facts about the Terraform state:

Synchronization: The state helps Terraform synchronize the infrastructure with the configuration, ensuring that the real-world resources match the desired state defined in your code.
Dependency Resolution: The state file contains information about the dependencies between resources, which Terraform uses to determine the order in which resources should be created, updated, or deleted.
Output Values: Terraform state also holds output values, which can be queried to extract information about the infrastructure, such as IP addresses or DNS names.
Resource Mapping: The state maps the resources in your code to the corresponding real-world resources in your cloud provider.

We consider a Terraform project large-state whenever we manage all our infrastructure in a single, or multiple large Terraform state files.

Disadvantages Of Large-State Projects

Slow Execution Times

Large state files can take a long time to load and process, which can slow down Terraform operations such as plan , apply and refresh . The more resources you have in a state file, the more time it takes to refresh the state and perform computations. API rate limits compound the problem and when not handled gracefully, can sometimes break things badly.

Monolithic Terraform Project Development velocity relies upon a short feedback loop that allows the developer to immediately see the effects of changes, which is impossible with large-state projects. Moreover, most teams also have strict SLAs and the slowness of fixing things correctly through code means they will resort to “ClickOps” to resolve production issues, and the changes may never be reflected into the codebase. A good rule of thumb is that executing terraform plan should never take significantly longer than making the change in code.

Limited Collaboration

In a large-state project, it can be challenging for multiple team members to work simultaneously, as they might face conflicts and state-locking issues. It’s harder to isolate changes, leading to unintentional impacts on other parts of the infrastructure. Slow execution times are often compounded because the developer must wait for other applies to finish, leading to wasted time. Another common issue with large-state projects is that the developer runs a plan and notices that some resources they didn’t modify have planned changes, which begins a frustrating and time-consuming series of chat threads to try to identify who changed what, and whether it’s safe to continue with the apply.

Dangerous

Large-state projects are also dangerous for a variety of reasons. For one thing, the plan output can often become unreadably large, even for trivial changes. "Flapping" resources (those planned changes that can safely be ignored, often due to provider bugs) can add to the noise. If developers aren't carefully reading the plan and are just trusting the changes, they made to the code are correct, then dangerous and unwanted changes can occur in production. Worse still, large-state projects allow no separation of responsibilities or principle of least privilege. As soon as someone is permitted to approve PRs and run an apply, they can potentially change things they have little understanding of, and that may be maintained by developers in other teams. Similarly, the permissions required for the CI/CD to run must, by necessity, be very broad with a large-state project because they are required to modify potentially anything. Finally, the larger the state, the larger the blast radius. Incorrect changes can be devastating and state corruption or conflicts can be more challenging to resolve.

Complexity

Large-state projects can become complex and hard to manage as the number of resources and dependencies grows. The complexity can make it difficult to understand the entire infrastructure and can lead to errors.

Difficulties in Testing and Debugging

Testing and debugging can be more challenging in a large-state project due to the sheer size and interconnectedness of the resources.

Conclusion

So, as you can see, there are a lot of disadvantages and challenges when grouping a large deployment into a single state file. Stay tuned for the next installment where we help you mitigate these issues. Also, if you liked this, and are interested in more like it, be sure to follow us on X (Formerly Twitter), and join our Discord channel!