At Terramate, we have always been proponents of splitting your state files into independent units of deployment with small state: the Terramate Stack. If stacks are sized in the “Goldilocks” zone, you can address a host of problems of large state files commonly referred to as Terraliths.
But in the context of DORA metrics, other key qualities of smaller stacks emerge:
Better Reviews - when stacks are smaller, there is a lot less context and thus ground to cover in a review. Which - especially for engineers who don’t know the whole codebase by heart - is a huge benefit to be able to make sense of a Terraform Plan quickly. And the faster the review, the faster the Change Lead Time.
Clear Ownership Model - the old adage “divide and conquer” (attributed to Julius Cesar) also applies in complex infrastructure. Having small stacks allows for a much simpler establishment of an ownership model, so that the team can specialize and get to a high competency level of “their” part of the codebase faster. Which in turn leads to a better Deployment Rate as well as a faster Change Lead Time.
And this is in addition to faster deployments (which help with Change Lead Time, Deployment Rate, and MTTR) as well as smaller blast radii (which help with Change Failure Rate).
So small is beautiful… at least when it comes to improving DORA metrics.
Granular Orchestration - building on the state splitting above: Once you have a plethora of stacks and an (implicit or explicit) relationship graph between those stacks, you can use all kinds of filters to make more precise deployments. E.g., you can filter by directory, by tag, by environment, by changes, heck with Terramate Cloud you can even do stateful orchestration and filter by drifted and failed stacks. You can also ideally combine multiple dimensions, think filter all changed stacks in staging with the tag networking. The more granular your orchestration, the simpler your pull request, the faster your preview, the faster your review, and the faster and less risky your deployments.
GitOps - following a GitOps flow may initially appear to add to the Change Lead Time. But if you have fast pipelines, that added cost may be trivial (seconds to minutes) compared to the benefits. With GitOps, you obtain auditability, versioning, reproducibility, and a single source of truth. All of which actually help the Change Lead Time, but even more so the Change Failure Rate and MTTR.
Change Detection - at Terramate, we are big on only running plans and applies on changes. That way, the amount of infra that has to go through the pipeline is greatly reduced, massively speeding up the deployments and also enabling the team to collaborate in parallel, especially if you have already established an ownership model (see above).
Parallelism - the icing on your cake would be to add massive parallelism in your pipelines, which requires you to again have a sound dependency graph as well as independently deployable stacks.
The above laundry list helps you improve all four DORA metrics.
In order to improve the reliability metrics MTTR and Change Failure Rate, by definition, you need to be able to quickly detect that something went wrong in the first place. Only then you can take action.
Failure Detection - the crux with the plan and apply phases of Terraform and OpenTofu is that sometimes plans are smooth sailing, and then at the apply phase, errors emerge. Even worse, sometimes you have partially applied plans, leaving you in a state of limbo and unsure of what is actually going on. For that reason, at Terramate we always urge you to run a health check with means that you should run another plan after an apply to detect any partially applied plans or drift which validates the integrity of your previously applied changes.
Pinpoint Visibility - once you know that a deployment has (partially) failed, it is often great to have tooling in place that shows you exactly which resources of the plan have been applied and how the remaining plan looks like. So that the detective work of finding the remaining needle in the plan file “haystack” is reduced to a mere trivial lookup.
Alerting - the best monitoring is worth little if the people causing the trouble or the people best placed to apply the fix don’t know about the issues in the first place. Alerts are often routed to shared Slack channels, which many teams we talk to consider as too noisy. Alternatively, if you have an implicit ownership model from your Pull Requests, you can route individual Slack alerts directly to the collaborators on a PR.
The faster your team learns about an issue, and where the issue is, the easier it is to improve the MTTR.
Interestingly, one phenomenon we see is that once your whole pipeline is really rapid, and the MTTR is blazingly fast, teams are okay in tolerating a little more risk and accept a higher Change Failure Rate knowing that eventual failures are essentially transient and very short-lived.
While in our everyday life we tend to encounter a lot of sprawling codebases with artisanal “units of one”, the true superpower of Infrastructure-as-Code is variant generation. The idea is that the more standardized a codebase is, the easier it is to comprehend and act on, improving your Change Failure Rate, Change Lead Time, and overall Deployment Rate.
Use Modules - modules are reusable configurations that can be used to manage a collection of related infrastructure resources. Think of them like functions in a programming language, but for infrastructure. Instead of writing the same block of Terraform code repeatedly for similar resources, you can define it once as a module and then call that module whenever you need to deploy that specific set of resources. Note that you want to be careful to not create a cascade of dependencies between modules, but still keep things simple, as module overuse can come at the expense of readability. Also, module consumption can be quite challenging for non-infra experts.
Code Generation - with code generation, you can strike a balance between readability of your codebase and DRYness. At Terramate, we are naturally big believers in code generation, as it allows you to generate your IaC code from templates and customize those with shared variables we call “Globals”. The net outtake is that some problems - e.g. backend generation - can be completely abstracted away with code generation.
Scaffolding - with scaffolding, you can separate the concerns of infrastructure experts in your platform teams from those that are non-expert infrastructure consumers, i.e., your developers who are building apps. Experts pre-configure outcome-oriented “Terramate Bundles” that are tested, governed, compliant, and structured in a way that the platform team always retains full control. Infrastructure consumers, however, don’t need to know anything about Terraform or OpenTofu to just scaffold their database or message queue; they just get the job done.
With standardization, managing large-scale IaC codebases becomes a lot easier, making failures a much less frequent concern while also accelerating Change Lead Time and Deployment Rate.
But even the best-tuned automation, the most elegant standardization, the most sophisticated architecture, it all is for naught if your team is poorly trained.
Train them so they are familiar with the codebase. Show them how the pipelines work. Introduce them to your tooling. And especially provide them with context around all the things that can and will go wrong, regardless of the errors and failures that may occur. Enrich with AI when helpful to make errors more accessible.
As the old adage attributed to Evan Kirshenbaum says: “An employer once said, ‘What if I train my people and they leave?’ I say, what if you don't train them... and they stay...".
At Terramate, our mission is to help organizations massively improve their infrastructure practices. It all starts with measuring things and creating transparency, because what you measure will get done.
From DORA metrics out, we strive to help you improve on all the 5 vectors above, providing you with tooling to assist you on your journey to elite DORA metrics and highest infrastructure maturity.
Ready to supercharge your IaC?
Explore how Terramate can uplift your IaC projects with a free trial or personalized demo.