After managing Terraform for over 50 different cloud environments across AWS, GCP, and Azure, I’ve learned a lot, often the hard way. I’ve developed a set of best practices that have become my personal playbook for writing maintainable, scalable, and collaborative infrastructure as code. These are the rules I live by, and most of them I learned by breaking production at least once.
The Lesson That Cost Me a Weekend
Let me start with the mistake that taught me the most important lesson about Terraform. It was a Friday afternoon (because of course it was), and I was deploying a simple networking change to our production environment. I ran terraform apply locally, watched the green text scroll by, and closed my laptop feeling accomplished. What I didn’t know was that my colleague in our Berlin office was also making changes to the same state file at the exact same moment.
By Monday morning, our production state file was corrupted. We spent the entire weekend manually reconciling the actual infrastructure with what Terraform thought existed. Some resources had been created twice. Others had been deleted when they shouldn’t have been. It was a nightmare, and it was completely avoidable.
That’s when I learned my most fundamental rule: remote state with locking isn’t just a best practice. It’s the first thing you set up before writing a single resource. There’s no excuse for skipping it, and I haven’t touched a Terraform project without it since that weekend.
Building the Foundation
These days, the first thing I do on any new project is set up a solid directory structure. I create an environments directory with subdirectories for dev, staging, and prod, and a modules directory for all my reusable components. This seems obvious now, but I learned this structure after managing a sprawling Terraform codebase where everything lived in one giant directory. Finding the right file took longer than making the actual change.
terraform/
├── environments/ # One directory per environment
│ ├── dev/
│ ├── staging/
│ └── prod/
├── modules/ # Reusable modules
│ ├── networking/
│ └── compute/
For the remote backend, I always use S3 or GCS with state locking and versioning enabled. The state locking part is what would have saved me that weekend incident. With DynamoDB backing our S3 state files, Terraform automatically prevents two people from making changes simultaneously. It’s such a simple safeguard, and yet I see teams skip it all the time because they think they’re too small to need it. Trust me, you’re not.
# I put this in my backend.tf file
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/networking/terraform.tfstate" # Separate state per env/component
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks" # Use DynamoDB for state locking
}
}
The versioning part saved me once when someone accidentally ran a destructive terraform apply that wiped out half our infrastructure. We were able to restore the previous state file version from S3 and import the resources back into Terraform’s management. Without versioning, we would have been rebuilding everything from scratch using old documentation that was probably out of date anyway.
Module Everything (I Learned This One Slowly)
I used to think modules were overkill for small projects. I’d copy and paste the same resource blocks across different environments, making tiny tweaks here and there. Then one day, we needed to roll out a security patch that affected a configuration setting in our EC2 instances. I had to make the same change in 17 different places. I missed three of them. Those three instances became the entry point for a security incident two months later.
Now I have a simple rule: if I’m going to create more than one of something, it becomes a module. Actually, that’s not quite true. Even if I’m only creating one of something complex, I still make it a module. Why? Because six months later when I need to deploy a second one, I’ll have forgotten all the intricate dependencies and configurations. The module documents all of that for me.
I design my modules to be reusable and composable, with clear inputs (variables) and outputs. Each module should do one thing well. My networking module handles VPCs, subnets, and routing. My compute module handles EC2 instances and their security groups. They compose together cleanly, and I can test them independently. This approach has prevented so many late-night debugging sessions where I’m trying to untangle spaghetti code to figure out why staging works but production doesn’t.
The Security Incident I Still Feel Guilty About
Here’s a story I’m not proud of. Early in my Terraform journey, I committed a .tfvars file to our Git repository. It contained database passwords, API keys, and other secrets. I knew it was wrong, but I was in a hurry, and I told myself I’d fix it later. That file sat in our commit history for three months before someone on the security team found it during an audit.
We had to rotate every single secret in that file. We locked down the repository access. We scanned all our systems for signs of compromise. Thankfully, nothing had been exploited, but it could have been catastrophic. And it was completely my fault for taking a shortcut I knew was wrong.
Now I have a zero-tolerance policy for secrets in code. I never hardcode them, and I never put them in .tfvars files that might accidentally get committed. Instead, I use data sources to fetch secrets at runtime from AWS Secrets Manager or Google Secret Manager. The secret never appears in the Terraform code or state file (well, it does appear in the state file, which is another reason why state file encryption is critical).
# This is how I fetch a database password securely
data "google_secret_manager_secret_version" "db_password" {
secret = "db-password"
}
resource "google_sql_user" "user" {
password = data.google_secret_manager_secret_version.db_password.secret_data
}
Getting the team to adopt this pattern took some effort. Developers were used to just dropping secrets into config files. But after I shared the story of our near-miss security incident, everyone understood why this was non-negotiable. We even set up pre-commit hooks to scan for common secret patterns, just to add another layer of protection against human mistakes.
Infrastructure as Code Means Treating It Like Code
I learned this lesson from watching a junior engineer manually apply Terraform changes to production without any review process. The changes looked fine in isolation, but they broke an assumption that another part of our infrastructure depended on. We had about 30 minutes of downtime before we could roll back.
That’s when we implemented a proper CI/CD pipeline for infrastructure. Now, every commit goes through automated checks before a human even looks at it. The pipeline runs terraform fmt to enforce consistent formatting, terraform validate to catch syntax errors, and tfsec to scan for security issues. Then it generates a plan file and saves it as an artifact.
The key thing about the plan file is that it becomes the source of truth for what will change. In our code review process, we don’t just look at the Terraform code changes. We look at the plan output. We ask questions like “Why is this resource being replaced instead of updated in-place?” or “Do we really want to delete that security group?” These reviews have caught so many unintended consequences before they reached production.
For production deploys, the apply step is always manual. I don’t care how good your testing is, you need a human to review the plan and click the button. We’ve been saved multiple times by someone in the approval chain saying “Wait, that’s going to recreate the database, isn’t it?” and stopping a destructive change that passed all the automated checks.
The Silent Killer: Configuration Drift
Here’s something that bothered me for years: you spend all this effort building infrastructure as code, and then someone on the operations team logs into the AWS console and changes a security group rule manually. Suddenly, your Terraform state doesn’t match reality. You run terraform apply a few weeks later, and it either reverts their change (causing an incident) or tries to reconcile the drift in unexpected ways (also causing an incident).
I used to find out about drift when it was already causing problems. Now, I have a scheduled job that runs terraform plan every night against all our environments. If the plan detects any changes (the exit code is 2), it sends an alert to our team’s Slack channel with the diff. This tells us immediately if someone made a manual change outside of our IaC workflow.
The first week we turned this on, we got 23 alerts. It turned out our monitoring team had been tweaking CloudWatch alarms directly in the console for months. They didn’t realize those alarms were managed by Terraform, and we’d been silently reverting their changes every time we deployed. Once we saw the alerts, we had a conversation, updated our Terraform code to include their changes, and set up a better process for them to request alarm modifications through pull requests.
What I’d Tell My Past Self
If I could go back to when I first started using Terraform, I’d tell myself that every shortcut I take will come back to bite me later. Using local state instead of remote state will cause a conflict at the worst possible time. Skipping modules because “this is just a small project” will turn into hours of repetitive work. Hardcoding secrets will create a security risk that keeps me up at night.
I’d also tell myself that getting a team to adopt these practices takes patience and communication. You can’t just declare “we’re doing infrastructure as code now” and expect everyone to immediately understand why state locking matters or why we need CI/CD for Terraform. You need to share the war stories, explain the near-misses, and sometimes let people make small mistakes in dev environments so they understand why the guardrails exist.
The most important thing I’ve learned is that Terraform best practices aren’t about following rules for the sake of rules. They’re about protecting yourself and your team from the inevitable mistakes that happen when you’re managing complex infrastructure. Remote state with locking prevents the Friday afternoon disaster. Modules prevent the security patch nightmare. Secret management prevents the audit crisis. Drift detection prevents the silent divergence that causes mysterious bugs.
These practices have become second nature to me now, but each one was learned through a painful experience. I hope by sharing these stories, you can skip some of the pain and build more reliable infrastructure from the start. Just remember: when you’re tempted to take a shortcut, there’s probably someone like me who learned the hard way why that shortcut leads to an incident.