Beyond 'terraform plan': A Guide to Unit, Integration, and Chaos Testing for Infrastructure

I used to think of infrastructure code differently than application code. That changed one Friday afternoon when a misconfigured security group rule exposed our staging database to the public internet. We caught it during a routine audit, but it had been sitting there for three days. The worst part? A simple test would have caught it before it ever got deployed.

That incident was my wake-up call. Over the past two years, I’ve built a comprehensive testing strategy that has prevented more than 50 major misconfigurations from reaching production. Some were minor (wrong instance types), but others could have been catastrophic (publicly accessible databases, misconfigured IAM roles with admin access). Now, automated testing is the foundation of everything my team does with infrastructure.

Here’s what I’ve learned the hard way about testing infrastructure as code.

My Testing Pyramid

I approach infrastructure testing with a classic testing pyramid, though it took me a while to figure out the right balance at each layer.

        /\
       /  \          End-to-End & Chaos Tests
      /____\         (Kitchen, AWS FIS)
     /      \
    /________\       Integration & Unit Tests
   /          \      (Terratest)
  /____________\
 /              \    Static Analysis & Policy Checks
/________________\   (TFLint, Checkov, OPA)

The Foundation: Static Analysis and Policy as Code

This is where I started, and honestly, where you should start too. It’s the fastest and cheapest layer of testing, and it runs on every single commit in my CI/CD pipeline. When I first proposed adding these checks to our pipeline, the team was skeptical. “More gates to slow us down,” someone said. But within the first week, we caught a developer accidentally trying to create an S3 bucket with public read access. That skepticism turned into enthusiasm pretty quickly.

I use tflint for basic linting to catch common errors and enforce conventions. Things like ensuring all resources have the required tags might seem pedantic, but when you’re managing hundreds of resources across multiple environments, consistent tagging is the difference between finding a resource in 30 seconds versus 30 minutes.

For security scanning, I rely on checkov. It scans for common security misconfigurations before they ever leave a developer’s machine. Public S3 buckets, overly permissive security groups, unencrypted storage - all the classics that make for embarrassing incident reports.

The most powerful tool in this layer, though, is Open Policy Agent (OPA). This is where I write custom policies in Rego that are specific to our organization’s requirements. For example, we have a strict policy that no EC2 instance in our production environment can have a public IP address. Everything goes through load balancers or bastion hosts. My CI/CD pipeline runs opa eval against the Terraform plan, and if any policy is violated, the build fails immediately.

# An OPA policy to deny public EC2 instances
package terraform.analysis

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_instance"
    resource.change.after.associate_public_ip_address == true
    msg := sprintf("EC2 instance %s should not have a public IP", [resource.address])
}

This policy saved us just last month when a junior developer was setting up a new application server and didn’t realize they had set associate_public_ip_address = true in their module. The build failed with a clear error message, they fixed it in 30 seconds, and we avoided a potential security audit finding.

The Middle Layer: Unit and Integration Tests

Static analysis is great, but it can’t tell you if your infrastructure actually works. That’s where unit and integration tests come in, and this is where I initially met the most resistance from my team. “You want us to write tests… for Terraform?” Yes. Yes, I do.

The turning point was when we had an incident where a VPC module update changed the subnet CIDR blocks in a way that broke our existing security group rules. The change looked fine in the Terraform plan, but once deployed, half our microservices couldn’t talk to each other. We spent four hours troubleshooting in production. After that, nobody questioned the value of integration tests.

My tool of choice for this layer is Terratest. I write tests in Go that call terraform apply on a specific module, and then use the AWS SDK to make assertions about the created resources. For example, after my VPC module runs, my test verifies that the VPC exists, has the correct CIDR block, has the expected number of subnets in the right availability zones, and that the route tables are configured correctly.

These tests create real resources in a test AWS account, so I run them as a manual or nightly step in my CI pipeline to control costs. The first month we ran these tests, our AWS bill went up by about $150. But considering we prevented at least three production incidents that month, each of which would have cost us far more in engineering time and potential customer impact, it was worth every penny.

// A simplified Terratest example for my VPC module
func TestVPCCreation(t *testing.T) {
    t.Parallel()
    awsRegion := "us-east-1"
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/vpc",
    })

    defer terraform.Destroy(t, terraformOptions) // Always clean up!
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    vpc := aws.GetVpcById(t, vpcId, awsRegion)

    assert.Equal(t, "10.0.0.0/16", *vpc.CidrBlock)
}

The Peak: Chaos Engineering

This is how I test the resilience of the entire system. I use the AWS Fault Injection Simulator (FIS) to intentionally break things in my staging environment. For example, I’ll have a test that randomly terminates an EC2 instance in an auto-scaling group and then verifies that the ASG replaces it and the application remains healthy. This is how I gain real confidence in my system’s resilience.

The Real-World Impact

We’ve prevented over 50 major misconfigurations from reaching production in the last year alone.
Our infrastructure-related incidents have decreased by 75%.
Our deployment time has actually gone down by 40% because my team has so much more confidence in their changes.

My Infrastructure Testing Best Practices

Test at every level: static analysis, unit, integration, and chaos.
Automate everything in your CI/CD pipeline. The validate stage should run on every commit.
Use Policy as Code (OPA) to enforce custom rules that are specific to your organization.
Use Terratest for unit testing individual modules. It’s the industry standard for a reason.
Always clean up test resources to avoid surprise bills. defer terraform.Destroy is your best friend.