How I Led a Cloud Migration That Cut Costs by 56% Annually

At a previous company, I led a cloud migration and modernization project that was one of the most impactful of my career. We moved over 40 legacy applications from an expensive, aging on-premises infrastructure to Google Cloud. The result? We cut our annual operational costs by 56%—a saving of $470,000 a year—while also increasing our deployment speed by 10x. Here’s how I did it.

The Challenge I Faced

The VP of Engineering called me into a conference room and showed me a spreadsheet. Our annual infrastructure costs: $840,000. For what we were getting, that number was embarrassing. Our on-premises data center was a mix of aging hardware that we’d been patching together for years. Half the servers were past their warranty period. We were paying for colocation space we barely used. And the operations team spent most of their time babysitting failing hardware instead of building new features.

But the cost wasn’t even the worst part. Our deployment process was completely manual. Someone would SSH into a server, stop the application, copy new files, restart it, and pray nothing broke. This process took hours and failed about 40% of the time. When it failed, someone had to scramble to roll back, which often made things worse.

We also couldn’t scale to meet demand. When traffic spiked, we had to physically provision new hardware, which took weeks. By the time the new servers were racked and configured, the spike was over. Our monolithic applications were tightly coupled nightmares where changing one thing could break three other things. Something had to change.

pie title Our Annual Costs Before I Started ($840K)
    "Server Hardware & Maintenance" : 350
    "Datacenter & Colocation" : 280
    "Network & Bandwidth" : 120
    "Staff Time (Manual Ops)" : 90

The Strategy: Not All Apps Are Equal

I spent two weeks analyzing our 40+ applications before proposing a strategy. The key insight was that not all applications are created equal. Some got hit with thousands of requests per second. Others processed a webhook maybe once an hour.

I categorized every application into two buckets based on traffic patterns and criticality. Fifteen applications were “high-duty”: our core API, the web application, real-time services, things that needed to be online 24/7 handling constant traffic. These went to Google Kubernetes Engine (GKE) where I could tune performance, control autoscaling, and have full flexibility.

Twenty-five applications were “low-duty”: batch processors, admin dashboards, webhook handlers, internal tools. These had sporadic or scheduled usage. An admin tool might get used for 30 minutes a day. A batch job might run for an hour at 2 AM. Running these on dedicated servers 24/7 was wasteful. These went to Cloud Run, which could scale to zero when not in use. We’d only pay for compute when they were actually running.

This hybrid approach was controversial. The ops team wanted everything in one place for simplicity. But the cost difference was too big to ignore. Cloud Run would save us tens of thousands a year on the low-duty apps alone.

graph LR
    OnPrem["On-Premises (Legacy)"] --> Migration["My Migration Path<br/>(Assess, Lift, Modernize)"] --> GCP["Google Cloud Platform"]
    subgraph GCP
        GKE[GKE for High-Duty Apps]
        CloudRun[Cloud Run for Low-Duty Apps]
    end
    style OnPrem fill:#ff6b6b,color:#fff
    style GCP fill:#51cf66,color:#fff

How I Actually Executed This

The migration took five months from start to finish. I broke it into four phases, learning from each one to make the next phase smoother.

Month 1: Planning and Assessment

I wrote Python scripts to scrape logs and analyze every application’s traffic patterns, resource usage, and dependencies. This data drove the GKE vs Cloud Run decision for each app. I also interviewed the developers who owned each application to understand any quirks or hidden dependencies. This uncovered several undocumented database connections and API dependencies that would have caused outages if I’d missed them.

Month 2: Building the Foundation

I used Terraform to build our GCP landing zone. VPCs, subnets, firewall rules, the GKE cluster, Cloud SQL instances for our databases, everything as code. This took longer than I wanted because I kept finding edge cases and security requirements, but having it all in Terraform meant I could recreate the entire environment for testing, which saved us later.

Months 3-5: The Migration Waves

I divided the 40 applications into four migration waves. Wave 1 was five low-risk internal tools. If we screwed up, only internal users would be affected. We learned a ton from this wave about containerization gotchas and Cloud Run configuration.

Wave 2 was ten more low-duty applications. We were getting confident now. Wave 3 was fifteen applications including some customer-facing ones. By this point, the process was smooth. Wave 4 was the big ten: our core services. These migrations happened during planned maintenance windows with the entire team on standby. Every one went smoothly because we’d practiced on 30 other apps first.

Post-Migration: Automation

Once everything was migrated, I implemented GitOps with Cloud Build and ArgoCD. Developers would push code to Git, Cloud Build would run tests and build containers, and ArgoCD would deploy to GKE or Cloud Run automatically. This eliminated our 40% deployment failure rate almost overnight.

What We Achieved

The CFO was skeptical when I proposed this migration. The upfront cost was significant, and she wanted proof it would pay off. Six months after completion, I sent her an updated version of that cost spreadsheet.

Annual costs dropped from $840,000 to $370,000. That’s a 56% reduction, or $470,000 in annual savings. The CFO actually forwarded my email to the CEO with “This is what good infrastructure work looks like.” I saved that email.

The cost breakdown was eye-opening. GKE and Cloud Run compute cost about $220,000 annually. Cloud SQL and other managed services cost $90,000. Networking and data transfer cost $60,000. We eliminated all hardware costs, all colocation costs, and 60% of the operational overhead. The Cloud Run apps that scaled to zero were especially impactful, some costing less than $10 per month.

Deployment speed improved dramatically. Our deployment time went from 2-4 hours of manual work to 10-15 minutes of automated deployment. The error rate dropped from 40% to 4%. We went from deploying a few times per week (because deployments were painful) to deploying dozens of times per day (because deployments became trivial).

Performance exceeded expectations. The GKE cluster could handle 5x our peak traffic with room to spare. Latency dropped by 60% because we weren’t running on aging hardware anymore. The autoscaling meant we could handle traffic spikes without manual intervention.

graph LR
    subgraph Before["On-Prem ($840K/year)"]
        direction LR
        B1[Deployment: 2-4 hours]
        B2[Error Rate: 40%]
    end
    subgraph After["GCP ($370K/year)"]
        direction LR
        A1[Deployment: 10-15 min]
        A2[Error Rate: 4%]
    end
    Before -->|My Project| After
    style Before fill:#ffe0e0
    style After fill:#e0ffe0

What I Learned

Application categorization made or broke this project. Putting everything on GKE would have been simpler but cost us an extra $100,000+ annually. The Cloud Run apps that scaled to zero saved a fortune. A one-size-fits-all approach would have failed.

Database migration was the hardest part, just like everyone said it would be. We used GCP’s Database Migration Service with a read-replica strategy, but we still needed planned maintenance windows for the final cutover. One database migration hit an unexpected incompatibility issue and took 3 hours longer than planned. We should have tested the migration process more thoroughly in staging.

Team training was essential but easy to skip under deadline pressure. I carved out 40 hours for workshops on Kubernetes, GitOps, and GCP fundamentals. Some people grumbled about taking time away from “real work,” but two months later, the team was self-sufficient. They could deploy, troubleshoot, and optimize without constantly asking me for help. That ROI was massive.

Start with low-risk applications. The Wave 1 migrations taught us about container networking issues, environment variable gotchas, and Cloud Run cold start behavior. If we’d started with our core services, we would have had outages. Practice on the apps that don’t matter before touching the apps that do.

Infrastructure as code is non-negotiable for migrations of this scale. Having everything in Terraform meant I could tear down and rebuild environments for testing. It also meant we had documentation that couldn’t drift from reality because the code was the reality.