Multi-Cloud Migration: Moving Production Workloads from AWS to GCP

Jan 20, 2025

I spent 18 months leading one of the most nerve-wracking projects of my career: migrating an entire production infrastructure from AWS to Google Cloud Platform. This wasn’t a greenfield project where you get to design everything perfectly from scratch. This was moving a living, breathing production system serving millions of users while keeping everything running.

The CTO walked into my office one Monday morning and said, “We’re going all-in on GCP. I need you to move everything off AWS in the next 12 months.” My first thought was “that’s insane.” My second thought was “I’m going to need a lot of coffee.”

Why We Migrated (And Why It Was Painful)

The company had grown through acquisitions. We’d bought three startups in two years, and each one ran on different cloud providers. We had services on AWS, GCP, and Azure. The operational complexity was crushing us. Every incident took longer to resolve because our team had to be experts in three different cloud platforms. Our cloud bills were higher than they should be because we couldn’t negotiate volume discounts when our spend was split across three providers.

The CTO made the call to consolidate on GCP. The reasoning was sound: better Kubernetes integration, stronger data analytics tools, and our data science team was already heavily invested in BigQuery. But knowing why doesn’t make the migration any less terrifying.

The Journey: Four Major Migrations

This wasn’t one migration. It was four distinct, high-risk projects that had to be executed in sequence. Each one taught me lessons that made the next one possible.

Migration 1: The 8.5TB Database (The Scariest One)

The Problem: Our core PostgreSQL database was 8.5TB on AWS RDS. This database was the heart of our platform, handling 12,000 writes per second at peak. Every transaction mattered because of compliance requirements. The requirements were brutal: zero data loss and less than 30 minutes of downtime.

The Wake-Up Call: During planning, I did a test pg_dump of our staging database (much smaller, only 200GB). It took 6 hours over the public internet. At that rate, migrating 8.5TB would take days, maybe a week. We couldn’t have the database down for a week.

The Solution: I used a combination of Megaport for a dedicated 10Gbps network link between AWS and GCP, and Striim for real-time Change Data Capture replication. Megaport gave us the fat network pipe to transfer terabytes quickly. Striim let us keep the target database in sync with the source while the source stayed online serving production traffic.

The migration took 5 weeks of preparation and testing. We rehearsed the cutover procedure five times in staging. By the fifth rehearsal, we had it down to 12 minutes. The actual production cutover happened on a Sunday at 2 AM with the entire engineering team watching on Slack. Fifteen minutes of downtime. Zero data loss. The Monday morning after, the CTO forwarded my summary email to the entire company with the subject line “This is how you execute.”

Read the full guide: 8.5TB Database Migration from AWS to GCP

What you’ll learn:

Migration 2: Legacy VMs (The Messy One)

The Problem: We had 87 legacy VMs on AWS EC2 running everything from old Java services to PostgreSQL databases to Redis caches. Some of these VMs were running applications nobody fully understood anymore. The original developers had left the company. Documentation was sparse or non-existent.

The Reality Check: I started by doing a full inventory. What I found was terrifying. We had VMs running ancient versions of Java with hardcoded credentials. We had databases with no backup strategy. We had applications that couldn’t be containerized because they relied on specific kernel modules or hardware configurations.

The Solution: I couldn’t just lift-and-shift these VMs blindly. I took a three-tier approach. For modern services that could be containerized, I rewrote them to run on GKE. For databases, I used CloudSQL Proxy for secure connectivity. For the truly legacy stuff that couldn’t be modernized, I used Velostrata (now called Migrate for Compute Engine) to migrate the VMs as-is, then tackled modernization later.

The whole process took 6 months. We migrated VMs in waves, 10-15 at a time, with validation after each wave. We had some failures. One Java application refused to start on GCP because it was trying to call an AWS-specific metadata endpoint. Took us three days to track that down.

Read the full guide: Legacy VM Migration to GCP

What you’ll learn:

Migration 3: Kubernetes Clusters (The Complex One)

The Problem: We were running production Kubernetes workloads on AWS EKS. Not just a few services. We had 50+ microservices, stateful applications with persistent volumes, complex networking with VPCs and peering, and service mesh configurations. Moving all of this to GKE while maintaining uptime was going to be incredibly complex.

The Learning Curve: I initially planned to just redeploy everything to GKE from scratch. Bad idea. I tried this approach with our first small cluster (a non-critical staging environment) and it took two weeks and we still had broken networking. The problem was we didn’t have all the Kubernetes manifests in source control. Some services were deployed manually. Some had configurations that only existed in the running cluster.

The Solution: I adopted a dual-cluster strategy. Keep both EKS and GKE running simultaneously. Use DNS weighted routing to gradually shift traffic from EKS to GKE. This gave us a safe rollback path and let us validate each service migration independently.

For stateful workloads, I used Velero for backup and restore of persistent volumes. For networking, I set up VPN tunnels between AWS and GCP so services could communicate across clouds during the migration. The whole process took 4 months, but we had zero unplanned downtime.

Read the full guide: AKS to GKE Migration Strategy

What you’ll learn:

Migration 4: The Full Platform (Bringing It All Together)

The Problem: Once the database, VMs, and Kubernetes clusters were migrated, we still had to migrate all the supporting infrastructure. Load balancers. DNS. CDN. Monitoring. Logging. CI/CD pipelines. This wasn’t a single migration. It was dozens of small migrations that all had to work together.

The Coordination Challenge: The hardest part wasn’t the technical migration. It was coordinating across teams. The frontend team needed to update API endpoints. The mobile team needed to update app configurations. The DevOps team needed to update CI/CD pipelines. Marketing needed to update tracking pixels and analytics.

The Solution: I created a comprehensive migration plan that broke the platform migration into 12 phases, each with clear entry and exit criteria. We held daily standup meetings during the migration to track progress and blockers. We built a migration dashboard that showed the status of every service, database, and infrastructure component.

The full platform migration took 3 months and involved the entire engineering organization. We migrated components in waves, validated each wave thoroughly, then moved to the next. We had a few incidents along the way (DNS propagation delays caused some API calls to hit old endpoints), but nothing catastrophic.

Read the full guide: Complete Cloud Migration AWS to GCP

What you’ll learn:

The Complete Picture: What We Actually Achieved

After 18 months of intense work, we completed the migration. Here’s what that actually meant in numbers:

Migrated infrastructure:

Business impact:

The real win: Operational simplicity. Before the migration, every incident required expertise in three cloud platforms. Now we’re all GCP experts. Our on-call burden dropped significantly because there’s less surface area to understand.

What I Learned About Large-Scale Migrations

Start with the scariest thing first. We migrated the database first because it was the highest risk. If we’d failed at the database migration, we would have learned those lessons early when we still had time to adjust the overall strategy. Doing easy migrations first is tempting but teaches you nothing about the hard problems.

Over-communicate during migrations. I sent daily status emails during active migration windows. Some people thought this was excessive. But when something went wrong (and things always go wrong), everyone already knew the context and we could respond faster.

Build rollback procedures for everything. Every migration had a tested rollback plan. We used those rollback plans three times during the 18-month project. Without them, those incidents would have been outages.

Test in production. I know that sounds crazy, but we did dark launches for critical services. We’d deploy to GCP, send a copy of production traffic to both AWS and GCP (but only return the AWS response to users), then compare results. This caught so many subtle bugs before they could affect users.

Validate obsessively. For the database migration, we ran data validation scripts twice daily for a month before cutover. This paranoia caught a table partition mismatch during rehearsal that would have caused data loss in production.

Budget more time than you think. We estimated 12 months. It took 18. Migrations always take longer than planned because you discover problems you didn’t know existed. Legacy systems have surprises buried in them.

The Cost of Migration

The migration itself wasn’t cheap. Here’s the brutal honesty about costs:

One-time migration costs:

Ongoing cost savings:

The ROI was clearly positive, but you need to go into these migrations with realistic budget expectations. The tooling costs money. The engineering time costs money. But if you’re consolidating infrastructure, the long-term savings are worth it.

Where to Start

If you’re planning a similar multi-cloud migration, follow this sequence:

  1. Start with a comprehensive inventory of everything you need to migrate
  2. Migrate the database first if it’s the most critical component (learn the hard lessons early)
  3. Tackle legacy VMs next (these take longer than you expect)
  4. Migrate Kubernetes workloads using dual-cluster strategies
  5. Finally migrate supporting infrastructure and optimize

Don’t try to do everything at once. We did four major migrations sequentially over 18 months. Each one built on lessons from the previous one.

Read through the linked guides in the order listed above. Each guide includes real code, real configurations, real costs, and real war stories from when things went wrong.

This was the hardest project I’ve ever led, but also the most rewarding. We moved an entire production platform across cloud providers with minimal disruption to users. That’s something worth being proud of.

For operating workloads after migration:

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub