How I Migrated an 8.5TB Database from AWS to GCP with Zero Data Loss

Aug 10, 2023

I was tasked with one of the most critical projects at a previous company: migrating our 8.5TB production PostgreSQL database from AWS RDS to Google Cloud SQL. This database served millions of users and was the backbone of our platform. The requirements were daunting: zero data loss and less than 30 minutes of downtime. I’m proud to say we pulled it off with only 15 minutes of planned downtime, and here’s exactly how I did it.

The Challenge I Faced

The CTO made the call to consolidate everything on GCP. Great for architecture simplicity, terrible for me because it meant migrating our massive production database from AWS RDS to Cloud SQL. This wasn’t just any database. It was 8.5TB of critical user data handling 12,000 writes per second at peak. Every transaction mattered because of compliance requirements. Lose a single transaction and we’d fail our next audit.

The requirements were brutal: zero data loss and less than 30 minutes of downtime. For context, a simple pg_dump of 8.5TB over the public internet would take days, maybe a week. The database was too big to snapshot and restore quickly. We couldn’t pause writes for hours while data transferred. This seemed impossible until I started researching modern migration techniques.

The Solution: Fast Network Plus Real-Time Replication

After weeks of research and testing, I landed on a two-part solution. First, I needed a dedicated high-speed network connection between AWS and GCP. The public internet wouldn’t cut it for 8.5TB. Second, I needed real-time database replication so we could keep the target database in sync with the source right up until cutover.

Megaport provided the network solution. They offer dedicated 10Gbps links between cloud providers. I set one up connecting our AWS region to our GCP region. The setup cost was about $2,000 and monthly fees were $800, but for a one-time migration, we could eat that cost. Initial testing showed we could sustain 9+ Gbps with under 5ms latency. That meant we could transfer 8.5TB in under a day instead of a week.

Striim provided the replication solution. Striim does Change Data Capture (CDC) by reading the PostgreSQL write-ahead log and replaying those changes to a target database in real-time. This meant I could do the initial bulk load, then Striim would keep the databases in sync as transactions continued to flow. The replication lag was typically under 2 seconds, which was perfect for our needs.

graph LR
    subgraph AWS
        RDS[(RDS PostgreSQL<br/>Source DB)]
    end
    subgraph Network["Megaport 10Gbps Link"]
        Striim[Striim CDC Replication]
    end
    subgraph GCP
        CloudSQL[(Cloud SQL<br/>Target DB)]
    end

    RDS -->|Real-time Sync| Striim
    Striim --> CloudSQL

    style AWS fill:#ff9900,color:#fff
    style Network fill:#34a853,color:#fff
    style GCP fill:#4285f4,color:#fff

How The Migration Actually Went

Week 1: Setting Up The Network

Getting the Megaport connection established took longer than I hoped. There was paperwork, cross-cloud coordination, BGP configuration. But once it was up, the bandwidth tests were incredible. We were pushing 9.5 Gbps sustained with 4ms latency. I could transfer our entire database in about 20 hours at that rate.

Week 2-3: The Initial Bulk Load

I wrote Python scripts to parallelize pg_dump across our 450 tables. I split them into 16 batches based on size and kicked off all dumps simultaneously. Each dump streamed over the Megaport link directly into Cloud SQL. The parallelization plus the fat network pipe meant the entire 8.5TB initial load completed in 18 hours. I watched the progress obsessively, expecting something to fail. Nothing did.

Week 4: Real-Time Sync and Validation

Once the initial data was loaded, I started the Striim CDC pipeline. Every write to the AWS RDS database got replicated to Cloud SQL within 1-2 seconds. The replication lag dashboard became my new obsession. I checked it a hundred times a day.

I also ran validation scripts twice daily. Row counts per table, random data sampling, checksum comparisons. Everything matched perfectly. After a week of this, I was confident the replication was solid.

Week 5: Cutover Rehearsals

I rehearsed the cutover five times in our staging environment. Each rehearsal taught us something. The first time took 45 minutes because I hadn’t automated enough steps. By the fifth rehearsal, we had it down to 12 minutes. The cutover script handled everything: stop writes, wait for CDC lag to hit zero, run final validation, update Kubernetes configs to point to Cloud SQL, resume writes.

The Actual Cutover

We scheduled a maintenance window on Sunday at 2 AM. The entire engineering team was on Slack watching. I kicked off the cutover script and watched the logs scroll by. Stop writes. CDC lag dropping. Zero lag reached. Validation passed. Kubernetes updated. Resume writes.

Fifteen minutes. Zero errors. The new database was serving production traffic and nobody even noticed the switch.

What We Achieved

The Monday morning after the migration, I sent the CTO a summary email with the results. He forwarded it to the entire company with the subject line “This is how you execute.”

Zero data loss. We verified 100% data integrity through multiple validation passes. Every single row, every single transaction, perfectly replicated.

Fifteen minutes of downtime. We beat our 30-minute target by half. Most users didn’t even notice because it happened at 2 AM.

The entire 8.5TB transferred in 18 hours for the initial load. Without Megaport, this would have taken a week or more.

Database costs dropped by 25% monthly. Cloud SQL was cheaper than our equivalent RDS configuration, and we got better performance as a bonus.

Query performance improved by about 30% on average. Cloud SQL’s newer hardware and better disk IO made a noticeable difference.

What I Learned

The dedicated network wasn’t optional, it was essential. The $2,800 we spent on Megaport saved us days of transfer time and made the whole project feasible. Sometimes you have to spend money to execute properly.

Validate obsessively at every step. We ran checksums and row counts before, during, and after the migration. This paranoia caught a table partition mismatch during rehearsal number three that would have caused data loss in production. Better to find it in staging.

Rehearse until you’re bored. Five cutover rehearsals felt excessive at the time, but they were the reason the actual cutover went perfectly. By rehearsal five, we had muscle memory for every step and had automated away all the manual error-prone parts.

CDC replication is amazing for database migrations. Striim let us keep the databases in sync with minimal lag while the source stayed online. This is how you migrate multi-terabyte databases with minimal downtime. The technology exists, you just have to use it.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub