A Case Study of a 6-Month Migration of 1,000+ Workloads from AKS to GKE

Jan 01, 2024

At Aliz, a Google Cloud Premier Partner, I had the opportunity to lead a massive project as the Cloud Architect, migrating over 1,000 production workloads from Azure Kubernetes Service (AKS) to Google Kubernetes Engine (GKE) across five Kubernetes clusters. It was a complex, high-stakes migration that we completed within the 6-month timeline, 20% under budget, and with minimal downtime.

Our client, a major enterprise, came to us with a significant challenge: they needed to consolidate their multi-cloud Kubernetes infrastructure. This was no small task. We were looking at a migration of over 1,000 production workloads spread across 5 clusters, with an immense complexity from a mix of stateful and stateless applications. On top of that, we had to operate with minimal disruption to their production services, a tight 6-month deadline with strict budget constraints, and a zero-tolerance policy for data loss while maintaining security and compliance. Coordinating our cross-functional teams across multiple time zones added another layer to the challenge.

Migration Overview

graph LR
    subgraph Phase1["Phase 1: Assessment (Month 1)"]
        A1[Workload Inventory]
        A2[Risk Assessment]
        A3[Pilot Migration]
    end

    subgraph Phase2["Phase 2: Infrastructure (Month 2)"]
        B1[GKE Provisioning]
        B2[Network Setup]
        B3[Security Config]
    end

    subgraph Phase3["Phase 3: Migration (Months 3-5)"]
        C1[Wave 1: Non-Critical]
        C2[Wave 2: Stateless]
        C3[Wave 3: Stateful]
        C4[Wave 4: Critical Services]
    end

    subgraph Phase4["Phase 4: Optimization (Month 6)"]
        D1[Performance Tuning]
        D2[Cost Optimization]
        D3[AKS Decommission]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4

    style Phase1 fill:#e1f5ff
    style Phase2 fill:#fff4e6
    style Phase3 fill:#f3e5f5
    style Phase4 fill:#e8f5e9

To tackle this, we devised a phased approach. We started with a month of assessment and planning, where we took a full inventory of all AKS workloads and their dependencies, conducted a risk assessment to create a migration priority matrix, and ran a pilot migration with non-critical workloads to test our strategy. The second month was dedicated to infrastructure setup. Using Terraform, we provisioned the new GKE clusters, established network connectivity between Azure and GCP, and configured all the necessary security and compliance settings. Then came the core migration phase, which we spread over three months in carefully planned waves. We started with non-critical workloads, moved to stateless applications, then tackled the complex stateful ones, and finally migrated the most critical services using a blue-green deployment strategy for safety. Throughout this, we were continuously testing and validating. The final month was for optimization and cleanup, where we fine-tuned performance, optimized costs, decommissioned the old AKS clusters, and handed everything over with comprehensive documentation.

For the technical implementation, we heavily relied on automation. We implemented automated deployment pipelines using a combination of Helm and Jenkins, which accelerated our deployment speed by a staggering 60%. By standardizing our application packaging with Helm charts, we could create reusable Jenkins pipelines for different workload types. We also embraced Infrastructure as Code, using Terraform for all GKE cluster provisioning. This allowed us to version-control our infrastructure definitions, automate environment replication, and ensure consistent configuration across all five clusters.

Migration Execution

Cluster Design

GKE Architecture

- 5 Production Clusters:
  - Regional clusters for high availability
  - Autoscaling node pools (CPU + GPU workloads)
  - Workload separation: Frontend, Backend, Data Processing, ML, Batch

- Network Configuration:
  - VPC peering for inter-cluster communication
  - Cloud Interconnect for hybrid connectivity
  - Private clusters with authorized networks

- Security:
  - Workload Identity for service authentication
  - Binary Authorization for image verification
  - Network policies for pod-to-pod security

For stateless applications, the process was relatively straightforward. We used DNS-based traffic shifting, running the apps in parallel on both clouds during the cutover and gradually migrating traffic while monitoring performance, with a clear rollback path at every stage. Stateful applications were more complex. We set up database replication between Azure and GCP and used cloud-native tools for storage migration, ensuring data consistency was validated before the coordinated cutover, which we scheduled during low-traffic periods to minimize downtime. We also took the opportunity to design a modern microservices architecture using a service mesh, which greatly improved observability, traffic management, and security with mTLS between services, and gave us capabilities for canary deployments and A/B testing in the future.

Automation & CI/CD

To accelerate the migration, we enhanced our Jenkins pipelines to automate workload discovery and migration, enabling parallel deployments across clusters with built-in health checks and validation gates for automated rollbacks on failure. We also created organization-wide Helm chart templates to simplify application packaging and enable versioned deployments with easy rollbacks through environment-specific value overrides. For monitoring and validation, we built real-time migration dashboards to track progress, monitor workload health across both clouds, compare performance between AKS and GKE, and keep an eye on cost metrics.

Key Achievements

The results speak for themselves. We successfully migrated over 1,000+ production applications to 5 new GKE clusters, all within the 6-month deadline and coming in 20% under budget. We achieved this with minimal disruption to production services and accelerated future deployment speed by 60% thanks to our new Helm and Jenkins pipelines.

On the technical side, we saw significant improvements in scalability with GKE’s autoscaling, enhanced security with GCP-native features, better observability through Cloud Operations, and a 35% reduction in operational costs. For the business, this translated to improved SLA compliance, faster team velocity, significant cost savings, and reduced risk by eliminating the complexity of their previous multi-cloud setup.

Technical Challenges & Solutions

Of course, a project of this scale came with its share of technical hurdles. One of the biggest was migrating databases and persistent storage without downtime. We solved this by implementing continuous data replication between clouds and using read replicas for a gradual traffic shift, all within carefully coordinated low-traffic cutover windows. Establishing secure and reliable network connectivity between Azure and GCP was another challenge, which we addressed by deploying Cloud Interconnect for a dedicated link, backed up by a VPN for redundancy. The sheer diversity of over 1,000+ workloads required a methodical approach; we created a workload categorization matrix and developed specific migration playbooks for each category. Finally, managing our cross-functional teams across different geographies and time zones demanded clear communication protocols, which we facilitated with daily standups and our detailed migration dashboards.

Technology Stack

Container Orchestration

Infrastructure as Code

CI/CD

Service Mesh & Networking

Observability

Security

Architecture Diagrams

Multi-Cloud Migration Architecture

graph LR
    subgraph Azure["Azure Cloud (Source)"]
        AKS1[AKS Cluster 1<br/>Production]
        AKS2[AKS Cluster 2<br/>Services]
        AKS3[AKS Cluster 3<br/>Data]
        AKS4[AKS Cluster 4<br/>...]
        AKS5[AKS Cluster 5<br/>ML]
    end

    subgraph Network["Migration Network"]
        CI[Cloud Interconnect<br/>Dedicated Connection]
        VPN[VPN Backup<br/>Redundancy]
    end

    subgraph GCP["Google Cloud Platform (Target)"]
        GKE1[GKE Cluster 1<br/>Frontend]
        GKE2[GKE Cluster 2<br/>Backend]
        GKE3[GKE Cluster 3<br/>Data Processing]
        GKE4[GKE Cluster 4<br/>ML Workloads]
        GKE5[GKE Cluster 5<br/>Batch Jobs]

        subgraph ServiceMesh["Service Mesh Layer (Istio/ASM)"]
            SM1[Traffic Management]
            SM2[mTLS Security]
            SM3[Observability]
        end
    end

    AKS1 & AKS2 & AKS3 & AKS4 & AKS5 --> CI
    AKS1 & AKS2 & AKS3 & AKS4 & AKS5 --> VPN
    CI --> GKE1 & GKE2 & GKE3 & GKE4 & GKE5
    VPN -.Backup.-> GKE1 & GKE2 & GKE3 & GKE4 & GKE5

    ServiceMesh -.Manages.-> GKE1 & GKE2 & GKE3 & GKE4 & GKE5

    style Azure fill:#0078d4,color:#fff
    style GCP fill:#4285f4,color:#fff
    style Network fill:#34a853,color:#fff
    style ServiceMesh fill:#fbbc04,color:#000

CI/CD Pipeline Flow

graph LR
    subgraph Source["Source Code"]
        Git[Git Repository]
    end

    subgraph CICD["CI/CD Pipeline (Jenkins)"]
        Build[Build & Test]
        Helm[Helm Package]
        Push[Push to Registry]
    end

    subgraph Deploy["Deployment"]
        Dev[Dev Cluster]
        Staging[Staging Cluster]
        Prod[Production Cluster]
    end

    Git -->|Webhook| Build
    Build -->|Success| Helm
    Helm --> Push
    Push -->|Auto Deploy| Dev
    Dev -->|Tests Pass| Staging
    Staging -->|Manual Approve| Prod

    style Source fill:#e1f5ff
    style CICD fill:#fff4e6
    style Deploy fill:#e8f5e9

Traffic Migration Strategy

graph LR
    Start[Start Migration]

    subgraph Phase1["Phase 1: Deploy to GKE"]
        Deploy[Deploy Blue Environment<br/>on GKE]
        Health[Health Checks Pass]
    end

    subgraph Phase2["Phase 2: Gradual Traffic Shift"]
        T10[10% Traffic to GKE<br/>90% to AKS]
        Monitor1[Monitor Metrics]
        T50[50% Traffic to GKE<br/>50% to AKS]
        Monitor2[Compare Performance]
    end

    subgraph Phase3["Phase 3: Complete Migration"]
        T100[100% Traffic to GKE]
        Verify[Verify Stability]
        Decom[Decommission AKS]
    end

    Start --> Deploy
    Deploy --> Health
    Health --> T10
    T10 --> Monitor1
    Monitor1 --> T50
    T50 --> Monitor2
    Monitor2 --> T100
    T100 --> Verify
    Verify --> Decom

    style Phase1 fill:#e1f5ff
    style Phase2 fill:#fff4e6
    style Phase3 fill:#e8f5e9

Impact & Results

Ultimately, this migration was a huge success. We eliminated the operational complexity of a multi-cloud setup, reduced costs by 35%, and established a modern, scalable Kubernetes foundation for the client. By implementing standardized Helm and Jenkins pipelines, we accelerated their future deployment velocity by 60%, empowering them to move faster while maintaining top-tier reliability and security.

Lessons Learned

This project reinforced several key lessons for me. First, plan extensively, but execute incrementally; our detailed planning and wave-based execution gave us confidence. Second, automate everything possible—our Helm + Jenkins automation was the only way to manage 1,000+ workloads efficiently. Communication is paramount, especially with distributed teams. Always maintain a tested rollback plan for every stage. Starting with non-critical workloads is a great way to validate the approach and build team confidence. Infrastructure as Code is non-negotiable for consistency, and finally, real-time monitoring provides the visibility needed to make confident decisions under pressure. It was a powerful reminder that successful migrations require a blend of technical expertise, project management discipline, and a methodical approach to risk management.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub