At Aliz, a Google Cloud Premier Partner, I had the opportunity to lead a massive project as the Cloud Architect, migrating over 1,000 production workloads from Azure Kubernetes Service (AKS) to Google Kubernetes Engine (GKE) across five Kubernetes clusters. It was a complex, high-stakes migration that we completed within the 6-month timeline, 20% under budget, and with minimal downtime.
Our client, a major enterprise, came to us with a significant challenge: they needed to consolidate their multi-cloud Kubernetes infrastructure. This was no small task. We were looking at a migration of over 1,000 production workloads spread across 5 clusters, with an immense complexity from a mix of stateful and stateless applications. On top of that, we had to operate with minimal disruption to their production services, a tight 6-month deadline with strict budget constraints, and a zero-tolerance policy for data loss while maintaining security and compliance. Coordinating our cross-functional teams across multiple time zones added another layer to the challenge.
Migration Overview
graph LR
subgraph Phase1["Phase 1: Assessment (Month 1)"]
A1[Workload Inventory]
A2[Risk Assessment]
A3[Pilot Migration]
end
subgraph Phase2["Phase 2: Infrastructure (Month 2)"]
B1[GKE Provisioning]
B2[Network Setup]
B3[Security Config]
end
subgraph Phase3["Phase 3: Migration (Months 3-5)"]
C1[Wave 1: Non-Critical]
C2[Wave 2: Stateless]
C3[Wave 3: Stateful]
C4[Wave 4: Critical Services]
end
subgraph Phase4["Phase 4: Optimization (Month 6)"]
D1[Performance Tuning]
D2[Cost Optimization]
D3[AKS Decommission]
end
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
style Phase1 fill:#e1f5ff
style Phase2 fill:#fff4e6
style Phase3 fill:#f3e5f5
style Phase4 fill:#e8f5e9
To tackle this, we devised a phased approach. We started with a month of assessment and planning, where we took a full inventory of all AKS workloads and their dependencies, conducted a risk assessment to create a migration priority matrix, and ran a pilot migration with non-critical workloads to test our strategy. The second month was dedicated to infrastructure setup. Using Terraform, we provisioned the new GKE clusters, established network connectivity between Azure and GCP, and configured all the necessary security and compliance settings. Then came the core migration phase, which we spread over three months in carefully planned waves. We started with non-critical workloads, moved to stateless applications, then tackled the complex stateful ones, and finally migrated the most critical services using a blue-green deployment strategy for safety. Throughout this, we were continuously testing and validating. The final month was for optimization and cleanup, where we fine-tuned performance, optimized costs, decommissioned the old AKS clusters, and handed everything over with comprehensive documentation.
For the technical implementation, we heavily relied on automation. We implemented automated deployment pipelines using a combination of Helm and Jenkins, which accelerated our deployment speed by a staggering 60%. By standardizing our application packaging with Helm charts, we could create reusable Jenkins pipelines for different workload types. We also embraced Infrastructure as Code, using Terraform for all GKE cluster provisioning. This allowed us to version-control our infrastructure definitions, automate environment replication, and ensure consistent configuration across all five clusters.
Migration Execution
Cluster Design
GKE Architecture
- 5 Production Clusters:
- Regional clusters for high availability
- Autoscaling node pools (CPU + GPU workloads)
- Workload separation: Frontend, Backend, Data Processing, ML, Batch
- Network Configuration:
- VPC peering for inter-cluster communication
- Cloud Interconnect for hybrid connectivity
- Private clusters with authorized networks
- Security:
- Workload Identity for service authentication
- Binary Authorization for image verification
- Network policies for pod-to-pod security
For stateless applications, the process was relatively straightforward. We used DNS-based traffic shifting, running the apps in parallel on both clouds during the cutover and gradually migrating traffic while monitoring performance, with a clear rollback path at every stage. Stateful applications were more complex. We set up database replication between Azure and GCP and used cloud-native tools for storage migration, ensuring data consistency was validated before the coordinated cutover, which we scheduled during low-traffic periods to minimize downtime. We also took the opportunity to design a modern microservices architecture using a service mesh, which greatly improved observability, traffic management, and security with mTLS between services, and gave us capabilities for canary deployments and A/B testing in the future.
Automation & CI/CD
To accelerate the migration, we enhanced our Jenkins pipelines to automate workload discovery and migration, enabling parallel deployments across clusters with built-in health checks and validation gates for automated rollbacks on failure. We also created organization-wide Helm chart templates to simplify application packaging and enable versioned deployments with easy rollbacks through environment-specific value overrides. For monitoring and validation, we built real-time migration dashboards to track progress, monitor workload health across both clouds, compare performance between AKS and GKE, and keep an eye on cost metrics.
Key Achievements
The results speak for themselves. We successfully migrated over 1,000+ production applications to 5 new GKE clusters, all within the 6-month deadline and coming in 20% under budget. We achieved this with minimal disruption to production services and accelerated future deployment speed by 60% thanks to our new Helm and Jenkins pipelines.
On the technical side, we saw significant improvements in scalability with GKE’s autoscaling, enhanced security with GCP-native features, better observability through Cloud Operations, and a 35% reduction in operational costs. For the business, this translated to improved SLA compliance, faster team velocity, significant cost savings, and reduced risk by eliminating the complexity of their previous multi-cloud setup.
Technical Challenges & Solutions
Of course, a project of this scale came with its share of technical hurdles. One of the biggest was migrating databases and persistent storage without downtime. We solved this by implementing continuous data replication between clouds and using read replicas for a gradual traffic shift, all within carefully coordinated low-traffic cutover windows. Establishing secure and reliable network connectivity between Azure and GCP was another challenge, which we addressed by deploying Cloud Interconnect for a dedicated link, backed up by a VPN for redundancy. The sheer diversity of over 1,000+ workloads required a methodical approach; we created a workload categorization matrix and developed specific migration playbooks for each category. Finally, managing our cross-functional teams across different geographies and time zones demanded clear communication protocols, which we facilitated with daily standups and our detailed migration dashboards.
Technology Stack
Container Orchestration
- Google Kubernetes Engine (GKE)
- Azure Kubernetes Service (AKS - source)
- Kubernetes 1.27+
Infrastructure as Code
- Terraform
- Helm 3
- Kustomize
CI/CD
- Jenkins
- GitHub / GitLab
- ArgoCD for GitOps
Service Mesh & Networking
- Istio / Anthos Service Mesh
- Cloud Load Balancing
- Cloud Interconnect
Observability
- Cloud Monitoring (formerly Stackdriver)
- Cloud Logging
- Prometheus & Grafana
Security
- Workload Identity
- Binary Authorization
- Security Command Center
- KMS for secrets management
Architecture Diagrams
Multi-Cloud Migration Architecture
graph LR
subgraph Azure["Azure Cloud (Source)"]
AKS1[AKS Cluster 1<br/>Production]
AKS2[AKS Cluster 2<br/>Services]
AKS3[AKS Cluster 3<br/>Data]
AKS4[AKS Cluster 4<br/>...]
AKS5[AKS Cluster 5<br/>ML]
end
subgraph Network["Migration Network"]
CI[Cloud Interconnect<br/>Dedicated Connection]
VPN[VPN Backup<br/>Redundancy]
end
subgraph GCP["Google Cloud Platform (Target)"]
GKE1[GKE Cluster 1<br/>Frontend]
GKE2[GKE Cluster 2<br/>Backend]
GKE3[GKE Cluster 3<br/>Data Processing]
GKE4[GKE Cluster 4<br/>ML Workloads]
GKE5[GKE Cluster 5<br/>Batch Jobs]
subgraph ServiceMesh["Service Mesh Layer (Istio/ASM)"]
SM1[Traffic Management]
SM2[mTLS Security]
SM3[Observability]
end
end
AKS1 & AKS2 & AKS3 & AKS4 & AKS5 --> CI
AKS1 & AKS2 & AKS3 & AKS4 & AKS5 --> VPN
CI --> GKE1 & GKE2 & GKE3 & GKE4 & GKE5
VPN -.Backup.-> GKE1 & GKE2 & GKE3 & GKE4 & GKE5
ServiceMesh -.Manages.-> GKE1 & GKE2 & GKE3 & GKE4 & GKE5
style Azure fill:#0078d4,color:#fff
style GCP fill:#4285f4,color:#fff
style Network fill:#34a853,color:#fff
style ServiceMesh fill:#fbbc04,color:#000
CI/CD Pipeline Flow
graph LR
subgraph Source["Source Code"]
Git[Git Repository]
end
subgraph CICD["CI/CD Pipeline (Jenkins)"]
Build[Build & Test]
Helm[Helm Package]
Push[Push to Registry]
end
subgraph Deploy["Deployment"]
Dev[Dev Cluster]
Staging[Staging Cluster]
Prod[Production Cluster]
end
Git -->|Webhook| Build
Build -->|Success| Helm
Helm --> Push
Push -->|Auto Deploy| Dev
Dev -->|Tests Pass| Staging
Staging -->|Manual Approve| Prod
style Source fill:#e1f5ff
style CICD fill:#fff4e6
style Deploy fill:#e8f5e9
Traffic Migration Strategy
graph LR
Start[Start Migration]
subgraph Phase1["Phase 1: Deploy to GKE"]
Deploy[Deploy Blue Environment<br/>on GKE]
Health[Health Checks Pass]
end
subgraph Phase2["Phase 2: Gradual Traffic Shift"]
T10[10% Traffic to GKE<br/>90% to AKS]
Monitor1[Monitor Metrics]
T50[50% Traffic to GKE<br/>50% to AKS]
Monitor2[Compare Performance]
end
subgraph Phase3["Phase 3: Complete Migration"]
T100[100% Traffic to GKE]
Verify[Verify Stability]
Decom[Decommission AKS]
end
Start --> Deploy
Deploy --> Health
Health --> T10
T10 --> Monitor1
Monitor1 --> T50
T50 --> Monitor2
Monitor2 --> T100
T100 --> Verify
Verify --> Decom
style Phase1 fill:#e1f5ff
style Phase2 fill:#fff4e6
style Phase3 fill:#e8f5e9
Impact & Results
Ultimately, this migration was a huge success. We eliminated the operational complexity of a multi-cloud setup, reduced costs by 35%, and established a modern, scalable Kubernetes foundation for the client. By implementing standardized Helm and Jenkins pipelines, we accelerated their future deployment velocity by 60%, empowering them to move faster while maintaining top-tier reliability and security.
Lessons Learned
This project reinforced several key lessons for me. First, plan extensively, but execute incrementally; our detailed planning and wave-based execution gave us confidence. Second, automate everything possible—our Helm + Jenkins automation was the only way to manage 1,000+ workloads efficiently. Communication is paramount, especially with distributed teams. Always maintain a tested rollback plan for every stage. Starting with non-critical workloads is a great way to validate the approach and build team confidence. Infrastructure as Code is non-negotiable for consistency, and finally, real-time monitoring provides the visibility needed to make confident decisions under pressure. It was a powerful reminder that successful migrations require a blend of technical expertise, project management discipline, and a methodical approach to risk management.