How I Built a Scalable E-Commerce Platform on Kubernetes with 99.99% Uptime

Nov 12, 2022

I was given the challenge of building a highly available e-commerce platform from the ground up on AWS. The goal was ambitious: 99.99% uptime, the ability to handle 10x traffic spikes during sales events, and sub-15-minute deployment cycles. I designed and deployed a solution on Amazon EKS (Kubernetes) that not only met but exceeded all these targets. Here’s how I built it.

The Challenge I Faced

The CTO dropped this bomb in our weekly one-on-one: “We’re launching a new e-commerce platform. Black Friday is in six months. We need 99.99% uptime. Can you build it?”

I said yes before I’d really thought it through. Then I spent that night doing math. 99.99% uptime means less than 5 minutes of downtime per month. For an e-commerce platform where every minute of downtime costs thousands in lost revenue, that’s not a soft target. That’s a hard requirement.

The scariest part wasn’t the uptime requirement. It was the traffic spikes. Based on projections from the business team, we needed to handle 10x our normal traffic during sales events. If we provisioned for peak load 24/7, we’d burn money. If we under-provisioned, we’d crash during the one time we absolutely couldn’t afford to crash.

And then there was the deployment velocity issue. The existing platform took hours to deploy manually. Developers would submit tickets, ops would schedule maintenance windows, and we’d do deployments at 2 AM to minimize user impact. For a modern e-commerce platform competing with the big players, that wasn’t going to cut it.

The Architecture I Designed

I spent two weeks whiteboarding different architectures. I considered ECS, I considered Lambda, I even briefly considered just running VMs because that’s what I knew best. But I kept coming back to Kubernetes. EKS specifically, because I didn’t want to manage the control plane myself.

The architecture I landed on had three layers, each designed for a specific failure mode.

The Edge Layer was all about protection and performance. CloudFront CDN to cache static assets and reduce load on the origin. WAF (Web Application Firewall) to block common attacks. Route53 for DNS with health checks. If our entire region went down, Route53 could failover to a backup region. (We never built that backup region due to cost, but the option was there.)

The Compute Layer was where the magic happened. An Application Load Balancer sat in front of our EKS cluster, distributing traffic across multiple availability zones. Inside the cluster, I broke the monolithic e-commerce application into microservices: Product Catalog, Shopping Cart, Order Processing, Payment, User Authentication. Each service was independently scalable and deployable.

The Data Layer used managed services wherever possible because I didn’t want to wake up at 3 AM to deal with database failovers. RDS PostgreSQL in multi-AZ configuration for transactional data. ElastiCache Redis, also multi-AZ, for session storage and caching. S3 for product images and static assets.

graph LR
    Users --> Edge[Edge: Route53, CloudFront, WAF]
    Edge --> ALB[ALB]
    ALB --> EKS[EKS Cluster: Microservices]
    EKS --> Data[Data Layer: RDS, ElastiCache]

    style EKS fill:#4285f4,color:#fff
    style Data fill:#34a853,color:#fff

How I Actually Built This

The architecture looked great on the whiteboard. Making it work in production was a different story.

Infrastructure as Code Was Non-Negotiable

I wrote everything in Terraform from day one. The entire EKS cluster, node groups, VPCs, security groups, RDS instances, all of it. This took longer upfront than clicking through the AWS console, but it saved me countless hours later when I needed to reproduce the staging environment or troubleshoot configuration drift.

The node groups were a balancing act. I used on-demand instances for the critical services (payment processing, order management) because I couldn’t afford to have those pods evicted. For the less critical stuff (product recommendation engine, search indexing), I used spot instances which were 60-70% cheaper. The first time AWS reclaimed a spot instance and I watched Kubernetes seamlessly reschedule the pod to another node, I felt like I’d made the right choice.

Helm Charts Saved My Sanity

I built a Helm chart template for microservices and used it for all 12 services. Each chart had the same structure: deployment, service, HPA, ingress, ConfigMap. This consistency meant developers could understand and modify any service’s deployment without learning a new pattern every time.

The HPA (Horizontal Pod Autoscaler) configuration was critical for handling traffic spikes. I set conservative targets initially, 70% CPU utilization, and let the system scale up aggressively.

# Example HPA config from my Helm chart
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

CI/CD Pipeline Took Three Tries to Get Right

My first attempt at a CI/CD pipeline was overly ambitious. I tried to build everything in Jenkins because that’s what I knew. It was a disaster. The pipelines were slow, flaky, and required constant maintenance.

I scrapped it and rebuilt using AWS CodePipeline and CodeBuild. The integration with AWS services was seamless. A developer pushes code to GitHub, the pipeline triggers, CodeBuild builds the Docker image, runs tests, scans for vulnerabilities with Trivy, pushes to ECR, and deploys to dev automatically. Staging and production deployments required manual approval, which was a safety net that saved us multiple times from deploying bad code.

The first time the entire pipeline ran end-to-end in under 12 minutes, I actually celebrated. We’d gone from hours of manual work to automated deployments faster than it took to make coffee.

High Availability Through Pod Disruption Budgets

The 99.99% uptime requirement kept me up at night. I knew the architecture was solid, but Kubernetes does routine maintenance. Nodes get upgraded. Pods get evicted. How do you maintain availability through all that?

Pod Disruption Budgets (PDBs) were the answer. I configured them for every critical service to ensure a minimum number of replicas stayed running during voluntary disruptions. Combined with pod anti-affinity rules that spread replicas across different availability zones, this meant we could lose an entire AZ and stay online.

# Example PDB to ensure at least 2 replicas are always available
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: product-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: product-service

Black Friday: The Real Test

Six months later, Black Friday arrived. I’d been dreading this day since the project started. All the architecture, all the testing, all the planning came down to whether the system could handle real traffic.

The sale started at midnight. I was in the office with the ops team, watching dashboards. Traffic started ramping up. The cluster autoscaler kicked in, spinning up new nodes. The HPAs started scaling pods. We went from our baseline of 12 nodes to 48 nodes in about 20 minutes.

Peak traffic hit 25,000 requests per second, more than 10x our normal load. The system didn’t even flinch. Latency stayed under 200ms. Error rate stayed below 0.01%. The CloudFront cache hit ratio was 87%, which meant most requests never even reached our cluster.

We ran that traffic level for 14 hours straight. Zero downtime. Zero incidents. Zero pages.

When the traffic finally tapered off and the cluster started scaling back down, I actually took a screenshot of the Grafana dashboard. We’d done it. 99.99% uptime for the month, well within our target.

What I Learned the Hard Way

I initially over-provisioned everything because I was scared of running out of resources. The Product Catalog service requested 2Gi of memory per pod. After watching Grafana for a few weeks, I realized it peaked at 600Mi. I cut the request to 800Mi and immediately freed up 60% of our memory allocation. That let us run more pods on the same nodes and saved thousands in EC2 costs.

Health checks were the secret to zero-downtime deployments. I spent a lot of time tuning the liveness and readiness probes. A pod had to pass readiness checks before receiving traffic, and it had to pass liveness checks to avoid being restarted. Getting these right was fiddly, but it was the difference between smooth rolling updates and brief service disruptions.

Pod Disruption Budgets saved us during a scary incident where AWS did emergency maintenance on our underlying hardware. They drained nodes aggressively, but the PDBs ensured we always had enough replicas online. Without them, we would have had an outage. With them, users never noticed.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub