How I Built a Distributed ML Pipeline to Cut Training Time by 44%

Jan 01, 2025

At a previous company, I took on one of the most challenging projects of my career: building a distributed ML training pipeline for a real-time computer vision system. Our existing process was a huge bottleneck, taking 18 days to train a single model, which was crippling our ability to innovate. I designed and deployed a new system on GKE using Ray that cut this training time down to just 7 days.

The Challenge I Faced

When I started, the team was struggling with several critical issues. The 18-day training time was the most obvious one, but we also had inefficient GPU usage, and our globally distributed team had trouble collaborating. My goal was to solve all these problems at once by building a scalable, efficient, and automated platform.

The Solution I Designed

I designed a hybrid-cloud solution. For the core training, I built a scalable platform on Google Kubernetes Engine (GKE). I chose Ray for distributed training orchestration because it simplified what would have been a very complex custom setup. This allowed me to create elastic GPU and CPU node pools that would scale based on the training workload. For inference, we used on-premise edge nodes to meet our sub-500ms latency requirement. A GitOps-based CI/CD pipeline automated the deployment of new models from our cloud registry to the edge.

graph LR
    subgraph Global["Global Team Access"]
        Team[Development Team]
    end

    subgraph Cloud["My GKE Training Infrastructure"]
        Head[Ray Head<br/>Orchestration]
        GPU[GPU Pool<br/>Training]
        CPU[CPU Pool<br/>Preprocessing]
        Registry[Model<br/>Registry]
    end

    subgraph Edge["On-Premise Edge"]
        Inference[Edge Nodes<br/>Inference<br/>&lt;500ms]
    end

    Global -->|GitOps| Head
    Head --> GPU & CPU
    Registry -->|Deploy| Inference

    style Cloud fill:#4285f4,color:#fff
    style Edge fill:#34a853,color:#fff

On the infrastructure side, I used Terraform to provision the GKE clusters, with specific node pools for GPU and CPU workloads. I then used Helm to deploy Ray and our ML applications. The entire process was managed through GitOps, which was essential for our global team.

My Key Achievements

The results of this project were a huge success for the company. Here are the key achievements:

Technical Challenges I Overcame

The Technology Stack I Used

The Impact

This distributed ML training infrastructure enabled the company to accelerate their AI development cycle dramatically. By reducing training time by 44% and establishing reliable CI/CD pipelines, the team could iterate faster, experiment more, and deliver features more rapidly. The hybrid-cloud architecture I designed balanced the need for powerful cloud-based training with low-latency edge inference, creating a scalable foundation for the company’s core product.

Lessons I Learned

This ML pipeline is part of my broader production MLOps work:

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub