How I Built a Distributed Training Platform on Kubernetes with Ray

When I joined the team, our data scientists were struggling. They were trying to train massive machine learning models on single GPUs, which was slow and inefficient. GPU availability was a constant bottleneck, and cloud costs were spiraling out of control. I was brought in to build a scalable, distributed training platform. Using Ray on Kubernetes, I built a system that made our training 10x faster and cut our costs by 65%.

The Problem I Was Solving

I remember my first week on the job. I sat down with Sarah, one of our senior data scientists, and she showed me her workflow. She’d submit a training job at 9 AM and then… wait. Sometimes for hours, sometimes for days, just to get access to a single GPU. When she finally got one, if her job crashed halfway through (which happened more often than anyone wanted to admit), she’d have to start from scratch. The frustration in her voice was palpable.

The real kicker came during a team meeting when our engineering director pulled up the cloud bill. We were spending over $40,000 a month on GPU compute, and our data scientists were still bottlenecked. Something had to change. The team needed a self-service platform that would let them focus on building models instead of fighting with infrastructure. That’s where I came in.

The Solution I Built: Ray on Kubernetes

After researching options, I settled on Ray running on Google Kubernetes Engine. I’ll be honest, my first choice was actually to build something custom with pure Kubernetes jobs. Big mistake. I spent three days fighting with pod affinity rules and custom schedulers before I realized I was reinventing the wheel. Ray already solved these problems elegantly, and it gave our data scientists a Python API they could actually use.

The architecture I ended up with used a hybrid approach to node pools in GKE. I set up on-demand GPU nodes for critical production training jobs that absolutely couldn’t fail. Then I added much cheaper preemptible GPU nodes (about 70% cheaper) for experimentation. I also created a separate CPU node pool that could autoscale independently to handle data preprocessing. This separation turned out to be crucial because we were wasting expensive GPU time on tasks that didn’t need GPUs at all.

The Hard Parts: What I Actually Had to Build

Getting autoscaling to work properly was harder than I expected. The platform needed to watch the Ray job queue and spin up GPU nodes before jobs started waiting. My first implementation was too aggressive, spinning up 10 GPU nodes the moment someone queued a job, even if that job only needed 2 GPUs. I learned this the hard way when I woke up to a Slack message from our finance team asking why we’d burned through $3,000 overnight. I ended up writing a custom controller that looked at both the number of queued jobs and their actual resource requirements. Much better.

The fault tolerance piece was probably the most technically satisfying work. I built automatic checkpointing into our training scripts so that if a preemptible node got yanked away mid-training (which Google does with zero warning), Ray would catch the failure and resume from the last checkpoint on a new node. The first time I tested this, I manually killed a node in the middle of a 6-hour training run. My heart was pounding as I watched the logs. Five minutes later, the job resumed on a new node like nothing had happened. That feeling never gets old.

For the developer experience, I knew that if this platform was hard to use, nobody would adopt it. I spent a solid week building a CLI tool that reduced job submission to a single command. I also integrated everything with Jupyter notebooks because that’s where our data scientists actually lived. The goal was to make distributed training feel no different from running code locally. When Sarah trained her first model on 8 GPUs simultaneously and told me it felt like magic, I knew I’d nailed it.

I also built Grafana dashboards that showed real-time training progress, GPU utilization, and (the metric that got leadership excited) cost per experiment. Being able to see exactly how much each training run cost changed the conversation entirely. Suddenly, people were asking “how can we optimize this?” instead of “why is our cloud bill so high?”

The Results & Impact

Three months after launch, I sat down with the team to review the metrics. The numbers honestly exceeded my expectations. We were seeing a 10x speedup in training time for models that previously ran on a single GPU. What used to take 2 days now finished in 4 hours. Our average GPU utilization jumped from a wasteful 40% to an efficient 85%.

But the number that made our CFO smile was the 65% reduction in overall training costs. We went from that $40,000 monthly bill down to about $14,000, primarily because of preemptible nodes and smarter autoscaling. The data science team was running 3x more experiments per week, shipping ML features faster than ever.

The platform became fully self-service. No more tickets to the infrastructure team asking for GPU access. The data scientists could spin up distributed training jobs themselves in minutes. We hit a 99.2% successful job completion rate, even with heavy use of preemptible nodes.

What I’d Do Differently Next Time

Looking back, there are a few things I wish I’d known from the start. First, I should have built with preemptible nodes and checkpointing from day one instead of treating them as an optimization to add later. I wasted two weeks building the initial version with only on-demand nodes, then had to refactor everything to support fault tolerance. The cost savings are just too significant to treat as an afterthought.

Second, observability isn’t optional. You can’t optimize what you can’t measure. I initially thought I could skip the monitoring dashboards and add them later, but that was naive. The moment people started asking questions about costs and performance, I realized I needed those dashboards yesterday. The cost-per-experiment view turned out to be the most popular feature of the whole platform.

Finally, developer experience is everything. I spent weeks optimizing the backend for performance, but what actually drove adoption was making the platform dead simple to use. That one-command job submission got people to switch from their old workflows. If your platform requires users to read a 50-page manual, you’ve already lost.

The platform is still running in production today, handling hundreds of training jobs every week. Data scientists who used to wait days for GPU access can now experiment freely. Models that used to take weeks to train now finish in days. And our finance team is a lot happier too.

This distributed training platform is part of my broader MLOps work: