I’ve been running Kubernetes in production for over five years now. I’ve managed clusters serving millions of users, handled incidents at 3 AM, optimized costs from $60,000/month to $25,000/month, and hardened security after seeing what attackers actually try. This is everything I’ve learned about operating Kubernetes in production, organized into the four areas that matter most.
When I first started with Kubernetes, I thought it was just about deploying containers. I was so naive. Kubernetes is easy to get started with. You can spin up a cluster in 20 minutes and deploy your first app. But running Kubernetes in production, at scale, serving real users, with real security requirements and real budget constraints? That’s a completely different challenge.
Let me walk you through the four critical areas I had to master to run production Kubernetes successfully.
Area 1: Security Hardening (Because Kubernetes Is Not Secure by Default)
The Problem: I learned this the hard way. Two years ago, I was called in to help with an incident response. A company had been breached through their Kubernetes cluster. Attackers exploited an overly permissive RBAC policy to gain cluster-admin access, then used that to deploy cryptominers across every node. The monthly cloud bill jumped from $15,000 to $87,000 before anyone noticed.
That incident could have been prevented with basic security hygiene. It terrified me because I realized our own clusters had similar vulnerabilities. I spent the next three months implementing defense-in-depth security controls.
The Wake-Up Call: I did a security audit of our production clusters and found problems everywhere. Default service accounts with automatic token mounting. Pods running as root. No network policies (any pod could talk to any other pod). Container images pulled from untrusted registries. No runtime security monitoring. We were one exploit away from a major breach.
The Solution: I implemented a defense-in-depth security strategy with multiple layers of controls. RBAC with least privilege for everything. Pod Security Standards enforcing restricted mode (no root containers, read-only filesystems). Zero-trust networking with default-deny network policies. Image scanning in CI/CD blocking vulnerable images. Runtime security monitoring with Falco to detect suspicious behavior.
The security controls prevented 12 potential breaches over two years. Most were compromised developer credentials. A few were actual attackers who had exploited application vulnerabilities but couldn’t escalate because of our pod security policies. I’m not exaggerating when I say these controls saved us from catastrophic incidents.
Read the full guide: Kubernetes Security Hardening
What you’ll learn:
- RBAC configuration for least privilege access
- Pod Security Standards implementation
- Zero-trust network policies (default deny all)
- Container image scanning and trusted registries
- Runtime security monitoring with Falco
- Real metrics: 85% reduction in attack surface, zero privilege escalation incidents in 2 years
Area 2: Cost Optimization (Because Cloud Bills Get Out of Control Fast)
The Problem: Six months into running our main production cluster, I got called into a meeting with the CFO. Our GKE costs had ballooned to $60,000 per month and were growing 15% month-over-month. The CFO wanted to know why our Kubernetes bill was higher than our entire AWS bill had been before we migrated to GCP.
I didn’t have a good answer. I’d been focused on reliability and features, not costs. Big mistake. The CFO gave me 90 days to cut the Kubernetes bill in half or we’d have to seriously reconsider our infrastructure choices.
The Analysis: I spent two weeks analyzing where money was going. What I found was embarrassing. We were massively over-provisioned. Developers requested 4 CPU and 8GB RAM for services that used 0.5 CPU and 2GB RAM. We had 30% cluster utilization, meaning we were wasting $18,000 per month on idle capacity. We were running expensive n1-standard-16 nodes when n1-standard-8 would have been fine. We had no autoscaling, so the cluster was sized for peak load 24/7 even though peak was only 4 hours per day.
The Solution: I implemented a comprehensive cost optimization strategy. Right-sized all resource requests based on actual usage (4 CPU requests became 0.5 CPU). Implemented cluster autoscaling and pod horizontal autoscaling (cluster scales down during off-peak). Switched to preemptible nodes for non-critical workloads (60% discount). Set up pod disruption budgets so we could safely use preemptible nodes. Implemented resource quotas per namespace to prevent runaway usage.
The results were dramatic. Monthly costs went from $60,000 to $25,000, a 58% reduction. Cluster utilization improved from 30% to 70%. We were running the same workloads, serving the same traffic, but paying less than half as much.
Read the full guide: Kubernetes Cost Optimization
What you’ll learn:
- Resource request right-sizing based on actual usage
- Cluster autoscaling and horizontal pod autoscaling
- Preemptible nodes for cost reduction (when safe to use)
- Pod disruption budgets for safe node evictions
- Resource quotas and limit ranges per namespace
- Real metrics: $60k to $25k monthly costs, 58% reduction
Area 3: Production Readiness (Because Uptime Matters)
The Problem: We were deploying services to production that weren’t ready. Health checks that didn’t actually test health. Resource limits that caused OOM kills. Missing monitoring and alerting. No graceful shutdown handling. Every deploy was a gamble.
The Incident: A developer deployed a new version of our API service one afternoon. It passed CI, looked good in staging, so we deployed to production. Within 5 minutes, the service was OOM-ing and restarting in a loop. Every restart caused active requests to fail. Error rates spiked to 15%. We were getting paged, users were complaining on social media, and nobody knew what was wrong.
It turned out the service had a memory leak that only manifested under production traffic patterns. It also had no graceful shutdown handling, so when Kubernetes killed the pod, active requests failed. And the health check was just checking if the HTTP server responded, not if the service could actually talk to the database.
I spent the next weekend building a production readiness checklist.
The Solution: I created a comprehensive production readiness checklist that every service must pass before deploying to production. Proper health checks (liveness checks HTTP server, readiness checks database connectivity). Resource requests and limits based on load testing. Graceful shutdown with preStop hooks. Monitoring and alerting configured. Circuit breakers for external dependencies. Documentation for on-call engineers.
This checklist became a gate in our CI/CD pipeline. If your service doesn’t meet the criteria, it can’t deploy to production. Developers initially complained this was bureaucratic overhead. Then we went three months with zero deployment-related incidents and they stopped complaining.
Read the full guide: Kubernetes Production Readiness Checklist
What you’ll learn:
- Health check patterns (liveness vs readiness)
- Resource request and limit sizing strategies
- Graceful shutdown implementation with preStop hooks
- Monitoring and alerting setup for services
- Circuit breakers and retry patterns
- Real metrics: 75% reduction in deployment-related incidents
Area 4: Cluster Operations (Because Kubernetes Doesn’t Run Itself)
The Problem: We had GKE clusters running in production, but I didn’t really understand how to operate them. How do you safely upgrade a cluster? How do you handle node maintenance? How do you debug networking issues? How do you optimize etcd performance? I was learning these things through painful trial and error, usually during incidents.
The Learning Curve: Our first cluster upgrade went terribly. I clicked the upgrade button in the GCP console without really understanding what would happen. Nodes started draining and pods started evicting. But some pods didn’t have pod disruption budgets, so critical services went down. The upgrade took 4 hours and we had multiple partial outages. I got paged at 2 AM when the database pod got evicted and didn’t come back up on the new node because of persistent volume mount failures.
I realized I needed to actually understand Kubernetes operations at a deep level.
The Solution: I systematized cluster operations with runbooks, automation, and deep technical understanding. I learned how to safely perform cluster upgrades (check pod disruption budgets, cordon and drain nodes manually for critical services, validate each node before moving to the next). I set up monitoring for cluster-level metrics (etcd performance, API server latency, node conditions). I built automation for common operations (node pool rotations, cluster backups, disaster recovery).
I also established a regular maintenance schedule. Monthly node pool rotations to pick up OS patches. Quarterly cluster version upgrades (never skip versions, always test in staging first). Regular disaster recovery drills (I delete the entire cluster and see how long it takes to restore from backups).
Read the full guide: GKE Cluster Operations Guide
What you’ll learn:
- Safe cluster upgrade procedures with minimal disruption
- Node pool management and rotation strategies
- Cluster monitoring and alerting (etcd, API server, node health)
- Disaster recovery and backup strategies
- Common operational troubleshooting patterns
- How to safely handle node maintenance
The Complete Production Operations Framework
Here’s how all these pieces fit together in a production Kubernetes environment:
Security (Defense in Depth):
- RBAC with least privilege (no cluster-admin for developers)
- Pod Security Standards enforcing restricted mode
- Network policies with default deny
- Image scanning blocking vulnerable images
- Runtime monitoring with Falco
Cost Management:
- Resource requests right-sized to actual usage
- Cluster autoscaling and HPA
- Preemptible nodes for non-critical workloads
- Resource quotas preventing runaway costs
Reliability:
- Production readiness checklist for all services
- Proper health checks and graceful shutdown
- Pod disruption budgets for safe node operations
- Monitoring and alerting for everything
Operations:
- Regular cluster maintenance schedule
- Safe upgrade procedures with validation
- Disaster recovery drills every quarter
- Runbooks for common operations
This framework has kept our production clusters running reliably for years. We achieve 99.9% uptime consistently. Our security posture is strong enough to pass SOC 2 audits. Our costs are optimized. Our operations are predictable.
What This Actually Achieved
The numbers from three years of production Kubernetes operations:
Security:
- Prevented 12 potential security incidents through controls
- Zero privilege escalation incidents in 2+ years
- Passed SOC 2 audit with no findings related to Kubernetes security
- 85% reduction in attack surface measured by penetration testing
Cost:
- Reduced monthly cluster costs from $60k to $25k (58% reduction)
- Improved cluster utilization from 30% to 70%
- Prevented budget overruns through resource quotas
- ROI on optimization work: 3 months
Reliability:
- 99.9% uptime across all production clusters
- 75% reduction in deployment-related incidents
- Zero data loss incidents
- Mean time to recovery under 10 minutes for most incidents
Operations:
- Cluster upgrades from 4+ hour ordeals to 60-minute routine operations
- Disaster recovery time from “unknown” to “4 hours measured”
- On-call pages reduced by 60% through better health checks and monitoring
Lessons Learned From Production
Security is not optional. I’ve seen the damage that security breaches cause. The defense-in-depth approach works. Every layer has failed at some point, but the other layers caught what got through.
Cost optimization requires ongoing attention. Cloud costs will grow if you don’t actively manage them. Set up alerts when costs spike unexpectedly. Review resource usage monthly. Challenge developers on why they need 4 CPU for a service that uses 0.5 CPU.
Production readiness checklists prevent incidents. The majority of deployment-related incidents I’ve seen were preventable with proper health checks, resource limits, and graceful shutdown handling. Make these requirements mandatory.
Operations expertise matters. Kubernetes is complex. You can’t just click upgrade and hope for the best. You need to understand how upgrades work, how to safely drain nodes, how to handle persistent volumes, how to debug networking issues. This knowledge comes from experience and study.
Automate the repetitive, manual work. I’ve automated cluster backups, node pool rotations, resource usage reports, cost analysis, and more. This frees up time to work on more valuable things than manual operations.
Test your disaster recovery procedures regularly. I delete entire clusters every quarter just to prove we can restore them. Every DR drill has revealed something that needed fixing. Better to find those issues during a drill than during a real disaster.
Where to Start
If you’re running Kubernetes in production or planning to, here’s the path I’d recommend:
- Start with security hardening (prevent breaches before they happen)
- Implement cost optimization (prevent budget overruns)
- Enforce production readiness standards (prevent deployment incidents)
- Build operational expertise (handle cluster operations safely)
Don’t try to do everything at once. I spent three years building this operational framework. Start with the area that’s causing you the most pain right now.
Read through the linked guides in the order that makes sense for your priorities. Each guide includes real code, real configurations, real metrics, and real war stories from production incidents.
The Reality of Production Kubernetes
Running Kubernetes in production is hard. It requires expertise in security, cost management, reliability engineering, and operations. The learning curve is steep. You will have incidents. You will make mistakes. I certainly did.
But with the right practices, the right tooling, and the right operational discipline, Kubernetes becomes a powerful platform for running production workloads at scale. We’re running 100+ services on Kubernetes serving millions of users with 99.9% uptime and reasonable costs.
It took years to build this expertise and put these systems in place, but it was worth it. This is how you run Kubernetes in production successfully.
Related Reading
For deeper dives into specific operational areas:
- Vault Secret Management - Secure secret management for Kubernetes
- Deployment Strategies - Safe deployment practices
- GitOps and CI/CD - Deployment automation
- MLOps Infrastructure - Running ML workloads on Kubernetes at scale
- Prometheus Monitoring Best Practices - Monitoring setup
- Infrastructure Testing Strategies - Testing your infrastructure
This is the knowledge I wish I had when I started running Kubernetes in production. I hope it helps you avoid some of the mistakes I made and get to production-grade operations faster.