Before You Go Live: A Battle-Tested Kubernetes Production Readiness Checklist

Moving a Kubernetes cluster to production is a big step. After taking dozens of clusters live, I’ve learned that a successful launch depends on careful preparation. To make sure I don’t miss anything, I’ve developed a comprehensive checklist that I run through before any new cluster handles production traffic. This is my personal production readiness playbook, built on hard-won experience.

My Core Readiness Pillars

I learned these pillars the hard way. Each one represents a mistake I or someone on my team made that caused an incident.

Security Hardening (Or How I Learned to Stop Trusting Default Settings)

My first production Kubernetes cluster had almost no security hardening. We went live, and three weeks later our security team did a penetration test. They owned the cluster in about 20 minutes. That was humiliating. I never made that mistake again.

Now security is the first thing I configure, not the last. I start with network policies. Every production namespace gets a default-deny-all policy. Nothing can talk to anything unless I explicitly allow it. This feels restrictive at first, but it’s caught multiple incidents where compromised pods tried to phone home or scan the internal network.

RBAC (Role-Based Access Control) is where I see the most dangerous shortcuts. Developers will ask for cluster-admin because it’s easier than figuring out what permissions they actually need. I say no every time. I create fine-grained Roles that grant exactly the minimum permissions required. It takes more time upfront, but it’s saved us from credential compromise incidents where an attacker got hold of a service account token.

Pod Security Standards are non-negotiable. I enforce the restricted standard on all production namespaces. Every pod runs as non-root. Every pod has a read-only root filesystem. No privilege escalation allowed. This breaks some badly-written applications, but that’s their problem. If your app needs root access in a container, it’s not production-ready.

I launched a cluster once without proper monitoring. Not my proudest moment. We went live, everything seemed fine, and then users started complaining about slow response times. I had no idea what was wrong. No metrics. No visibility. I spent two hours SSHing into nodes and running top manually before I found the problem. A pod was leaking memory and bringing down its node repeatedly.

Never again. Now I deploy Prometheus before anything else touches the cluster. I set up dashboards for nodes (CPU, memory, disk), pods (restarts, resource usage), and the critical control plane components (API server, etcd, scheduler). If I can’t see it in Grafana, I can’t debug it.

Alerts are where most people screw up. They configure alerts but never test them. Then when an incident happens, they discover the alerts never fired, or they fired but got routed to an old Slack channel nobody monitors. I test every single alert. I deliberately crash a pod to verify the crash loop alert fires. I fill up disk space to verify the disk pressure alert works. An untested alert might as well not exist.

High Availability and Disaster Recovery (Prepare for the Worst)

Here’s an uncomfortable truth: your cluster will fail. Maybe AWS will have a regional outage. Maybe someone will accidentally delete a critical namespace. Maybe etcd will get corrupted. If you’re not prepared, you’re going to have a very bad day.

I run minimum three control plane nodes spread across different availability zones. This means the cluster can survive losing an entire AZ. I configure automated etcd snapshots every six hours and store them in S3 with versioning enabled. That’s saved me twice when etcd got into a weird state and I needed to restore from backup.

But having backups isn’t enough. You need to test restoring from them. I schedule DR drills quarterly where we actually destroy a staging cluster and restore it from backups. Every single time we do this, we find something that doesn’t work quite right. Better to find that in a drill than during a real incident.

Resource Management (Stop the Noisy Neighbors)

Early in my Kubernetes journey, I let developers deploy pods without resource requests or limits. Big mistake. One service had a memory leak and consumed 90% of a node’s RAM. Other pods on that node started getting OOM-killed. It was a cascading failure that took down half our services.

Now every single container must have CPU and memory requests and limits defined. No exceptions. Developers push back on this because it requires them to actually understand their application’s resource usage, but that’s a good thing. If you don’t know how much memory your app needs, you’re not ready for production.

I also enforce ResourceQuotas and LimitRanges on every namespace. This prevents any single team from accidentally consuming all the cluster’s resources. One team tried to deploy a development environment with 50 replicas because they misconfigured their Helm values. The quota stopped them before they brought down the cluster.

GitOps (Because Manual Changes Are Technical Debt)

The worst production incidents I’ve dealt with all had one thing in common: someone made a manual change to the cluster that wasn’t tracked anywhere. A quick kubectl apply to fix something, with the intention of documenting it later. Spoiler: they never documented it later.

I mandate that all configuration lives in Git and is deployed through ArgoCD or Flux. No one, including me, makes manual changes to production. Need to update a deployment? Open a pull request, get it reviewed, merge it, and let GitOps deploy it. This gives us a full audit trail, makes rollbacks trivial (just revert the Git commit), and ensures the cluster state always matches what’s in Git.

The discipline this requires is hard at first. During an incident, your instinct is to jump in and fix things manually. But I’ve seen too many incidents where the manual fix made things worse because it conflicted with what GitOps expected. Trust the process.

The Final Walk-Through

The night before I take a cluster live, I go through this checklist one more time. I’ve learned that rushing this step is how things go wrong.

I verify all pods have resource requests and limits. I actually run a script that scans the cluster for any pod missing these. If I find one, the launch gets delayed. No exceptions.

I check that network policies are in place. I test them by trying to curl between pods that shouldn’t be able to communicate. If the curl succeeds, something is misconfigured.

I audit RBAC bindings. I grep for cluster-admin and verify it’s only bound to actual cluster administrators, not service accounts or developers.

I trigger my Prometheus alerts manually to make sure they fire and route to the right places. I crash a pod, I fill up disk space, I let a certificate approach expiration. Every alert better work.

I verify the most recent etcd backup is actually restorable. I don’t trust backups I haven’t tested. I spin up a test cluster and restore from the backup. If it doesn’t work, the launch gets delayed.

I confirm all application config is in Git and managed by GitOps. I check for any configuration drift between Git and the cluster. If there’s drift, I fix it before launch.

This checklist has saved me more times than I can count. It feels tedious, but launching a cluster that crashes on day one because you skipped a step feels way worse.