My Prometheus Monitoring Playbook for Production Kubernetes

Nov 05, 2021

I’ll never forget the 3 AM page that taught me everything I needed to know about monitoring. It was about six months into my first Kubernetes deployment, and I was woken up by what felt like a dozen alerts firing simultaneously. Half asleep, I fumbled for my laptop only to discover that every single one of those alerts was essentially telling me the same thing: a single node had run out of memory. The redundancy was maddening, and worse, none of the alerts actually told me what I needed to know. Were users affected? Which services were impacted? I spent 20 minutes digging through logs before I even understood what was actually broken.

That night changed how I think about monitoring entirely. I’ve learned that effective monitoring isn’t about collecting every metric possible or creating elaborate alert chains. It’s about collecting the right metrics and creating actionable alerts that actually help you solve problems. After setting up Prometheus for dozens of production Kubernetes clusters over the past few years, I’ve developed a playbook that helps me focus on what matters, avoid alert fatigue, and keep my monitoring costs reasonable. Let me walk you through what I’ve learned.

My Monitoring Philosophy

My philosophy boils down to something simple: alert on symptoms, not causes, and make everything as efficient as possible. I didn’t start out this way, though. In my early days, I was guilty of what I now call “monitoring maximalism.” I thought more metrics meant better visibility. I was wrong. More metrics just meant more noise and a much higher AWS bill at the end of the month.

The Shift to Symptom-Based Alerting

Here’s what I realized after that 3 AM disaster: my primary goal should be getting alerted to user-facing problems, not infrastructure hiccups. I started focusing on high-level symptoms like error rates, latency spikes, and saturation issues. This is what the industry calls the RED method (Rate, Errors, Duration), and it works because it mirrors what your users actually experience.

I do keep a few critical alerts for node-level issues, but only the ones that could lead to widespread outages. Memory pressure is one of them, because I’ve seen firsthand how a single node running out of memory can cascade into a cluster-wide problem. Here’s the alert I use now:

# An example of a critical, symptom-based alert
- alert: NodeMemoryPressure
  expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
  for: 5m
  labels:
    severity: critical # This will wake me up
  annotations:
    summary: "Node {{ $labels.instance }} is out of memory!"

The key detail here is that for: 5m clause. I learned this lesson the hard way too. Without it, I was getting paged for transient memory spikes that resolved themselves within seconds. Adding that five-minute buffer eliminated probably 80% of my false positives overnight.

Recording Rules: My Secret Weapon

About a year into running Prometheus at scale, I started noticing something troubling. Our Grafana dashboards were taking forever to load, sometimes timing out completely. The problem was that I was running these massive, complex PromQL queries with high cardinality across 30-day time ranges. Every time someone opened a dashboard, Prometheus had to crunch through millions of data points.

The solution turned out to be recording rules, and honestly, I wish someone had told me about these earlier. Recording rules let you pre-calculate expensive queries and store the results as new, lower-cardinality time series. Instead of calculating sum(rate(node_cpu_seconds_total[5m])) by (instance) on the fly every time, you calculate it once per evaluation interval and store the result. Your dashboards and alerts then query these pre-aggregated metrics, which is dramatically faster.

# A recording rule to pre-calculate CPU usage per instance
- record: instance:node_cpu:ratio
  expr: sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance)

After implementing recording rules across our most expensive queries, our dashboard load times dropped from 15-20 seconds to under 2 seconds. The difference was night and day. It also reduced the load on our Prometheus instances significantly, which meant we could scale back some of our over-provisioned resources.

Running Prometheus in High Availability

I made another rookie mistake early on: running a single Prometheus instance in production. Everything was fine until the day that instance crashed due to an out-of-memory error (oh, the irony). We were blind for about 10 minutes while I scrambled to get it back up. That’s 10 minutes where we had no idea if our production systems were healthy or on fire.

Since then, I never run a single Prometheus instance for production workloads. I always deploy at least two replicas using a StatefulSet. Yes, this means you’re storing metrics twice, which increases storage costs. But the peace of mind knowing that your monitoring system is as resilient as the applications it’s monitoring? Totally worth it. Plus, you can configure each replica slightly differently for redundancy. I learned this approach after reading about how larger companies handle Prometheus, and it’s saved me multiple times when one instance has gone down.

Intelligent Alert Routing

After my alert fatigue experience, I spent a lot of time thinking about how to route alerts intelligently. The breakthrough came when I realized that not all alerts are created equal. Some things genuinely need immediate attention at 3 AM. Most things don’t.

I configure my Alertmanager to route alerts based on their severity labels. Critical alerts, the ones marked severity: critical, go directly to PagerDuty and will wake someone up. These are reserved for situations where users are actively being impacted right now. Think service outages, cascading failures, or data loss scenarios.

Warning alerts, marked severity: warning, go to a dedicated Slack channel that the on-call engineer monitors during business hours. These are things like elevated error rates that haven’t crossed the critical threshold yet, or resources trending toward saturation. They’re important, but they don’t require someone to roll out of bed.

# My basic Alertmanager routing configuration
route:
  group_by: ['alertname', 'cluster']
  receiver: 'slack-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty' # Critical alerts go to PagerDuty

receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'

- name: 'pagerduty'
  pagerduty_configs:
  - service_key: 'your-service-key'

This routing strategy cut our after-hours pages by about 70%. More importantly, it meant that when we did get paged at night, we knew it was serious. The alert fatigue that had been plaguing our on-call rotation basically disappeared.

The Dashboards That Actually Matter

I used to build elaborate Grafana dashboards with every metric I could think of. Beautiful, complex visualizations that took hours to create. Know how often they got used? Almost never. People would glance at them during incidents, get overwhelmed by the information density, and go dig through logs instead.

These days, I keep my dashboard strategy simple. I have a standard set that I deploy for every cluster, and each one serves a specific purpose. The high-level Cluster Overview dashboard shows CPU, memory, and network at a glance. This is the first place I look when something seems off. It answers the question: “Is this a cluster-wide problem or localized to specific workloads?”

The Pod Resource Usage dashboard has probably saved me more money than any other tool in my arsenal. It shows me which applications are over-provisioned (requesting way more resources than they actually use) and which are under-provisioned (constantly hitting their limits). I run a monthly review using this dashboard and adjust resource requests accordingly. Last quarter alone, this helped us reduce our compute costs by about 15%.

I also maintain an API Server Performance dashboard, because I learned the hard way that when the API server is slow, everything feels slow. Debugging kubectl timeouts without knowing that the API server is struggling is an exercise in frustration. The etcd Health dashboard serves a similar purpose. etcd is the heart of your cluster, and if it’s unhealthy, you’re going to have a bad time.

Finally, I keep a detailed Node Exporter dashboard for deep dives. This is where I go when I need to understand what’s actually happening on a specific node. Disk I/O, network throughput, individual CPU core usage, all the nitty-gritty details. I don’t look at this one daily, but when I need it, I really need it.

What I’d Tell My Past Self

If I could go back and give myself advice before that first 3 AM incident, I’d say this: treat your monitoring system with the same care and intentionality as your production applications. Don’t just throw metrics at a wall and hope something sticks. Think carefully about what actually matters, what decisions those metrics will help you make, and what alerts will genuinely help you solve problems faster.

I’d also tell myself to resist the urge to alert on everything. Every alert you create is a promise to whoever is on call that this thing is worth their immediate attention. If you create alerts that cry wolf, people will start ignoring them. I’ve seen this happen on multiple teams, and it’s incredibly dangerous.

By following this playbook, I’ve been able to build monitoring systems that are reliable, efficient, and most importantly, provide actionable insights without drowning my team in noise. The key is focusing on what matters, automating the expensive parts with recording rules, and always remembering that someone will eventually get paged because of the decisions you’re making right now. Make sure those pages are worth it.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub