Zero Certificate Expiry Incidents with My cert-manager Playbook for Kubernetes

I’ll never forget the panic of a production certificate expiring. It was 2 AM on a Saturday when my phone started buzzing with PagerDuty alerts. Our main customer-facing API was throwing SSL errors, and angry support tickets were piling up. I fumbled around in the dark, laptop screen glaring, trying to manually generate a new certificate while half-asleep. By the time I got everything working again, we’d been down for 47 minutes. The post-mortem wasn’t fun.

Before that incident, I thought I had certificate management under control. I kept a spreadsheet with expiry dates. I set calendar reminders. I was “on top of it.” But spreadsheets don’t wake you up at 2 AM, and calendar reminders are easy to dismiss when you’re drowning in other work. I needed something foolproof, something that didn’t rely on me remembering to do the right thing at the right time.

That’s when I started getting serious about cert-manager. After managing certificates for over 50 Kubernetes clusters and not having a single expiry incident in over two years, I’ve learned that good certificate management isn’t about being diligent. It’s about making the right choices impossible to screw up.

Building a Certificate System You Can Actually Trust

Starting With a Foundation That Won’t Let You Down

The first mistake I made with cert-manager was treating it like any other deployment. Single replica, minimal resources, “it’s just certificate management, how hard can it be?” Then during a routine cluster upgrade, the cert-manager pod got evicted. During those few minutes, three new services deployed that needed certificates. Those certificates never got issued. I only found out a week later.

Now I install cert-manager with at least two replicas of each component. Certificate management is too critical to be a single point of failure.

Let’s Encrypt Is Great, But You Need More

When I first discovered Let’s Encrypt, I thought all my problems were solved. Then I hit their rate limit during a major migration. We were moving 30 domains to a new cluster, and I pointed all of them at the production Let’s Encrypt endpoint at once. Suddenly half my domains were stuck without valid certificates.

That’s why I now create two ClusterIssuer resources: one for staging and one for production. I test everything against the staging endpoint first. It has much higher rate limits, and I can verify my configuration works before switching to production.

For internal services, I set up a private CA using cert-manager itself. This was a game-changer for mTLS. Before this, we were using self-signed certificates, and managing the trust chain was a nightmare. Now I can issue internal certificates just as easily as public ones.

The Magic of Automation: Annotate and Forget

For public-facing services, I don’t create Certificate resources manually anymore. I just annotate my Ingress resources with cert-manager.io/cluster-issuer: "letsencrypt-prod", and cert-manager does everything else. It creates the Certificate resource, handles the ACME challenge, gets the certificate, and stores it in a Secret.

I used to write separate YAML manifests for every certificate, apply them manually, check the status, troubleshoot failures. Now I add one annotation to my Ingress, and I’m done. New service? Add the annotation. Certificate expires? cert-manager renews it 30 days before expiry. No spreadsheet, no calendar reminder, no 2 AM wake-up calls.

Trust, But Verify: Why Monitoring Matters

You know what’s worse than getting paged at 2 AM for an expired certificate? Not getting paged at all and finding out from a customer three days later. That almost happened to me. We had a certificate that cert-manager couldn’t renew because of a DNS issue. The renewal kept failing silently. I had no idea until I checked the logs for something else. The certificate was five days from expiry.

After that scare, I got serious about monitoring. I set up Prometheus alerts that fire 30 days before a certificate expires. This gives me plenty of time to investigate. Most of the time, these alerts resolve themselves as cert-manager successfully renews the certificate.

I also have a more aggressive alert at 7 days that pages whoever is on-call. And a critical alert at 3 days that wakes up everyone on the infrastructure team. We’ve never hit that 3-day mark, but I sleep better knowing the safety net is there.

The Backup Plan Nobody Thinks About Until It’s Too Late

Even with all this automation working perfectly, I learned to back up everything. Why? Because one day, someone accidentally deleted our production namespace while cleaning up a test environment. With that namespace went all the certificate secrets.

Now I run a simple script nightly that dumps all certificate secrets and cert-manager resources to a tarball and uploads it to S3. More importantly, I tested the restore process. Not just once, but regularly. Because a backup you’ve never restored is just a file that makes you feel better.

What Changed After I Got This Right

The impact has been dramatic. Zero certificate expiry incidents in over two years. Before cert-manager, we dealt with certificate issues almost monthly. What used to take me half a day every week now takes maybe 10 minutes a month. Provisioning time went from two days to two minutes. We’ve saved over $5,000 a year by using Let’s Encrypt instead of commercial CA certificates.

But the biggest impact isn’t in the numbers. It’s in the peace of mind. I don’t think about certificates anymore. That 2 AM phone call doesn’t happen. The post-mortem meetings don’t happen. The emergency scrambles don’t happen.

The Lessons That Stuck

If I could go back and talk to myself before that first certificate expiry incident, I’d say this: automate everything from day one. Don’t wait for disaster to force you into doing the right thing. Let cert-manager handle it, because you will forget, or you’ll be on vacation, or you’ll be dealing with six other fires.

Set up monitoring that alerts early and often. My 30-day, 7-day, and 3-day alert tiers have saved me multiple times. Use a private CA for internal services. The security improvement is real, and the management overhead drops to almost nothing. And back up your certificates, then actually test the restore process. Not someday, but right after you set everything up.

Certificate management used to keep me up at night. Now it’s one of the most boring, reliable parts of our infrastructure. And boring is exactly what you want your infrastructure to be.