Production-Ready Vault Secret Management on Kubernetes

Jan 20, 2025

I’ve spent the last two years building and operating a production-grade secret management system using HashiCorp Vault on Kubernetes. This wasn’t a quick weekend project. It was a journey from manually unsealing Vault at 3 AM to a fully automated, self-healing system that manages thousands of secrets across multiple clusters.

Let me walk you through everything I learned, broken down into the key phases of implementation. Each section links to detailed guides where I dive deep into the specific challenges and solutions.

The Journey: From Manual Hell to Automated Paradise

When I first deployed Vault to production, I didn’t know what I was getting into. The default Vault setup requires manual unsealing every time it restarts. That means a human with unseal keys needs to be available 24/7. I learned this the hard way during my first on-call rotation when a Kubernetes node upgrade restarted Vault at 3 AM. I had to wake up, VPN in, and manually unseal the cluster while every application that depended on Vault for secrets sat broken.

That incident led me down the path of auto-unseal, which led to better operations practices, which led to automated secret synchronization. Each improvement built on the last. Here’s how it all fits together.

Phase 1: Auto-Unseal with Google Cloud KMS

The Problem: Manual unsealing was killing us. Every Vault restart required human intervention. Every Kubernetes upgrade became a potential incident.

The Solution: I implemented auto-unseal using Google Cloud KMS. This allows Vault to unseal itself automatically using encryption keys stored in Google’s Hardware Security Modules.

The setup involves creating KMS keys, configuring service accounts, and updating Vault’s configuration. It sounds simple, but the devil is in the details. Get the IAM permissions wrong and Vault fails to start with cryptic error messages. Forget to configure the Vault pods correctly and auto-unseal won’t work.

After implementing this, Vault restarts became invisible. Kubernetes could upgrade nodes at 3 AM and I’d sleep through it. No pages. No manual intervention. Just automatic unsealing.

Read the full guide: GCP KMS Auto-Unseal Setup

What you’ll learn:

Phase 2: Deploying Vault on Kubernetes

The Problem: Running Vault in production on Kubernetes requires careful configuration. High availability, persistent storage, proper resource limits, monitoring, backup strategies. Miss any of these and you’re setting yourself up for an outage.

The Solution: I built a production-ready Vault deployment on GKE with HA configuration, automated backups, and comprehensive monitoring.

This phase taught me about Vault’s Raft storage backend, how to configure proper health checks, how to size Vault pods, and how to handle storage for Vault data. I made mistakes here. I initially under-provisioned CPU and watched Vault struggle during high request loads. I learned about Vault’s audit log performance impact the hard way when audit logs filled a disk and took down the cluster.

Read the full guide: Vault on Kubernetes Auto-Unseal with GCP KMS

What you’ll learn:

Phase 3: Day-to-Day Operations

The Problem: Once Vault is deployed, you need to actually use it. Creating secrets, managing policies, granting access to applications, rotating credentials. The Vault CLI is powerful but doing everything manually doesn’t scale.

The Solution: I developed operational patterns and automation for common Vault tasks. Policy management, secret creation workflows, access control patterns, credential rotation strategies.

This is where theory meets reality. You can read Vault documentation all day, but until you’re actually managing hundreds of secrets for dozens of applications, you don’t understand the operational complexity. I learned to template Vault policies, automate secret creation through GitOps, and build self-service workflows for developers.

Read the full guide: Vault Kubernetes Operations Guide

What you’ll learn:

Phase 4: Automated Secret Synchronization

The Problem: Having secrets in Vault is great, but applications need those secrets as Kubernetes Secrets. Manually creating Kubernetes Secrets from Vault secrets is error-prone and doesn’t scale.

The Solution: I implemented the Vault Secrets Operator, which automatically synchronizes secrets from Vault to Kubernetes. Define a VaultStaticSecret CRD, and the operator creates and maintains the corresponding Kubernetes Secret.

This was the final piece that made the system truly self-service. Developers can create secrets in Vault through our GitOps workflow, define a VaultStaticSecret custom resource, and their application automatically gets a Kubernetes Secret with the data. No manual steps. No tickets to the ops team. Just automation.

The operator also handles secret rotation automatically. When a secret changes in Vault, the operator updates the Kubernetes Secret, and depending on how your application is configured, it can automatically pick up the new value.

Read the full guide: Vault CRD Kubernetes Secret Sync

What you’ll learn:

The Complete Picture

Here’s how all these pieces fit together in production:

  1. Vault runs on Kubernetes with auto-unseal enabled via GCP KMS
  2. Vault automatically unseals when pods restart (no human intervention)
  3. Developers create secrets in Vault through GitOps workflows
  4. Vault Secrets Operator automatically syncs those secrets to Kubernetes
  5. Applications consume standard Kubernetes Secrets (no Vault-specific code needed)
  6. Secrets rotate automatically when updated in Vault
  7. Everything is audited through Vault’s audit logs

The system is self-healing. Pods can restart, nodes can fail, Kubernetes can upgrade, and secrets keep flowing to applications. The operational burden dropped from “wake up at 3 AM to unseal Vault” to “check dashboards once a week.”

What I Learned Along the Way

Start with auto-unseal. Don’t even consider running Vault in production without it. Manual unsealing is operational hell disguised as security.

Test disaster recovery before you need it. I run quarterly DR drills where we completely destroy Vault and restore from backups. Every single drill has uncovered something that needed fixing. Better to find those issues during a drill than during a real incident.

The Vault Secrets Operator is a game-changer for Kubernetes integration. It eliminates an entire class of operational work (manually creating and updating Kubernetes Secrets) and makes the system self-service for developers.

Vault policies are code. Treat them like code. Version control them. Review them. Test them. Template them. I’ve seen too many production incidents caused by typos in manually-edited Vault policies.

Monitoring and alerting are essential. I monitor Vault’s sealed/unsealed status, raft cluster health, secret sync lag, and audit log delivery. If any of these fail, I want to know immediately, not when an application can’t start because it can’t get secrets.

Cost and Complexity

The total cost for running this system in production is surprisingly low:

Total: ~$170/month for a system that manages thousands of secrets for our entire infrastructure.

The complexity is front-loaded. Getting everything set up correctly takes time and careful attention to detail. But once it’s running, the operational burden is minimal. I spend maybe 2-3 hours per month on Vault operations, mostly updating versions and reviewing audit logs.

Where to Start

If you’re building this from scratch, follow the phases in order:

  1. Start with auto-unseal. Get that working in a test cluster first.
  2. Deploy Vault to Kubernetes with HA configuration.
  3. Practice common operations until you’re comfortable.
  4. Add the Vault Secrets Operator for automatic secret sync.

Each phase builds on the previous one. Don’t skip ahead. I tried to shortcut this process and ended up with a Vault deployment that worked in staging but failed spectacularly in production.

Read through the linked guides in order. Each one assumes you’ve completed the previous phase. The guides include real code, real configurations, and real troubleshooting advice based on problems I actually encountered.

This is a robust, production-ready secret management system. It’s not perfect, but it’s been running in production for two years handling thousands of secrets across multiple clusters with minimal operational overhead. That’s good enough for me.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub