My Container Security Playbook From Build to Runtime

Apr 30, 2022

I learned about container security the hard way. About three years ago, I was on call when our security team discovered we were running containers with a critical Log4Shell vulnerability in production. That 2 AM phone call still gives me nightmares. We spent the entire weekend scrambling to patch and redeploy dozens of services, wondering if we’d already been compromised. Spoiler: we got lucky, no breach. But that incident completely changed how I think about container security.

Since then, I’ve implemented security scanning across over 50 microservices and built a system that’s prevented hundreds of vulnerable images from ever seeing production traffic. The biggest lesson I learned? Container security isn’t something you bolt on at the end. It’s a continuous process that needs to be baked into every stage of your development workflow, or you’re just playing security theater.

My Container Security Lifecycle

I’ve broken down my security approach into five stages. This didn’t happen overnight. I built this system incrementally over two years, making mistakes and learning from each security audit. Here’s what actually works.

Stage 1: I Start with a Secure Dockerfile

I used to think Dockerfiles were just recipes for building containers. Then I got a security audit report that ripped apart our images. We had containers running as root (ouch), images bloated with build tools in production (double ouch), and base images we hadn’t updated in over a year (triple ouch). That audit was humbling, but it taught me that security starts at the Dockerfile.

Now I obsess over minimal images. I use multi-stage builds religiously to strip out everything that doesn’t need to run in production. My base images are either distroless or alpine, and every container runs as a non-root user with a read-only root filesystem. These aren’t just security best practices, they’re the things that saved us during the next audit.

# A secure, multi-stage Dockerfile for a Go application
FROM golang:1.21-alpine AS builder
RUN adduser -D -u 10001 appuser
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-w -s" -o /app/server .

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=builder --chown=nonroot:nonroot /app/server /app/server
USER nonroot:nonroot
EXPOSE 8080
ENTRYPOINT ["/app/server"]

Stage 2: I Scan Everything in My CI/CD Pipeline

Here’s where I learned another painful lesson. Early on, we had Trivy running in our pipeline, but it was only generating reports. We’d get a nice JSON file with vulnerabilities listed, and developers would… ignore it. Completely. I watched vulnerable images sail into production because nobody actually read those reports.

So I made the pipeline fail. Hard stop. If Trivy or any of our scanners find a CRITICAL or HIGH severity vulnerability, the build dies right there. No image gets pushed to the registry. This made me temporarily unpopular with the dev team (they called me the “pipeline Nazi” for a week), but you know what? It worked. Within a month, our developers were proactively checking for vulnerabilities before pushing code.

Now on every single commit, my GitLab CI pipeline runs three tools: hadolint to lint the Dockerfile for security issues, Trivy to scan for OS and application vulnerabilities, and Grype as a second opinion. I learned the hard way that no single scanner catches everything. Last year, Grype caught a critical vulnerability in a Python package that Trivy completely missed. Running both takes an extra 2 minutes in the pipeline, but it’s saved us at least three times that I know of.

# A simplified CI job for Trivy
trivy-scan:
  stage: scan
  image: aquasec/trivy:latest
  script:
    # Fail the build if any critical or high vulnerabilities are found
    - trivy image --severity HIGH,CRITICAL --exit-code 1 ${IMAGE_NAME}:${IMAGE_TAG}

Stage 3: I Generate an SBOM for Every Image

The Log4Shell incident taught me something crucial: when a new vulnerability drops, you need to know instantly which of your services are affected. During that crisis, we wasted hours manually digging through Dockerfiles and package manifests trying to figure out which containers had Log4j. It was chaos.

Now I generate a Software Bill of Materials (SBOM) for every single image using Syft. Think of it as an ingredient list for your container. When the next big vulnerability hits (and there will be a next one), I can grep through our SBOMs and know within minutes exactly which images are vulnerable. I attach the SBOM directly to the image metadata, so it travels with the container through the entire deployment pipeline. This isn’t theoretical anymore. It’s saved us twice this year alone.

Stage 4: I Sign Every Production Image

I’ll be honest, I resisted implementing image signing for months. It felt like overkill, like I was being paranoid. Then we had an incident where a developer accidentally pushed an unsigned test image to production. The image had debug tools and elevated permissions that absolutely should not have been there. That’s when I stopped arguing with our security team and implemented cosign.

Now every image that passes our security scans gets cryptographically signed. In our Kubernetes cluster, we run an admission controller that acts like a bouncer. No signature? No entry. Invalid signature? Rejected. It adds maybe 15 seconds to our deployment pipeline, but it guarantees that only vetted, scanned images make it to production. The developer who pushed that test image actually thanked me later, because now the system prevents that kind of mistake automatically.

Stage 5: I Monitor at Runtime

Here’s the thing about security: scanning before deployment only protects you from known vulnerabilities. Zero-days exist, and they’re terrifying because you can’t scan for something that’s not in the database yet. That’s why I assume every container could potentially be compromised and monitor everything at runtime.

I run Falco with custom rules tuned specifically for our workloads. It watches for suspicious behavior like someone spawning a shell inside a container (why would a production API need bash?), unexpected network connections to weird IPs, or attempts to write to system directories that should be read-only. Last month, Falco caught what turned out to be a compromised dependency trying to phone home to a command and control server. We killed that pod within seconds of the alert. Without runtime monitoring, we might not have noticed for days.

The Real-World Impact

My Key Takeaways for Container Security

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub