Deploy with Confidence Using My Automated Canary Releases with Flagger and Traefik

Mar 05, 2024

I used to get a knot in my stomach every time I had to deploy a new version of a critical service. A single bug could cause a major outage. That’s why I moved to automated canary deployments. This strategy lets me gradually expose a new version to a small percentage of users, automatically measure its performance, and roll back instantly if something goes wrong. The tool that makes this all possible for me is Flagger, which I use with Traefik for traffic shifting and Prometheus for metrics. It has completely changed how I approach releases.

How My Canary Process Works

My canary process is fully automated. When I deploy a new version, Flagger takes over. It starts by sending a tiny fraction of traffic—say, 5%—to the new “canary” version. Then, it watches Prometheus for key metrics like request success rate and latency. If the metrics are healthy (e.g., success rate > 99% and latency < 500ms), it gradually increases the traffic to the canary—10%, 15%, and so on. If at any point the metrics degrade, Flagger immediately rolls back the deployment by shifting all traffic back to the stable version. It’s like having an automated QA engineer and SRE watching every single deployment.

graph LR
    Start[Deploy New Version] --> Canary[Canary Phase: 5% Traffic]
    Canary --> Check{Metrics OK?}
    Check -->|Success > 99%| Increase[Increase to 10%]
    Check -->|Failure| Rollback[Automatic Rollback]
    Increase --> Eventually[...]
    Eventually --> Complete[100% New Version]

    style Rollback fill:#faa,stroke:#c00
    style Complete fill:#afa,stroke:#0a0

My Setup Workflow

Here’s how I set up this automated workflow.

Step 1: Install Flagger and its Tools

First, I install Flagger into my cluster using its Helm chart. I make sure to use the prometheus.install=true and meshProvider=traefik flags to get the whole stack set up correctly.

helm repo add flagger https://flagger.app
helm upgrade -i flagger flagger/flagger \
  --namespace flagger-system --create-namespace \
  --set prometheus.install=true \
  --set meshProvider=traefik

Next, I install Flagger’s loadtester component. This is a crucial piece that generates synthetic traffic during the analysis, ensuring the canary is tested even on low-traffic services.

Step 2: Deploy the Application

Then, I deploy the initial “primary” version of my application and create a standard IngressRoute for it.

Step 3: Configure the Canary Resource

Now for the magic: the Canary resource. This is where I define the entire automated analysis. I point it to my application’s Deployment. In the analysis block, I define my interval (how often to check), threshold (how many failures before rollback), stepWeight (how much to increase traffic by), and maxWeight (the ceiling for the canary traffic).

The most important part is the metrics section. I define my golden signals here: a request-success-rate that must be above 99%, and a request-duration (P99 latency) that must be below 500ms.

# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: demo-canary
spec:
  provider: traefik
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  service:
    port: 80
    targetPort: 9898
  analysis:
    interval: 10s
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
    - name: request-success-rate
      interval: 1m
      thresholdRange:
        min: 99
    - name: request-duration
      interval: 1m
      thresholdRange:
        max: 500
    webhooks:
    - name: load-test
      type: rollout
      url: http://flagger-loadtester.flagger-system/
      timeout: 5s
      metadata:
        type: cmd
        cmd: "hey -z 10m -q 10 -c 2 http://podinfo.demo-canary:9898/"

Step 4: Trigger and Watch the Canary

To trigger a canary deployment, all I have to do is update the image tag in my Deployment manifest. As soon as I run kubectl set image, Flagger detects the change and kicks off the process.

kubectl set image deployment/podinfo podinfo=ghcr.io/stefanprodan/podinfo:6.1.6

I can watch the whole thing unfold with kubectl get canaries -w. I see the WEIGHT column slowly increase from 5 to 10, to 15, and so on. If everything goes well, the status eventually changes to Succeeded, and the new version is promoted to primary. If not, it changes to Failed, and I know Flagger has already rolled it back safely.

My Final Thoughts

My key takeaway is that Flagger completely automates the process of progressive delivery, significantly reducing the risk of deployments. My best practice is to start with conservative thresholds (longer intervals, smaller steps) and then tune them based on historical data. I also make sure to test my rollback scenarios in staging by intentionally deploying a broken version. The combination of Flagger, Traefik, and Prometheus gives me the confidence to deploy frequently and safely, knowing that any issues will be caught and reverted automatically before they can impact our users.

El Muhammad's Portfolio

© 2025 Aria

Instagram YouTube TikTok 𝕏 GitHub