I used to get a knot in my stomach every time I had to deploy a new version of a critical service. A single bug could cause a major outage. That’s why I moved to automated canary deployments. This strategy lets me gradually expose a new version to a small percentage of users, automatically measure its performance, and roll back instantly if something goes wrong. The tool that makes this all possible for me is Flagger, which I use with Traefik for traffic shifting and Prometheus for metrics. It has completely changed how I approach releases.
How My Canary Process Works
My canary process is fully automated. When I deploy a new version, Flagger takes over. It starts by sending a tiny fraction of traffic—say, 5%—to the new “canary” version. Then, it watches Prometheus for key metrics like request success rate and latency. If the metrics are healthy (e.g., success rate > 99% and latency < 500ms), it gradually increases the traffic to the canary—10%, 15%, and so on. If at any point the metrics degrade, Flagger immediately rolls back the deployment by shifting all traffic back to the stable version. It’s like having an automated QA engineer and SRE watching every single deployment.
graph LR
Start[Deploy New Version] --> Canary[Canary Phase: 5% Traffic]
Canary --> Check{Metrics OK?}
Check -->|Success > 99%| Increase[Increase to 10%]
Check -->|Failure| Rollback[Automatic Rollback]
Increase --> Eventually[...]
Eventually --> Complete[100% New Version]
style Rollback fill:#faa,stroke:#c00
style Complete fill:#afa,stroke:#0a0
My Setup Workflow
Here’s how I set up this automated workflow.
Step 1: Install Flagger and its Tools
First, I install Flagger into my cluster using its Helm chart. I make sure to use the prometheus.install=true and meshProvider=traefik flags to get the whole stack set up correctly.
helm repo add flagger https://flagger.app
helm upgrade -i flagger flagger/flagger \
--namespace flagger-system --create-namespace \
--set prometheus.install=true \
--set meshProvider=traefik
Next, I install Flagger’s loadtester component. This is a crucial piece that generates synthetic traffic during the analysis, ensuring the canary is tested even on low-traffic services.
Step 2: Deploy the Application
Then, I deploy the initial “primary” version of my application and create a standard IngressRoute for it.
Step 3: Configure the Canary Resource
Now for the magic: the Canary resource. This is where I define the entire automated analysis. I point it to my application’s Deployment. In the analysis block, I define my interval (how often to check), threshold (how many failures before rollback), stepWeight (how much to increase traffic by), and maxWeight (the ceiling for the canary traffic).
The most important part is the metrics section. I define my golden signals here: a request-success-rate that must be above 99%, and a request-duration (P99 latency) that must be below 500ms.
# canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: podinfo
namespace: demo-canary
spec:
provider: traefik
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
progressDeadlineSeconds: 60
service:
port: 80
targetPort: 9898
analysis:
interval: 10s
threshold: 10
maxWeight: 50
stepWeight: 5
metrics:
- name: request-success-rate
interval: 1m
thresholdRange:
min: 99
- name: request-duration
interval: 1m
thresholdRange:
max: 500
webhooks:
- name: load-test
type: rollout
url: http://flagger-loadtester.flagger-system/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 10m -q 10 -c 2 http://podinfo.demo-canary:9898/"
Step 4: Trigger and Watch the Canary
To trigger a canary deployment, all I have to do is update the image tag in my Deployment manifest. As soon as I run kubectl set image, Flagger detects the change and kicks off the process.
kubectl set image deployment/podinfo podinfo=ghcr.io/stefanprodan/podinfo:6.1.6
I can watch the whole thing unfold with kubectl get canaries -w. I see the WEIGHT column slowly increase from 5 to 10, to 15, and so on. If everything goes well, the status eventually changes to Succeeded, and the new version is promoted to primary. If not, it changes to Failed, and I know Flagger has already rolled it back safely.
My Final Thoughts
My key takeaway is that Flagger completely automates the process of progressive delivery, significantly reducing the risk of deployments. My best practice is to start with conservative thresholds (longer intervals, smaller steps) and then tune them based on historical data. I also make sure to test my rollback scenarios in staging by intentionally deploying a broken version. The combination of Flagger, Traefik, and Prometheus gives me the confidence to deploy frequently and safely, knowing that any issues will be caught and reverted automatically before they can impact our users.