Our log volume was exploding. We were generating over 500GB of logs every day, and our existing system (basically just grep-ing through files stored on S3) couldn’t keep up. Debugging was nearly impossible, and our cloud bills were getting out of control. I was tasked with finding a scalable and cost-effective solution.
The situation had gotten ridiculous. An engineer would ping me saying “I need to find logs from service X between 2 PM and 3 PM yesterday,” and I’d have to tell them it would take 20 minutes just to download and decompress the right files. Meanwhile, our DevOps budget was getting hammered because we were paying S3 retrieval costs every time someone needed to debug something. The CTO finally pulled me aside after a particularly bad incident and said “fix this, I don’t care how.”
So I did what any good engineer does when given an impossible problem: I built a bake-off. I set up production-grade versions of all three main contenders and threw real traffic at them for a month. This wasn’t some toy evaluation with sample data. I wanted to know how these systems would actually perform under our specific, messy, high-volume workload.
My Evaluation Process
Contender #1: The ELK Stack
I started with ELK (Elasticsearch, Logstash, Kibana) because it’s the industry standard. Everyone and their dog uses ELK, so how hard could it be?
Turns out, pretty hard. I spent three days just getting Elasticsearch configured properly. The number of knobs you can turn in Elasticsearch is overwhelming. Shard count, replica settings, heap sizes, JVM garbage collection tuning… it felt like I was back in college optimizing database indexes for my comp sci homework.
But once I got it running, I have to admit, the search capabilities were incredible. I could run fuzzy text searches across our entire log corpus. Regular expressions, wildcard queries, complex boolean logic, you name it. During the testing period, an engineer asked me to find all instances where a specific user ID appeared in any log line containing the word “payment” but not “success”. I built that query in Kibana in about 30 seconds, and it worked perfectly.
The problem was everything else. The Elasticsearch cluster needed constant babysitting. One node would run out of heap space, and the whole cluster would start rejecting writes. I’d tune the heap settings, and then a different node would have shard allocation problems. It felt like playing whack-a-mole. And the costs? We were running six m5.2xlarge instances just for the data nodes, plus separate master nodes, plus Logstash instances. The bill was adding up fast.
Contender #2: AWS CloudWatch Logs
After the ELK headaches, CloudWatch looked appealing for one simple reason: someone else’s problem. No infrastructure to manage. No clusters to babysit. No JVM tuning at 2 AM. Just point your logs at an API endpoint and let AWS handle it.
Setup was embarrassingly easy. I had our entire test environment shipping logs to CloudWatch in about an hour. The FluentBit config was straightforward, and CloudWatch Logs Insights (their query language) was actually pretty nice. I could write queries that felt natural, and the results came back fast.
Then I got the bill estimate.
At our scale of 500GB per day, CloudWatch was going to cost us over $8,000 per month just for ingestion. That didn’t even include storage or query costs. I ran the numbers three times because I was sure I’d made a mistake. Nope. AWS’s pricing model charges per GB ingested, and when you’re pushing half a terabyte daily, those pennies per GB add up to real money fast.
I brought the estimate to our finance team. Our CFO literally laughed and said “absolutely not.” So much for the managed solution.
Contender #3: Grafana Loki
Loki was the dark horse candidate. I’d heard about it but never used it in production. The pitch sounded almost too good to be true: “Like Prometheus, but for logs.” Given that we were already heavy Prometheus and Grafana users, I figured it was worth a shot.
The fundamental difference with Loki is that it doesn’t try to index everything. ELK indexes every single word in every single log line, which is why it’s so powerful but also so expensive. Loki takes a different approach. It only indexes metadata labels like app, namespace, environment, level. The actual log content just gets compressed and stored in cheap object storage like S3.
When I first heard this, I was skeptical. “How can you search logs if you don’t index them?” The answer is that you can, it’s just different. Instead of searching across everything, you first filter by labels to narrow down which log streams you want, then Loki downloads and searches just those streams. For 90% of our debugging use cases (find errors in service X during this time range), this works perfectly fine.
Setup was surprisingly smooth. I deployed Loki to our Kubernetes cluster using their Helm chart, configured it to use S3 for storage, and set up Promtail (their log shipper) on all our nodes. The whole thing took maybe half a day. Compare that to the three days I spent wrestling with Elasticsearch.
The query language, LogQL, felt immediately familiar because it’s basically PromQL for logs. Our engineers already knew PromQL from working with Prometheus, so the learning curve was almost nonexistent.
# A simple LogQL query to find errors in our production backend
{namespace="production", app="backend"} |= "error" | json | level="error"
The Moment of Truth: Cost Analysis
After running all three systems for a month with real production traffic, I sat down to crunch the actual numbers. I built a spreadsheet breaking down compute costs, storage costs, data transfer, everything. The results weren’t even close.
For a 100GB/day workload (I scaled down from our full 500GB to make the comparison cleaner), here’s what it would actually cost us:
ELK Stack came in at around $1,480 per month. That included six data nodes, three master nodes, monitoring, and backup storage. It was the most expensive to operate, but at least I had full control.
CloudWatch was slightly higher at $1,590 per month, and that was just for ingestion and basic storage. Every query cost extra. Every exported log cost extra. The convenience of managed services was costing us a premium.
Loki? $425 per month. I checked my math three times because it seemed too good. The bulk of that cost was just S3 storage and a couple of small Kubernetes pods. The compression was so effective that our actual storage footprint was way smaller than I expected.
I made a nice chart showing the cost breakdown and sent it to the engineering leadership team. The VP of Engineering replied in about five minutes: “Loki it is.”
What Happened Next
We’ve been running Loki in production for over a year now, and here’s the real-world data:
We rolled it out to all our production clusters. The full deployment handles our 500GB per day of logs without breaking a sweat. The monthly cost for everything (compute, storage, bandwidth) is around $2,100. That’s a 70% savings compared to what CloudWatch would have cost us and 65% compared to ELK. Our finance team was thrilled.
Performance-wise, our p95 query response time sits under 3 seconds. That’s not as fast as ELK’s sub-second searches, but it’s more than good enough for debugging. Most importantly, engineers actually use it. The Grafana integration means they can see logs and metrics in the same dashboard, which turned out to be way more valuable than I expected.
The operational overhead is minimal. Loki mostly just works. I’ve had to intervene maybe twice in the past year, both times for routine maintenance. Compare that to ELK, where I was getting paged every other week for cluster issues.
What I Learned
Structured logging isn’t optional anymore. We’d already moved all our services to emit JSON logs, and that decision paid off massively during this project. Being able to parse and filter by structured fields made every solution work better, but it was especially important for Loki’s label-based approach.
The biggest insight was about indexing strategy. ELK’s approach of indexing everything is powerful, but it’s overkill for most use cases. Loki’s label-only indexing hits the sweet spot for modern cloud-native apps. You get 90% of the value for 30% of the cost.
I also learned that managed services aren’t always the answer. CloudWatch is great if you’re pushing small log volumes, but AWS’s per-GB ingestion pricing becomes brutal at scale. The operational effort of running Loki ourselves was minimal thanks to Kubernetes, and the cost savings were impossible to ignore.
The final lesson? Always run a proper bake-off with real data. I could have read blog posts and made a decision based on what worked for other companies. But actually running all three systems with our actual logs and our actual query patterns gave me the confidence to make the right choice for us.