Datadog is genuinely great software.
It’s also how platform teams accidentally end up signing six-figure observability contracts.
The problem isn’t Datadog. It’s how observability costs scale - and how few teams see it coming until the bill is already painful.
How the Bill Spirals
The pattern is the same everywhere. You start small. A few services. A few dashboards. The value is obvious and the bill is fine.
Then the platform grows.
- More services get onboarded
- More logs get shipped
- More metrics get collected
- Custom metrics creep in
- APM gets enabled on everything
- Log retention quietly becomes 30 days because someone asked for it
- A new team adds their own dashboards and nobody decommissions the old ones
Nobody made a bad decision. The bill just doubled. Then doubled again.
The Pricing Model Works Against You at Scale
Most commercial observability vendors price on ingest volume - logs per GB, metrics per custom metric, traces per span, hosts per agent. This model is manageable at small scale but punishing at platform scale.
Here’s what catches teams out:
Custom metrics are expensive. A single Prometheus-style metric with high cardinality (multiple label combinations) can generate thousands of time series. On Datadog, custom metrics are billed per unique time series. A well-intentioned developer adding a few labels to a metric can increase your bill by thousands of pounds a month - without anyone noticing until the invoice arrives.
Log volume is hard to predict. Verbose application logging, debug-level logs left on in production, and retry storms can cause log ingest to spike dramatically. Per-GB pricing means those spikes hit your wallet directly.
APM costs scale with traffic. Tracing every request across every service generates enormous volumes of span data. At high throughput, APM alone can become the largest line item on your observability bill.
Retention costs compound. 15 days of log retention costs half as much as 30 days. But once someone asks for 30 days and builds a workflow around it, reducing retention becomes a political problem.
And by the time finance notices, you’re locked in. Migration feels risky. The dashboards are embedded. The alerts are relied upon.
This is how observability becomes a hostage situation.
What the Open-Source Alternative Actually Looks Like
The LGTM stack - Loki, Grafana, Tempo, Mimir - isn’t a hobby project. It’s a fully production-grade observability platform that large organisations run at scale.
Here’s how the components map:
| Concern | Commercial (Datadog) | Open Source (LGTM) |
|---|---|---|
| Metrics | Datadog Metrics | Mimir (Prometheus-compatible, horizontally scalable) |
| Logs | Datadog Logs | Loki (label-indexed, doesn’t index full text by default - dramatically cheaper storage) |
| Traces | Datadog APM | Tempo (trace storage with no indexing requirement, works with OpenTelemetry) |
| Dashboards | Datadog Dashboards | Grafana (the industry standard, used even by Datadog customers) |
| Collection | Datadog Agent | Alloy (Grafana’s OpenTelemetry-compatible collector) |
Why the Cost Profile Is Different
The LGTM stack isn’t just cheaper because it’s open source. The architecture is fundamentally different:
- Loki doesn’t index log content. It indexes labels only, which means log storage is dramatically cheaper than full-text-indexed alternatives. You pay for object storage (S3, GCS) rather than per-GB ingest pricing.
- Mimir uses Prometheus-compatible storage. No per-custom-metric pricing. You pay for the compute and storage you provision, not for the cardinality of your metrics.
- Tempo stores traces in object storage with minimal indexing. Trace storage costs are a fraction of commercial APM pricing.
- No per-seat licensing. Grafana dashboards don’t cost more when more people use them.
The cost model shifts from pay-per-ingest to pay-for-infrastructure. At scale, that difference is enormous.
Real-World Cost Comparison
The numbers vary by organisation, but the pattern is consistent. Teams running the LGTM stack at scale typically see 60-80% cost reductions compared to equivalent commercial tooling.
For a platform running 200+ services:
- A Datadog bill in the range of £300k-600k/year is not unusual
- The equivalent LGTM infrastructure (compute, storage, engineering time for operations) typically lands at £80k-150k/year
- The delta grows as you scale, because object storage costs scale linearly while commercial per-ingest pricing often scales super-linearly
These aren’t theoretical numbers. We’ve helped organisations make this transition and measured the before and after.
The Migration Is Real Work
Let’s be honest: migrating off a commercial observability platform isn’t trivial. Here’s what’s actually involved:
What You’re Moving
- Dashboards: Grafana is often already in use alongside Datadog, which helps. But Datadog-specific queries (DQL) need to be rewritten in PromQL/LogQL.
- Alerts: Every alert rule needs to be recreated in Grafana Alerting or Alertmanager. This is also an opportunity to clean up alert sprawl.
- Instrumentation: If your applications emit metrics via the Datadog agent or StatsD, they’ll need to be migrated to Prometheus exposition format or OpenTelemetry.
- Log pipelines: Log collection and parsing pipelines need to move to Alloy or a similar collector. Loki’s label-based approach requires a different mental model for log querying.
- Integrations: Datadog’s 1,000+ integrations are a genuine advantage. You’ll need to replicate the ones you actually use - but most teams use fewer than they think.
A Practical Migration Approach
- Audit what you actually use. Most teams use 20% of their observability tooling for 80% of their operational decisions. Start there.
- Run in parallel. Deploy the LGTM stack alongside your existing tooling. Dual-ship metrics and logs for a transition period. This lets teams validate that the new stack works before you cut over.
- Migrate by team, not all at once. Let one team move fully, work through the rough edges, and document the process. Then scale.
- Set a decommission date. Without a firm deadline, parallel running becomes permanent. And then you’re paying for both.
- Invest in the platform. The LGTM stack needs platform engineering to run well. Self-hosted Mimir and Loki need capacity planning, operational runbooks, and upgrade management. Factor this into the cost comparison.
When to Stay Commercial
Open source isn’t always the right answer. Commercial observability makes sense when:
- Your team is small and doesn’t have the capacity to operate observability infrastructure
- You need rapid time-to-value and can’t invest in a migration
- The vendor integrations you rely on are genuinely not available in open-source tooling
- Your observability spend is proportionate to the value it provides and the engineering time it saves
The decision should be economic, not ideological. If your Datadog bill is £30k/year and your team is five engineers, the migration cost probably doesn’t make sense. If it’s £500k/year and growing, the conversation is very different.
The Takeaway
Observability cost isn’t a fixed line item. It’s a function of your architecture, your ingest volume, and your vendor’s pricing model. Left unmanaged, it compounds - quietly, and then suddenly.
The LGTM stack isn’t just a cost play. It gives you more control over how observability is architected, stored, and queried. But it requires investment in platform engineering to run well.
The question isn’t whether open-source observability works at scale. It does. The question is whether your current bill justifies the migration effort.
Have you calculated what your observability stack costs per service? If the number surprises you, let’s talk about what your options look like.