Why do Datadog bills get so high?

Datadog pricing is based on per-host, per-GB, and per-metric pricing that compounds as infrastructure grows. Custom metrics, log volume, and APM trace ingestion are the three areas where costs most commonly spiral beyond initial estimates.

What is the LGTM stack?

Loki for logs, Grafana for dashboards, Tempo for traces, and Mimir for metrics. It is an open-source observability stack that provides equivalent functionality to commercial platforms like Datadog at significantly lower licensing cost, in exchange for higher operational overhead.

When should a platform team consider migrating from Datadog to open source?

When observability cost is a significant budget line item, when the team has the operational maturity to run stateful infrastructure, and when the vendor lock-in is limiting flexibility. The migration is not trivial - it requires planning for data migration, dashboard recreation, and alert reconfiguration.

What are the hidden costs of open-source observability?

Engineering time to operate, storage infrastructure costs, capacity planning, upgrades, and the loss of vendor-managed integrations. Open source is cheaper in licensing but requires dedicated operational investment. The total cost of ownership depends on team size and operational maturity.

Datadog vs LGTM Stack: Cutting Observability Costs With Open Source

On this page

How the Bill Spirals
What the Open-Source Alternative Actually Looks Like
The Migration Is Real Work
When to Stay Commercial
The Takeaway

Datadog is genuinely great software.

It’s also how platform teams accidentally end up signing six-figure observability contracts.

The problem isn’t Datadog. It’s how observability costs scale - and how few teams see it coming until the bill is already painful.

How the Bill Spirals

The pattern is the same everywhere. You start small. A few services. A few dashboards. The value is obvious and the bill is fine.

Then the platform grows.

More services get onboarded
More logs get shipped
More metrics get collected
Custom metrics creep in
APM gets enabled on everything
Log retention quietly becomes 30 days because someone asked for it
A new team adds their own dashboards and nobody decommissions the old ones

Nobody made a bad decision. The bill just doubled. Then doubled again.

The Pricing Model Works Against You at Scale

Most commercial observability vendors price on ingest volume - logs per GB, metrics per custom metric, traces per span, hosts per agent. This model is manageable at small scale but punishing at platform scale.

Here’s what catches teams out:

Custom metrics are expensive. A single Prometheus-style metric with high cardinality (multiple label combinations) can generate thousands of time series. On Datadog, custom metrics are billed per unique time series. A well-intentioned developer adding a few labels to a metric can increase your bill by thousands of pounds a month - without anyone noticing until the invoice arrives.

Log volume is hard to predict. Verbose application logging, debug-level logs left on in production, and retry storms can cause log ingest to spike dramatically. Per-GB pricing means those spikes hit your wallet directly.

APM costs scale with traffic. Tracing every request across every service generates enormous volumes of span data. At high throughput, APM alone can become the largest line item on your observability bill.

Retention costs compound. 15 days of log retention costs half as much as 30 days. But once someone asks for 30 days and builds a workflow around it, reducing retention becomes a political problem.

And by the time finance notices, you’re locked in. Migration feels risky. The dashboards are embedded. The alerts are relied upon.

This is how observability becomes a hostage situation.

What the Open-Source Alternative Actually Looks Like

The LGTM stack - Loki, Grafana, Tempo, Mimir - isn’t a hobby project. It’s a fully production-grade observability platform that large organisations run at scale.

Here’s how the components map:

Concern	Commercial (Datadog)	Open Source (LGTM)
Metrics	Datadog Metrics	Mimir (Prometheus-compatible, horizontally scalable)
Logs	Datadog Logs	Loki (label-indexed, doesn’t index full text by default - dramatically cheaper storage)
Traces	Datadog APM	Tempo (trace storage with no indexing requirement, works with OpenTelemetry)
Dashboards	Datadog Dashboards	Grafana (the industry standard, used even by Datadog customers)
Collection	Datadog Agent	Alloy (Grafana’s OpenTelemetry-compatible collector)

Why the Cost Profile Is Different

The LGTM stack isn’t just cheaper because it’s open source. The architecture is fundamentally different:

Loki doesn’t index log content. It indexes labels only, which means log storage is dramatically cheaper than full-text-indexed alternatives. You pay for object storage (S3, GCS) rather than per-GB ingest pricing.
Mimir uses Prometheus-compatible storage. No per-custom-metric pricing. You pay for the compute and storage you provision, not for the cardinality of your metrics.
Tempo stores traces in object storage with minimal indexing. Trace storage costs are a fraction of commercial APM pricing.
No per-seat licensing. Grafana dashboards don’t cost more when more people use them.

The cost model shifts from pay-per-ingest to pay-for-infrastructure. At scale, that difference is enormous.

Real-World Cost Comparison

The numbers vary by organisation, but the pattern is consistent. Teams running the LGTM stack at scale typically see 60-80% cost reductions compared to equivalent commercial tooling.

For a platform running 200+ services:

A Datadog bill in the range of £300k-600k/year is not unusual
The equivalent LGTM infrastructure (compute, storage, engineering time for operations) typically lands at £80k-150k/year
The delta grows as you scale, because object storage costs scale linearly while commercial per-ingest pricing often scales super-linearly

These aren’t theoretical numbers. We’ve helped organisations make this transition and measured the before and after.

The Migration Is Real Work

Let’s be honest: migrating off a commercial observability platform isn’t trivial. Here’s what’s actually involved:

What You’re Moving

Dashboards: Grafana is often already in use alongside Datadog, which helps. But Datadog-specific queries (DQL) need to be rewritten in PromQL/LogQL.
Alerts: Every alert rule needs to be recreated in Grafana Alerting or Alertmanager. This is also an opportunity to clean up alert sprawl.
Instrumentation: If your applications emit metrics via the Datadog agent or StatsD, they’ll need to be migrated to Prometheus exposition format or OpenTelemetry.
Log pipelines: Log collection and parsing pipelines need to move to Alloy or a similar collector. Loki’s label-based approach requires a different mental model for log querying.
Integrations: Datadog’s 1,000+ integrations are a genuine advantage. You’ll need to replicate the ones you actually use - but most teams use fewer than they think.

A Practical Migration Approach

Audit what you actually use. Most teams use 20% of their observability tooling for 80% of their operational decisions. Start there.
Run in parallel. Deploy the LGTM stack alongside your existing tooling. Dual-ship metrics and logs for a transition period. This lets teams validate that the new stack works before you cut over.
Migrate by team, not all at once. Let one team move fully, work through the rough edges, and document the process. Then scale.
Set a decommission date. Without a firm deadline, parallel running becomes permanent. And then you’re paying for both.
Invest in the platform. The LGTM stack needs platform engineering to run well. Self-hosted Mimir and Loki need capacity planning, operational runbooks, and upgrade management. Factor this into the cost comparison.

When to Stay Commercial

Open source isn’t always the right answer. Commercial observability makes sense when:

Your team is small and doesn’t have the capacity to operate observability infrastructure
You need rapid time-to-value and can’t invest in a migration
The vendor integrations you rely on are genuinely not available in open-source tooling
Your observability spend is proportionate to the value it provides and the engineering time it saves

The decision should be economic, not ideological. If your Datadog bill is £30k/year and your team is five engineers, the migration cost probably doesn’t make sense. If it’s £500k/year and growing, the conversation is very different.

The Takeaway

Observability cost isn’t a fixed line item. It’s a function of your architecture, your ingest volume, and your vendor’s pricing model. Left unmanaged, it compounds - quietly, and then suddenly.

The LGTM stack isn’t just a cost play. It gives you more control over how observability is architected, stored, and queried. But it requires investment in platform engineering to run well.

The question isn’t whether open-source observability works at scale. It does. The question is whether your current bill justifies the migration effort.

Have you calculated what your observability stack costs per service? If the number surprises you, let’s talk about what your options look like.

How Platform Teams End Up With Six-Figure Observability Bills

How the Bill Spirals

The Pricing Model Works Against You at Scale

What the Open-Source Alternative Actually Looks Like

Why the Cost Profile Is Different

Real-World Cost Comparison

The Migration Is Real Work

What You’re Moving

A Practical Migration Approach

When to Stay Commercial

The Takeaway

Frequently Asked Questions

How the Bill Spirals

The Pricing Model Works Against You at Scale

What the Open-Source Alternative Actually Looks Like

Why the Cost Profile Is Different

Real-World Cost Comparison

The Migration Is Real Work

What You’re Moving

A Practical Migration Approach

When to Stay Commercial

The Takeaway

Frequently Asked Questions

Continue reading

Why FinOps Just Became an Engineering Leader Problem

Nobody Decided to Have 100 Kubernetes Clusters

GPU Spend Is a Platform Problem, Not a Model Problem