GPU spend is about to become the next six-figure line item that nobody planned for.
The pattern is familiar. A team needs GPU for an inference workload. They provision a node with the accelerator they think they need. Nobody questions it because AI is a priority. Another team follows. Then another. Six months later, finance asks the platform team why the cloud bill jumped by 40%.
This is the same story platform teams already lived through with general compute, observability tooling, and Kubernetes cluster sprawl. The difference is that GPU pricing makes those problems look cheap.
The Real Reason GPU Costs Spiral
When GPU spend gets out of control, the conversation usually starts in the wrong place. Leadership asks: “Can we use a smaller model?” or “Can we switch to a cheaper provider?” or “Can we do this on CPU instead?”
Those are product questions. They’re valid, but they’re not the root cause.
The root cause is almost always the same: there is no governance layer between “a team wants GPU” and “GPU appears on the bill.”
No quota policy. No right-sizing process. No visibility into what’s allocated versus what’s used. No accountability for idle capacity. No standard for when GPU is justified versus when it isn’t.
That’s a platform problem. It always has been.
The parallel with CPU and memory
Platform teams solved this problem years ago for standard compute:
- Resource quotas prevent teams from consuming unbounded CPU and memory
- Limit ranges enforce sensible defaults
- Right-sizing tools identify over-provisioned workloads
- Cost attribution dashboards show spend by team and service
- Capacity planning models predict growth
GPU governance needs the same discipline, but with characteristics that make it harder:
GPU is expensive. Even after AWS cut H100 pricing by over 40% in mid-2025, a single p5.48xlarge still runs at roughly $55-66/hour on-demand. A Blackwell p6 instance costs even more. An idle GPU node overnight costs more than some teams’ entire monthly compute budget. The penalty for poor governance is an order of magnitude higher.
GPU is coarse. CPU can be shared across pods in fine-grained fractions. GPU sharing is improving - Dynamic Resource Allocation is now GA in Kubernetes 1.34 and MIG partitioning is more accessible than it was - but the operational complexity is still higher than CPU. Over-provisioning GPU means paying for an entire accelerator when the workload only needs a fraction of it.
GPU capacity is constrained. You can’t always get more when you need it. Capacity reservations, spot availability, and regional constraints mean that GPU allocation decisions have longer-term consequences than CPU.
GPU utilisation is harder to measure. CPU utilisation metrics are mature and well-understood. GPU utilisation, memory consumption, and compute saturation metrics require different tooling and different baselines. A GPU sitting at 30% utilisation might be right-sized for bursty inference or massively over-provisioned - you need context to tell the difference.
What Goes Wrong Without Governance
Every organisation that runs GPU workloads without a governance model ends up in the same set of failure modes:
Teams over-provision by default
Without guidance, teams request the largest accelerator available because they don’t know what they need and can’t afford to be wrong. An A100 gets provisioned for a workload that would run fine on a T4. Nobody checks because there’s no right-sizing process.
Idle capacity accumulates
Development and staging GPU nodes run 24/7 even though they’re only used during business hours. Production endpoints keep warm capacity for traffic spikes that happen once a day. Nobody tracks idle time because the observability stack doesn’t capture GPU-specific utilisation.
No one owns the optimisation
The model team owns the model. The platform team owns the cluster. Finance owns the budget. Nobody owns the GPU efficiency gap between them. Right-sizing falls into a gap between “that’s a model decision” and “that’s an infrastructure decision.”
Cold-start versus cost becomes a political argument
Keeping GPU capacity warm is expensive. Letting it scale to zero means cold-start latency when traffic arrives. Without a framework for making this trade-off - one that accounts for SLOs, traffic patterns, and cost - the decision becomes a negotiation between whoever shouts loudest.
Spend is invisible until it’s too late
Most cost dashboards show total compute spend. They don’t break down GPU versus CPU, or show GPU spend by team, by workload, or by utilisation level. By the time someone notices the problem, it’s been compounding for months.
What GPU Governance Actually Looks Like
Good GPU governance is not a policy document. It’s an operating model - a set of controls, abstractions, and visibility tools built into the platform.
Accelerator classes
Define a small number of standard accelerator classes with clear use cases:
| Class | Accelerator | Use case |
|---|---|---|
| Small | T4 / L4 | Light inference, batch scoring, development |
| Medium | A10G / L40S | Production inference for mid-sized models |
| Large | A100 / H100 | Large model inference, high-throughput serving |
| XL | H200 / B200 | High-throughput LLM serving, latency-sensitive large models |
Teams select a class, not a specific instance type. The platform handles node provisioning, scheduling, and scaling. This gives the platform team a controllable surface area instead of unbounded instance selection.
Quota and tenancy
Every team that uses GPU gets a quota. The quota is based on their workload requirements, not their requests. Quotas are reviewed quarterly.
This is the same model that works for CPU and memory. The tooling is different - GPU quotas need to account for whole-device allocation, fractional sharing capabilities, and the much higher unit cost - but the principle is identical.
Right-sizing
Active right-sizing for GPU workloads requires different signals than CPU right-sizing:
- GPU compute utilisation - is the accelerator being used, and how often?
- GPU memory high-water mark - could this workload fit on a smaller accelerator?
- Request-to-capacity ratio - is the workload receiving enough traffic to justify dedicated GPU?
- Inference latency versus accelerator class - would a smaller GPU still meet the SLO?
Build this into a regular review cycle. The savings per right-sizing action are much higher for GPU than for general compute.
Cost attribution
Per-team, per-workload GPU cost visibility is non-negotiable. This means:
- Tagging GPU nodes and workloads consistently
- Separating GPU spend from general compute in cost dashboards
- Showing allocated versus utilised cost (the gap is where the waste lives)
- Reporting GPU spend at the cadence leadership needs to see it - monthly at minimum
If teams can see their own GPU spend, they’ll self-optimise. If they can’t, they won’t.
Capacity planning
GPU capacity planning is less forgiving than CPU capacity planning because of provisioning lead times and pricing volatility. The platform should maintain:
- A forecast of GPU demand by team and workload class
- A strategy for reserved versus on-demand versus spot capacity
- Regional placement decisions based on availability and latency requirements
- Burst capacity policy - who can burst, under what conditions, and who pays
This is not new work for platform teams. It’s existing work applied to a resource class with higher stakes.
The Takeaway
GPU spend doesn’t spiral because models are expensive. It spirals because there’s no platform governance around how GPU is allocated, measured, and optimised.
The fix is not to avoid GPU or to make product teams feel guilty about their cloud bill. The fix is to apply the same discipline to accelerator compute that platform teams already apply to everything else: quotas, right-sizing, visibility, accountability, and standard operating procedures.
If your GPU bill is growing and nobody can explain exactly why, that’s a platform conversation.