How do you govern GPU spend on Kubernetes?

Define a small number of standard accelerator classes, assign per-team GPU quotas, implement right-sizing reviews using GPU utilisation and memory metrics, build cost attribution dashboards that show spend by team and workload, and establish capacity planning processes for reserved versus on-demand GPU.

What is the difference between GPU and CPU cost governance?

GPU governance follows the same principles as CPU governance - quotas, right-sizing, visibility - but the unit cost is much higher, sharing is more complex, capacity is constrained, and utilisation metrics require different tooling and different baselines.

How much does GPU over-provisioning typically cost?

A single idle H100 node costs roughly $55-66 per hour on-demand. Overnight idle capacity on GPU nodes can cost more than some teams' entire monthly compute budget. The penalty for poor governance is an order of magnitude higher than for CPU.

GPU Cost Governance for Kubernetes: Why AI Infrastructure Spend Is a Platform Problem

Q: Why is GPU spend a platform problem?

GPU costs spiral because there is no governance layer between teams requesting GPU and GPU appearing on the bill. No quota policy, no right-sizing process, no visibility into utilisation. That is a platform problem, the same way CPU and memory governance is a platform problem.

GPU spend is about to become the next six-figure line item that nobody planned for.

The pattern is familiar. A team needs GPU for an inference workload. They provision a node with the accelerator they think they need. Nobody questions it because AI is a priority. Another team follows. Then another. Six months later, finance asks the platform team why the cloud bill jumped by 40%.

This is the same story platform teams already lived through with general compute, observability tooling, and Kubernetes cluster sprawl. The difference is that GPU pricing makes those problems look cheap.

The Real Reason GPU Costs Spiral

When GPU spend gets out of control, the conversation usually starts in the wrong place. Leadership asks: “Can we use a smaller model?” or “Can we switch to a cheaper provider?” or “Can we do this on CPU instead?”

Those are product questions. They’re valid, but they’re not the root cause.

The root cause is almost always the same: there is no governance layer between “a team wants GPU” and “GPU appears on the bill.”

No quota policy. No right-sizing process. No visibility into what’s allocated versus what’s used. No accountability for idle capacity. No standard for when GPU is justified versus when it isn’t.

That’s a platform problem. It always has been.

The parallel with CPU and memory

Platform teams solved this problem years ago for standard compute:

Resource quotas prevent teams from consuming unbounded CPU and memory
Limit ranges enforce sensible defaults
Right-sizing tools identify over-provisioned workloads
Cost attribution dashboards show spend by team and service
Capacity planning models predict growth

GPU governance needs the same discipline, but with characteristics that make it harder:

GPU is expensive. Even after AWS cut H100 pricing by over 40% in mid-2025, a single p5.48xlarge still runs at roughly $55-66/hour on-demand. A Blackwell p6 instance costs even more. An idle GPU node overnight costs more than some teams’ entire monthly compute budget. The penalty for poor governance is an order of magnitude higher.

GPU is coarse. CPU can be shared across pods in fine-grained fractions. GPU sharing is improving - Dynamic Resource Allocation is now GA in Kubernetes 1.34 and MIG partitioning is more accessible than it was - but the operational complexity is still higher than CPU. Over-provisioning GPU means paying for an entire accelerator when the workload only needs a fraction of it.

GPU capacity is constrained. You can’t always get more when you need it. Capacity reservations, spot availability, and regional constraints mean that GPU allocation decisions have longer-term consequences than CPU.

GPU utilisation is harder to measure. CPU utilisation metrics are mature and well-understood. GPU utilisation, memory consumption, and compute saturation metrics require different tooling and different baselines. A GPU sitting at 30% utilisation might be right-sized for bursty inference or massively over-provisioned - you need context to tell the difference.

What Goes Wrong Without Governance

Every organisation that runs GPU workloads without a governance model ends up in the same set of failure modes:

Teams over-provision by default

Without guidance, teams request the largest accelerator available because they don’t know what they need and can’t afford to be wrong. An A100 gets provisioned for a workload that would run fine on a T4. Nobody checks because there’s no right-sizing process.

Idle capacity accumulates

Development and staging GPU nodes run 24/7 even though they’re only used during business hours. Production endpoints keep warm capacity for traffic spikes that happen once a day. Nobody tracks idle time because the observability stack doesn’t capture GPU-specific utilisation.

No one owns the optimisation

The model team owns the model. The platform team owns the cluster. Finance owns the budget. Nobody owns the GPU efficiency gap between them. Right-sizing falls into a gap between “that’s a model decision” and “that’s an infrastructure decision.”

Cold-start versus cost becomes a political argument

Keeping GPU capacity warm is expensive. Letting it scale to zero means cold-start latency when traffic arrives. Without a framework for making this trade-off - one that accounts for SLOs, traffic patterns, and cost - the decision becomes a negotiation between whoever shouts loudest.

Spend is invisible until it’s too late

Most cost dashboards show total compute spend. They don’t break down GPU versus CPU, or show GPU spend by team, by workload, or by utilisation level. By the time someone notices the problem, it’s been compounding for months.

What GPU Governance Actually Looks Like

Good GPU governance is not a policy document. It’s an operating model - a set of controls, abstractions, and visibility tools built into the platform.

Accelerator classes

Define a small number of standard accelerator classes with clear use cases:

Class	Accelerator	Use case
Small	T4 / L4	Light inference, batch scoring, development
Medium	A10G / L40S	Production inference for mid-sized models
Large	A100 / H100	Large model inference, high-throughput serving
XL	H200 / B200	High-throughput LLM serving, latency-sensitive large models

Teams select a class, not a specific instance type. The platform handles node provisioning, scheduling, and scaling. This gives the platform team a controllable surface area instead of unbounded instance selection.

Quota and tenancy

Every team that uses GPU gets a quota. The quota is based on their workload requirements, not their requests. Quotas are reviewed quarterly.

This is the same model that works for CPU and memory. The tooling is different - GPU quotas need to account for whole-device allocation, fractional sharing capabilities, and the much higher unit cost - but the principle is identical.

Right-sizing

Active right-sizing for GPU workloads requires different signals than CPU right-sizing:

GPU compute utilisation - is the accelerator being used, and how often?
GPU memory high-water mark - could this workload fit on a smaller accelerator?
Request-to-capacity ratio - is the workload receiving enough traffic to justify dedicated GPU?
Inference latency versus accelerator class - would a smaller GPU still meet the SLO?

Build this into a regular review cycle. The savings per right-sizing action are much higher for GPU than for general compute.

Cost attribution

Per-team, per-workload GPU cost visibility is non-negotiable. This means:

Tagging GPU nodes and workloads consistently
Separating GPU spend from general compute in cost dashboards
Showing allocated versus utilised cost (the gap is where the waste lives)
Reporting GPU spend at the cadence leadership needs to see it - monthly at minimum

If teams can see their own GPU spend, they’ll self-optimise. If they can’t, they won’t.

Capacity planning

GPU capacity planning is less forgiving than CPU capacity planning because of provisioning lead times and pricing volatility. The platform should maintain:

A forecast of GPU demand by team and workload class
A strategy for reserved versus on-demand versus spot capacity
Regional placement decisions based on availability and latency requirements
Burst capacity policy - who can burst, under what conditions, and who pays

This is not new work for platform teams. It’s existing work applied to a resource class with higher stakes.

The Takeaway

GPU spend doesn’t spiral because models are expensive. It spirals because there’s no platform governance around how GPU is allocated, measured, and optimised.

The fix is not to avoid GPU or to make product teams feel guilty about their cloud bill. The fix is to apply the same discipline to accelerator compute that platform teams already apply to everything else: quotas, right-sizing, visibility, accountability, and standard operating procedures.

If your GPU bill is growing and nobody can explain exactly why, that’s a platform conversation.

GPU Spend Is a Platform Problem, Not a Model Problem