Can I run AI inference on my existing Kubernetes platform?

You can, but most platforms built for stateless microservices are missing five things for inference: a workload onboarding model, GPU governance, inference-specific reliability patterns, inference observability, and a clear operating model for who owns what.

Why is GPU scheduling different from CPU scheduling?

GPU scheduling requires finding nodes with specific accelerator types, sufficient GPU memory, and compatible driver versions. Unlike CPU, GPU compute is not fungible - scaling up means waiting for a specific type of node, which can take minutes rather than seconds.

What observability do I need for inference workloads?

Beyond standard request rate, error rate, and latency, you need per-model GPU utilisation, memory consumption, queue depth, token throughput for LLM endpoints, model loading duration, and cost per request. Without these, you can't tell if inference is performing well.

Who should be on-call for inference endpoints?

This should be answered before the first inference workload hits production. Define the boundary between platform responsibility and model-team responsibility, document it, and agree on escalation paths. Discovering ownership during an incident causes confusion and delays.

AI Inference on Kubernetes: Why Your Platform Isn't Ready and How to Fix It

Your Kubernetes platform isn’t ready for AI inference workloads.

You have clusters. You have platform teams. You probably have a GitOps pipeline, an observability stack, and a self-service model that works well enough for stateless microservices.

None of that translates automatically to inference.

The first time a team tries to deploy a model-serving endpoint on your platform, the problems will appear fast. Not because the platform is bad - because inference workloads have fundamentally different characteristics than the workloads it was designed for.

Why Inference Is Different

Standard application workloads on Kubernetes share a set of assumptions that most platform teams have optimised around: they’re stateless, they scale horizontally on commodity compute, they have predictable resource profiles, and they fail in ways the platform already understands.

Inference workloads break most of those assumptions.

GPU scheduling is not CPU scheduling

When a service needs more CPU, Kubernetes adds pods across available nodes. When a model-serving endpoint needs GPU, the scheduler has to find a node with the right accelerator type, enough free GPU memory, and potentially the right driver version - and there might be three nodes in the cluster that qualify.

Kubernetes has made progress here. Dynamic Resource Allocation hit GA in 1.34, and MIG partitioning with DRA means GPU sharing is more practical than it was even a year ago. But having the primitives available is not the same as having them operationalised. Most platform teams haven’t configured DeviceClasses, haven’t defined ResourceClaim templates for their accelerator fleet, and haven’t built the self-service layer that makes DRA usable by application teams.

Standard horizontal pod autoscaling also doesn’t translate cleanly. Scaling up means waiting for a node with an accelerator to become available - or provisioning one, which can take minutes, not seconds. Scaling down means deciding whether to release expensive GPU capacity that might be needed again shortly.

Your platform’s autoscaling model was designed for a world where compute is fungible. GPU compute isn’t.

Latency profiles are different

A typical microservice responds in single-digit milliseconds. A model-serving endpoint might take 200ms for a simple prediction or several seconds for a large language model generating a response.

That changes everything downstream: timeout configurations, retry policies, load balancer settings, health check intervals, and capacity planning. Your platform’s defaults - which were tuned for fast, lightweight HTTP services - will cause false positives on health checks, premature timeouts, and misleading latency dashboards.

Resource consumption is unpredictable

A standard service uses roughly the same resources per request. Inference workloads vary dramatically depending on input size, model complexity, batch size, and whether the model is warm or cold-loaded. A single endpoint can swing from 2GB to 40GB of GPU memory depending on the request pattern.

Your resource quotas, limit ranges, and capacity planning models weren’t built for this kind of variance.

The failure modes are new

Model-serving endpoints don’t fail like web services. They fail because a model artifact didn’t download correctly. Because GPU memory fragmented. Because a driver version was incompatible after a node update. Because the model was too large to fit alongside the other workloads on the node.

Your runbooks, alert definitions, and incident response playbooks don’t cover these scenarios yet.

The Five Things Your Platform Is Missing

Most platforms that work well for standard workloads are missing five specific things for inference:

1. A workload onboarding model for inference

When a team wants to deploy a standard service, they know the path: create a repo from the template, configure the pipeline, deploy to staging, promote to production.

For inference, that path doesn’t exist yet. Teams don’t know what deployment model to use, how to request GPU capacity, what “production-ready” means for a model endpoint, or who to ask for help. Without a defined onboarding model, every inference deployment becomes a bespoke project.

2. GPU governance

This is the most expensive gap. Without a governance model for accelerators, you’ll see:

Teams requesting more GPU than they need because they don’t understand the options
No visibility into what’s allocated versus what’s actually used
No policy for cold-start versus warm-capacity trade-offs
No right-sizing guidance for different model sizes and traffic patterns
GPU capacity sitting idle at significant cost because nobody owns the optimisation

The same patterns platform teams use for CPU and memory governance - quotas, right-sizing, capacity planning - need to be extended to accelerators. But the tooling and abstractions are different enough that it’s not a simple extension.

3. Inference-specific reliability patterns

SLOs for a model-serving endpoint aren’t the same as SLOs for a typical web API. Latency targets might be 500ms p99 instead of 50ms. Availability might need to account for model loading time. Graceful degradation might mean falling back to a smaller model, not returning a cached response.

Backpressure, load shedding, canary deployments, and rollback all need rethinking for inference workloads. A bad model deployment isn’t a bad container image - it might serve requests that are technically successful but produce wrong results.

4. Inference observability

Your existing metrics pipeline captures request rate, error rate, and latency. For inference, you also need:

Per-model and per-endpoint throughput and latency
GPU utilisation, memory consumption, and saturation
Queue depth and request waiting time
Token throughput for language model endpoints
Model loading and warm-up duration
Cost per request, per endpoint, per team

Without this, you can’t operate inference workloads - you can only run them and hope.

5. A clear operating model

Who owns the inference platform layer? Who owns the model endpoints? When a model-serving endpoint goes down at 3am, who gets paged - the platform team or the team that deployed the model?

Most organisations don’t answer this question before the first inference workload hits production. Then they discover it the hard way, usually during an incident.

What a Production-Ready Foundation Looks Like

A Kubernetes platform that’s ready for inference workloads has the same properties as a well-run platform for any workload class: it’s standardised, governed, observable, and self-service.

The difference is in the specifics:

Standardised deployment patterns - a small number of approved ways to deploy inference workloads, with templates, documentation, and guardrails. Not “figure it out yourself.”

GPU scheduling and governance - quota policies, node class strategies, right-sizing guidance, and visibility into utilisation and cost. Not “request a GPU node and we’ll see.”

Inference-aware observability - dashboards, alerts, and SLOs designed for the latency, throughput, and cost characteristics of model serving. Not your existing application monitoring with GPU metrics bolted on.

Security and compliance controls - network segmentation, artifact provenance, admission policies, and audit trails that account for model endpoints and external model access patterns. Not “we’ll treat it like any other service.”

Self-service with guardrails - teams can deploy inference workloads through a defined path without bypassing governance. The platform team owns the controls; consuming teams interact with abstractions.

This isn’t a new platform. It’s your existing platform, extended thoughtfully to a workload class with different characteristics.

The Takeaway

If your organisation is moving toward AI, the work will eventually land on your Kubernetes platform. The question is whether it lands on a foundation designed for it or crashes into infrastructure that was built for a different world.

The tooling has matured significantly - DRA is GA, KServe has purpose-built CRDs for LLM serving, and vLLM has become a standard inference engine. But tooling availability and platform readiness are different things. The gap isn’t talent or budget. It’s that most Kubernetes environments haven’t operationalised these tools into governed, self-service platform patterns. Filling those gaps before the first production deployment is significantly cheaper than discovering them during an incident.

If your platform team is being asked to support AI inference and you’re not sure where the gaps are, that’s a conversation worth starting now.

Your Kubernetes Platform Isn't Ready for AI Inference Workloads