If you have a mature Kubernetes platform - clusters, GitOps, self-service, observability - and leadership has just told you to start running inference workloads, you are about to discover a gap in your team.
GPU nodes get provisioned ad hoc because nobody has a governance model. Health checks tuned for stateless microservices kill pods that are still loading multi-gigabyte model weights. Autoscaling is based on CPU, which says nothing about whether inference endpoints are saturated. The cloud bill grows a new six-figure line item that nobody can attribute to a team or workload.
The platform team isn’t underperforming. They’re operating a workload class that breaks every assumption their platform was built on - and nobody on the team has the specific combination of skills needed to bridge Kubernetes platform engineering and production model serving.
That gap is creating a new kind of platform engineer.
The Gap in Current Role Definitions
Platform engineering has matured around stateless, horizontally scalable workloads on commodity compute. Inference breaks those assumptions: GPU scheduling instead of CPU bin-packing, startup times in minutes not seconds, resource consumption that swings dramatically by input size and model architecture, and failure modes that don’t appear in standard HTTP metrics.
Platform engineering was designed for HTTP APIs on commodity compute. Extending it to GPU-accelerated inference requires different scheduling primitives, scaling signals, cost models, and failure-mode expertise.
MLOps focuses on the model lifecycle - training pipelines, experiment tracking, model versioning. Production inference reliability on Kubernetes is adjacent to MLOps but not central to it.
AI infrastructure engineering is closer, but broad enough to cover everything from training clusters to data pipelines. It doesn’t specifically describe the person who owns the inference platform layer.
The work sits at the intersection of all three.
The Market Already Knows This Role Exists
We’re seeing companies hire for this work under a wide range of titles. Search for inference platform roles on any major job board and you’ll see the same responsibilities described with different labels:
| Company | Title | What the role actually does |
|---|---|---|
| General Motors | Senior ML Infrastructure Engineer, Inference Platform | Kubernetes-based model serving, GPU scheduling, autoscaling |
| Anthropic | Senior Software Engineer, Cloud Inference | Production inference infrastructure, performance, reliability |
| Scale AI | AI Infrastructure Engineer, Model Serving Platform | Model deployment platform, GPU governance, scaling |
| Microsoft | MTS, AI Platform Engineer | Inference serving infrastructure, Kubernetes, accelerator management |
| NVIDIA | AI Inference Performance Engineer | Inference optimisation, GPU utilisation, serving runtime tuning |
| AMD | AI Infrastructure / Platform Engineer - GPU Compute | GPU platform tooling, scheduling, resource management |
| Johnson Controls | AI/ML Platform Engineer | Model serving platform, Kubernetes, MLOps integration |
Different titles, same role: someone who owns the platform layer for production inference, with deep Kubernetes expertise and enough ML systems knowledge to make the serving infrastructure work.
Broader hiring trends point in the same direction. LinkedIn’s Jobs on the Rise 2026 report lists “AI Engineer” as the fastest-growing US job title, but that umbrella covers everything from prompt engineering to training infrastructure. The more specific inference platform work is emerging inside that trend without a settled label.
Gartner estimates that 55% of AI-optimised IaaS spending already supports inference workloads, projected to exceed 65% by 2029. When inference is the majority of your AI spend, the person governing that infrastructure is a critical role.
What Makes This Different
| Concern | Traditional platform engineer | Inference platform engineer |
|---|---|---|
| Compute scheduling | CPU/memory bin-packing, standard scheduler | GPU-aware scheduling, accelerator selection, fractional GPU sharing |
| Scaling signals | Request rate, CPU utilisation | Token throughput, GPU utilisation, queue depth, batch saturation |
| Scaling behaviour | Fast horizontal scale-out | Slow scale-out (GPU provisioning takes minutes), expensive warm capacity |
| Cost unit | Cost per pod-hour | Cost per inference, cost per token, cost per GPU-hour |
| Failure modes | OOM kills, crashloops, network partitions | GPU memory fragmentation, model loading failures, driver incompatibility, silent quality degradation |
| Deployment | Container image rollout, traffic-based canary | Model artifact loading, runtime warm-up, quality-based canary |
| Observability | RED metrics | RED plus GPU utilisation, VRAM, tokens/sec, queue wait, model load duration |
An inference platform engineer is still a platform engineer - golden paths, self-service, guardrails, operational excellence. But the specific knowledge required for inference is different enough that a platform engineer without GPU scheduling and model-serving experience will spend months ramping up.
What the Role Owns
The platform layer between “the ML team has a model” and “the model is running reliably in production with governance, observability, and cost controls.”
Deployment and serving infrastructure. A standardised, self-service path for inference workloads - serving runtime configuration (KServe, vLLM, Triton), deployment templates, health checks that account for model loading times, rollout strategies. The tooling has matured significantly: Dynamic Resource Allocation (DRA) is GA in Kubernetes 1.34, vLLM has become a common choice for LLM serving, and KServe provides Kubernetes-native model serving with built-in autoscaling and canary support. The challenge is integrating them into a governed platform.
GPU scheduling and governance. Accelerator class definitions, scheduling policy configuration (DRA, DeviceClasses, MIG partitioning), per-team quotas, and right-sizing reviews. The same governance discipline platform teams apply to CPU and memory, extended to a resource class where the cost of poor governance is an order of magnitude higher.
Autoscaling. Custom-metric-driven scaling (KEDA, custom HPA) using GPU utilisation, queue depth, and token throughput rather than CPU. Owning the scale-to-zero versus warm-capacity trade-off across all endpoints.
Observability. GPU utilisation, VRAM consumption, tokens per second, queue depth, model load duration, batch efficiency, cost per inference. Without this, the platform team can tell you pods are running but not whether inference is performing well or what it costs.
Reliability. Inference-specific failure modes - GPU memory fragmentation, driver incompatibility, silent quality degradation - and the runbooks to handle them. This is where the gap hurts most: a traditional platform engineer paged for a GPU memory issue will spend an hour determining whether it’s infrastructure or model. An inference platform engineer triages it in minutes.
Why the Label Matters
Titles shape organisations:
Budgets. Without a named role, inference platform work competes with every other platform priority. A name makes the investment visible.
Hiring. Post “Platform Engineer” for work that’s 70% inference platform and you’ll attract candidates who expect standard Kubernetes work. A specific title attracts the right intersection of skills.
Ownership. Without a named role, inference platform work falls into the gap between the platform team (“that’s an ML problem”) and the ML team (“that’s an infrastructure problem”). The result is ML teams running infrastructure poorly or platform teams treating inference like a standard application.
Career paths. Engineers doing this work today are platform engineers doing “AI stuff” or ML engineers doing “infra stuff.” Neither career ladder recognises what they do.
Specialisation, Not Invention
This is platform engineering - in the same way that data platform engineering and security platform engineering are platform engineering. The platform engineering community is already discussing clearer sub-specialisations - infrastructure, developer, security, observability platform engineering - each with its own career ladder. Inference platform engineering fits naturally into that taxonomy.
The expected profile:
Strong in: Kubernetes, cloud infrastructure, autoscaling, observability, reliability engineering, cost governance, platform product thinking.
Comfortable in: GPU operations, model-serving runtimes (vLLM, Triton, KServe), inference latency and throughput optimisation, accelerator scheduling.
Not expected to own: Model research, training, fine-tuning, data science, feature engineering.
This is not a unicorn profile. It’s a platform engineer who has developed inference-specific domain expertise - the same way data platform engineers developed data-specific expertise. The pipeline already exists: experienced platform engineers and SREs working on AI infrastructure, accumulating the specialised knowledge through hands-on work.
What This Means for a Platform Team Today
If you already run a mature Kubernetes platform, this is a specialisation path, not a team reset. You don’t need to hire a new team or build a parallel infrastructure. You need to recognise that inference workloads require dedicated domain expertise on the platform team - and invest accordingly.
The first gaps are usually predictable:
- GPU governance - no accelerator classes, no quotas, no right-sizing process. Teams request whatever GPU they’ve heard of and nobody has the context to push back.
- Autoscaling signals - HPA is configured on CPU, which tells you nothing about inference saturation. You need custom metrics: GPU utilisation, queue depth, token throughput.
- Observability - the existing stack shows pods are running. It doesn’t show GPU utilisation, inference latency by endpoint, or cost per prediction.
- Ownership boundaries - no documented line between “platform problem” and “model problem.” Every incident becomes a cross-team investigation.
The mistake is treating inference like just another deployment target. It isn’t. The compute is more expensive, the failure modes are different, the scaling behaviour is different, and the cost of getting governance wrong is an order of magnitude higher.
Whether you hire for this role, grow it internally, or bring in external expertise to bridge the gap - the important thing is to recognise the gap exists before the first production inference outage forces the conversation.
If you’re working through that question, we can help.