What are the three layers of an AI inference stack?

The model layer (model selection, fine-tuning, evaluation), the application layer (API design, orchestration, user integration), and the platform layer (deployment, governance, scaling, observability, security, and cost control on Kubernetes).

Why should platform work start before the first AI model hits production?

Building GPU governance, deployment templates, observability baselines, and an operating model before production pressure is straightforward engineering. Building them after - while endpoints are running and leadership is asking for status updates - is expensive, rushed, and results in permanent compromises.

What happens if every inference deployment is a one-off?

Without standard patterns, every team gets a custom setup. By the fifth team, the platform has five incompatible deployment patterns and the platform team is permanently firefighting. Standard patterns established early give every team the same starting point.

AI Platform Engineering: Why Starting at the Infrastructure Layer Matters More Than the Model Layer

Q: When should GPU quotas be established?

Before the second team asks for GPU. The first team can be the design partner. The second team should be onboarded through a standard process. Once spending is entrenched, changing the allocation approach is politically difficult.

Most organisations building AI platforms start in the wrong layer.

They start with the model. Which LLM to use. Which framework for serving. Which vector database for RAG. Which orchestration tool for agents.

Then they hand the result to the platform team and say: “Make this run in production.”

That is where the problems begin.

The Layer Problem

There are roughly three layers in an AI inference stack:

The model layer - model selection, fine-tuning, prompt design, evaluation. This is where data scientists and ML engineers work.

The application layer - API design, orchestration, context management, user-facing integration. This is where application engineers work.

The platform layer - how inference workloads are deployed, governed, scaled, observed, secured, and cost-controlled on Kubernetes. This is where platform engineers work.

Most organisations invest heavily in the first two layers and assume the third will sort itself out. After all, they already have Kubernetes. They already have a platform team. How hard could it be?

The CNCF’s 2025 survey found that 66% of organisations hosting generative AI models are already using Kubernetes for inference. The infrastructure is converging. But having Kubernetes handle AI workloads and having a governed, production-grade platform for AI workloads are not the same thing. For most organisations, the platform layer is where things break down.

What Happens When You Skip the Platform Layer

The pattern is consistent across organisations. An ML team builds a proof of concept. It works on their machines or in a managed notebook environment. Leadership gets excited. The mandate comes: get this into production.

The ML team doesn’t have access to production infrastructure. They don’t know how to write a Kubernetes deployment manifest, configure autoscaling for GPU workloads, set up monitoring for inference latency, or integrate with the company’s secrets management system.

So they go to the platform team. And the platform team discovers:

There’s no deployment model for inference

The existing deployment pipeline assumes stateless HTTP services with CPU resource requests. An inference endpoint needs GPU scheduling via DRA and DeviceClasses, model artifact loading, health check configurations that account for long startup times, and potentially a service mesh configuration that handles the different latency profile. Tools like KServe and vLLM exist for this, but somebody has to integrate them into your platform’s deployment pipeline, observability stack, and governance model.

Nobody built a golden path for this.

There’s no GPU governance

The ML team says they need an A100. Maybe they do, maybe they don’t - but there’s no framework for evaluating that. No accelerator classes. No quota policy. No visibility into what’s already allocated. The platform team either says yes to everything (and cost spirals) or says no to everything (and becomes a blocker).

There’s no operating model

Who gets paged when the inference endpoint goes down? The platform team doesn’t understand the model. The ML team doesn’t understand the infrastructure. There’s no documented boundary between “platform problem” and “model problem.”

The result: every production AI deployment becomes a cross-team fire drill. Each one is slightly different. Each one requires platform engineers to learn something new under pressure. Each one creates a precedent that makes the next one harder, not easier.

There’s no observability

The existing monitoring stack shows that pods are running and requests are flowing. It doesn’t show GPU utilisation, model loading times, inference latency by endpoint, queue saturation, or cost per prediction. The ML team asks “why is the model slow?” and the platform team can’t answer the question with existing dashboards.

Why Starting at the Platform Layer Is Better

The counterintuitive move is to build the platform foundation before - or at least alongside - the first production AI workload.

This feels wrong because the model layer is where the visible value lives. A working model demo gets executive attention. A Kubernetes scheduling policy does not.

But the platform layer is what determines whether the demo becomes a reliable production service or an ongoing operational headache.

It’s cheaper to define patterns before production pressure

Building GPU governance, deployment templates, observability baselines, and an operating model before the first production inference workload is straightforward engineering work. Building them after - while a production endpoint is already running, teams are waiting, and leadership is asking for status updates - is expensive, rushed, and results in compromises that become permanent.

It prevents the one-off precedent problem

Without a standard path, every inference deployment becomes a unique snowflake. The first team gets a custom setup. The second team gets a different custom setup because the first one wasn’t designed to be reusable. By the fifth team, the platform has five incompatible patterns and the platform team is permanently firefighting.

Standard patterns established early - even if they evolve - give every team the same starting point and give the platform team a manageable surface area.

It makes GPU cost governable from day one

The window for establishing GPU governance is before spending becomes entrenched. Once teams are running on specific accelerator types with specific capacity reservations, changing the allocation approach is politically difficult. Starting with quotas, classes, and right-sizing from the beginning is radically easier than retrofitting them six months later.

It answers the operating model question before the first incident

“Who owns what?” is a question best answered in a design session, not during a 3am page. Defining the boundary between platform responsibility and model-team responsibility - and documenting it - before inference workloads hit production avoids the confusion and finger-pointing that derails incident response.

What Platform-First AI Infrastructure Looks Like

Starting at the platform layer doesn’t mean ignoring the model or application layers. It means ensuring the foundation exists before production workloads land on it.

Concretely, this means:

Define accelerator classes and scheduling policy before the first GPU node is provisioned. Teams should select from standard options, not request arbitrary instance types.

Build a deployment template for inference workloads before the first model-serving endpoint is deployed. This should handle GPU requests, health checks with appropriate timeouts, model artifact loading, and autoscaling configuration.

Establish observability baselines before the first production traffic. GPU utilisation, inference latency, queue depth, and per-endpoint cost should be visible from day one.

Document the operating model before the first on-call rotation. Platform team responsibilities, model-team responsibilities, escalation paths, and incident response procedures should be written down and agreed upon.

Set GPU quotas and cost attribution before the second team asks for GPU. The first team can be the design partner. The second team should be onboarded through a standard process.

This work is not glamorous. It doesn’t demo well. It doesn’t generate executive excitement. But it’s the difference between an AI capability that scales and one that becomes the next operational burden the platform team gets blamed for.

The Takeaway

The model layer is where the value proposition lives. The platform layer is where the operational reality lives. Most organisations invest heavily in the first and discover the second the hard way.

If your organisation is moving toward production AI workloads, the platform conversation needs to happen alongside the model conversation - not after it. Starting at the right layer is the difference between a capability that scales and a series of one-off fire drills that never converge into a repeatable model.

If you’re trying to figure out what “production-ready” looks like for AI inference on your Kubernetes platform, we can help you work through that.

Most AI Platform Work Starts in the Wrong Layer