← All Posts

Nobody Decided to Have 100 Kubernetes Clusters

Kubernetes cluster sprawl is one of the most expensive problems in platform engineering. A decision framework for multi-cluster management, consolidation, and when a new cluster is actually justified.

KubernetesPlatform EngineeringFinOpsArchitecture

Nobody decided to have 100 Kubernetes clusters.

It just happened.

One per environment. One per team. One because of a compliance requirement. One because the last one got messy and a fresh start felt easier.

Each decision made sense in isolation. The cumulative result is an estate nobody actually designed.

The Hidden Cost of Every Cluster

A Kubernetes cluster isn’t free to exist. Every cluster carries a fixed operational tax, regardless of what runs on it:

  • Control plane costs: On EKS, that’s $0.10/hour ($73/month) per cluster during standard support - and $0.60/hour ($438/month) once you hit extended support. Multiply by 100 clusters and the control plane bill alone becomes significant before a single node runs.
  • Networking surface: Each cluster needs its own ingress controllers, load balancers, DNS entries, and potentially its own VPC or subnet allocation.
  • Duplicated observability: Monitoring agents, log collectors, and telemetry pipelines deployed independently per cluster. That’s not just compute cost - it’s pipeline complexity and storage.
  • Security overhead: Each cluster maintains its own RBAC model, secrets, network policies, and admission controllers. Every one needs auditing.
  • Upgrade lifecycle: Kubernetes releases three minor versions per year. Every cluster you own is another upgrade you need to plan, test, and execute.
  • Operational burden: Each cluster carries its own runbooks, its own on-call blast radius, and its own failure modes. Your platform team doesn’t scale linearly with your cluster count.

Clusters multiply faster than platform teams do.

At 40, 80, or 100 clusters, you don’t just have a cluster problem. You have operational debt that compounds every quarter.

How Cluster Sprawl Actually Happens

The pattern is predictable. It usually starts with a reasonable decision and escalates through precedent:

Stage 1 - Sensible separation. Production and non-production get their own clusters. Maybe a separate cluster for CI/CD workloads. Three or four clusters, well-understood boundaries.

Stage 2 - Team autonomy. A team wants more control over their environment. They request their own cluster. It’s easier to say yes than to design a multi-tenant model. Other teams follow.

Stage 3 - Compliance and regulation. An audit requires PCI workloads to be isolated. A new data residency requirement means a cluster in a different region. These are legitimate - but they set a precedent that isolation means a new cluster.

Stage 4 - Escape velocity. New clusters become the default answer to any friction. Noisy neighbour problems, upgrade disagreements, “we just need something clean” - all lead to more clusters. Nobody tracks the total. Nobody owns the decision.

By stage 4, most teams can’t clearly explain why they have as many clusters as they do.

A Decision Framework: When Is a New Cluster Justified?

A cluster boundary should exist for at least one of these reasons:

Hard compliance isolation

PCI scope, data residency requirements, or regulated tenant separation that cannot be satisfied by namespace-level controls. If your auditor or regulator requires it, this is non-negotiable.

Blast radius control

A failure in one set of workloads must not impact another set. This applies when workloads have fundamentally different availability requirements and namespace-level isolation isn’t sufficient to guarantee it.

Lifecycle independence

A workload cannot tolerate the same Kubernetes upgrade cadence as the rest of the estate. This is common with legacy applications or third-party software that certifies against specific Kubernetes versions.

Genuine trust boundary

The tenancy model requires stronger isolation than namespaces, RBAC, and NetworkPolicies can provide. This applies when you’re running workloads for different legal entities or when a compromised namespace could lead to lateral movement across a trust boundary.

If none of those apply, you probably have a namespace problem wearing a cluster costume.

What to Use Instead

Most of the problems teams solve with new clusters can be solved with existing Kubernetes primitives:

  • Namespaces provide logical isolation, resource scoping, and RBAC boundaries. They’re the first tool to reach for.
  • RBAC and admission controllers enforce who can do what within a cluster, down to the namespace and resource level.
  • NetworkPolicies control pod-to-pod communication. Combined with a CNI that supports them properly, they provide strong network isolation without cluster boundaries.
  • ResourceQuotas and LimitRanges prevent noisy neighbour problems by capping resource consumption per namespace.
  • Virtual clusters (tools like vCluster) fill the gap between namespace isolation and full cluster separation. They provide the experience of a dedicated cluster - with its own API server and control plane - without the operational overhead of another physical cluster.

The right number of clusters isn’t more. It isn’t fewer. It’s the number you can justify.

Running the Exercise

If you want to right-size your cluster estate, start with a simple audit:

  1. Inventory every cluster and document its stated purpose.
  2. Map workloads to clusters - what actually runs where, and why?
  3. Apply the four criteria above. For each cluster, identify which justification applies. If none do, flag it.
  4. Identify consolidation candidates. Clusters that exist for convenience rather than necessity are candidates for consolidation into a multi-tenant model.
  5. Estimate the savings. Control plane costs, networking, observability duplication, and engineering time spent on upgrades all add up.

Most organisations that go through this exercise find that 30-50% of their clusters exist for reasons that no longer apply - or never had a strong justification in the first place.

The Takeaway

Cluster sprawl isn’t a technology problem. It’s a decision-making problem.

Every cluster you can consolidate is operational overhead you stop paying - in money, in engineering time, and in cognitive load. The fix isn’t a migration project. It’s a governance model: a clear set of criteria for when a cluster boundary is justified, applied consistently.

How many of your clusters would survive a real architectural review?

If you’re not sure, that’s a conversation worth having.