Enterprise Observability at Scale | UK Telecoms Provider

Results

Greenfield observability capability established across 40+ Kubernetes clusters - the organisation had no unified monitoring before this engagement
864 million metrics per day ingested through a purpose-built network telemetry platform processing ~10,000 datapoints per second, with long-term retention and horizontal scaling
Developer self-service for dashboards and alerts through Kubernetes-native CRDs, removing the platform team as a bottleneck for observability configuration
Resilience and alerting completeness built in from day one - not retrofitted after the fact
Consolidated operating model replacing a fragmented mix of ClickOps, ad-hoc installs, and multiple competing CD tools with a single, standardised GitOps approach
Unblocked self-hosting of critical third-party platforms (e.g. MuleSoft, Camunda) that the organisation had previously been unable to run reliably on their own infrastructure
Technical authority for Kubernetes and cloud infrastructure across the platform team, owning architecture, roadmap, and delivery for the platform and observability domain
Team capability uplift through systematic, hands-on upskilling - materially improving the team’s ability to operate the platform and fulfil service requests independently

The Problem

A major UK telecommunications infrastructure provider had invested in Kubernetes but was experiencing the operational problems that emerge when a multi-cluster estate grows without standardisation:

No unified observability - teams had no visibility across the estate, making troubleshooting slow and incident response reactive rather than structured
Fragmented deployment tooling - clusters had been set up through a mix of manual configuration, ad-hoc installs, and multiple competing CD tools with no consistent operating model
Inconsistent cluster management - no standard approach to how clusters were provisioned, configured, or maintained, creating operational risk and making changes difficult
Large-scale network telemetry requirements - the business needed to ingest and retain very high volumes of network metrics for operational visibility and capacity planning

What We Delivered

Centralised Observability Platform

Designed and operated an organisation-wide observability stack covering metrics, logs, and traces across the entire Kubernetes estate. The architecture was built for resilience and self-service from day one - engineering teams could define their own dashboards and alerts through Kubernetes-native configuration without waiting on the platform team. This gave the organisation production visibility it had never had before.

Network Telemetry at Scale

Built a purpose-designed telemetry platform to handle the ingestion of approximately 864 million metrics per day at around 10,000 datapoints per second. The architecture used controlled batching, horizontal scaling, and long-term retention strategies to meet both real-time operational needs and compliance requirements.

Platform Consolidation

Standardised the fragmented Kubernetes estate into a single GitOps operating model. This unified how clusters were managed, how services were deployed, and how changes were promoted - replacing a patchwork of manual processes and competing tools. This also unblocked the reliable self-hosting of complex third-party platforms that the organisation had previously been unable to run on their own infrastructure.

Technical Leadership and Knowledge Transfer

Served as the technical authority for Kubernetes and cloud infrastructure across the platform team. Through hands-on guidance and architectural decision-making, materially improved the team’s ability to operate the platform independently and fulfil service requests without external dependencies.