← All Case Studies
Telecommunications

Enterprise Observability at Scale

CityFibre

Designed and operated a centralised observability platform across 40+ EKS clusters, processing 864 million metrics daily.

864M Metrics ingested per day
40+ EKS clusters monitored
~10k/s Datapoints processed
Technologies
AWSEKSTerraformFluxGrafanaMimirLokiTempoOpenTelemetryKarpenter

The Challenge

CityFibre, a major UK telecommunications infrastructure provider, faced several platform challenges:

  • Fragmented infrastructure: Multiple EKS clusters deployed through various methods (ClickOps, ad-hoc Helm, different CD pipelines)
  • No unified observability: Teams lacked visibility across the estate, making troubleshooting difficult
  • Inconsistent deployments: Mix of Bitbucket CD pipelines, ArgoCD, and Flux without standardisation
  • Network telemetry requirements: Need to ingest and retain large-scale network metrics for operational visibility

Our Approach

Centralised Observability Stack

Designed and operated an organisation-wide LGTM-based monitoring stack:

  • Grafana Mimir for metrics at scale with long-term retention
  • Grafana Loki for centralised log aggregation
  • Grafana Tempo for distributed tracing
  • OpenTelemetry Collector for unified telemetry ingestion across all clusters

The architecture prioritised:

  • Resilience and high availability from day one
  • Developer self-service through Kubernetes CRDs for dashboards and alerts
  • Alerting completeness with well-defined escalation paths

Network Telemetry Platform

Built a large-scale network telemetry platform:

  • OpenTSDB on ECS backed by EMR HBase for time-series storage
  • Controlled batching and horizontal scaling for consistent ingestion
  • Long-term retention strategies for compliance and capacity planning
  • Processing approximately 864 million metrics per day (~10,000 datapoints/second)

GitOps Consolidation

Standardised the fragmented EKS estate into a single GitOps operating model:

  • Unified on Flux for declarative cluster management
  • Established patterns for self-hosting internal services
  • Enabled reliable deployment of complex third-party platforms (MuleSoft, Camunda)
  • Documented architecture decisions and operational runbooks

Technical Leadership

Served as the technical authority for Kubernetes and AWS across the platform team:

  • Systematic upskilling of engineers through hands-on guidance
  • Architectural decision-making for platform direction
  • Materially improved the team’s ability to operate the platform independently
  • Enabled team to fulfil service requests without external dependencies

Results

  • Unified visibility across 40+ EKS clusters through a single observability platform
  • Self-service enabled for development teams via Kubernetes-native dashboards and alerts
  • Scalable telemetry processing 864M metrics/day with room for growth
  • Standardised operations through consolidated GitOps model
  • Team capability uplift through hands-on guidance and architectural decision-making
  • Third-party platform hosting enabled for complex platforms like MuleSoft and Camunda