The Challenge
CityFibre, a major UK telecommunications infrastructure provider, faced several platform challenges:
- Fragmented infrastructure: Multiple EKS clusters deployed through various methods (ClickOps, ad-hoc Helm, different CD pipelines)
- No unified observability: Teams lacked visibility across the estate, making troubleshooting difficult
- Inconsistent deployments: Mix of Bitbucket CD pipelines, ArgoCD, and Flux without standardisation
- Network telemetry requirements: Need to ingest and retain large-scale network metrics for operational visibility
Our Approach
Centralised Observability Stack
Designed and operated an organisation-wide LGTM-based monitoring stack:
- Grafana Mimir for metrics at scale with long-term retention
- Grafana Loki for centralised log aggregation
- Grafana Tempo for distributed tracing
- OpenTelemetry Collector for unified telemetry ingestion across all clusters
The architecture prioritised:
- Resilience and high availability from day one
- Developer self-service through Kubernetes CRDs for dashboards and alerts
- Alerting completeness with well-defined escalation paths
Network Telemetry Platform
Built a large-scale network telemetry platform:
- OpenTSDB on ECS backed by EMR HBase for time-series storage
- Controlled batching and horizontal scaling for consistent ingestion
- Long-term retention strategies for compliance and capacity planning
- Processing approximately 864 million metrics per day (~10,000 datapoints/second)
GitOps Consolidation
Standardised the fragmented EKS estate into a single GitOps operating model:
- Unified on Flux for declarative cluster management
- Established patterns for self-hosting internal services
- Enabled reliable deployment of complex third-party platforms (MuleSoft, Camunda)
- Documented architecture decisions and operational runbooks
Technical Leadership
Served as the technical authority for Kubernetes and AWS across the platform team:
- Systematic upskilling of engineers through hands-on guidance
- Architectural decision-making for platform direction
- Materially improved the team’s ability to operate the platform independently
- Enabled team to fulfil service requests without external dependencies
Results
- Unified visibility across 40+ EKS clusters through a single observability platform
- Self-service enabled for development teams via Kubernetes-native dashboards and alerts
- Scalable telemetry processing 864M metrics/day with room for growth
- Standardised operations through consolidated GitOps model
- Team capability uplift through hands-on guidance and architectural decision-making
- Third-party platform hosting enabled for complex platforms like MuleSoft and Camunda