Results
- £450,000/year in cost savings from retiring a legacy logging platform and migrating to a modern, Kubernetes-native stack
- £35,000/year in additional savings from consolidating hundreds of load balancers to shared infrastructure - completed in 3 weeks
- ~100 Kubernetes clusters managed across a standardised Common Platform
- ~30 platform engineers supported across the team, from junior to lead level, with structured mentoring and hands-on upskilling
- 2000+ monitoring checks migrated from a manually configured legacy system to a standardised, self-service alerting framework
- Internal Developer Platform established with opinionated infrastructure modules, built-in alerting, and SLO framework - enabling engineering teams to self-serve
- CI/CD modernisation with migration from legacy pipelines to GitHub Actions, including self-hosted Kubernetes-backed runners
- End-to-end tracing rolled out across the platform through OpenTelemetry instrumentation
- DORA metrics and engineering intelligence introduced to give senior stakeholders visibility into delivery performance and platform adoption
- Organisation-wide backup strategy for databases and object storage across the entire AWS estate
- Re-engaged for a second engagement after the initial programme - a direct signal of the value delivered
The Problem
The Common Platform powers major streaming services, news platforms, and OTT products. A platform team of approximately 30 engineers was responsible for around 100 Kubernetes clusters, but the platform had grown with significant technical debt:
- Legacy logging infrastructure with high licensing costs and operational overhead - no path to scale without increasing spend
- Fragmented monitoring requiring manual configuration for over 2000 individual checks, with no self-service capability for engineering teams
- Inefficient load balancing with hundreds of individual load balancers instead of shared infrastructure
- No developer self-service - engineering teams were dependent on the platform team for common infrastructure tasks, creating bottlenecks and slowing delivery
- Limited visibility into platform health and delivery performance - no SLO framework, no engineering metrics, no way to measure improvement
What We Delivered
Observability Modernisation
Replaced the legacy logging and monitoring stack with a modern, Kubernetes-native observability platform. This eliminated the largest single infrastructure cost on the platform while giving engineering teams better visibility than they had before. We migrated 2000+ monitoring checks to a self-service alerting framework, allowing developers to define their own metric, log-based, and script-based alerts without platform team involvement.
Internal Developer Platform
Designed and built an IDP that shifted common infrastructure tasks from the platform team to engineering self-service. This included opinionated infrastructure modules with built-in alerting, a standardised SLO framework, and self-service CI/CD runners. The platform team’s role shifted from fulfilling requests to maintaining and improving the platform itself.
Distributed Tracing and Telemetry
Rolled out instrumentation across the platform to enable end-to-end tracing, with the goal of unifying metric collection, log shipping, and trace correlation into a single telemetry pipeline.
Engineering Intelligence
Introduced DORA metrics and KPI tracking to give senior stakeholders visibility into platform adoption and delivery performance - enabling data-driven decisions about platform investment.
Backup and Disaster Recovery
Architected and implemented an organisation-wide backup strategy for databases and object storage, ensuring data protection and compliance across the entire AWS estate.