The Challenge
ITV’s Common Platform powers ITVX (formerly ITV Hub), ITV News, and various OTT projects. The platform had accumulated technical debt across several areas:
- Legacy logging: Puppet-managed ELK stack with high operational overhead and licensing costs
- Fragmented monitoring: Sensu/Uchiwa setup requiring manual configuration for over 2000 checks
- Inefficient load balancing: Hundreds of Classic Load Balancers (CLBs) for individual services
- Limited developer self-service: Engineers dependent on platform team for common infrastructure tasks
Our Approach
Logging Infrastructure Overhaul
We designed and implemented a migration from the legacy ELK stack to Loki hosted on EKS. This involved:
- Architecting a scalable Loki deployment with appropriate retention policies
- Developing migration tooling to ensure zero data loss during transition
- Creating Grafana dashboards to maintain feature parity with Kibana
- Training development teams on LogQL and the new observability stack
Monitoring Modernisation
Migrated from Sensu/Uchiwa to Prometheus/Alertmanager:
- Converted 2000+ legacy checks to Prometheus recording rules and alerts
- Implemented alerts for EKS and AWS services (RDS, Lambda, SQS)
- Established self-service alerting through Kubernetes CRDs
Internal Developer Platform
Architected the Common Platform IDP including:
- Opinionated Terraform component modules with built-in alerting
- SLO framework using Sloth for standardised service level objectives
- Jenkins pipelines enabling developer self-service for metrics (PrometheusRules), scripts (K8s CronJobs), and log-based alerts (LogQL)
- Actions Runner Controller deployment for GitHub Actions self-hosted runners
- Migration of internal CI/CD from Jenkins pipelines to GitHub Actions workflows
Distributed Tracing
Rolled out OpenTelemetry operator and OpenTelemetry Collector to enhance tracing capabilities across the platform, with a goal of simplifying metric scraping and log shipping.
Backup & Disaster Recovery
Architected and implemented an AWS Backup solution for RDS, S3, and DynamoDB across the entire AWS organisation, ensuring data protection and compliance.
Engineering Intelligence
- Deployed and configured Apache DevLake to visualise platform DORA metrics for senior stakeholders
- Automated collection of KPIs to track developer uptake of the IDP
- Created Python scripts to automate Terraform state migration of services to the new IDP
Platform Operations
- Performed standard EKS BAU including Helm chart upgrades, Kubernetes version upgrades, and vulnerability patching
- Maintained and improved internal tooling in Ruby, Python, and Bash
Results
The platform modernisation delivered measurable business impact:
- £450,000/year savings from retiring the legacy ELK stack
- £35,000/year savings from consolidating CLBs to shared ALBs (completed in 3 weeks)
- Improved reliability through standardised monitoring and alerting
- Faster delivery with self-service infrastructure provisioning
- Reduced toil for the platform team through automation