About
Well-funded frontier AI startup building state-of-the-art agentic systems. The company operates large-scale GPU infrastructure (approximately 2,000 H100 GPUs across cloud providers and partners). As cluster scale increases, infrastructure reliability and observability become critical to enabling research and production deployment.
This role focuses on ensuring large-scale training systems are reliable, observable, and performant.
What you'll do
- Own infrastructure and observability across large-scale GPU clusters
- Improve reliability, fault tolerance, and debugging workflows as cluster scale increases
- Optimise distributed training performance, including network performance and scheduling
- Diagnose and resolve failures in large-scale production training jobs
- Partner with research teams to remove infrastructure bottlenecks
- Strengthen monitoring, logging, and performance analysis tooling
What you'll need
- Experience operating and scaling large GPU clusters (ideally 500+ nodes)
- Background in ML infrastructure supporting foundation model training
- Deep understanding of distributed training systems and network optimisation
- Experience debugging complex failures in large-scale environments
- Systems-level thinking and comfort working close to hardware
Shortlisted candidates will be contacted within 48 hours.