Free cookie consent management tool by TermsFeed Generator ML Infrastructure Engineer – Large-Scale GPU Clusters | Axioma Search
Image
Image
Bg

Senior MLOps Engineer (Model Training)

About

Well-funded frontier AI startup building state-of-the-art agentic systems. The company operates large-scale GPU infrastructure (approximately 2,000 H100 GPUs across cloud providers and partners). As cluster scale increases, infrastructure reliability and observability become critical to enabling research and production deployment.

This role focuses on ensuring large-scale training systems are reliable, observable, and performant.

What you'll do

  • Own infrastructure and observability across large-scale GPU clusters
  • Improve reliability, fault tolerance, and debugging workflows as cluster scale increases
  • Optimise distributed training performance, including network performance and scheduling
  • Diagnose and resolve failures in large-scale production training jobs
  • Partner with research teams to remove infrastructure bottlenecks
  • Strengthen monitoring, logging, and performance analysis tooling

What you'll need

  • Experience operating and scaling large GPU clusters (ideally 500+ nodes)
  • Background in ML infrastructure supporting foundation model training
  • Deep understanding of distributed training systems and network optimisation
  • Experience debugging complex failures in large-scale environments
  • Systems-level thinking and comfort working close to hardware

Shortlisted candidates will be contacted within 48 hours.

Back to job listings
  • Location London, Paris
  • Salary / Compensation Up to £180k + equity
  • Work Setup permanent , hybrid
  • Sectors Cloud Infrastructure, Agentic, Frontier AI / Foundation Models
  • Skills MLOps, GPU clusters, Distributed training, ML infrastructure, Observability, AWS
Image

Role Contact

Calvin Duffy

Bg

Didn't find the right role?

Send us your CV.