Senior MLOps Engineer (Model Training)

About

Well-funded frontier AI startup building state-of-the-art agentic systems. The company operates large-scale GPU infrastructure (approximately 2,000 H100 GPUs across cloud providers and partners). As cluster scale increases, infrastructure reliability and observability become critical to enabling research and production deployment.

This role focuses on ensuring large-scale training systems are reliable, observable, and performant.

What you'll do

Own infrastructure and observability across large-scale GPU clusters
Improve reliability, fault tolerance, and debugging workflows as cluster scale increases
Optimise distributed training performance, including network performance and scheduling
Diagnose and resolve failures in large-scale production training jobs
Partner with research teams to remove infrastructure bottlenecks
Strengthen monitoring, logging, and performance analysis tooling

What you'll need

Experience operating and scaling large GPU clusters (ideally 500+ nodes)
Background in ML infrastructure supporting foundation model training
Deep understanding of distributed training systems and network optimisation
Experience debugging complex failures in large-scale environments
Systems-level thinking and comfort working close to hardware

Shortlisted candidates will be contacted within 48 hours.

Location London, Paris
Salary / Compensation Up to £180k + equity
Work Setup permanent , hybrid
Sectors Cloud Infrastructure, Agentic, Frontier AI / Foundation Models
Skills MLOps, GPU clusters, Distributed training, ML infrastructure, Observability, AWS

Role Contact

Calvin Duffy

calvin@axiomasearch.com

Research Scientist (Robotics)

VC-backed robotics and AI lab building universal foundation models is hiring a Research Scientist to develop Vision-Language-Action models powering real-world robotic control.

Location
London, Paris, San Francisco
Type
On-site
Salary
Up to £220k + equity package

Software Engineer (Rust/TypeScript)

VC-backed GenAI startup building an AI-native platform for financial modelling, replacing legacy spreadsheet workflows. The team is small and engineering-led, focused on hiring top-tier talent while serving major enterprise customers.

Location
London
Type
On-site
Salary
£120k–£200k + equity

Didn't find the right role?

Send us your CV.

Upload Your CV Now