About
This is a well-funded frontier AI company building agentic systems that automate complex, multi-step work. The models team sits close to product, training the foundation models behind those systems – with a focus on instruction following, tool use, multimodal understanding, and reinforcement learning.
The research is ambitious. The bottleneck is training infrastructure. This role exists to make large-scale LLM and VLM training faster, more reliable, and less painful as model size and system complexity increase.
What you’ll do
- Build and improve training infrastructure for large-scale LLMs and VLMs
- Work on distributed training systems using tools such as FSDP, DeepSpeed, Megatron, TorchTitan or similar
- Optimise training performance, reliability, and GPU utilisation across large clusters
- Support multimodal model training, including data flow, checkpointing, and experiment workflows
- Partner closely with researchers to turn new training ideas into production-grade systems
- Debug hard failures in long-running jobs and improve observability across the stack
- Contribute to evaluation workflows and help teams ship model improvements faster
What you’ll need
- Strong Python engineering skills
- Hands-on experience with training infrastructure for large models (100B+)
- Good understanding of distributed training libraries such as FSDP, Megatron, TorchTitan, or DeepSpeed
- Experience supporting or training LLMs (or VLMs) at scale
- Familiarity with at least one major deep learning framework such as PyTorch, JAX, or TensorFlow
- Comfort working closely with researchers in ambiguous, fast-moving environments
- Strong communication skills and a low-ego, collaborative way of working
Shortlisted candidates will be contacted within 48 hours.