Free cookie consent management tool by TermsFeed Generator Research Engineer, Training Infra | Frontier AI | Axioma Search
Image
Image
Bg

Research Engineer (Training Infrastructure)

About

This is a well-funded frontier AI company building agentic systems that automate complex, multi-step work. The models team sits close to product, training the foundation models behind those systems – with a focus on instruction following, tool use, multimodal understanding, and reinforcement learning.

The research is ambitious. The bottleneck is training infrastructure. This role exists to make large-scale LLM and VLM training faster, more reliable, and less painful as model size and system complexity increase.

What you’ll do

  • Build and improve training infrastructure for large-scale LLMs and VLMs
  • Work on distributed training systems using tools such as FSDP, DeepSpeed, Megatron, TorchTitan or similar
  • Optimise training performance, reliability, and GPU utilisation across large clusters
  • Support multimodal model training, including data flow, checkpointing, and experiment workflows
  • Partner closely with researchers to turn new training ideas into production-grade systems
  • Debug hard failures in long-running jobs and improve observability across the stack
  • Contribute to evaluation workflows and help teams ship model improvements faster

What you’ll need

  • Strong Python engineering skills 
  • Hands-on experience with training infrastructure for large models (100B+)
  • Good understanding of distributed training libraries such as FSDP, Megatron, TorchTitan, or DeepSpeed
  • Experience supporting or training LLMs (or VLMs) at scale
  • Familiarity with at least one major deep learning framework such as PyTorch, JAX, or TensorFlow
  • Comfort working closely with researchers in ambiguous, fast-moving environments
  • Strong communication skills and a low-ego, collaborative way of working

Shortlisted candidates will be contacted within 48 hours.

Back to job listings
  • Location London, Paris
  • Salary / Compensation Up to £180k + equity package
  • Sectors Agentic, Frontier AI / Foundation Models, GenAI
  • Skills Python, PyTorch, Distributed Training, FSDP/DeepSpeed/Megatron, LLM/VLM Training, Training Infrastructure
Image

Role Contact

Matthieu Derycke

Bg

Didn't find the right role?

Send us your CV.