About
This is a well-funded frontier AI company building agentic systems that automate complex, multi-step work.
This role is about making LLMs respond quickly, cheaply, and reliably for real users – under real traffic, with real latency constraints, and without wasting GPU capacity.
You’d join the inference team focused on serving logic and systems optimisation. The question is not how to train the model. It is how to run it well in production.
What you’ll do
- Build and improve model serving systems for production LLM workloads
- Optimise latency, throughput, batching, and memory use across inference pipelines
- Work on serving architecture, scheduling, caching, and request handling under load
- Improve GPU efficiency through systems-level profiling and performance tuning
- Partner closely with model and product teams to ship fast, reliable inference
- Diagnose bottlenecks in live serving paths and turn them into concrete improvements
- Strengthen observability and debugging around production inference performance
What you’ll need
- Strong experience with LLM serving, ML systems, or high-performance inference
- Good low-level understanding of GPU performance, memory behaviour, and serving trade-offs
- Strong Python skills; C++, Rust, CUDA, or Triton would be useful
- Experience working on latency-sensitive distributed systems in production
- Ability to profile systems, isolate bottlenecks, and improve performance end-to-end
- Comfort working close to research teams in a fast-moving, loosely specified environment
Optional Bonus
- Experience with custom kernels, quantisation, or compiler-level optimisation
- Background in model serving frameworks or large-scale inference platforms
Shortlisted candidates will be contacted within 48 hours.