About
Enterprise operations are still held together by emails, spreadsheets, and systems that don’t talk to each other. The gap isn’t model capability — it’s turning messy, exception-heavy workflows into systems that actually run end-to-end.
You’d join a seed-stage team (~35 people, ~50% engineering) building an agentic platform already used by enterprise retail teams. The product runs live workflows across supplier portals, ERPs, inboxes, and internal tools — often without clean APIs or well-defined processes.
You’ll be the first dedicated SRE. Infrastructure exists, but ownership is fragmented. You take over a live system — Terraform modules, OpenTelemetry pipeline, Prometheus/Grafana — and build the reliability function from first principles.
What you’ll do
- Own platform reliability across infrastructure, observability, and incident response
- Design and improve systems for long-running, unpredictable workloads
- Build monitoring, alerting, and tracing that surfaces issues before customers do
- Lead incident response, root cause analysis, and long-term fixes
- Strengthen security across multi-tenant environments, including isolation and encryption
- Improve infrastructure automation using Terraform, CI/CD, and container orchestration
- Optimise system performance, capacity planning, and cost trade-offs
- Partner with engineers to improve reliability without slowing product velocity
What you’ll need
- Strong experience operating distributed systems in production environments
- Deep understanding of reliability trade-offs: latency, consistency, availability
- Hands-on experience with cloud infrastructure (GCP preferred) and Kubernetes
- Strong observability experience (Prometheus, Grafana, OpenTelemetry or similar)
- Experience with infrastructure as code and automation (Terraform, CI/CD)
- Security mindset — comfortable working with sensitive enterprise data and isolation models
- High ownership — you build systems, not just maintain them
Optional Bonus
- Experience with AI/ML systems or LLM-based applications
- Exposure to sandboxed execution or multi-tenant platforms
- Background in product engineering or full-stack systems
Shortlisted candidates will be contacted within 48 hours.