About
The machines can only be as reliable as the systems around them. Video, control, updates, deployment, security, incident response — once a fleet grows, cloud reliability becomes part of the product.
This company is building teleoperated and autonomous systems for heavy industrial machines. The teleoperation layer is live today and generating data; the autonomy stack is scaling behind it. The team is still small, which means infra decisions made now will shape how the platform behaves at 50 machines and again at 300.
This role owns the cloud side properly. Not as support for engineering. As core engineering.
What you'll do
- Own reliability across the cloud systems behind teleoperation and fleet operations
- Improve deployment, release, and rollback workflows across production environments
- Build observability, alerting, and incident response practices that work under real operational pressure
- Harden infrastructure security across access, networking, secrets, and updates
- Improve uptime and resilience for streaming, control, and operator-facing services
- Partner with platform and application teams to remove infrastructure bottlenecks
- Raise the standard for operational maturity as the fleet scales
What you'll need
- Strong SRE, platform, or infrastructure engineering experience in production systems
- Deep Linux and cloud fundamentals
- Experience with modern deployment tooling, observability, and incident management
- Solid security instincts across IAM, networking, patching, and production access
- Ability to design reliable systems from first principles, not just maintain existing ones
- Comfort operating in a small team with broad ownership
Optional Bonus
- Experience with low-latency streaming or teleoperation systems
- Background in robotics, edge, or machine fleets
Shortlisted candidates will be contacted within 48 hours.