About the Client
Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.
We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.
The role
We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU
clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.
This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.
What you'll own
A team of ~9 engineers spanning production engineering, SRE, security, and
automation.
The reliability, performance, and operability of Client's GPU cloud platform across
multiple clusters and customers
Incident response and post-incident culture, you'll set the standard for how we
investigate, communicate, and learn from outages
Operational readiness for new clusters and data center buildouts, in close partnership
with our DC Build, Networking, and Program Management functions
Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal
knowledge, and making our environment legible to both engineers and AI tooling
Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale
What we're looking for
Required
8+ years of infrastructure, production engineering, or SRE experience, including 3+
years managing or tech-leading engineers
Deep hands-on experience with Linux production systems at scale, you've debugged
kernel, networking, and storage issues in anger, not just read about them
Strong Kubernetes operational experience, you understand what breaks in Kubernetes
at scale and why, not just how to write a deployment manifest
Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime
Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads
Track record of building and running on-call, incident response, and post-mortem
practices that engineers actually trust
Clear, direct written communication, much of our team and work is async
Strongly preferred
Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it
quickly
OpenStack, or bare-metal provisioning experience
Experience working alongside networking and security engineers as peers, not as tickets to file
Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a
point of view on what good monitoring looks like
Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust
How we work
We're remote-first across US, LATAM, and EU time zones
We write things down, decisions, architecture, runbooks, post-mortems
We use AI t

