About the Client

Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.

The role

We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU

clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.

This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.

What you'll own

A team of ~9 engineers spanning production engineering, SRE, security, and

automation.

The reliability, performance, and operability of Client's GPU cloud platform across

multiple clusters and customers

Incident response and post-incident culture, you'll set the standard for how we

investigate, communicate, and learn from outages

Operational readiness for new clusters and data center buildouts, in close partnership

with our DC Build, Networking, and Program Management functions

Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal

knowledge, and making our environment legible to both engineers and AI tooling

Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale

What we're looking for

Required

8+ years of infrastructure, production engineering, or SRE experience, including 3+

years managing or tech-leading engineers

Deep hands-on experience with Linux production systems at scale, you've debugged

kernel, networking, and storage issues in anger, not just read about them

Strong Kubernetes operational experience, you understand what breaks in Kubernetes

at scale and why, not just how to write a deployment manifest

Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime

Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads

Track record of building and running on-call, incident response, and post-mortem

practices that engineers actually trust

Clear, direct written communication, much of our team and work is async

Strongly preferred

Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it

quickly

OpenStack, or bare-metal provisioning experience

Experience working alongside networking and security engineers as peers, not as tickets to file

Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a

point of view on what good monitoring looks like

Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust

How we work

We're remote-first across US, LATAM, and EU time zones

We write things down, decisions, architecture, runbooks, post-mortems

We use AI t

Senior Technical Manager Platform Engineering

Схожі вакансії

З блогу Trackr