About the client The client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we’re rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.
We’re small, senior, and moving fast. The people who do well at the client own problems end-to-end and make decisions with incomplete information.
The role We are hiring an Incident & Reliability Lead to own the incident and problem management programs at the client. This is the person who runs the call when a customer cluster goes down, drives the post-mortem afterward, and hardens the runbooks so the same failure does not surface twice.
This is not an ITIL paperwork role. You will be technically credible across Kubernetes, Linux, networking, and GPU infrastructure, enough to lead a live incident with platform engineers, and a customer on the bridge. But your center of gravity is the process: fast, clean incident response, honest communication, and a relentless feedback loop between customers engineering.
You will set the standard for how the client behaves during our most challenging hours. Done well, this is one of the most visible and highest-leverage roles on the team.
You will work closely with cloud operations, network teams, and customer-facing counterparts. Expect to be on-call in a shared rotation.
What you’ll own
• Incident command. Run major incidents end-to-end: triage, escalation, comms cadence, timeline, decision log, and drive-to-resolution. You are the conductor, not the fixer.
• Customer communications. Own customer-facing incident messaging, initial acknowledgement, status updates at defined intervals, resolution notice, and written RCA. Clear, honest, no hedging.
• Post-mortem and RCA program. Run blameless post-mortems, write the RCA, track action items to completion, and publish learnings across the org. Own the quality bar for RCA writing at the client.
• Escalation framework. Define and maintain the severity matrix, escalation trees, on-call rotations, and paging policies across platform, network, DC ops, and leadership.
• Runbooks and playbooks. Turn tribal knowledge into durable, testable runbooks. Drive the engineering teams to document before the next 2am page, not after.
• On-call health. Own the on-call experience: page volume, alert quality, rotation fairness, and handoff hygiene. Kill noisy alerts. Fix the ones that matter.
• Reliability metrics. Define and report the metrics that matter: MTTA, MTTR, incident count by class, SLO attainment, repeat-offender systems. Make reliability a number the business trusts.
• Game days and drills. Plan and run incident drills across clusters, regions, and customer scenarios. Surface gaps before customers do.
• Incident tooling. Own the incident tooling stack, paging, status page, incident channel automation, timeline capture, RCA templates. Evaluate and land the right tools.
What we’re looking for
Required
• 7+ years in SRE, production engineering, or infrastructure operations, with clear ownership of incident management in at least one prior role.
• Proven ability to run major incidents as the commander, keeping a room calm, driving decisions, managing comms to customers and executives in parallel.
• Strong technical fluency across Linux, Kubernetes, networking (L2/L3, BGP basics), and cloud or bare-metal infrastructure. Deep enough in the stack to challenge assumptions, debug alongside engineers on a bridge call, and senior enough to turn each incident into durable improvements like runbooks, automation, guardrails, so the next incident does not need you.
• Excellent written communication. You can draft a customer-facing RCA that is accurate, c

