Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. They operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and they're rapidly expanding. Their infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

They're small, senior, and moving fast. The people who do well with our client own problems end-to-end and make decisions with incomplete information.

The role

We are hiring a Senior SRE to own the incident and problem management programs for our client. This is the person who runs the call when a customer cluster goes down, drives the post-mortem afterward, and hardens the runbooks so the same failure does not surface twice.

This is not an ITIL paperwork role. You will be technically credible across Kubernetes, Linux, networking, and GPU infrastructure, enough to lead a live incident with platform engineers, and a customer on the bridge. But your center of gravity is the process: fast, clean incident response, honest communication, and a relentless feedback loop between customers and engineering.

You will set the standard for how our client behaves during their most challenging hours. Done well, this is one of the most visible and highest-leverage roles on the team. You will work closely with cloud operations, network teams, and customer-facing counterparts. Expect to be on-call in a shared rotation.

What you'll own

Incident command. Run major incidents end-to-end: triage, escalation, comms cadence, timeline, decision log, and drive-to-resolution. You are the conductor, not the fixer.

Customer communications. Own customer-facing incident messaging, initial acknowledgement, status updates at defined intervals, resolution notice, and written RCA. Clear, honest, no hedging.

Post-mortem and RCA program. Run blameless post-mortems, write the RCA, track action items to completion, and publish learnings across the org. Own the quality bar for RCA writing for our client.

Escalation framework. Define and maintain the severity matrix, escalation trees, on-call rotations, and paging policies across platform, network, DC ops, and leadership.

Runbooks and playbooks. Turn tribal knowledge into durable, testable runbooks. Drive the engineering teams to document before the next 2am page, not after.

On-call health. Own the on-call experience: page volume, alert quality, rotation fairness, and handoff hygiene. Kill noisy alerts. Fix the ones that matter.

Reliability metrics. Define and report the metrics that matter: MTTA, MTTR, incident count by class, SLO attainment, repeat-offender systems. Make reliability a number the business trusts.

Game days and drills. Plan and run incident drills across clusters, regions, and customer scenarios. Surface gaps before customers do.

Incident tooling. Own the incident tooling stack, paging, status page, incident channel automation, timeline capture, RCA templates. Evaluate and land the right tools.

What we're looking for

Required

7+ years in SRE, production engineering, or infrastructure operations, with clear ownership of incident management in at least one prior role.

Proven ability to run major incidents as the commander, keeping a room calm, driving decisions, managing comms to customers and executives in parallel.

Strong technical fluency across Linux, Kubernetes, networking ($L2/L3$, BGP basics), and cloud or bare-metal infrastructure.

Deep enough in the stack to challenge assumptions, debug alongside engineers on a bridge call, and senior enough to turn each incident into durable improvements like runbooks, automation, guardrails, so the next incident does not need you.

Excellent written communication. You can draft a customer-facing RCA that is accurate, clear, and does not over- or under-promise, and do it within the SLA window.

Experien

Senior Site Reliability Engineer (SRE)

Схожі вакансії

З блогу Trackr