Description

About the Client

Our client is a neocloud purpose-built for AI workloads. We design, deploy, and operate GPU clusters at scale, bare-metal compute, high-performance networking (NVLink, InfiniBand, RoCE), and the software platforms that make them usable for training and inference customers. We move fast, we own our infrastructure end-to-end, and we are building the operational foundation for the next generation of AI compute.

The Role

We are hiring a Principal Production Operations Engineer to serve as part of the senior technical backbone of how we run production. This is a generalist role. You will operate horizontally across systems administration, DevOps, SRE, automation, platform engineering, and vertically from hands-on incident command to multi-quarter architectural strategy and execution.

This is not a people-manager role. It is a senior individual contributor role with org-wide technical influence. You will help set the standards that the rest of the operations org follows, lead the hardest incidents, design the automation and platform investments that determine how we scale, and raise the bar for every engineer around you. We expect you to have already done this work somewhere else, owned production at scale, survived the outages, built the systems that prevented the next ones, and mentored the engineers who now run them.

You will work alongside our network architects, platform engineers, and datacenter delivery team. You will be expected to have strong opinions, defend them with data, and change your mind when the evidence says you should.

What you'll own

Production infrastructure and reliability

End-to-end ownership of reliability, performance, and operational health across the company's production infrastructure, bare-metal GPU clusters, OpenStack, Kubernetes, storage, and the networking and security layers underneath.

Technical leadership during major incidents: incident command, root cause analysis, and the durable fixes that ensure the same class of failure does not recur.

Architectural decisions for how we scale compute, storage, and networking across

multiple datacenters and cluster generations.

Lifecycle workflows for bare-metal provisioning, node onboarding, decommissioning,

firmware management, and fleet-wide remediation.

Capacity planning, performance engineering, and the operational readiness reviews that gate new clusters into production.

Automation and platform engineering

Set the bar for infrastructure-as-code, configuration management, and operational

tooling across the org. Define the patterns; do not just follow them.

Design and build the automation that eliminates classes of toil, deployment, scaling,

failover, provisioning, firmware, observability, security posture.

Own the CI/CD, GitOps, and release engineering primitives that production runs on. Make them boring, reliable, and self-service.

Choose the right tools and frameworks (Terraform, Ansible, Helm, Python, Go, Bash) for the problem in front of us, and know when to build versus buy versus avoid.

Drive the AI-assisted operations strategy, agentic runbooks, MCP-integrated tooling, and the workflows that let a small team operate a very large fleet.

Observability and operational discipline

Evolve the observability stack (OpenTelemetry, Prometheus, Grafana, Alertmanager,

Loki, Checkmk) into a platform engineers trust.

Define what good looks like for instrumentation, SLOs, alerting, and on-call hygiene.

Drive the org toward signal over noise.

Lead post-incident reviews. Translate findings into concrete engineering work, not action items that die in a doc.

Participate in the on-call rotation.

Technical leadership and force multiplication

Mentor senior and mid-level engineers. Raise the technical ceiling of the team without

becoming the bottleneck.

Write the design docs, the runbooks, and the standards that the rest of the org builds on.

Prioritize engineering effort against real o

Principal Production Engineer

Схожі вакансії

З блогу Trackr