Join the teams that make the Impossible Happen.

Infrastructure SRE - HPC

Sarvam AI

Other Engineering

Bengaluru, Karnataka, India · Chennai, Tamil Nadu, India

Posted on Jun 26, 2026

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

Sarvam runs a large, multi-vendor GPU fleet that serves two demanding workloads on the same physical infrastructure: training jobs that span hundreds of GPUs and must run uninterrupted for weeks, and inference services that must hold a flat p99 under production load. Keeping both healthy at once is a hard, specialized reliability problem, and it is the problem this team exists to solve.

This is not a Kubernetes administration role. We assume Kubernetes fluency as a baseline. The difficulty lies above and below it - in parallel filesystems under heavy checkpoint load, in RDMA fabrics that degrade quietly, in NCCL hangs whose root cause may be the network or the kernel, in driver and firmware drift across heterogeneous hardware, and in distributed training failures that masquerade as infrastructure faults.

We are hiring a team of specialists rather than a set of identical generalists. This posting covers five areas of focus. We expect candidates to bring genuine depth in one and working fluency across the others, because on a shared fleet a storage problem often first appears as a training hang, and the engineer on call must route an incident correctly before anyone can resolve it.

When you apply, please indicate the area of focus that best matches your experience. Strong generalists are welcome; we will place you where your depth is most useful.

What You’ll Do

Operate the GPU fleet end to end across training and serving - provisioning, observability, capacity, and fleet health.
Hold a meaningful on-call rotation, write runbooks that hold up under pressure, and drive postmortems that produce durable fixes.
Build the internal tooling the team relies on, rather than operating off-the-shelf systems alone.
Partner with ML and platform teams to keep large runs alive and serving latency predictable.

What We're Looking For

5+ years in infrastructure or site reliability engineering, including 2+ years operating GPU clusters at scale.*
Demonstrated on-call ownership of infrastructure that mattered, with a track record of postmortems that led to real change.
Proficiency in Python or Go, used to build and maintain internal tooling.
Working fluency across all five areas of focus below - enough to recognize, triage, and route a problem outside your specialty, even if the fix belongs to a teammate.

* For the Storage and Fabric areas of focus, we will weigh deep domain expertise against the GPU-cluster requirement; exceptional specialists with less direct GPU-fleet time are encouraged to apply.

Bring depth in one of the five areas below; expect to be conversational across the rest.

Distributed high-performance storage — operate a parallel filesystem (Lustre, GPFS, WEKA, or BeeGFS) at scale and keep it from stalling under checkpoint-write storms.
Fabric & RDMA networking — InfiniBand or RoCE health, NVLink/NVSwitch topology, and RDMA debugging that catches degradation before the workload feels it.
GPU systems reliability — NCCL debugging, driver and firmware lifecycle across a mixed fleet, and DCGM-based node health at scale.
Kubernetes platform reliability — the GPU operator stack, scheduling (Slurm-on-k8s or pure k8s), multi-tenant isolation, and cost/SLO primitives.
Training & inference workload reliability — hang and straggler detection, checkpoint/restart, and protecting serving p99 with HA and DR.

Bonus Points

Slurm and Kubernetes hybrid environments.
On-premise GPU deployment, including coordination with datacenter operations on power, cooling, and InfiniBand cabling.
Experience with Indian NCPs, DGX SuperPOD, Lambda, CoreWeave, NeevCloud etc.
Multi-tenant GPU isolation (MIG, MPS, time-slicing) in production.

Why Sarvam?

Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.

Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar

High ownership and high impact, from day one

Everything we do is AI-first, from the way we build and ship to the way we think about problems

You can work on problems that could change how an entire country learns, works, and communicates

If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.

See more open positions at Sarvam AI