DevOps Engineer - Studio Platform
Sarvam AI
Software Engineering
Bengaluru, Karnataka, India
Location
Bengaluru
Employment Type
Full time
Location Type
On-site
Department
Engineering
About Sarvam
Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.
About the Role
We are seeking a driven and highly skilled DevOps Engineer to join the Studio engineering team at Sarvam. Studio is our creative media platform — powering AI dubbing, live translation, voice cloning, and more across 12+ Indian languages for enterprise media companies and millions of creators.
You will own the operational reliability, deployment velocity, and infrastructure automation for a production system that orchestrates ML-heavy workloads at scale. The platform runs a multi-service architecture on Kubernetes with GPU inference servers, async task queues, multi-stage ML pipelines, and real-time delivery — across a multi-cloud setup.
You will be the person who keeps this running, makes it faster to ship, and ensures it scales with growing demand.
What You’ll Do
Own and operate production Kubernetes clusters — managing multi-role deployments (API servers, task schedulers, per-stage workers, WebSocket servers), horizontal pod autoscalers, CronJobs, and node pool scheduling
Build and optimize CI/CD pipelines— automated testing, container image builds, registry management, staged rollouts across QA/staging/production, and deployment notifications
ManageHelm-based deployments— maintaining application charts, shared module dependencies, environment-specific value overlays, and promotion workflows
Implement and maintain observabilityacross services — metrics collection, dashboards, error tracking, and distributed tracing to catch issues before customers do
Operate and optimize cloud infrastructure— blob storage with CDN for media delivery, ingress configuration, secrets management, and IAM policies
Managemulti-cloud coordination— ensuring consistent deployments, artifact management (Docker images + internal Python packages), and credential management across cloud providers
Ensure robusts ecrets management— vault-to-cluster sync pipelines, service account credentials, and CI/CD secret hygiene
Optimize async task queue infrastructure — health monitoring, dead-letter handling, stuck message recovery, and real-time notification delivery
Monitor and optimize database performance — query health, connection pooling, backup/recovery, and coordinating with the platform team on shared infrastructure
Build developer productivity tooling — local dev environments, pre-commit hooks, self-service infrastructure, and documentation
Own incident response — runbooks, alerting rules, post-mortem processes, and reliability improvements for a system where jobs span minutes and failures can cascade across pipeline stages
Drivecost optimization— right-sizing compute, autoscaling policies, storage lifecycle management, and resource utilization across the cluster
What We're Looking For
3+ years of experience in a DevOps, SRE, or Infrastructure Engineering role
Strong hands-on experience with Kubernetes in production— deployments, Helm charts, HPAs, CronJobs, node affinity, resource management, and debugging pod failures (non-negotiable)
Proficiency with at least one major cloud platform(Azure, GCP, or AWS) — managed Kubernetes, container registries, blob storage, secrets vaults, and IAM (non-negotiable)
Experience building and maintainingCI/CD pipelines— container builds, automated testing gates, multi-environment promotion, and deployment automation
Solid understanding of containerization— writing production Dockerfiles, multi-stage builds, image optimization, and managing container registries
Experience with monitoring and observability stacks — Prometheus, Grafana, OpenTelemetry, Sentry, or similar
StrongLinux systems knowledge — networking, process management, storage, and debugging system-level issues
Scripting proficiency inPython or Bashfor automation and tooling
Good understanding of networking— DNS, load balancing, ingress controllers, TLS termination, and CDN configuration
Experience with secrets management patterns — vault-to-cluster sync workflows (External Secrets Operator, Sealed Secrets, or similar)
Familiarity with Redisor similar in-memory data stores — queues, pub/sub, monitoring, and troubleshooting
Strong analytical and problem-solving skills; ability to work collaboratively in a fast-paced, small team environment
Bonus Points
Experience withmulti-cloud environments— operating across more than one cloud provider for runtime, build, and ML workloads
Experience managingGPU workloads on Kubernetes — NVIDIA device plugins, GPU node scheduling, and resource limits for inference servers
Familiarity withML model serving infrastructure— Triton Inference Server, TorchServe, vLLM, or similar
Experience with service mesh technologies — Linkerd, Istio, or Consul
Familiarity with GitOps workflows — ArgoCD, Flux, or similar declarative deployment patterns
Experience withevent-driven autoscaling— KEDA or similar scale-to-zero/scale-on-event patterns
Background in supporting media/audio/video processing workloads — high I/O, large file handling, and media processing pipelines
Experience with infrastructure-as-codetools like Terraform or Pulumi
Experience with multi-tenant Kubernetes clusters— namespace isolation, resource quotas, and network policies across products
Cloud certifications (Azure, GCP, or AWS)
Why Sarvam?
Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.
Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar
High ownership and high impact, from day one
Everything we do is AI-first, from the way we build and ship to the way we think about problems
You can work on problems that could change how an entire country learns, works, and communicates
If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.