Machine Learning Engineer, Dubbing
Sarvam AI
Software Engineering, Data Science
Bengaluru, Karnataka, India
Location
Bengaluru
Employment Type
Full time
Location Type
On-site
Department
Engineering
About Sarvam
Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.
About the Role
You will own the ML integration layer for Sarvam’s dubbing and live translation products — building production pipelines that connect ASR, translation, TTS, and voice cloning into seamless end-to-end systems. The scope spans offline video dubbing (batch processing across 12+ Indian languages) and real-time speech-to-speech translation in multi-participant environments where latency budgets are measured in hundreds of milliseconds. The team’s roadmap evolves with the field; we want engineers who are comfortable with that.
What You’ll Do
Build and optimise the real-time speech-to-speech translation pipeline — streaming ASR with server-side VAD, low-latency translation, and TTS synthesis delivered as live audio streams
Design fan-out architectures where a single ASR stream serves multiple concurrent listeners, each receiving personalised translated audio
Implement voice cloning in streaming and batch contexts — reference audio selection heuristics, handling short vs. long utterances, and maintaining speaker identity across a session
Optimise end-to-end latency across the ASR → translation → TTS chain, including transcript buffering, segmentation strategies, and flush timing for continuous speech
Integrate ML pipelines with real-time media infrastructure (WebRTC, RTMP, SRT) for live broadcast and conferencing use cases
Own the automated QC loop — designing multi-stage verification pipelines that catch and correct quality issues before delivery
Build evaluation harnesses for speech quality — WER/CER tracking, tempo analysis, pronunciation verification, and automated QC scoring
Optimise inference pipelines — quantisation, batching strategies, model server configuration, and runtime acceleration for VAD and vocal separation
Design and maintain audio data pipelines — segment extraction, filtering, deduplication, and quality assurance
Build robust integrations across multiple ASR, TTS, and translation backends — managing fallbacks, retries, and quality routing
Debug and improve deployed speech systems — latency, audio artifacts, code-mixed content, regional dialect handling, and edge cases in production
Translate real-world dubbing problems (timing preservation, naturalness, register matching, multi-speaker scenarios) into well-scoped ML tasks with the right data and evaluation strategy
What We’re Looking For
Strong Python and PyTorch — comfortable reading model internals, profiling inference, and debugging production failures
Hands-on experience integrating and optimising speech models (ASR or TTS) in production environments
Experience with real-time/streaming systems — WebSocket pipelines, chunked audio processing, or latency-sensitive async architectures
Solid understanding of modern speech system architectures — sequence-to-sequence models, attention mechanisms, flow-matching or diffusion-based TTS, streaming ASR
Familiarity with model serving infrastructure — Triton, TorchServe, ONNX Runtime, or equivalent
Experience with audio signal processing fundamentals: sample rates, PCM formats, spectrograms, vocoding, time-stretching
Strong async Python skills — asyncio, concurrent pipelines, managing backpressure in streaming systems
Comfort with ambiguity — the roadmap is not fully pre-specified
Undergraduate degree in a technical discipline (CS, EE, statistics, physics, or equivalent)
Bonus Points
Experience with multilingual or Indic speech systems — handling code-mixing, transliteration, tonal variation across Indian languages
Voice cloning or speaker adaptation techniques (zero-shot or few-shot) in production
Experience with real-time media protocols — WebRTC, RTMP, SRT, HLS, or real-time audio agent frameworks
Multi-participant audio systems — speaker diarization, concurrent pipeline management, per-user audio routing
Vocal source separation or speech enhancement techniques
Familiarity with FFmpeg, GStreamer, or media muxing/demuxing
Contributions to open-source speech/audio projects or a solid GitHub portfolio
Why Sarvam?
Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.
Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar
High ownership and high impact, from day one
Everything we do is AI-first, from the way we build and ship to the way we think about problems
You can work on problems that could change how an entire country learns, works, and communicates
If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.