Machine Learning Engineer, Dubbing

Sarvam AI

Sarvam AI

Software Engineering, Data Science

Bengaluru, Karnataka, India

Posted on Jun 2, 2026

Location

Bengaluru

Employment Type

Full time

Location Type

On-site

Department

Engineering

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

You will own the ML integration layer for Sarvam’s dubbing and live translation products — building production pipelines that connect ASR, translation, TTS, and voice cloning into seamless end-to-end systems. The scope spans offline video dubbing (batch processing across 12+ Indian languages) and real-time speech-to-speech translation in multi-participant environments where latency budgets are measured in hundreds of milliseconds. The team’s roadmap evolves with the field; we want engineers who are comfortable with that.

What You’ll Do

  • Build and optimise the real-time speech-to-speech translation pipeline — streaming ASR with server-side VAD, low-latency translation, and TTS synthesis delivered as live audio streams

  • Design fan-out architectures where a single ASR stream serves multiple concurrent listeners, each receiving personalised translated audio

  • Implement voice cloning in streaming and batch contexts — reference audio selection heuristics, handling short vs. long utterances, and maintaining speaker identity across a session

  • Optimise end-to-end latency across the ASR → translation → TTS chain, including transcript buffering, segmentation strategies, and flush timing for continuous speech

  • Integrate ML pipelines with real-time media infrastructure (WebRTC, RTMP, SRT) for live broadcast and conferencing use cases

  • Own the automated QC loop — designing multi-stage verification pipelines that catch and correct quality issues before delivery

  • Build evaluation harnesses for speech quality — WER/CER tracking, tempo analysis, pronunciation verification, and automated QC scoring

  • Optimise inference pipelines — quantisation, batching strategies, model server configuration, and runtime acceleration for VAD and vocal separation

  • Design and maintain audio data pipelines — segment extraction, filtering, deduplication, and quality assurance

  • Build robust integrations across multiple ASR, TTS, and translation backends — managing fallbacks, retries, and quality routing

  • Debug and improve deployed speech systems — latency, audio artifacts, code-mixed content, regional dialect handling, and edge cases in production

  • Translate real-world dubbing problems (timing preservation, naturalness, register matching, multi-speaker scenarios) into well-scoped ML tasks with the right data and evaluation strategy

What We’re Looking For

  • Strong Python and PyTorch — comfortable reading model internals, profiling inference, and debugging production failures

  • Hands-on experience integrating and optimising speech models (ASR or TTS) in production environments

  • Experience with real-time/streaming systems — WebSocket pipelines, chunked audio processing, or latency-sensitive async architectures

  • Solid understanding of modern speech system architectures — sequence-to-sequence models, attention mechanisms, flow-matching or diffusion-based TTS, streaming ASR

  • Familiarity with model serving infrastructure — Triton, TorchServe, ONNX Runtime, or equivalent

  • Experience with audio signal processing fundamentals: sample rates, PCM formats, spectrograms, vocoding, time-stretching

  • Strong async Python skills — asyncio, concurrent pipelines, managing backpressure in streaming systems

  • Comfort with ambiguity — the roadmap is not fully pre-specified

  • Undergraduate degree in a technical discipline (CS, EE, statistics, physics, or equivalent)

Bonus Points

  • Experience with multilingual or Indic speech systems — handling code-mixing, transliteration, tonal variation across Indian languages

  • Voice cloning or speaker adaptation techniques (zero-shot or few-shot) in production

  • Experience with real-time media protocols — WebRTC, RTMP, SRT, HLS, or real-time audio agent frameworks

  • Multi-participant audio systems — speaker diarization, concurrent pipeline management, per-user audio routing

  • Vocal source separation or speech enhancement techniques

  • Familiarity with FFmpeg, GStreamer, or media muxing/demuxing

  • Contributions to open-source speech/audio projects or a solid GitHub portfolio

Why Sarvam?

Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.

  • Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar

  • High ownership and high impact, from day one

  • Everything we do is AI-first, from the way we build and ship to the way we think about problems

  • You can work on problems that could change how an entire country learns, works, and communicates

If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.