ML Engineer (Data), Foundational Models

Sarvam AI

Sarvam AI

Software Engineering, Data Science

Bengaluru, Karnataka, India

Posted on May 21, 2026

Location

Bengaluru

Employment Type

Full time

Location Type

On-site

Department

Models

About Sarvam

Sarvam is building the bedrock of Sovereign AI for India. The company is developing India’s full-stack sovereign AI platform, building across research, models, infrastructure and applications with a singular focus on making AI genuinely work for India. Sarvam works with leading enterprises and public institutions and is backed by Lightspeed, Peak XV, and Khosla Ventures. Sarvam partners with India’s leading brands, including Tata Capital, SBI Life, CRED, IDFC, and LIC.

About the Role

You will own the data infrastructure that feeds our next family of foundational models. This means building petabyte-scale curation and filtering pipelines, designing the systems that decide what goes into a training run and in what proportion, and treating data quality with the same rigor a research team would treat an architectural choice.

This is not a glue-code role. The data work at a serious pretraining lab is engineering- and research-heavy: deduplication at scale, quality models, contamination detection, mixture design, curriculum and annealing, attribution and debugging. You should care deeply about all of it.

What You’ll Do

  • Design and build large-scale data pipelines for pre-training and post-training — ingestion, parsing, normalization, filtering, deduplication, tokenization, and packing — at petabyte scale.

  • Develop and continually improve quality filtering systems, including model-based quality classifiers and contamination detection.

  • Own data mixture design, curriculum, and annealing strategies in partnership with the research team. The question "what data did this model see, in what proportion, in what order" should always have a precise answer because of work you did.

  • Build the tooling that lets researchers and engineers analyze, slice, attribute, and debug the data.

  • Scale the pipeline to handle multilingual corpora, code, math, multi-source web data, and licensed datasets, while keeping provenance and licensing tracked end-to-end.

  • Partner with the training infrastructure team so that data is never the bottleneck of a production training run

What We’re Looking For

  • BS or MS in Computer Science or a closely related technical field (or equivalent demonstrated experience).

  • 3+ years of experience building large-scale data systems — petabyte-scale processing, distributed data pipelines, or comparable. Exceptional early-career candidates with a strong systems background will be considered.

  • Hands-on experience with data curation and filtering for LLM training. You should be able to walk through a pre-training corpus you helped build, end to end, and defend the choices that went into it.

  • Deep familiarity with distributed data processing frameworks — Spark, Ray, Beam, Dask, or equivalent — and the storage systems that sit underneath them.

  • Strong Python; comfort with the low-level pieces of the data path (tokenization, sharding, packing, IO patterns) and the performance tradeoffs they imply.

  • Meaningful open-source contributions in the data tooling ecosystem — datasets, dedup libraries, filtering frameworks, or substantive work on widely-used open data releases.

Bonus Points

  • Direct experience building or working with large open pretraining corpora.

  • Work on multilingual data — collection, normalization, quality scoring, and mixing across many languages.

  • Hands-on experience with model-based data quality classifiers, contamination detection, or data attribution research.

  • Familiarity with tokenization research and the practical implications of tokenizer choices on training.

  • First-author papers or technical reports on data curation, quality, or pretraining mixtures.

Why this role?

The frontier is moving towards data being the dominant lever in model quality, and the labs that get this right will define the next generation of models. You will be the person at Sarvam most responsible for that lever.

Why Sarvam?

Sarvam is a fast-moving, high talent-density team building full-stack AI for India, working on problems that push the frontiers of AI with real population-scale impact.

  • Work alongside researchers, engineers, builders, and business leaders who move fast and hold each other to a very high bar

  • High ownership and high impact, from day one

  • Everything we do is AI-first, from the way we build and ship to the way we think about problems

  • You can work on problems that could change how an entire country learns, works, and communicates

If you want to work on problems at the frontier of AI in India, Sarvam is the place to be.