Sanchit Ahuja

Pre-Doctoral Researcher, Microsoft Research

sanchitahuja205[AT]gmail.com

Bio

Hi! I am a Pre-Doctoral Researcher at the Microsoft Research in Bengaluru, working with Sunayana Sitaram and Kalika Bali on multilinguality, evaluation methodologies, and cultural dimensions. I also collaborate with the Microsoft Turing Team at Redmond, working alongside Vishrav Chaudhary on synthetic data, instruction tuning, and scaling laws in multilingual systems.
Previously, I worked as a Research Engineer at a voice-tech startup, Skit.ai, where I focused on developing speech solutions, language understanding, and text-to-speech models for Indic languages.
During my undergraduate, I was advised by Meriem Beloucif and worked on low-resource NLP and Neural Machine Translation using Reinforcement Learning.

I am looking for PhD opportunities starting Fall 2025. If you are working on multilinguality, evaluations, synthetic data, scaling laws or any other exciting research area, I would love to chat!

Publications

Most recent publications on Google Scholar.
indicates equal contribution.

Megaverse: Benchmarking large language models across languages, modalities, models and tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (NAACL-HLT 2024)

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram

Preprint (Submitted to NAACL-HLT 2025)

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Scaling Laws for Multilingual Language Models

Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song

Preprint (Submitted to ICLR 2025)

Contamination Report for Multilingual Benchmarks

Sanchit Ahuja, Varun Gumma, Sunayana Sitaram

EvalEval Workshop at NeurIPS 2024

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M Mohammad

Findings of the Association for Computational Linguistics ACL 2024

Hyphen: Hyperbolic hawkes attention for text streams

Shivam Agarwal, Ramit Sawhney, Sanchit Ahuja, Ritesh Soun, Sudheer Chava

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Megaverse: Benchmarking large language models across languages, modalities, models and tasks

Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (NAACL-HLT 2024)

sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting

Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram

Preprint (Submitted to NAACL-HLT 2025)

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Scaling Laws for Multilingual Language Models

Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song

Preprint (Submitted to ICLR 2025)

Contamination Report for Multilingual Benchmarks

Sanchit Ahuja, Varun Gumma, Sunayana Sitaram

EvalEval Workshop at NeurIPS 2024

SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M Mohammad

Findings of the Association for Computational Linguistics ACL 2024

SemEval-2024 task 1: Semantic textual relatedness for african and asian languages

Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M Mohammad

Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). Association for Computational Linguistics

Hyphen: Hyperbolic hawkes attention for text streams

Shivam Agarwal, Ramit Sawhney, Sanchit Ahuja, Ritesh Soun, Sudheer Chava

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

CV

Full Resume in PDF.