Hi! I am a Pre-Doctoral Researcher at the Microsoft Research in Bengaluru, working with Sunayana Sitaram and Kalika Bali on multilinguality, evaluation methodologies, and cultural dimensions. I also collaborate with the Microsoft Turing Team at Redmond, working alongside Vishrav Chaudhary on synthetic data, instruction tuning, and scaling laws in multilingual systems.
Previously, I worked as a Research Engineer at a voice-tech startup, Skit.ai, where I focused on developing speech solutions, language understanding, and text-to-speech models for Indic languages.
During my undergraduate, I was advised by Meriem Beloucif and worked on low-resource NLP and Neural Machine Translation using Reinforcement Learning.
Most recent publications on Google Scholar.
‡ indicates equal contribution.
Megaverse: Benchmarking large language models across languages, modalities, models and tasks
Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (NAACL-HLT 2024)
sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting
Sanchit Ahuja‡, Kumar Tanmay‡, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram
Preprint (Submitted to NAACL-HLT 2025)
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Scaling Laws for Multilingual Language Models
Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song
Preprint (Submitted to ICLR 2025)
Contamination Report for Multilingual Benchmarks
Sanchit Ahuja‡, Varun Gumma‡, Sunayana Sitaram
EvalEval Workshop at NeurIPS 2024
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages
Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M Mohammad
Findings of the Association for Computational Linguistics ACL 2024
Hyphen: Hyperbolic hawkes attention for text streams
Shivam Agarwal, Ramit Sawhney, Sanchit Ahuja, Ritesh Soun, Sudheer Chava
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Megaverse: Benchmarking large language models across languages, modalities, models and tasks
Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (NAACL-HLT 2024)
sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting
Sanchit Ahuja‡, Kumar Tanmay‡, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano Del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram
Preprint (Submitted to NAACL-HLT 2025)
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Scaling Laws for Multilingual Language Models
Yifei He, Alon Benhaim, Barun Patra, Praneetha Vaddamanu, Sanchit Ahuja, Parul Chopra, Vishrav Chaudhary, Han Zhao, Xia Song
Preprint (Submitted to ICLR 2025)
Contamination Report for Multilingual Benchmarks
Sanchit Ahuja‡, Varun Gumma‡, Sunayana Sitaram
EvalEval Workshop at NeurIPS 2024
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 14 Languages
Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine De Kock, Genet Shanko Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Hailegnaw Getaneh Tilaye, Krishnapriya Vishnubhotla, Genta Winata, Seid Muhie Yimam, Saif M Mohammad
Findings of the Association for Computational Linguistics ACL 2024
SemEval-2024 task 1: Semantic textual relatedness for african and asian languages
Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Meriem Beloucif, Christine De Kock, Oumaima Hourrane, Manish Shrivastava, Thamar Solorio, Nirmal Surange, Krishnapriya Vishnubhotla, Seid Muhie Yimam, Saif M Mohammad
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024). Association for Computational Linguistics
Hyphen: Hyperbolic hawkes attention for text streams
Shivam Agarwal, Ramit Sawhney, Sanchit Ahuja, Ritesh Soun, Sudheer Chava
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Full Resume in PDF.