Applications of NLP in recipe texts

Neelu; Vaikundam, Gurupriya; Upadhyay, Rituj; Bagler, Ganesh (Advisor)

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1962

Title:	Applications of NLP in recipe texts
Authors:	Neelu Vaikundam, Gurupriya Upadhyay, Rituj Bagler, Ganesh (Advisor)
Keywords:	Recipe Classification Text Preprocessing XGBoost Food Analytics
Issue Date:	27-Jul-2025
Publisher:	IIIT-Delhi
Abstract:	This study addresses the challenge of large-scale, multi-label recipe classification us- ing a real-world dataset of over 600,000 recipes collected from heterogeneous sources. The raw data exhibited significant noise, duplication, and label imbalance, motivating a comprehensive, multi-stage cleaning and preprocessing framework. Key steps included in- gredient normalization, instructions standardization, multi-label parsing, deduplication, and semantic category mapping into hierarchical supercategories. For modeling, we im- plemented a modular pipeline combining TF-IDF feature extraction, classical classifiers, XGBoost, and fine-tuned BERT models to capture both statistical and contextual signals. By adopting a per-supercategory strategy, we minimized cross-domain interference and achieved strong performance, with the fine-tuned BERT classifier attaining a weighted F1-score of 0.7996 and high accuracy on dominant labels. This work demonstrates how rigorous data preparation and modular modeling can enable fine-grained, interpretable recipe classification at scale, providing a robust foundation for downstream culinary ap- plications such as personalized meal planning and intelligent search.
URI:	http://repository.iiitd.edu.in/xmlui/handle/123456789/1962
Appears in Collections:	Year-2025

Files in This Item:

File	Description	Size	Format
BTP_Poster_Summer - Gurupriya Vaikundam.pdf Restricted Access		1.13 MB	Adobe PDF	View/Open Request a copy

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets