Abstract:
This project, Pixel to Plate, aims to bridge computer vision and natural language processing to automate recipe generation from images of ingredients. The first phase of the project focuses on object detection, employing state-of-the-art YOLO models to accurately identify ingredients in the AI-Cook dataset. Comprehensive exploratory data analysis (EDA) was conducted to address dataset quality, class imbalance, and object co-occurrence patterns. Among the tested models, YOLOv8x demonstrated superior performance with a precision of 0.970, making it the chosen model for ingredient detection. The second phase evaluates four large language models (LLaMA, Falcon, GEMMA, and Phi) for recipe generation based on detected ingredients. Models were assessed in a zero-shot setting for coherence, completeness, and relevance. The analysis revealed that LLaMA outperformed the others, producing recipes with logical structure, meaningful use of ingredients, and balanced food combinations. This interdisciplinary effort highlights the potential of combining advanced computer vision and language models for culinary applications, paving the way for automated recipe generation systems that could transform personalized cooking experiences. The findings underscore the importance of model selection, data quality, and task-specific evaluation metrics in achieving reliable results.