Abstract:
Over the course of the semester we worked on VertexVQA4k, a comprehensive multimodal dataset designed for secondary-level geometry education, drawing from Indian curricula. The dataset, containing approximately 4,000 geometric image-caption and question-answer pairs, em- phasizes Numerical Answer Questions and Theorem Proving Questions, thereby broadening the scope and educational significance of multimodal numerical reasoning in Large Language Models (LLMs). VertexVQA4k distinguishes itself from existing geometry datasets by providing dual solution approaches for each problem, aiming to enhance problem-solving skills and model com- prehension. The paper details the meticulous dataset extraction and augmentation processes, including diagram description generation and solution regeneration, to improve the capabilities of multimodal LLMs in geometric problem-solving. The paper also explores hallucination in Large Vision Language Models (LVLMs) and proposes mitigation strategies. Furthermore, it delves into image captioning, stressing the importance of generating meaningful visual repre- sentations and coherent captions. The study concludes with an evaluation of the dataset and models, underscoring the efficacy of VertexVQA4k in advancing multimodal learning and rea- soning in the LLMs.