Abstract:
Large Language Models (LLMs) have transformative capabilities but limited application in specialized educational and research contexts requiring multimodal reasoning, context-aware processing, and domain-specific understanding. Education and research need tools handling nuanced textual and visual interplay and context-rich language. This thesis advances LLMs in high school physics reasoning, multimodal problem solving, mathematical reasoning with bilingual understanding, student engagement analysis, grammar correction, and citation generation. The first contribution enhances multimodal reasoning in physics education, where problems combine text and diagrams. Introducing the MM-PhyQA dataset and using retrieval-augmented methods with Multi-Image Chain-of-Thought (MI-CoT), the study achieves 71.60% accuracy on complex physics tasks, improving LLM support for physics education. Next, mathematical problem-solving, especially geometry, is addressed. The GeoVQA and GPSM4K datasets enable training of LLaVA-v1.5 and G-LLaVA models, which outperform Larger LLMs in geometric reasoning benchmarks, showing the benefit of tailored LLMs for visually and linguistically challenging math tasks. The thesis also tackles student engagement prediction in online learning, lacking in-person cues. Using ECLIPSE dataset to capture virtual attention dynamics, fine-tuning CG-ViT and NeuralGaze models yields a 21.45% improvement in engagement accuracy, supporting adaptive, personalized remote education. For grammatical error correction (GEC), traditional neural machine translation methods struggle with long context. The Dynamic Context Learner (DCL) enables LLMs to integrate relevant context dynamically, improving accuracy on CoNLL 2014 and BEA-Dev datasets with F1 score gains, enhancing grammar correction for academic writing. In academic writing, accurate citation generation is vital. Existing models lack depth to capture complex citation relationships. The multi-source citation text generation (M-CTG) framework combines knowledge graphs and keyphrase embeddings with fine-tuned Vicuna and Alpaca models, achieving a 36.98% ROUGE-1 improvement, facilitating better citation and source attribution. Collectively, this thesis demonstrates the potential of multimodal LLMs fine-tuned for domain-specific educational and scientific tasks. By introducing new datasets, refining architectures, and applying innovative methods, it bridges AI application gaps across fields. In physics education, bilingual mathematical reasoning, and engagement analysis, tailored multimodal LLMs enhance reasoning and context processing. These advances show how domain-specific multimodal AI tools benefit both education and science, paving the way for precise, context-aware, impactful LLM applications across complex, cross-domain challenges.