Abstract:
Code-mixing presents significant challenges for Automatic Speech Recognition (ASR), especially for Indian languages, due to homophone ambiguity, domain-specific word identification, and data scarcity. Traditional ASR models struggle with these complexities, often failing to differentiate between phonetically similar words in multilingual contexts. To address this, we propose CLEAR, a novel rescoring model that integrates descriptive prompting and LLM-based rescoring while analyzing the impact of n-best hypotheses across multiple beam widths. CLEAR enhances ASR performance, achieving S-WER of 26.9, P-WER of 26.46, and T- WER of 25.04—improving by 6.9%, 13.47%, and 4.42%, respectively, over the best baseline, i.e., TDNN. These findings demonstrate that CLEAR effectively resolves homophone ambiguities and refines transcriptions, leading to a 13.56% S-WER reduction over fine-tuned Whisper without extensive pretraining. In addition to improving transcription accuracy, CLEAR introduces a principled framework for handling ambiguous hypotheses in low-resource, script-mixed speech. CLEAR is a generic framework that can be adopted for multiple languages apart from Hindi. This work sets the foundation for more linguistically aware ASR systems tailored for multilingual societies.