Abstract:
In recent years, researchers have started putting in extensive efforts for Natural Language Pro-
cessing (NLP) in the context of Indic Languages like Hindi. Tasks like Spell and Grammar
Correction, which have been thoroughly studied for languages like English, have gained mo-
mentum. However, with limited data and tools available for Hindi, performing Grammatical
Error Correction (GEC) for Hindi is a challenge. We propose InHerrant, a Grammatical ERRor
ANnotation Toolkit for the Indic language Hindi. This tool is built to automatically extract
edits from a parallel corpus of incorrect and correct sentences and classify them according to a
new, dataset-agnostic, rule-based framework. InHerrant provides Hindi GEC researchers with a
standardised metric for evaluation and reduces annotator workload and can classify edits based
on their error-types at different levels of granularity. We also try to improve upon the models
developed for Hindi GEC. We create an artificial dataset by introducing errors in a corpus of
Hindi Wikipedia and train multiple state-of-the-art models developed for English for GEC in
Hindi. We achieve state of the art results for Hindi GEC, surpassing the existing state of the
art by 14.65 per cent in terms of F0.5 score.