Abstract:
In this paper, we worked on different aspects like dataset, annotation guidelines, annotation platform, and models to build a complete eco-system, aimed at making significant contributions towards NLP for the Tamil language. We focused on researching about morpho-syntactic relations in the Tamil text. A more diverse dataset was curated from 5 sources to form a treebank of 10,000 CoNLL-U format annotated sentences. Detailed annotation guidelines were developed for guiding the annotators and the users. We proposed hierarchical tag sets for POS and NER tasks, after testing various available tag sets for the Tamil language. To carry out the CoNLL-U format annotations efficiently, we introduce CoNLL-U GSheets. This annotation platform uses the highly accessible and easy-to-use Google sheets and equips it with all the necessary tools for annotations. The research also focused on developing the pipeline and the models for each task in the morpho-syntactic analysis. We have addresed the language-specific issues for each task in the morpho-syntactic analysis. We also took design decisions that promote flexibility in applications and assist in later NLP tasks.