Please use this identifier to cite or link to this item:
http://repository.iiitd.edu.in/xmlui/handle/123456789/1090| Title: | TamilNLP: low resource language processing |
| Authors: | Singh, Himanshu Shah, Rajiv Ratn (Advisor) |
| Keywords: | Low resource language processing NLP Dataset Annotation guidelines Morpho-syntactic relations Tamil language |
| Issue Date: | Jul-2022 |
| Publisher: | IIIT-Delhi |
| Abstract: | In this paper, we worked on different aspects like dataset, annotation guidelines, annotation platform, and models to build a complete eco-system, aimed at making significant contributions towards NLP for the Tamil language. We focused on researching about morpho-syntactic relations in the Tamil text. A more diverse dataset was curated from 5 sources to form a treebank of 10,000 CoNLL-U format annotated sentences. Detailed annotation guidelines were developed for guiding the annotators and the users. We proposed hierarchical tag sets for POS and NER tasks, after testing various available tag sets for the Tamil language. To carry out the CoNLL-U format annotations efficiently, we introduce CoNLL-U GSheets. This annotation platform uses the highly accessible and easy-to-use Google sheets and equips it with all the necessary tools for annotations. The research also focused on developing the pipeline and the models for each task in the morpho-syntactic analysis. We have addresed the language-specific issues for each task in the morpho-syntactic analysis. We also took design decisions that promote flexibility in applications and assist in later NLP tasks. |
| URI: | http://repository.iiitd.edu.in/xmlui/handle/123456789/1090 |
| Appears in Collections: | Year-2022 |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| MTech_Thesis__Himanshu.pdf | 6.29 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.