Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1063
Full metadata record
DC FieldValueLanguage
dc.contributor.authorS, Tharun-
dc.contributor.authorAkhtar, Md. Shad (Advisor)-
dc.contributor.authorChakraborty, Tanmoy (Advisor)-
dc.date.accessioned2023-04-03T08:19:23Z-
dc.date.available2023-04-03T08:19:23Z-
dc.date.issued2022-05-
dc.identifier.urihttp://repository.iiitd.edu.in/xmlui/handle/123456789/1063-
dc.description.abstractBeing a popular mode of text-based communication in multilingual communities, code-mixing in online social media has became an important subject to study. Learn- ing the semantics and morphology of code-mixed language remains a key challenge, due to scarcity of data and unavailability of robust and language-invariant representa- tion learning technique. Any morphologically-rich language can benefit from charac- ter, subword, and word-level embeddings, aiding in learning meaningful correlations. In this paper, we explore a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages. HIT consists of multi-headed self-attention and outer product attention components to simultaneously comprehend the seman- tic and syntactic structures of code-mixed texts. We evaluate the proposed method across 6 Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu and Malayalam) and Spanish for 9 NLP tasks on 17 datasets. The HIT model outperforms state- of-the-art code-mixed representation learning and multilingual language models in all tasks. We further demonstrate the generalizability of the HIT architecture us- ing masked language modeling-based pre-training, zero-shot learning, and transfer learning approaches. Our empirical results show that the pre-training objectives sig- nificantly improve the performance on downstream tasks.en_US
dc.language.isoen_USen_US
dc.publisherIIIT-Delhien_US
dc.subjectCode-mixingen_US
dc.subjectHITen_US
dc.subjectBengalien_US
dc.subjectGujaratien_US
dc.titleA comprehensive understanding of code-mixed language semantics using hierarchical transformeren_US
dc.typeThesisen_US
Appears in Collections:Year-2022

Files in This Item:
File Description SizeFormat 
Tharun S MT20119.pdf986.19 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.