A comprehensive understanding of code-mixed language semantics using hierarchical transformer

S, Tharun; Akhtar, Md. Shad (Advisor); Chakraborty, Tanmoy (Advisor)

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1063

Full metadata record

DC Field	Value	Language
dc.contributor.author	S, Tharun	-
dc.contributor.author	Akhtar, Md. Shad (Advisor)	-
dc.contributor.author	Chakraborty, Tanmoy (Advisor)	-
dc.date.accessioned	2023-04-03T08:19:23Z	-
dc.date.available	2023-04-03T08:19:23Z	-
dc.date.issued	2022-05	-
dc.identifier.uri	http://repository.iiitd.edu.in/xmlui/handle/123456789/1063	-
dc.description.abstract	Being a popular mode of text-based communication in multilingual communities, code-mixing in online social media has became an important subject to study. Learn- ing the semantics and morphology of code-mixed language remains a key challenge, due to scarcity of data and unavailability of robust and language-invariant representa- tion learning technique. Any morphologically-rich language can benefit from charac- ter, subword, and word-level embeddings, aiding in learning meaningful correlations. In this paper, we explore a hierarchical transformer-based architecture (HIT) to learn the semantics of code-mixed languages. HIT consists of multi-headed self-attention and outer product attention components to simultaneously comprehend the seman- tic and syntactic structures of code-mixed texts. We evaluate the proposed method across 6 Indian languages (Bengali, Gujarati, Hindi, Tamil, Telugu and Malayalam) and Spanish for 9 NLP tasks on 17 datasets. The HIT model outperforms state- of-the-art code-mixed representation learning and multilingual language models in all tasks. We further demonstrate the generalizability of the HIT architecture us- ing masked language modeling-based pre-training, zero-shot learning, and transfer learning approaches. Our empirical results show that the pre-training objectives sig- nificantly improve the performance on downstream tasks.	en_US
dc.language.iso	en_US	en_US
dc.publisher	IIIT-Delhi	en_US
dc.subject	Code-mixing	en_US
dc.subject	HIT	en_US
dc.subject	Bengali	en_US
dc.subject	Gujarati	en_US
dc.title	A comprehensive understanding of code-mixed language semantics using hierarchical transformer	en_US
dc.type	Thesis	en_US
Appears in Collections:	Year-2022

Files in This Item:

File	Description	Size	Format
Tharun S MT20119.pdf		986.19 kB	Adobe PDF	View/Open

Show simple item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets