Learning speaker, emotion, age, and gender information through disentanglement of speech pre-trained representations

Please use this identifier to cite or link to this item: http://repository.iiitd.edu.in/xmlui/handle/123456789/1448

Title:	Learning speaker, emotion, age, and gender information through disentanglement of speech pre-trained representations
Authors:	Koshal, Devyani Buduru, Arun Balaji (Advisor)
Keywords:	Speech Forensics Self-Supervised Learning Pre-Trained Models Multi-Task Learning Convolutional Neural Networks
Issue Date:	29-Nov-2023
Publisher:	IIIT-Delhi
Abstract:	Forensic speech science, rooted in acoustics, plays a key role in legal investigations. Among its diverse applications, automatic speaker recognition (ASR) stands as a primary task within forensic speech analysis followed by speech emotion recognition (SER), gender recognition (GR) and age estimation (AE). Expanding beyond conventional identification methods, leveraging multi-task learning and speech-pre-trained models (PTM) representations enhances the scope of analysis and is more resource-friendly. This approach allows simultaneous exploration of multiple facets, including speaker information, emotional cues, gender characterization, and age estimation embedded within speech. Additionally, this modeling prevents training models for tasks individually and resulting in preservation of computational resources as well as time. This multi-dimensional analysis aids in offering insights beyond identification and enriches the depth of the investigations via a comprehensive comparison of representations from various PTMs for the aforementioned tasks.
URI:	http://repository.iiitd.edu.in/xmlui/handle/123456789/1448
Appears in Collections:	Year-2023

Files in This Item:

File	Description	Size	Format
BTP_Report_23_Devyani_Koshal_2020055 - Devyani Koshal.pdf Restricted Access		5.58 MB	Adobe PDF	View/Open Request a copy

DSpace JSPUI