Abstract:
Forensic speech science, rooted in acoustics, plays a key role in legal investigations. Among its diverse applications, automatic speaker recognition (ASR) stands as a primary task within forensic speech analysis followed by speech emotion recognition (SER), gender recognition (GR) and age estimation (AE). Expanding beyond conventional identification methods, leveraging multi-task learning and speech-pre-trained models (PTM) representations enhances the scope of analysis and is more resource-friendly. This approach allows simultaneous exploration of multiple facets, including speaker information, emotional cues, gender characterization, and age estimation embedded within speech. Additionally, this modeling prevents training models for tasks individually and resulting in preservation of computational resources as well as time. This multi-dimensional analysis aids in offering insights beyond identification and enriches the depth of the investigations via a comprehensive comparison of representations from various PTMs for the aforementioned tasks.