| dc.description.abstract |
High-quality synthetic speech has transformative potential for accessibility, education, entertainment, and personalized human–computer interaction. However, it also poses serious risks: synthetic voices can be exploited for audio deepfakes and impersonation attacks. These risks are magnified in multilingual and low-resource settings, where audio deepfake detection (ADD) and speaker verification (SV) systems exhibit pro-nounced linguistic biases, and the scarcity of large-scale, publicly available datasets limits the development of robust, fair, and inclusive models. Moreover, existing methods for evaluating synthetic speech quality rely primarily on human studies, which are costly, difficult to scale, and often lack reproducibility. Additionally, synthetic speech generation models incur significant carbon emissions, yet environmental sustainability remains largely overlooked. Together, these challenges highlight a critical need for datasets, evaluation frameworks, and bias-mitigation methods that can enable responsible, inclusive, and environmentally conscious speech technologies. To address these gaps, this thesis makes the following key contributions: First, we introduce IndicSynth, a large-scale synthetic speech dataset covering 12 low-resource Indian languages to support multilingual ADD and anti-spoofing research. IndicSynth balances realistic voice mimicry and synthetic diversity. Using IndicSynth, we demon-strate the vulnerability of existing ADD and SV models against synthetic speech attacks. Human evaluation further validates the dataset quality, underscoring the dataset’s utility for security-focused applications. Second, we present Task-Lens, a cross-task profiling framework to mitigate task-resource gaps for underrepresented languages. Using Task-Lens, we profile 34 Indian speech datasets, including IndicSynth, covering 26 languages and eight downstream tasks, based on available metadata. Third, we propose FAtNet and EcoSpeak, which are cost-efficient methods for mitigating linguistic biases in speaker verification, addressing fully and partially cross-lingual scenarios while incorporating Green AI principles by reporting carbon emissions. Finally, we introduce GreenVoice, an automated environment-aware evaluation framework for synthetic speech generation models. GreenVoice cost-effectively highlights high-performing and sustainable generation models for large-scale synthetic speech dataset creation, thus enabling multilingual ADD and anti-spoofing research across more underrepresented languages and accents, beyond IndicSynth. Together, these contributions provide the foundations for building and evaluating speech technologies that are robust, equitable, and inclusive across languages and accents, while promoting environmentally responsible practices and supporting their reliable use in real-world applications. |
en_US |