Abstract:
With the rapid development and advance in the field of Speech Synthesis technology, differentiation of genuine and fake audios has become increasingly challenging. This semester the project focuses on benchmarking the voice conversion model as a preparatory step to create a dataset aimed at cost biased mitigation of audio deepfake detection but due to many constraints and inefficiency to finetune the model only one model was benchmarked correctly. To demonstrate progress and contribute meaningfully a dataset of 20.596 utterances was proposed named Kalpvani using the benchmark model. A user study was conducted where they were present with 6 fake and 6 real audios and evaluated cloned audio through subjective analysis. Participants were also asked to assess whether the given cloned audio is close to source audio or the target audio. Furthermore, speaker verification systems like Ecapa TDNN and Resnet TDNN were used to calculate Equal Error Rates (EER) for target-clone and source-clone pairs, providing an objective evaluation of voice similarity. This benchmarking lays the foundation for future work in cost bias mitigation of audio deep fake detection.