Abstract:
Abusive content detection in the spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing. However, ASR models introduce latency and often perform suboptimally for abusive words as they are underrepresented in training corpora and not spoken clearly or entirely. Abusive content on social media platforms is undesirable as it impedes healthy and safe social media interactions. While automatic abuse detection has been widely explored in the textual domain, audio abuse detection remains unexplored. The lack of audio datasets has limited mainly an exploration of this problem entirely in the audio domain. We have used ADIMA, a linguistically diverse, ethically sourced, expert annotated, and well-balanced multilingual abuse detection audio dataset comprising 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users. This work focuses on audio abuse detection from an acoustic cue perspective in a multilingual social media setting. While textual abuse detection has been widely researched; comparatively, abuse detection from audio remains unexplored. Our key hypothesis is based on the fact that abusive behavior leads to distinct acoustic cues. Such cues can help detect abuse directly from audio signals without the need to transcribe them. We first demonstrate that employing a generic large pre-trained acoustic/language model is suboptimal. This proves that incorporating the right acoustic cues might be the way forward to improve performance and achieve generalization. Our proposed method explicitly focuses on two modalities, namely, the underlying emotions expressed and the language features of audio. On the recently proposed ADIMA benchmark for this task, our approach achieves the stateof- the-art performance of 96% on the test set and outperforms existing best models by a large margin.