Abstract:
With an increasing amount of hate on online social media platforms, automatic detection of toxic language plays a protecting role for online users and content moderators. Hence, it be- comes important to ensure that these models are safe and are unbiased against minority groups based on their gender, religion, caste, etc. For mitigating the bias in hate speech detection tasks, data augmentation is not a complete solution as it is not desirable to equalize the data based on the presence of Social group Tokens in the dataset. This is because of the important role that they play in the contextualization of derogatory remarks to a specific group. In this thesis, I approach the problem of bias removal in hate speech models through robustness using counterfactual generation.