Debiasing Hate Speech Classification Models for Queer Language Through Keyword Analysis

Main Article Content

D.S. Yahathugoda
Rupika Wijesinghe
Ruvan Weerasinghe

Abstract

This article uses words or language that is considered profane, vulgar, or offensive by some readers.


Detecting hate speech is critical for moderating harmful content on social media and the Internet. However, existing models often struggle to accurately identify hate speech targeting queer communities due to inherent biases in training data and language usage. This research explores debiasing techniques for hate speech classification models, with a focus on queer language via keyword analysis. By analyzing established hate speech datasets and queer-specific linguistic traits, this study aims to identify words and phrases the models pay attention to the most and apply different debiasing approaches such as reweighting and adversarial debiasing to enhance the efficacy and equity of hate speech aimed at queer communities, without unfairly silencing queer voices. We found that these methods improved the accuracy of queer-specific datasets but showed a decrease in performance on more general datasets. These findings suggest that we must develop more community-specific models to safeguard them from harmful content. This research contributes to advancing the understanding of bias in hate speech detection models and provides practical guidance for devising more inclusive and fair classification systems for online content moderation.

Article Details

Select the Journal Issue
Articles