A growing body of work have focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of popular languages causing systems to either under-perform or not exist for low-resource languages. This is majorly caused by a lack of training data which is expensive to collect and curate in low-resource language settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in low-resource language settings using a contextual token replacement technique. Given a handful of hate speech examples in a high resource language such as English, our approach synthesizes new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We then employ a stratification technique to improve the diversity of the generated examples.
We apply our approach to generate training data for a hate speech classification task in the Hindi language and Vietnamese. Our findings show that a model trained using this method outperforms simple language translation for all tasks and performs better than an original curated dataset when tested on a new dataset. This method can be used to bootstrap hate speech detection models from scratch in low-resource language settings. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech.
This work is conducted as part of the Technologies and International Development Lab and more info can be found at https://tid.gatech.edu/
The Technologies and International Development Lab at Georgia Tech researches the practice, the promise, and the peril of information and communication technologies (ICTs) in social, economic, and political development. We study the risks and rewards of ICT systems for people and communities particularly within Africa and Asia. We explore issues of rights and justice in a digital age. And we examine new forms for inclusive innovation and social entrepreneurship enhanced through digital systems.
The T+ID Lab is an interdisciplinary community bringing together computer and social scientists with design and policy specialists. We collaborate directly with stakeholders outside of the Lab to critique technologies, invent new ones, and research how and why (or why not) ICTs can serve as a tool to empower, enrich, and interconnect.