Master's Thesis
Semi-Automatic Selection and Annotation of Hate Speech from Social Media
2022
—Key information
Authors:
Supervisors:
Published in
11/18/2022
Abstract
With the proliferation of hate speech, particularly on social media, it urges to develop models able to automatically detect it. Such models typically rely on large-scale annotated data, which are still scarce in languages such as Portuguese. However, creating manually annotated corpora is a very time-consuming, expensive, and demanding task. To address this problem, we tested an ensemble of three semi-supervised models that can be used to automatically create a corpus representative of online hate speech in Portuguese. These models consist of a CNN; a model that combines GAN and a BERT based model; and a label propagation model. Furthermore, this work explores the impact of data augmentation and domain adaptation to solve the unbalanced data and the linguistic heterogeneity, taking into consideration the geographic context, and the targets of hate speech. We have explored the annotations of three existing Portuguese corpora (CO-Hate, ToLR-BR, and HPHS) to automatically annotate FIGHT, a corpus composed of geolocated tweets produced in the Portuguese territory. Additionally, to augment our training dataset, HS English corpora were automatically translated into Portuguese. An intermediary domain between CO-Hate and FIGHT was also generated to diminish the differences in the nature of both data sources. The models obtained a performance in line with the results reported in the literature for the same domain task. Additional experiments, from FIGHT to CO-Hate, and within the same domain were also performed to analyze the potential of the proposed models.
Publication details
Authors in the community:
Raquel Bento Santos
ist189533
Supervisors of this institution:
Fields of Science and Technology (FOS)
electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering
Publication language (ISO code)
eng - English
Rights type:
Embargo lifted
Date available:
08/23/2023
Institution name
Instituto Superior Técnico