Master's Thesis

Semi-Automatic Selection and Annotation of Hate Speech from Social Media

Raquel Bento Santos2022

Key information

Authors:

Raquel Bento Santos (Raquel Bento Santos)

Supervisors:

Fernando Manuel Marques Batista; Paula Cristina Quaresma da Fonseca Carvalho (Paula Cristina Quaresma da Fonseca Carvalho)

Published in

11/18/2022

Abstract

With the proliferation of hate speech, particularly on social media, it urges to develop models able to automatically detect it. Such models typically rely on large-scale annotated data, which are still scarce in languages such as Portuguese. However, creating manually annotated corpora is a very time-consuming, expensive, and demanding task. To address this problem, we tested an ensemble of three semi-supervised models that can be used to automatically create a corpus representative of online hate speech in Portuguese. These models consist of a CNN; a model that combines GAN and a BERT based model; and a label propagation model. Furthermore, this work explores the impact of data augmentation and domain adaptation to solve the unbalanced data and the linguistic heterogeneity, taking into consideration the geographic context, and the targets of hate speech. We have explored the annotations of three existing Portuguese corpora (CO-Hate, ToLR-BR, and HPHS) to automatically annotate FIGHT, a corpus composed of geolocated tweets produced in the Portuguese territory. Additionally, to augment our training dataset, HS English corpora were automatically translated into Portuguese. An intermediary domain between CO-Hate and FIGHT was also generated to diminish the differences in the nature of both data sources. The models obtained a performance in line with the results reported in the literature for the same domain task. Additional experiments, from FIGHT to CO-Hate, and within the same domain were also performed to analyze the potential of the proposed models.

Publication details

Authors in the community:

Supervisors of this institution:

Fields of Science and Technology (FOS)

electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering

Publication language (ISO code)

eng - English

Rights type:

Embargo lifted

Date available:

08/23/2023

Institution name

Instituto Superior Técnico