Dissertação de Mestrado
Semi-Automatic Selection and Annotation of Hate Speech from Social Media
2022
—Informações chave
Autores:
Orientadores:
Publicado em
18/11/2022
Resumo
With the proliferation of hate speech, particularly on social media, it urges to develop models able to automatically detect it. Such models typically rely on large-scale annotated data, which are still scarce in languages such as Portuguese. However, creating manually annotated corpora is a very time-consuming, expensive, and demanding task. To address this problem, we tested an ensemble of three semi-supervised models that can be used to automatically create a corpus representative of online hate speech in Portuguese. These models consist of a CNN; a model that combines GAN and a BERT based model; and a label propagation model. Furthermore, this work explores the impact of data augmentation and domain adaptation to solve the unbalanced data and the linguistic heterogeneity, taking into consideration the geographic context, and the targets of hate speech. We have explored the annotations of three existing Portuguese corpora (CO-Hate, ToLR-BR, and HPHS) to automatically annotate FIGHT, a corpus composed of geolocated tweets produced in the Portuguese territory. Additionally, to augment our training dataset, HS English corpora were automatically translated into Portuguese. An intermediary domain between CO-Hate and FIGHT was also generated to diminish the differences in the nature of both data sources. The models obtained a performance in line with the results reported in the literature for the same domain task. Additional experiments, from FIGHT to CO-Hate, and within the same domain were also performed to analyze the potential of the proposed models.
Detalhes da publicação
Autores da comunidade :
Raquel Bento Santos
ist189533
Orientadores desta instituição:
Domínio Científico (FOS)
electrical-engineering-electronic-engineering-information-engineering - Engenharia Eletrotécnica, Eletrónica e Informática
Idioma da publicação (código ISO)
eng - Inglês
Acesso à publicação:
Embargo levantado
Data do fim do embargo:
23/08/2023
Nome da instituição
Instituto Superior Técnico