Dissertação de Mestrado

Automatic Hate Speech Detection in Portuguese Social Media Text

Bernardo Cunha Matos 2022

Informações chave

Autores:

Bernardo Cunha Matos (Bernardo Cunha Matos)

Orientadores:

Paula Cristina Quaresma da Fonseca Carvalho (Paula Cristina Quaresma da Fonseca Carvalho), Ricardo Daniel Santos Faro Marques Ribeiro

Publicado em

18/11/2022

Resumo

Online Hate Speech (HS) has been growing dramatically on social media and its uncontrolled spread has motivated researchers to develop a diversity of methods for its automated detection. However, the detection of online HS in Portuguese still merits further research. To fill this gap, we explored different models that proved to be successful in the literature to address this task. In particular, we have explored models that use the BERT architecture. Beyond testing single-task models we also explored multitask models that use the information on other related categories to learn HS. To better capture the semantics of this type of texts, we developed HateBERTimbau, a retrained version of BERTimbau more directed to social media language including potential HS targeting African descent, Roma, and LGBTQI+ communities. The performed experiments were based on CO-HATE and FIGHT, corpora of social media messages posted by the Portuguese online community that were labelled regarding the presence of HS among other categories. The results achieved show the importance of considering the annotator's agreement on the data used to develop HS detection models. Comparing different subsets of data used for the training of the models it was shown that, in general, a higher agreement on the data leads to better results. HATEBERTimbau consistently outperformed BERTimbau on both datasets confirming that further pre-training of BERTimbau was a successful strategy to obtain a language model more suitable for online HS detection in Portuguese. The implementation of target-specific models, and multitask learning have shown potential in obtaining better results.

Detalhes da publicação

Autores da comunidade :

Orientadores desta instituição:

Domínio Científico (FOS)

- Engenharia Eletrotécnica, Eletrónica e Informática

Idioma da publicação (código ISO)

- Inglês

Acesso à publicação:

Embargo levantado

Data do fim do embargo:

01/09/2023

Nome da instituição

Instituto Superior Técnico