Dissertação de Mestrado
Classificação Automática de Páginas Web numa Hierarquia de Tópicos
2011
—Informações chave
Autores:
Orientadores:
Publicado em
28/07/2011
Resumo
The volume of documents currently available on the Internet makes it impossible for humans to manually catalogue all the available information. Given the previous motivation, several authors have researched automatic classification techniques, capable of assigning a class from a hierarchy of possible classes to new documents. These techniques support the organization of textual contents by topic. My MSc thesis proposes the extension of the traditional top-down approach to perform hierarchical classification. In this extension the classifier tries to avoid misclassification in higher levels considering an alternative path, which can return a class from the hierarchy of classes that better fits a new document than the class returned by the first path. Besides the extension to top-down classification, my MSc thesis also proposes two methods to reduce the size of the training data, in order to reduce the time spent to train the classifier. The first, called naive, sets a limit on the number of documents per child class ignoring if we have documents from all the childs of N_i or not. The second method, not only tries to have documents from all nodes N_j that are child nodes of N_i, but also tries to have an equal number of documents from each child node. Finally, I also experimented with simple feature selection methods such as stemming or stopword removal, in order to measure the impact on the results. The results confirmed the expectations about the proposed classification method. Improvements were registered not only in accuracy but also in f-measure.
Detalhes da publicação
Autores da comunidade :
Vítor Hugo Fernandes Sequeira
ist152990
Orientadores desta instituição:
Pável Pereira Calado
ist14497
Bruno Emanuel Da Graça Martins
ist24686
Domínio Científico (FOS)
electrical-engineering-electronic-engineering-information-engineering - Engenharia Eletrotécnica, Eletrónica e Informática
Idioma da publicação (código ISO)
eng - Inglês
Acesso à publicação:
Embargo levantado
Data do fim do embargo:
04/08/2012
Nome da instituição
Instituto Superior Técnico