Master's Thesis

Classificação Automática de Páginas Web numa Hierarquia de Tópicos

Vítor Hugo Fernandes 2011

Key information

Authors:

Vítor Hugo Fernandes (Vítor Hugo Fernandes Sequeira)

Supervisors:

Pável Pereira Calado (Pável Pereira Calado); Bruno Emanuel Da Graça Martins (Bruno Emanuel Da Graça Martins)

Published in

07/28/2011

Abstract

The volume of documents currently available on the Internet makes it impossible for humans to manually catalogue all the available information. Given the previous motivation, several authors have researched automatic classification techniques, capable of assigning a class from a hierarchy of possible classes to new documents. These techniques support the organization of textual contents by topic. My MSc thesis proposes the extension of the traditional top-down approach to perform hierarchical classification. In this extension the classifier tries to avoid misclassification in higher levels considering an alternative path, which can return a class from the hierarchy of classes that better fits a new document than the class returned by the first path. Besides the extension to top-down classification, my MSc thesis also proposes two methods to reduce the size of the training data, in order to reduce the time spent to train the classifier. The first, called naive, sets a limit on the number of documents per child class ignoring if we have documents from all the childs of N_i or not. The second method, not only tries to have documents from all nodes N_j that are child nodes of N_i, but also tries to have an equal number of documents from each child node. Finally, I also experimented with simple feature selection methods such as stemming or stopword removal, in order to measure the impact on the results. The results confirmed the expectations about the proposed classification method. Improvements were registered not only in accuracy but also in f-measure.

Publication details

Authors in the community:

Supervisors of this institution:

Fields of Science and Technology (FOS)

electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering

Publication language (ISO code)

eng - English

Rights type:

Embargo lifted

Date available:

08/04/2012

Institution name

Instituto Superior Técnico