Dissertação de Mestrado

Classificação Automática de Páginas Web numa Hierarquia de Tópicos

Vítor Hugo Fernandes 2011

Informações chave

Autores:

Vítor Hugo Fernandes (Vítor Hugo Fernandes Sequeira)

Orientadores:

Pável Pereira Calado (Pável Pereira Calado); Bruno Emanuel Da Graça Martins (Bruno Emanuel Da Graça Martins)

Publicado em

28/07/2011

Resumo

The volume of documents currently available on the Internet makes it impossible for humans to manually catalogue all the available information. Given the previous motivation, several authors have researched automatic classification techniques, capable of assigning a class from a hierarchy of possible classes to new documents. These techniques support the organization of textual contents by topic. My MSc thesis proposes the extension of the traditional top-down approach to perform hierarchical classification. In this extension the classifier tries to avoid misclassification in higher levels considering an alternative path, which can return a class from the hierarchy of classes that better fits a new document than the class returned by the first path. Besides the extension to top-down classification, my MSc thesis also proposes two methods to reduce the size of the training data, in order to reduce the time spent to train the classifier. The first, called naive, sets a limit on the number of documents per child class ignoring if we have documents from all the childs of N_i or not. The second method, not only tries to have documents from all nodes N_j that are child nodes of N_i, but also tries to have an equal number of documents from each child node. Finally, I also experimented with simple feature selection methods such as stemming or stopword removal, in order to measure the impact on the results. The results confirmed the expectations about the proposed classification method. Improvements were registered not only in accuracy but also in f-measure.

Detalhes da publicação

Autores da comunidade :

Orientadores desta instituição:

Domínio Científico (FOS)

electrical-engineering-electronic-engineering-information-engineering - Engenharia Eletrotécnica, Eletrónica e Informática

Idioma da publicação (código ISO)

eng - Inglês

Acesso à publicação:

Embargo levantado

Data do fim do embargo:

04/08/2012

Nome da instituição

Instituto Superior Técnico