Master's Thesis
Classificação Automática de Páginas Web numa Hierarquia de Tópicos
2011
—Key information
Authors:
Supervisors:
Published in
07/28/2011
Abstract
The volume of documents currently available on the Internet makes it impossible for humans to manually catalogue all the available information. Given the previous motivation, several authors have researched automatic classification techniques, capable of assigning a class from a hierarchy of possible classes to new documents. These techniques support the organization of textual contents by topic. My MSc thesis proposes the extension of the traditional top-down approach to perform hierarchical classification. In this extension the classifier tries to avoid misclassification in higher levels considering an alternative path, which can return a class from the hierarchy of classes that better fits a new document than the class returned by the first path. Besides the extension to top-down classification, my MSc thesis also proposes two methods to reduce the size of the training data, in order to reduce the time spent to train the classifier. The first, called naive, sets a limit on the number of documents per child class ignoring if we have documents from all the childs of N_i or not. The second method, not only tries to have documents from all nodes N_j that are child nodes of N_i, but also tries to have an equal number of documents from each child node. Finally, I also experimented with simple feature selection methods such as stemming or stopword removal, in order to measure the impact on the results. The results confirmed the expectations about the proposed classification method. Improvements were registered not only in accuracy but also in f-measure.
Publication details
Authors in the community:
Vítor Hugo Fernandes Sequeira
ist152990
Supervisors of this institution:
Pável Pereira Calado
ist14497
Bruno Emanuel Da Graça Martins
ist24686
Fields of Science and Technology (FOS)
electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering
Publication language (ISO code)
eng - English
Rights type:
Embargo lifted
Date available:
08/04/2012
Institution name
Instituto Superior Técnico