PhD Thesis
Classification of sequences using compression-based dissimilarity measures
2014
—Key information
Authors:
Supervisors:
Published in
07/23/2014
Abstract
In the field of machine learning, the classical approach to sequence classification is based on statistical learning. This kind of problem is traditionally posed in a probabilistic framework, for which feature extraction and selection are essential to obtain the information needed to build statistical models. However, in practice, careful feature engineering and sophisticated preprocessing procedures are needed to obtain good features. Those procedures may thus become prohibitive for massive data collections. Moreover, the preprocessing is often taskspecific, thus have to be redesigned and reapplied when the same data is used in a different application. During the last decade, researchers have tried to find alternative methods that implement so-called universal classifiers, in the sense that they do not depend on prior assumptions about the unknown sequences/sources and do not require feature extraction or selection. This thesis addresses compression-based dissimilarity measures and their use for the classification of sequences from different types of sources. We propose information theoretic measures that exploit the concept of relative entropy and a supervised classification method which use these type of measures as features in a dissimilarity space. We apply the developed methods in text classification and electrocardiographic biometrics. Experimental results on public domain datasets show that the proposed dissimilarity measures and classification methods approximate or even outperform, in terms of accuracy, the state-of-the-art competitors in some benchmark problems.
Publication details
Authors in the community:
Supervisors of this institution:
RENATES TID
101462603
Degree Name
Doutoramento em Engenharia Electrotécnica e de Computadores
Fields of Science and Technology (FOS)
electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering
Keywords
- Machine Learning
- Sequence Classification
- Data Compression
- Dissimilarity Space
- Dissimilarity Measure
- Relative Entropy
- Ziv-Merhav Method
- Cross-Parsing Algorithm
Publication language (ISO code)
eng - English
Rights type:
Embargo lifted
Date available:
06/10/2015
Institution name
Instituto Superior Técnico