Tese de Doutoramento
Classification of sequences using compression-based dissimilarity measures
2014
—Informações chave
Autores:
Orientadores:
Publicado em
23/07/2014
Resumo
In the field of machine learning, the classical approach to sequence classification is based on statistical learning. This kind of problem is traditionally posed in a probabilistic framework, for which feature extraction and selection are essential to obtain the information needed to build statistical models. However, in practice, careful feature engineering and sophisticated preprocessing procedures are needed to obtain good features. Those procedures may thus become prohibitive for massive data collections. Moreover, the preprocessing is often taskspecific, thus have to be redesigned and reapplied when the same data is used in a different application. During the last decade, researchers have tried to find alternative methods that implement so-called universal classifiers, in the sense that they do not depend on prior assumptions about the unknown sequences/sources and do not require feature extraction or selection. This thesis addresses compression-based dissimilarity measures and their use for the classification of sequences from different types of sources. We propose information theoretic measures that exploit the concept of relative entropy and a supervised classification method which use these type of measures as features in a dissimilarity space. We apply the developed methods in text classification and electrocardiographic biometrics. Experimental results on public domain datasets show that the proposed dissimilarity measures and classification methods approximate or even outperform, in terms of accuracy, the state-of-the-art competitors in some benchmark problems.
Detalhes da publicação
Autores da comunidade :
Orientadores desta instituição:
RENATES TID
101462603
Designação
Doutoramento em Engenharia Electrotécnica e de Computadores
Domínio Científico (FOS)
electrical-engineering-electronic-engineering-information-engineering - Engenharia Eletrotécnica, Eletrónica e Informática
Palavras-chave
- Machine Learning
- Sequence Classification
- Data Compression
- Dissimilarity Space
- Dissimilarity Measure
- Relative Entropy
- Ziv-Merhav Method
- Cross-Parsing Algorithm
Idioma da publicação (código ISO)
eng - Inglês
Acesso à publicação:
Embargo levantado
Data do fim do embargo:
10/06/2015
Nome da instituição
Instituto Superior Técnico