PhD Thesis

Classification of sequences using compression-based dissimilarity measures

José David Pereira Coutinho Gomes Antão2014

Key information

Authors:

José David Pereira Coutinho Gomes Antão (José David Pereira Coutinho Gomes Antão)

Supervisors:

Mário Alexandre Teles de Figueiredo (Mário Alexandre Teles de Figueiredo)

Published in

07/23/2014

Abstract

In the field of machine learning, the classical approach to sequence classification is based on statistical learning. This kind of problem is traditionally posed in a probabilistic framework, for which feature extraction and selection are essential to obtain the information needed to build statistical models. However, in practice, careful feature engineering and sophisticated preprocessing procedures are needed to obtain good features. Those procedures may thus become prohibitive for massive data collections. Moreover, the preprocessing is often taskspecific, thus have to be redesigned and reapplied when the same data is used in a different application. During the last decade, researchers have tried to find alternative methods that implement so-called universal classifiers, in the sense that they do not depend on prior assumptions about the unknown sequences/sources and do not require feature extraction or selection. This thesis addresses compression-based dissimilarity measures and their use for the classification of sequences from different types of sources. We propose information theoretic measures that exploit the concept of relative entropy and a supervised classification method which use these type of measures as features in a dissimilarity space. We apply the developed methods in text classification and electrocardiographic biometrics. Experimental results on public domain datasets show that the proposed dissimilarity measures and classification methods approximate or even outperform, in terms of accuracy, the state-of-the-art competitors in some benchmark problems.

Publication details

Authors in the community:

Supervisors of this institution:

RENATES TID

101462603

Degree Name

Doutoramento em Engenharia Electrotécnica e de Computadores

Fields of Science and Technology (FOS)

electrical-engineering-electronic-engineering-information-engineering - Electrical engineering, electronic engineering, information engineering

Keywords

  • Machine Learning
  • Sequence Classification
  • Data Compression
  • Dissimilarity Space
  • Dissimilarity Measure
  • Relative Entropy
  • Ziv-Merhav Method
  • Cross-Parsing Algorithm

Publication language (ISO code)

eng - English

Rights type:

Embargo lifted

Date available:

06/10/2015

Institution name

Instituto Superior Técnico