Tese de Doutoramento

Classification of sequences using compression-based dissimilarity measures

José David Pereira Coutinho Gomes Antão2014

Informações chave

Autores:

José David Pereira Coutinho Gomes Antão (José David Pereira Coutinho Gomes Antão)

Orientadores:

Mário Alexandre Teles de Figueiredo (Mário Alexandre Teles de Figueiredo)

Publicado em

23/07/2014

Resumo

In the field of machine learning, the classical approach to sequence classification is based on statistical learning. This kind of problem is traditionally posed in a probabilistic framework, for which feature extraction and selection are essential to obtain the information needed to build statistical models. However, in practice, careful feature engineering and sophisticated preprocessing procedures are needed to obtain good features. Those procedures may thus become prohibitive for massive data collections. Moreover, the preprocessing is often taskspecific, thus have to be redesigned and reapplied when the same data is used in a different application. During the last decade, researchers have tried to find alternative methods that implement so-called universal classifiers, in the sense that they do not depend on prior assumptions about the unknown sequences/sources and do not require feature extraction or selection. This thesis addresses compression-based dissimilarity measures and their use for the classification of sequences from different types of sources. We propose information theoretic measures that exploit the concept of relative entropy and a supervised classification method which use these type of measures as features in a dissimilarity space. We apply the developed methods in text classification and electrocardiographic biometrics. Experimental results on public domain datasets show that the proposed dissimilarity measures and classification methods approximate or even outperform, in terms of accuracy, the state-of-the-art competitors in some benchmark problems.

Detalhes da publicação

Autores da comunidade :

Orientadores desta instituição:

RENATES TID

101462603

Designação

Doutoramento em Engenharia Electrotécnica e de Computadores

Domínio Científico (FOS)

electrical-engineering-electronic-engineering-information-engineering - Engenharia Eletrotécnica, Eletrónica e Informática

Palavras-chave

  • Machine Learning
  • Sequence Classification
  • Data Compression
  • Dissimilarity Space
  • Dissimilarity Measure
  • Relative Entropy
  • Ziv-Merhav Method
  • Cross-Parsing Algorithm

Idioma da publicação (código ISO)

eng - Inglês

Acesso à publicação:

Embargo levantado

Data do fim do embargo:

10/06/2015

Nome da instituição

Instituto Superior Técnico