Sequential Patterns for Text Categorization

Simon Jaillet; Anne Laurent; Maguelonne Teisseire

doi:10.3233/IDA-2006-10302

Article Dans Une Revue Intelligent Data Analysis Année : 2006

Sequential Patterns for Text Categorization

(1) , (2) , (2)

1
2

Simon Jaillet

Fonction : Auteur

LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier

Anne Laurent

Fonction : Auteur
PersonId : 21743
IdHAL : anne-laurent
ORCID : 0000-0003-3708-6429
IdRef : 075173735

TATOO - Fouille de données environnementales

Maguelonne Teisseire

Fonction : Auteur
PersonId : 8645
IdHAL : maguelonne-teisseire
ORCID : 0000-0001-9313-6414
IdRef : 117436593

TATOO - Fouille de données environnementales

Résumé

Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose here consists of mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus.

Mots clés

Domaines

Base de données [cs.DB]

Fichier principal

ida245.PDF (230.9 Ko)

Licence	Autorisation HAL

Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00135010

Soumis le : mardi 6 mars 2007-12:17:17

Dernière modification le : mardi 12 mars 2024-10:43:52

Archivage à long terme le : mercredi 7 avril 2010-01:25:15

Dates et versions

lirmm-00135010 , version 1 (06-03-2007)

Licence

Autorisation HAL

Identifiants

HAL Id : lirmm-00135010 , version 1
DOI : 10.3233/IDA-2006-10302

Citer

Simon Jaillet, Anne Laurent, Maguelonne Teisseire. Sequential Patterns for Text Categorization. Intelligent Data Analysis, 2006, 10 (3), pp.16. ⟨10.3233/IDA-2006-10302⟩. ⟨lirmm-00135010⟩

Sequential Patterns for Text Categorization

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager