Sequential Patterns for Text Categorization

Abstract : Text categorization is a well-known task based essentially on statistical approaches using neural networks, Support Vector Machines and other machine learning algorithms. Texts are generally considered as bags of words without any order. Although these approaches have proven to be efficient, they do not provide users with comprehensive and reusable rules about their data. Such rules are, however, very important for users to describe trends in the data they have to analyze. In this framework, an association-rule based approach has been proposed by Bing Liu (CBA). We propose, in this paper, to extend this approach by using sequential patterns in the SPaC method (Sequential Patterns for Classification) for text categorization. Taking order into account allows us to represent the succession of words through a document without complex and time-consuming representations and treatments such as those performed in natural language and grammatical methods. The original method we propose here consists of mining sequential patterns in order to build a classifier. We experimentally show that our proposal is relevant, and that it is very interesting compared to other methods. In particular, our method outperforms CBA and provides better results than SVM on some corpus.
Type de document :
Article dans une revue
Intelligent Data Analysis, IOS Press, 2006, 10 (3), pp.16
Liste complète des métadonnées

Littérature citée [38 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00135010
Contributeur : Anne Laurent <>
Soumis le : mardi 6 mars 2007 - 12:17:17
Dernière modification le : vendredi 19 octobre 2018 - 01:14:09
Document(s) archivé(s) le : mercredi 7 avril 2010 - 01:25:15

Fichier

Identifiants

  • HAL Id : lirmm-00135010, version 1

Collections

Citation

Simon Jaillet, Anne Laurent, Maguelonne Teisseire. Sequential Patterns for Text Categorization. Intelligent Data Analysis, IOS Press, 2006, 10 (3), pp.16. 〈lirmm-00135010〉

Partager

Métriques

Consultations de la notice

252

Téléchargements de fichiers

357