Classifying Words: A Syllables-based Model

Abstract : Text classification has been extensively studied by linguists and computer scientists. However, there are very few works on classification of words into classes or concepts (e.g. thesaurus). In this paper, we consider this topic, especially in the context of the classification of names like brand names or neologisms. The challenge is thus to provide automated tools to analyze new names by classifying them into concepts. Then, for example, a naming company customer can be informed about which concept a new name is closest to. As we argue that a word can belong to several concepts, we propose to consider the top-k classification approach. Moreover, we rely on syllables to build the classification model. The word corpus is collected from French thesaurus. All labeled-words are separated into syllables. Feature selection techniques are used to select discriminative syllables. We use a syllables frequency (SF) and mutual information (MI) performing with Naive Bayes classifier and K-nearest neighbor (KNN). Instead of selecting only one class, the model select top-k classes ranking them by a classifier score. The result shows the top-k classification model helps to analyze a new word by showing that it can be related to more than one concept. Moreover, the set of discriminative syllables can be used to explain the classification results which makes the results more meaningful.
Type de document :
Communication dans un congrès
DEXA'2011: 22nd International Workshop on Database and Expert Systems Applications, Aug 2011, Toulouse, France. pp.208-212, 2011, 〈10.1109/DEXA.2011.21〉
Liste complète des métadonnées

Littérature citée [11 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00671499
Contributeur : Isabelle Gouat <>
Soumis le : vendredi 17 février 2012 - 15:29:01
Dernière modification le : jeudi 11 janvier 2018 - 06:26:17
Document(s) archivé(s) le : vendredi 23 novembre 2012 - 16:20:35

Fichier

11_DEXA.PDF
Fichiers produits par l'(les) auteur(s)

Identifiants

Collections

Citation

Pattaraporn Warintarawej, Anne Laurent, Pierre Pompidor, Armelle Cassanas, Bénédicte Laurent. Classifying Words: A Syllables-based Model. DEXA'2011: 22nd International Workshop on Database and Expert Systems Applications, Aug 2011, Toulouse, France. pp.208-212, 2011, 〈10.1109/DEXA.2011.21〉. 〈lirmm-00671499〉

Partager

Métriques

Consultations de la notice

637

Téléchargements de fichiers

561