Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features

Abstract : Text clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics.
Type de document :
Communication dans un congrès
DMNLP: Data Mining and Natural Language Processing, Sep 2016, Riva del Garda, Italy. 3rd Workshop on Interactions between Data Mining and Natural Language Processing 2016 co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2016), 1646, 2016, 〈http://ceur-ws.org/Vol-1646/〉
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01362434
Contributeur : Pascal Poncelet <>
Soumis le : jeudi 8 septembre 2016 - 17:20:55
Dernière modification le : mercredi 10 octobre 2018 - 14:28:11
Document(s) archivé(s) le : vendredi 9 décembre 2016 - 13:10:17

Fichier

dmnlp revisedJulienVelcin2016....
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-01362434, version 1

Citation

Julien Velcin, Mathieu Roche, Pascal Poncelet. Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features. DMNLP: Data Mining and Natural Language Processing, Sep 2016, Riva del Garda, Italy. 3rd Workshop on Interactions between Data Mining and Natural Language Processing 2016 co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2016), 1646, 2016, 〈http://ceur-ws.org/Vol-1646/〉. 〈lirmm-01362434〉

Partager

Métriques

Consultations de la notice

180

Téléchargements de fichiers

190