Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features

Abstract : Text clustering and topic learning are two closely related tasks. In this paper, we show that the topics can be learnt without the absolute need of an exact categorization. In particular, the experiments performed on two real case studies with a vocabulary based on bigram features lead to extracting readable topics that cover most of the documents. Precision at 10 is up to 74% for a dataset of scientific abstracts with 10,000 features, which is 4% less than when using unigrams only but provides more interpretable topics.
Document type :
Conference papers
Complete list of metadatas

Cited literature [29 references]  Display  Hide  Download

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01362434
Contributor : Pascal Poncelet <>
Submitted on : Thursday, September 8, 2016 - 5:20:55 PM
Last modification on : Wednesday, November 20, 2019 - 3:04:59 AM
Long-term archiving on : Friday, December 9, 2016 - 1:10:17 PM

File

dmnlp revisedJulienVelcin2016....
Files produced by the author(s)

Identifiers

  • HAL Id : lirmm-01362434, version 1

Citation

Julien Velcin, Mathieu Roche, Pascal Poncelet. Shallow Text Clustering Does Not Mean Weak Topics: How Topic Identification Can Leverage Bigram Features. DMNLP: Data Mining and Natural Language Processing, Sep 2016, Riva del Garda, Italy. ⟨lirmm-01362434⟩

Share

Metrics

Record views

258

Files downloads

338