Skip to Main content Skip to Navigation
Conference papers

Semantic-Based Multilingual Document Clustering via Tensor Modeling

Abstract : A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.
Complete list of metadata

Cited literature [31 references]  Display  Hide  Download
Contributor : Dino Ienco Connect in order to contact the contributor
Submitted on : Monday, December 7, 2015 - 3:23:53 PM
Last modification on : Friday, August 5, 2022 - 3:02:49 PM
Long-term archiving on: : Tuesday, March 8, 2016 - 2:23:05 PM


Files produced by the author(s)



Salvatore Romeo, Andrea Tagarelli, Dino Ienco. Semantic-Based Multilingual Document Clustering via Tensor Modeling. EMNLP: Empirical Methods in Natural Language Processing, Oct 2014, Doha, Qatar. pp.600-609, ⟨10.3115/v1/D14-1065⟩. ⟨lirmm-01239231⟩



Record views


Files downloads