Semantic-Based Multilingual Document Clustering via Tensor Modeling

Abstract : A major challenge in document clustering research arises from the growing amount of text data written in different languages. Previous approaches depend on language-specific solutions (e.g., bilingual dictionaries, sequential machine translation) to evaluate document similarities, and the required transformations may alter the original document semantics. To cope with this issue we propose a new document clustering approach for multilingual corpora that (i) exploits a large-scale multilingual knowledge base, (ii) takes advantage of the multi-topic nature of the text documents, and (iii) employs a tensor-based model to deal with high dimensionality and sparseness. Results have shown the significance of our approach and its better performance w.r.t. classic document clustering approaches, in both a balanced and an unbalanced corpus evaluation.
Type de document :
Communication dans un congrès
EMNLP: Empirical Methods in Natural Language Processing, Oct 2014, Doha, Qatar. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 25-29, 2014, Doha, Qatar, pp.600-609, 2014, 〈10.3115/v1/D14-1065〉
Liste complète des métadonnées

Littérature citée [31 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01239231
Contributeur : Dino Ienco <>
Soumis le : lundi 7 décembre 2015 - 15:23:53
Dernière modification le : mercredi 18 avril 2018 - 14:24:05
Document(s) archivé(s) le : mardi 8 mars 2016 - 14:23:05

Fichier

585_Paper.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Salvatore Romeo, Andrea Tagarelli, Dino Ienco. Semantic-Based Multilingual Document Clustering via Tensor Modeling. EMNLP: Empirical Methods in Natural Language Processing, Oct 2014, Doha, Qatar. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 25-29, 2014, Doha, Qatar, pp.600-609, 2014, 〈10.3115/v1/D14-1065〉. 〈lirmm-01239231〉

Partager

Métriques

Consultations de la notice

136

Téléchargements de fichiers

165