The Impact of Corpus Quality and Type on Topic based Text Segmentation Evaluation

Abstract : In this paper, we try to fathom the real impact of corpus quality on methods performances and their evaluations. The considered task is topic-based text segmentation, and two highly different unsupervised algorithms are compared: C99, a word-based system, augmented with LSA, and Transeg, a sentence-based system. Two main characteristics of corpora have been investigated: Data quality (clean vs raw corpora), corpora manipulation (natural vs artificial data sets). The corpus size has also been subject to variation, and experiments related in this paper have shown that corpora characteristics highly impact recall and precision values for both algorithms.
Type de document :
Communication dans un congrès
CLA'08: Computational Linguistic Association, Oct 2008, pp.7, 2008
Liste complète des métadonnées

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00336165
Contributeur : Alexandre Labadié <>
Soumis le : lundi 3 novembre 2008 - 09:18:19
Dernière modification le : jeudi 11 janvier 2018 - 06:26:53
Document(s) archivé(s) le : lundi 7 juin 2010 - 22:37:39

Fichier

CLA.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-00336165, version 1

Collections

Citation

Alexandre Labadié, Violaine Prince. The Impact of Corpus Quality and Type on Topic based Text Segmentation Evaluation. CLA'08: Computational Linguistic Association, Oct 2008, pp.7, 2008. 〈lirmm-00336165〉

Partager

Métriques

Consultations de la notice

116

Téléchargements de fichiers

123