Biomedical Terminology Extraction: A new combination of Statistical and Web Mining Approaches

Juan Antonio Lossio-Ventura 1, * Clement Jonquet 2, 3 Mathieu Roche 1, 4 Maguelonne Teisseire 4, 1
* Auteur correspondant
1 ADVANSE - ADVanced Analytics for data SciencE
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
2 SMILE - Système Multi-agent, Interaction, Langage, Evolution
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier
Abstract : The objective of this work is to combine statistical and web mining methods for the automatic extraction, and ranking of biomedical terms from free text. We present new extraction methods that use linguistic patterns specialized for the biomedical field, and use term extraction measures, such as C-value, and keyword extraction measures, such as Okapi BM25, and TFIDF. We propose several combinations of these measures to improve the extraction and ranking process and we investigate which combinations are more relevant for different cases. Each measure gives us a ranked list of candidate terms that we finally re-rank with a new web-based measure. Our experiments show, first that an appropriate harmonic mean of C-value used with keyword extraction measures offers better precision results than used alone, either for the extraction of single-word and multi-words terms; second, that best precision results are often obtained when we re-rank using the web-based measure. We illustrate our results on the extraction of English and French biomedical terms from a corpus of laboratory tests available online in both languages. The results are validated by using UMLS (in English) and only MeSH (in French) as reference dictionary.
Type de document :
Communication dans un congrès
JADT: Journées d'Analyse statistique des Données Textuelles, Jun 2014, Paris, France. pp.421-432, 2014, 〈http://www.aftal.fr/jadt2014/?page_id=140〉
Liste complète des métadonnées

Littérature citée [25 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01056598
Contributeur : Juan Antonio Lossio Ventura <>
Soumis le : mercredi 17 septembre 2014 - 05:42:10
Dernière modification le : jeudi 11 janvier 2018 - 06:27:21
Document(s) archivé(s) le : jeudi 18 décembre 2014 - 10:16:08

Fichier

JADT2014.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-01056598, version 2

Citation

Juan Antonio Lossio-Ventura, Clement Jonquet, Mathieu Roche, Maguelonne Teisseire. Biomedical Terminology Extraction: A new combination of Statistical and Web Mining Approaches. JADT: Journées d'Analyse statistique des Données Textuelles, Jun 2014, Paris, France. pp.421-432, 2014, 〈http://www.aftal.fr/jadt2014/?page_id=140〉. 〈lirmm-01056598v2〉

Partager

Métriques

Consultations de la notice

448

Téléchargements de fichiers

341