How to extract unit of measure in scientific documents?

A large amount of quantitative data, related to experimental results, is reported in scientific documents in a free form of text. Each quantitative result is characterized by a numerical value often followed by a unit of measure. Extracting automatically quantitative data is a painstaking process because units suffer from different ways of writing within documents. In our paper, we propose to focus on the extraction and identification of the variant units, in order to enrich iteratively the terminological part of an Ontological and Terminological Resource (OTR) and in the end to allow the extraction of quantitative data. Focusing on unit extraction involves two main steps. Since we work on unstructured documents, units are completely drowned in textual information. In the first step, our method aims at handling the crucial time-consuming process of unit location using supervised learning methods. Once the units have been located in the text, the second step of our method consists in extracting and identifying candidate units in order to enrich the OTR. The extracted candidates are compared to units already known in the OTR using a new string distance measure to validate whether or not they are relevant variants. We have made concluding experiments on our two-step method on a set of more than 35000 sentences.

Mots clés

Ontological and Terminological Resource Unit of measure extraction Machine learning Information retrieval

Domaines

Web

Fichier principal

SSTM_2013_4_CR.pdf (316.49 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Patrice Buche : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00903771

Soumis le : mardi 19 novembre 2019-13:17:30

Dernière modification le : jeudi 14 novembre 2024-16:18:04

Archivage à long terme le : jeudi 20 février 2020-18:34:25

Dates et versions

lirmm-00903771 , version 1 (19-11-2019)

Identifiants

HAL Id : lirmm-00903771 , version 1
DOI : 10.5220/0004666302490256
PRODINRA : 279667

Citer

Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthelemy, Mathieu Roche. How to extract unit of measure in scientific documents?. KDIR: Knowledge Discovery and Information Retrieval, Sep 2013, Vilamoura, Portugal. pp.454-459, ⟨10.5220/0004666302490256⟩. ⟨lirmm-00903771⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CIRAD AGROPARISTECH CNRS INRIA INRA IATE TEXTE GRAPHIK LIRMM MIA-PARIS AGROPOLIS INRIA2 MIPS BA UNIV-MONTPELLIER INSTITUT-AGRO-MONTPELLIER INRAE INRAEOCCITANIEMONTPELLIER MATHNUM

607 Consultations

397 Téléchargements