OryzaGP: rice gene and protein dataset for named-entity recognition

Pierre Larmande; Huy Do; Yue Wang

doi:10.5808/GI.2019.17.2.e17

Article Dans Une Revue Genomics & Informatics Année : 2019

OryzaGP: rice gene and protein dataset for named-entity recognition

(1, 2) , (3) , (4)

1
2
3
4

Pierre Larmande

Fonction : Auteur
PersonId : 753219
IdHAL : pierre-larmande
ORCID : 0000-0002-2923-9790
IdRef : 122767802

Diversité, adaptation, développement des plantes

South Green Bioinformatics Platform [Montpellier]

Huy Do

Fonction : Auteur
PersonId : 786136
ORCID : 0000-0003-2588-2858

University of Science and Technology of Hanoi

Yue Wang

Fonction : Auteur
PersonId : 767474
ORCID : 0000-0001-6230-3275

Database Center for Life Science [Chiba]

Résumé

Text mining has become an important research method in biology, with its original purpose to extract biological entities, such as genes, proteins and phenotypic traits, to extend knowledge from scientific papers. However, few thorough studies on text mining and application development, for plant molecular biology data, have been performed, especially for rice, resulting in a lack of datasets available to solve named-entity recognition tasks for this species. Since there are rare benchmarks available for rice, we faced various difficulties in exploiting advanced machine learning methods for accurate analysis of the rice literature. To evaluate several approaches to automatically extract information from gene/protein entities, we built a new dataset for rice as a benchmark. This dataset is composed of a set of titles and abstracts, extracted from scientific papers focusing on the rice species, and is downloaded from PubMed. During the 5th Biomedical Linked Annotation Hackathon, a portion of the dataset was uploaded to PubAnnotation for sharing. Our ultimate goal is to offer a shared task of rice gene/protein name recognition through the BioNLP Open Shared Tasks framework using the dataset, to facilitate an open comparison and evaluation of different approaches to the task.

Mots clés

Named-entity recognition Natural language processing Oryza sativa Plant molecular biology Rice Text mining

Domaines

Bio-informatique [q-bio.QM]

Fichier principal

main.pdf (155.38 Ko)

gi-2019-17-2-e17-E1.pdf (196.62 Ko)

Origine	Publication financée par une institution

Format	typeAnnex_undefined

Pierre Larmande : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03615384

Soumis le : lundi 4 novembre 2024-10:56:28

Dernière modification le : jeudi 28 novembre 2024-03:15:03

Dates et versions

lirmm-03615384 , version 1 (21-03-2022)

lirmm-03615384 , version 2 (04-11-2024)

Licence

Paternité

Identifiants

HAL Id : lirmm-03615384 , version 2
DOI : 10.5808/GI.2019.17.2.e17
IRD : fdi:010084042
PUBMEDCENTRAL : PMC6808627

Citer

Pierre Larmande, Huy Do, Yue Wang. OryzaGP: rice gene and protein dataset for named-entity recognition. Genomics & Informatics, 2019, 17 (2), pp.e17. ⟨10.5808/GI.2019.17.2.e17⟩. ⟨lirmm-03615384v2⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

IRD CIRAD CNRS LIRMM AGROPOLIS BA UNIV-MONTPELLIER

84 Consultations

16 Téléchargements

OryzaGP: rice gene and protein dataset for named-entity recognition

Résumé

Mots clés

Domaines

Dates et versions

Licence

Identifiants

Citer

Exporter

Collections

Altmetric

Partager