Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Renan Souza; Leonardo Azevedo; Vítor Lourenço; Elton Soares; Raphael Thiago; Rafael Brandão; Daniel Civitarese; Emilio Vital Brazil; Marcio Moreno; Patrick Valduriez; Marta Mattoso; Renato Cerqueira; Marco Netto

Communication Dans Un Congrès Année : 2019

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

(1, 2) , (2) , (2) , (2) , (2) , (2) , (2) , (2) , (2) , (3) , (1) , (2) , (2)

1
2
3

Renan Souza

Fonction : Auteur
PersonId : 1170591
ORCID : 0000-0002-1794-808X

Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia

IBM Research [Rio de Janeiro]

Leonardo Azevedo

Fonction : Auteur

IBM Research [Rio de Janeiro]

Vítor Lourenço

Fonction : Auteur

IBM Research [Rio de Janeiro]

Elton Soares

Fonction : Auteur

IBM Research [Rio de Janeiro]

Raphael Thiago

Fonction : Auteur

IBM Research [Rio de Janeiro]

Rafael Brandão

Fonction : Auteur

IBM Research [Rio de Janeiro]

Daniel Civitarese

Fonction : Auteur

IBM Research [Rio de Janeiro]

Emilio Vital Brazil

Fonction : Auteur

IBM Research [Rio de Janeiro]

Marcio Moreno

Fonction : Auteur

IBM Research [Rio de Janeiro]

Patrick Valduriez

Fonction : Auteur
PersonId : 172604
IdHAL : patrick-valduriez
ORCID : 0000-0001-6506-7538
IdRef : 028314417

Scientific Data Management

Marta Mattoso

Fonction : Auteur
PersonId : 863196
ORCID : 0000-0002-0870-3371

Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia

Renato Cerqueira

Fonction : Auteur

IBM Research [Rio de Janeiro]

Marco Netto

Fonction : Auteur

IBM Research [Rio de Janeiro]

Résumé

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of prove-nance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 48 GPUs in parallel.

Mots clés

Machine Learning Lifecycle Workflow Provenance Computational Science and Engineering

Domaines

Base de données [cs.DB]

Fichier principal

provlake_ml_preprint_notice.pdf (991.56 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Patrick Valduriez : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-02335500

Soumis le : lundi 28 octobre 2019-12:36:56

Dernière modification le : jeudi 1 février 2024-10:05:32

Archivage à long terme le : mercredi 29 janvier 2020-15:50:42

Dates et versions

lirmm-02335500 , version 1 (28-10-2019)

Identifiants

HAL Id : lirmm-02335500 , version 1

Citer

Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, et al.. Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. WORKS 2019 - Workflows in Support of Large-Scale Science co-located with SC 2019 - ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2019, Denver, United States. pp.10. ⟨lirmm-02335500⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA ZENITH LIRMM INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC MIPS UNIV-MONTPELLIER UNIV-RENNES UR1-MATH-NUM

97 Consultations

340 Téléchargements

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager