Workflow Provenance in the Lifecycle of Scientific Machine Learning

Renan Souza; Leonardo G Azevedo; Vítor Lourenço; Elton Soares; Raphael Thiago; Rafael Brandão; Daniel Civitarese; Emilio Vital Brazil; Marcio Moreno; Patrick Valduriez; Marta Mattoso; Renato Cerqueira; Marco a S Netto

doi:10.1002/cpe.6544

Article Dans Une Revue Concurrency and Computation: Practice and Experience Année : 2022

Workflow Provenance in the Lifecycle of Scientific Machine Learning

(1) , (1) , (1) , (1) , (1) , (1) , (1) , (1) , (1) , (2) , (3) , (1) , (1)

1
2
3

Renan Souza

Fonction : Auteur
PersonId : 1170591
ORCID : 0000-0002-1794-808X

IBM Research [Rio de Janeiro]

Leonardo G Azevedo

Fonction : Auteur

IBM Research [Rio de Janeiro]

Vítor Lourenço

Fonction : Auteur

IBM Research [Rio de Janeiro]

Elton Soares

Fonction : Auteur

IBM Research [Rio de Janeiro]

Raphael Thiago

Fonction : Auteur

IBM Research [Rio de Janeiro]

Rafael Brandão

Fonction : Auteur

IBM Research [Rio de Janeiro]

Daniel Civitarese

Fonction : Auteur

IBM Research [Rio de Janeiro]

Emilio Vital Brazil

Fonction : Auteur

IBM Research [Rio de Janeiro]

Marcio Moreno

Fonction : Auteur

IBM Research [Rio de Janeiro]

Patrick Valduriez

Fonction : Auteur
PersonId : 172604
IdHAL : patrick-valduriez
ORCID : 0000-0001-6506-7538
IdRef : 028314417

Scientific Data Management

Marta Mattoso

Fonction : Auteur
PersonId : 863196
ORCID : 0000-0002-0870-3371

Universidade Federal do Rio de Janeiro [Brasil] = Federal University of Rio de Janeiro [Brazil] = Université fédérale de Rio de Janeiro [Brésil]

Renato Cerqueira

Fonction : Auteur

IBM Research [Rio de Janeiro]

Marco a S Netto

Fonction : Auteur

IBM Research [Rio de Janeiro]

Résumé

Machine Learning (ML) has already fundamentally changed several businesses. More recently, it has also been profoundly impacting the computational science and engineering domains, like geoscience, climate science, and health science. In these domains, users need to perform comprehensive data analyses combining scientific data and ML models to provide for critical requirements, such as reproducibility, model explainability, and experiment data understanding. However, scientific ML is multidisciplinary, heterogeneous, and affected by the physical constraints of the domain, making such analyses even more challenging. In this work, we leverage workflow provenance techniques to build a holistic view to support the lifecycle of scientific ML. We contribute with (i) characterization of the lifecycle and taxonomy for data analyses; (ii) design decisions to build this view, with a W3C PROV compliant data representation and a reference system architecture; and (iii) lessons learned after an evaluation in an Oil & Gas case using an HPC cluster with 393 nodes and 946 GPUs. The experiments show that the decisions enable queries that integrate domain semantics with ML models while keeping low overhead (<1%), high scalability, and an order of magnitude of query acceleration under certain workloads against without our representation.

Mots clés

Scientific Machine Learning Machine Learning Lifecycle Artificial Intelligence Data Science Provenance Lineage Reproducibility Explainability Scientific Workflow Data lake e-Science Design Principles Taxonomy

Domaines

Base de données [cs.DB]

Fichier principal

Workflow_Provenance_in_the_Lifecycle_of_Scientific_Machine_Learning.pdf (664.26 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Patrick Valduriez : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-03324881

Soumis le : mardi 24 août 2021-10:13:06

Dernière modification le : vendredi 6 décembre 2024-03:19:58

Archivage à long terme le : vendredi 26 novembre 2021-09:24:06

Dates et versions

lirmm-03324881 , version 1 (24-08-2021)

Identifiants

HAL Id : lirmm-03324881 , version 1
DOI : 10.1002/cpe.6544

Citer

Renan Souza, Leonardo G Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, et al.. Workflow Provenance in the Lifecycle of Scientific Machine Learning. Concurrency and Computation: Practice and Experience, 2022, 34 (14), pp.e6544. ⟨10.1002/cpe.6544⟩. ⟨lirmm-03324881⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA ZENITH LIRMM INRIA2 UNIV-MONTPELLIER INRIA-BRASIL

130 Consultations

224 Téléchargements

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager