Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering - LIRMM - Laboratoire d’Informatique, de Robotique et de Microélectronique de Montpellier
Conference Papers Year : 2019

Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering

Abstract

Machine Learning (ML) has become essential in several industries. In Computational Science and Engineering (CSE), the complexity of the ML lifecycle comes from the large variety of data, scientists' expertise, tools, and workflows. If data are not tracked properly during the lifecycle, it becomes unfeasible to recreate a ML model from scratch or to explain to stackholders how it was created. The main limitation of prove-nance tracking solutions is that they cannot cope with provenance capture and integration of domain and ML data processed in the multiple workflows in the lifecycle, while keeping the provenance capture overhead low. To handle this problem, in this paper we contribute with a detailed characterization of provenance data in the ML lifecycle in CSE; a new provenance data representation, called PROV-ML, built on top of W3C PROV and ML Schema; and extensions to a system that tracks provenance from multiple workflows to address the characteristics of ML and CSE, and to allow for provenance queries with a standard vocabulary. We show a practical use in a real case in the O&G industry, along with its evaluation using 48 GPUs in parallel.
Fichier principal
Vignette du fichier
provlake_ml_preprint_notice.pdf (991.56 Ko) Télécharger le fichier
Origin Files produced by the author(s)
Loading...

Dates and versions

lirmm-02335500 , version 1 (28-10-2019)

Identifiers

  • HAL Id : lirmm-02335500 , version 1

Cite

Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, et al.. Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering. WORKS 2019 - Workflows in Support of Large-Scale Science co-located with SC 2019 - ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Nov 2019, Denver, United States. pp.10. ⟨lirmm-02335500⟩
88 View
334 Download

Share

More