Skip to Main content Skip to Navigation
New interface
Conference papers

Efficient Runtime Capture of Multiworkflow Data Using Provenance

Renan Souza 1, 2 Leonardo Azevedo 2 Raphael Thiago 2 Elton Soares 2 Marcelo Nery 2 Marco Netto 2 Emilio Vital Brazil 2 Renato Cerqueira 2 Patrick Valduriez 3 Marta Mattoso 1 
3 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Computational Science and Engineering (CSE) projects are typically developed by multidisciplinary teams. Despite being part of the same project, each team manages its own workflows, using specific execution environments and data processing tools. Analyzing the data processed by all workflows globally is a core task in a CSE project. However, this analysis is hard because the data generated by these workflows are not integrated. In addition, since these workflows may take a long time to execute, data analysis needs to be done at runtime to reduce cost and time of the CSE project. A typical solution in scientific data analysis is to capture and relate the data in a provenance database while the workflows run, thus allowing for data analysis at runtime. However, the main problem is that such data capture competes with the running workflows, adding significant overhead to their execution. To mitigate this problem, we introduce in this paper a system called ProvLake, which adopts design principles for providing efficient distributed data capture from the workflows. While capturing the data, ProvLake logically integrates and ingests them into a provenance database ready for analyses at runtime. We validated ProvLake in a real use case in the O&G industry encompassing four workflows that process 5 TB datasets for a deep learning classifier. Compared with Komadu, the closest solution that meets our goals, our approach enables runtime multiworkflow data analysis with much smaller overhead, such as 0.1%.
Complete list of metadata

Cited literature [24 references]  Display  Hide  Download
Contributor : Patrick Valduriez Connect in order to contact the contributor
Submitted on : Monday, August 12, 2019 - 9:23:31 PM
Last modification on : Friday, August 5, 2022 - 3:03:28 PM
Long-term archiving on: : Thursday, January 9, 2020 - 7:09:15 PM


Files produced by the author(s)



Renan Souza, Leonardo Azevedo, Raphael Thiago, Elton Soares, Marcelo Nery, et al.. Efficient Runtime Capture of Multiworkflow Data Using Provenance. eScience 2019 - 15th International Conference on eScience, Sep 2019, San Diego, United States. pp.359-368, ⟨10.1109/eScience.2019.00047⟩. ⟨lirmm-02265932⟩



Record views


Files downloads