TARDIS: Optimal Execution of Scientific Workflows in Apache Spark

Daniel Gaspar 1 Fabio Porto 1 Reza Akbarinia 2 Esther Pacitti 2
2 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : The success of using workflows for modeling large-scale scientific applications has fostered the research on parallel execution of scientific workflows in shared-nothing clusters, in which large volumes of scientific data may be stored and processed in parallel using ordinary machines. However, most of the current scientific workflow management systems do not handle the memory and data locality appropriately. Apache Spark deals with these issues by chaining activities that should be executed in a specific node, among other optimizations such as the in-memory storage of intermediate data in RDDs (Resilient Distributed Datasets). However, to take advantage of the RDDs, Spark requires existing workflows to be described using its own API, which forces the activities to be reimplemented in Python, Java, Scala or R, and this demands a big effort from the workflow programmers. In this paper, we propose a parallel scientific workflow engine called TARDIS, whose objective is to run existing workflows inside a Spark cluster, using RDDs and smart caching, in a completely transparent way for the user, i.e., without needing to reimplement the workflows in the Spark API. We evaluated our system through experiments and compared its performance with Swift/K. The results show that TARDIS performs better (up to 138% improvement) than Swift/K for parallel scientific workflow execution.
Type de document :
Communication dans un congrès
DaWaK 2017: Data Warehousing and Knowledge Discovery, Aug 2017, Lyon, France. 19th International Conference on Big Data Analytics and Knowledge Discovery, pp.74-87, 2017, LNCS. 〈10.1007/978-3-319-64283-3_6〉
Liste complète des métadonnées

Littérature citée [13 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01620060
Contributeur : Reza Akbarinia <>
Soumis le : vendredi 20 octobre 2017 - 10:16:38
Dernière modification le : jeudi 11 janvier 2018 - 17:01:51

Fichier

tardis-optimal-execution.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

Citation

Daniel Gaspar, Fabio Porto, Reza Akbarinia, Esther Pacitti. TARDIS: Optimal Execution of Scientific Workflows in Apache Spark. DaWaK 2017: Data Warehousing and Knowledge Discovery, Aug 2017, Lyon, France. 19th International Conference on Big Data Analytics and Knowledge Discovery, pp.74-87, 2017, LNCS. 〈10.1007/978-3-319-64283-3_6〉. 〈lirmm-01620060〉

Partager

Métriques

Consultations de la notice

27

Téléchargements de fichiers

20