FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Miguel Liroz-Gistau 1 Reza Akbarinia 1 Patrick Valduriez 2, 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Big data parallel frameworks, such as MapReduce or Spark have been praised for their high scalability and performance, but show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this demonstration, we illustrate the use of FP-Hadoop, a system that efficiently deals with data skew in MapReduce jobs. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values , constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. Within the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieve excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time. During our demonstration, we give the users the possibility to execute and compare job executions in FP-Hadoop and Hadoop. They can retrieve general information about the job and the tasks and a summary of the phases. They can also visually compare different configurations to explore the difference between the approaches.
Type de document :
Article dans une revue
Proceedings of the VLDB Endowment (PVLDB), VLDB Endowment, 2015, 8 (12), pp.1856-1867
Liste complète des métadonnées

Littérature citée [3 références]  Voir  Masquer  Télécharger

Contributeur : Reza Akbarinia <>
Soumis le : mercredi 10 juin 2015 - 11:54:58
Dernière modification le : mercredi 10 octobre 2018 - 14:28:13
Document(s) archivé(s) le : mardi 25 avril 2017 - 06:06:34


Fichiers produits par l'(les) auteur(s)


  • HAL Id : lirmm-01162362, version 1



Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data. Proceedings of the VLDB Endowment (PVLDB), VLDB Endowment, 2015, 8 (12), pp.1856-1867. 〈lirmm-01162362〉



Consultations de la notice


Téléchargements de fichiers