An Efficient Solution for Processing Skewed MapReduce Jobs

Reza Akbarinia; Miguel Liroz-Gistau; Divyakant Agrawal; Patrick Valduriez

doi:10.1007/978-3-319-22852-5_35

Communication Dans Un Congrès Année : 2015

An Efficient Solution for Processing Skewed MapReduce Jobs

(1) , (1) , (2) , (3, 1)

1
2
3

Reza Akbarinia

Fonction : Auteur
PersonId : 172647
IdHAL : reza-akbarinia
ORCID : 0000-0002-7098-0361
IdRef : 119863421

Scientific Data Management

Miguel Liroz-Gistau

Fonction : Auteur
PersonId : 901689

Scientific Data Management

Divyakant Agrawal

Fonction : Auteur
PersonId : 947753

University of California [Santa Barbara]

Patrick Valduriez

Fonction : Auteur
PersonId : 172604
IdHAL : patrick-valduriez
ORCID : 0000-0001-6506-7538
IdRef : 028314417

Institut de Biologie Computationnelle

Scientific Data Management

Résumé

Although MapReduce has been praised for its high scalability and fault tolerance, it has been criticized in some points, in particular, its poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side is done by a few nodes, or even one node, while the others remain idle. There have been some attempts to address the problem of data skew, but only for specific cases. In particular, there is no proposed solution for the cases where most of the intermediate values correspond to a single key, or when the number of keys is less than the number of reduce workers. In this paper, we propose FP-Hadoop, a system that makes the reduce side of MapReduce more parallel, and efficiently deals with the problem of data skew in the reduce side. In FP-Hadoop, there is a new phase, called intermediate reduce (IR), in which blocks of intermediate values, constructed dynamically, are processed by intermediate reduce workers in parallel, by using a scheduling strategy. By using the IR phase, even if all intermediate values belong to only one key, the main part of the reducing work can be done in parallel by using the computing resources of all available workers. We implemented a prototype of FP-Hadoop, and conducted extensive experiments over synthetic and real datasets. We achieved excellent performance gains compared to native Hadoop, e.g. more than 10 times in reduce time and 5 times in total execution time.

Mots clés

MapReduce Load Balancing Data Skew

Domaines

Recherche d'information [cs.IR]

Fichier principal

paper.pdf (158.84 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Reza Akbarinia : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01162359

Soumis le : mercredi 10 juin 2015-11:47:32

Dernière modification le : jeudi 5 décembre 2024-03:21:53

Archivage à long terme le : mardi 25 avril 2017-06:18:12

Dates et versions

lirmm-01162359 , version 1 (10-06-2015)

Identifiants

HAL Id : lirmm-01162359 , version 1
DOI : 10.1007/978-3-319-22852-5_35

Citer

Reza Akbarinia, Miguel Liroz-Gistau, Divyakant Agrawal, Patrick Valduriez. An Efficient Solution for Processing Skewed MapReduce Jobs. Globe, Sep 2015, Valencia, Spain. pp.417-429, ⟨10.1007/978-3-319-22852-5_35⟩. ⟨lirmm-01162359⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA INRA GRID5000 INRIA-SILICONVALLEY ZENITH LIRMM INRIA2 MIPS UNIV-MONTPELLIER INRAE SILECS

393 Consultations

808 Téléchargements

An Efficient Solution for Processing Skewed MapReduce Jobs

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager