Data Partitioning for Minimizing Transferred Data in MapReduce

Miguel Liroz-Gistau; Reza Akbarinia; Divyakant Agrawal; Esther Pacitti; Patrick Valduriez

doi:10.1007/978-3-642-40053-7_1

Communication Dans Un Congrès Année : 2013

Data Partitioning for Minimizing Transferred Data in MapReduce

(1) , (1) , (2) , (1) , (1, 3)

1
2
3

Miguel Liroz-Gistau

Fonction : Auteur
PersonId : 901689

Scientific Data Management

Reza Akbarinia

Fonction : Auteur
PersonId : 172647
IdHAL : reza-akbarinia
ORCID : 0000-0002-7098-0361
IdRef : 119863421

Scientific Data Management

Divyakant Agrawal

Fonction : Auteur
PersonId : 947753

Department of Computer Science [Santa Barbara]

Esther Pacitti

Fonction : Auteur
PersonId : 3271
IdHAL : esther-pacitti
ORCID : 0000-0003-1370-9943
IdRef : 117946451

Scientific Data Management

Patrick Valduriez

Fonction : Auteur
PersonId : 172604
IdHAL : patrick-valduriez
ORCID : 0000-0001-6506-7538
IdRef : 028314417

Scientific Data Management

Institut de Biologie Computationnelle

Résumé

Reducing data transfer in MapReduce's shuffle phase is very important because it increases data locality of reduce tasks, and thus decreases the overhead of job executions. In the literature, several optimizations have been proposed to reduce data transfer between mappers and reducers. Nevertheless, all these approaches are limited by how intermediate key-value pairs are distributed over map outputs. In this paper, we address the problem of high data transfers in MapReduce, and propose a technique that repartitions tuples of the input datasets, and thereby optimizes the distribution of key-values over mappers, and increases the data locality in reduce tasks. Our approach captures the relationships between input tuples and intermediate keys by monitoring the execution of a set of MapReduce jobs which are representative of the workload. Then, based on those relationships, it assigns input tuples to the appropriate chunks. We evaluated our approach through experimentation in a Hadoop deployment on top of Grid5000 using standard benchmarks. The results show high reduction in data transfer during the shuffle phase compared to Native Hadoop.

Domaines

Calcul parallèle, distribué et partagé [cs.DC]

Fichier principal

globe_2013-paper.pdf (194.91 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Miguel Liroz-Gistau : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00879527

Soumis le : lundi 4 novembre 2013-10:48:46

Dernière modification le : vendredi 6 décembre 2024-03:19:36

Archivage à long terme le : vendredi 7 avril 2017-20:16:22

Dates et versions

lirmm-00879527 , version 1 (04-11-2013)

Identifiants

HAL Id : lirmm-00879527 , version 1
DOI : 10.1007/978-3-642-40053-7_1

Citer

Miguel Liroz-Gistau, Reza Akbarinia, Divyakant Agrawal, Esther Pacitti, Patrick Valduriez. Data Partitioning for Minimizing Transferred Data in MapReduce. Globe, Aug 2013, Prague, Czech Republic. pp.1-12, ⟨10.1007/978-3-642-40053-7_1⟩. ⟨lirmm-00879527⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA INRA GRID5000 ZENITH LIRMM INRIA2 MIPS UNIV-MONTPELLIER INRAE SILECS

665 Consultations

1100 Téléchargements

Data Partitioning for Minimizing Transferred Data in MapReduce

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager