Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Saber Salah; Reza Akbarinia; Florent Masseglia

doi:10.1007/978-3-319-22849-5_21

Communication Dans Un Congrès Année : 2015

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

(1) , (1) , (1)

Saber Salah

Fonction : Auteur
PersonId : 967928

Scientific Data Management

Reza Akbarinia

Fonction : Auteur
PersonId : 172647
IdHAL : reza-akbarinia
ORCID : 0000-0002-7098-0361
IdRef : 119863421

Scientific Data Management

Florent Masseglia

Fonction : Auteur
PersonId : 172896
IdHAL : florent-masseglia
ORCID : 0000-0002-1149-585X
IdRef : 120528681

Scientific Data Management

Résumé

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when i) the amount of data tends to be very large and/or ii) the minimum support (M inSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Ter-abytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently , relying on an absolute minimum support (AM inSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

Mots clés

Data Mining Machine Learning Frequent Itemset MapReduce Big Data

Domaines

Base de données [cs.DB]

Fichier principal

dexa_salah.pdf (415.8 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Florent Masseglia : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01169603

Soumis le : lundi 29 juin 2015-18:29:28

Dernière modification le : dimanche 24 mars 2024-11:37:22

Archivage à long terme le : mardi 25 avril 2017-19:50:16

Dates et versions

lirmm-01169603 , version 1 (29-06-2015)

Identifiants

HAL Id : lirmm-01169603 , version 1
DOI : 10.1007/978-3-319-22849-5_21

Citer

Saber Salah, Reza Akbarinia, Florent Masseglia. Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments. DEXA 2015 - 26th International Conference on Database and Expert Systems Applications, Sep 2015, Valencia, Spain. pp.303-318, ⟨10.1007/978-3-319-22849-5_21⟩. ⟨lirmm-01169603⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-RENNES1 CNRS INRIA IRISA GRID5000 ZENITH LIRMM INRIA2 UR1-MATH-STIC UR1-UFR-ISTIC MIPS UNIV-MONTPELLIER UNIV-RENNES SILECS UR1-MATH-NUM

745 Consultations

1127 Téléchargements

Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager