Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Saber Salah 1 Reza Akbarinia 1 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when i) the amount of data tends to be very large and/or ii) the minimum support (M inSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Ter-abytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently , relying on an absolute minimum support (AM inSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.
Type de document :
Communication dans un congrès
DEXA: Database and Expert Systems Applications, Sep 2015, Valencia, Spain. 26th International Conference on Database and Expert Systems Applications, 2015, 〈http://www.dexa.org〉
Liste complète des métadonnées

Littérature citée [14 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01169603
Contributeur : Florent Masseglia <>
Soumis le : lundi 29 juin 2015 - 18:29:28
Dernière modification le : jeudi 24 mai 2018 - 15:59:21
Document(s) archivé(s) le : mardi 25 avril 2017 - 19:50:16

Fichier

dexa_salah.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-01169603, version 1

Citation

Saber Salah, Reza Akbarinia, Florent Masseglia. Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments. DEXA: Database and Expert Systems Applications, Sep 2015, Valencia, Spain. 26th International Conference on Database and Expert Systems Applications, 2015, 〈http://www.dexa.org〉. 〈lirmm-01169603〉

Partager

Métriques

Consultations de la notice

1262

Téléchargements de fichiers

1138