A Highly Scalable Parallel Algorithm for Maximally Informative k-Itemset Mining

Saber Salah 1 Reza Akbarinia 1 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when i) the data set is massive, calling for large-scale distribution, and/or ii) the length k of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative K-ItemSet) a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of miki having different sizes, which drastically reduces the execution time, the communication cost and the energy consumption, in a distributed computational platform. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.
Type de document :
Article dans une revue
Knowledge and Information Systems (KAIS), Springer, 2017
Liste complète des métadonnées

Littérature citée [29 références]  Voir  Masquer  Télécharger

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01288571
Contributeur : Florent Masseglia <>
Soumis le : mardi 15 mars 2016 - 12:26:50
Dernière modification le : vendredi 12 janvier 2018 - 11:03:48
Document(s) archivé(s) le : jeudi 16 juin 2016 - 10:38:55

Fichier

KAIS_Salah_2016.pdf
Fichiers produits par l'(les) auteur(s)

Identifiants

  • HAL Id : lirmm-01288571, version 1

Citation

Saber Salah, Reza Akbarinia, Florent Masseglia. A Highly Scalable Parallel Algorithm for Maximally Informative k-Itemset Mining. Knowledge and Information Systems (KAIS), Springer, 2017. 〈lirmm-01288571〉

Partager

Métriques

Consultations de la notice

465

Téléchargements de fichiers

410