Fast Parallel Mining of Maximally Informative k-Itemsets in Big Data

Saber Salah 1 Reza Akbarinia 1 Florent Masseglia 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : The discovery of informative itemsets is a fundamental building block in data analytics and information retrieval. While the problem has been widely studied, only few solutions scale. This is particularly the case when i) the data set is massive, calling for large-scale distribution, and/or ii) the length K of the informative itemset to be discovered is high. In this paper, we address the problem of parallel mining of maximally informative k-itemsets (miki) based on joint entropy. We propose PHIKS (Parallel Highly Informative K-itemSets) a highly scalable, parallel miki mining algorithm. PHIKS renders the mining process of large scale databases (up to terabytes of data) succinct and effective. Its mining process is made up of only two compact, yet efficient parallel jobs. PHIKS uses a clever heuristic approach to efficiently estimates the joint entropies of miki having different sizes with very low upper bound error rate, which dramatically reduces the runtime process. PHIKS has been extensively evaluated using massive, real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high featuresets length and hundreds of millions of objects.
Type de document :
Communication dans un congrès
ICDM: International Conference on Data Mining, Aug 2015, Atlantic city, United States. 15th IEEE International Conference on Data Mining, pp.359-368, 2015, 〈https://icdm2015.stonybrook.edu〉. 〈10.1109/ICDM.2015.86〉
Liste complète des métadonnées

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01187275
Contributeur : Florent Masseglia <>
Soumis le : lundi 26 novembre 2018 - 19:58:41
Dernière modification le : lundi 26 novembre 2018 - 20:05:06

Fichier

Miki.pdf
Fichiers éditeurs autorisés sur une archive ouverte

Identifiants

Collections

Citation

Saber Salah, Reza Akbarinia, Florent Masseglia. Fast Parallel Mining of Maximally Informative k-Itemsets in Big Data. ICDM: International Conference on Data Mining, Aug 2015, Atlantic city, United States. 15th IEEE International Conference on Data Mining, pp.359-368, 2015, 〈https://icdm2015.stonybrook.edu〉. 〈10.1109/ICDM.2015.86〉. 〈lirmm-01187275〉

Partager

Métriques

Consultations de la notice

2258

Téléchargements de fichiers

5