Skip to Main content Skip to Navigation

Parallel Itemset Mining in Massively Distributed Environments

Saber Salah 1 
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Since few decades ago, the volume of data has been increasingly growing. The rapid advances that have been made in computer storage have offered a great flexibility in storing very large amounts of data. The processing of these massive volumes of data have opened up new challenges in data mining. In particular, frequent itemset mining (FIM) algorithms have shown poor performances when processing large quantities of data. This is particularly the case when i) the data tends to be very large and/or ii) the minimum support threshold is very low. Despite the answers that frequent itemset mining methods can provide about the data, some hidden relationships cannot be easily driven and detected inside the data. This is specifically the case when the data is very large and massively distributed. To this end, a careful analysis of the informativeness of the itemsets would give more explanation about the existing correlations and relationships inside the data. However, digging through very large amount of data to determine a set of maximally informative itemsets (of a given size k) presents a major challenge in data mining. This is particularly the case when the size k of the informative itemsets to be discovered is very high. In this thesis, first we address the problem of frequent itemset mining in big data. We call for specific data placement techniques in massively distributed environments to improve the performance of parallel frequent itemset mining (PFIM) algorithms. We thoroughly study and investigate the impact of combining such a frequent itemset algorithm with a specific data placement strategy. We show that an adequate placement of the data in a massively distributed environment along with a specific frequent itemset mining algorithm can make a mining process either inoperative or completely significant. We propose ODPR (Optimal Data-Process Relationship) our solution for fast mining of frequent itemsets in MapReduce. Our method allows discovering itemsets from massive data sets, where standard solutions from the literature do not scale. Indeed, in a massively distributed environment, the arrangement of both the data and the different processes can make the global job either completely inoperative or very effective. Our proposal has been evaluated using real-world data sets and the results illustrate a significant scale-up obtained with very minimum support which confirms the effectiveness of our approach. Generally, in a massively distributed environment (e.g., MapReduce or Spark), minimizing the number of jobs results in a significant performance of the process being executed. In the case of frequent itemset mining problem, discovering frequent itemsets in just one simple job would be preferable. To this end, we propose a highly scalable, parallel frequent itemset mining algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), the PATD algorithm mines each data partition independently, relying on an absolute minimum support instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches. The second problem which we address in this thesis is discovering maximally informative k-itemsets (miki) in big data based on joint entropy. We propose PHIKS (Parallel Highly Informative K-ItemSet) a highly scalable, parallel miki mining algorithm that renders the mining process of large scale databases (up to Terabytes of data) succinct and effective. Its mining process is made up of only two efficient parallel jobs. With PHIKS, we provide a set of significant optimizations for calculating the joint entropies of the miki having different sizes, which drastically reduces the execution time of the mining process. PHIKS has been extensively evaluated using massive real-world data sets. Our experimental results confirm the effectiveness of our proposal by the significant scale-up obtained with high itemsets length and over very large databases.
Complete list of metadata

Cited literature [76 references]  Display  Hide  Download
Contributor : Reza Akbarinia Connect in order to contact the contributor
Submitted on : Tuesday, December 13, 2016 - 12:14:53 PM
Last modification on : Friday, August 5, 2022 - 3:03:28 PM
Long-term archiving on: : Tuesday, March 14, 2017 - 12:48:12 PM


  • HAL Id : tel-01415587, version 1


Saber Salah. Parallel Itemset Mining in Massively Distributed Environments. Information Retrieval [cs.IR]. Université de Montpellier, 2016. English. ⟨tel-01415587⟩



Record views


Files downloads