Skip to Main content Skip to Navigation
Conference papers

High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models

Khadidja Meguelati 1 Bénédicte Fontez 2 Nadine Hilgert 2 Florent Masseglia 1 
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Clustering is a data mining technique intensively used for data analytics, with applications to marketing, security, text/document analysis, or sciences like biology, astronomy, and many more. Dirichlet Process Mixture (DPM) is a model used for multivariate clustering with the advantage of discovering the number of clusters automatically and offering favorable characteristics. However, in the case of high dimensional data, it becomes an important challenge with numerical and theoretical pitfalls. The advantages of DPM come at the price of prohibitive running times, which impair its adoption and makes centralized DPM approaches inefficient, especially with high dimensional data. We propose HD4C (High Dimensional Data Distributed Dirichlet Clustering), a parallel clustering solution that addresses the curse of dimensionality by two means. First it gracefully scales to massive datasets by distributed computing, while remaining DPM-compliant. Second, it performs clustering of high dimensional data such as time series (as a function of time), hyperspectral data (as a function of wavelength) etc. Our experiments, on both synthetic and real world data, illustrate the high performance of our approach.
Document type :
Conference papers
Complete list of metadata

Cited literature [44 references]  Display  Hide  Download
Contributor : Florent Masseglia Connect in order to contact the contributor
Submitted on : Saturday, November 16, 2019 - 10:58:26 PM
Last modification on : Friday, August 5, 2022 - 3:03:28 PM
Long-term archiving on: : Monday, February 17, 2020 - 12:39:23 PM


Files produced by the author(s)


  • HAL Id : lirmm-02364411, version 1
  • WOS : 000554828700105


Khadidja Meguelati, Bénédicte Fontez, Nadine Hilgert, Florent Masseglia. High Dimensional Data Clustering by means of Distributed Dirichlet Process Mixture Models. IEEE Big Data 2019 - IEEE International Conference on Big Data, Dec 2019, Los-Angeles, United States. ⟨lirmm-02364411⟩



Record views


Files downloads