Skip to Main content Skip to Navigation
Journal articles

Parallel Computation of PDFs on Big Spatial Data Using Spark

Ji Liu 1 Noel Moreno Lemus 2 Esther Pacitti 1 Fábio Porto 2 Patrick Valduriez 1
1 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : We consider big spatial data, which is typically produced in scientific areas such as geological or seismic interpretation. The spatial data can be produced by observation (e.g. using sensors or soil instruments) or numerical simulation programs and correspond to points that represent a 3D soil cube area. However, errors in signal processing and modeling create some uncertainty, and thus a lack of accuracy in identifying geological or seismic phenomenons. Such uncertainty must be carefully analyzed. To analyze uncertainty, the main solution is to compute a Probability Density Function (PDF) of each point in the spatial cube area. However, computing PDFs on big spatial data can be very time consuming (from several hours to even months on a computer cluster). In this paper, we propose a new solution to efficiently compute such PDFs in parallel using Spark, with three methods: data grouping, machine learning prediction and sampling. We evaluate our solution by extensive experiments on different computer clusters using big data ranging from hundreds of GB to several TB. The experimental results show that our solution scales up very well and can reduce the execution time by a factor of 33 (in the order of seconds or minutes) compared with a baseline method.
Complete list of metadatas

Cited literature [46 references]  Display  Hide  Download

https://hal-lirmm.ccsd.cnrs.fr/lirmm-02045144
Contributor : Patrick Valduriez <>
Submitted on : Thursday, February 21, 2019 - 6:23:53 PM
Last modification on : Tuesday, June 2, 2020 - 12:17:38 PM
Document(s) archivé(s) le : Wednesday, May 22, 2019 - 7:21:19 PM

File

DAPDauthorVersion.pdf
Files produced by the author(s)

Identifiers

Citation

Ji Liu, Noel Moreno Lemus, Esther Pacitti, Fábio Porto, Patrick Valduriez. Parallel Computation of PDFs on Big Spatial Data Using Spark. Distributed and Parallel Databases, Springer, 2020, 38, pp.63-100. ⟨10.1007/s10619-019-07260-3⟩. ⟨lirmm-02045144⟩

Share

Metrics

Record views

185

Files downloads

233