Skip to Main content Skip to Navigation
Journal articles

Analyzing Related Raw Data Files through Dataflows

Vitor Silva 1 Daniel De Oliveira 1 Patrick Valduriez 2, 3 Marta Mattoso 1
3 ZENITH - Scientific Data Management
LIRMM - Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier, CRISAM - Inria Sophia Antipolis - Méditerranée
Abstract : Computer simulations may ingest and generate high numbers of raw data files. Most of these files follow a de facto standard format established by the application domain, e.g., FITS for astronomy. Although these formats are supported by a variety of programming languages, libraries and programs, analyzing thousands or millions of files requires developing specific programs. Database Management Systems (DBMS) are not suited for this, because they require loading the raw data and structuring it, which gets heavy at large-scale. Systems like NoDB, RAW and FastBit, have been proposed to index and query raw data files without the overhead of using a DBMS. However, these solutions are focused on analyzing one single large file instead of several related files. In this case, when related files are produced and required for analysis, the relationship among elements within file contents must be managed manually, with specific programs to access raw data. Thus, this data management may be time-consuming and error-prone. When computer simulations are managed by a Scientific Workflow Management System (SWfMS), they can take advantage of provenance data to relate and analyze raw data files produced during workflow execution. However, SWfMS register provenance at a coarse grain, with limited analysis on elements from raw data files. When the SWfMS is dataflow-aware, it can register provenance data and the relationships among elements of raw data files altogether in a database which is useful to access the contents of a large number of files. In this paper, we propose a dataflow approach for analyzing element data from several related raw data files. Our approach is complementary to the existing single raw data file analysis approaches. We use the Montage workflow from astronomy and a workflow from Oil and Gas domain as I/O intensive case studies. Our experimental results for the Montage workflow explore different types of raw data flows like showing all linear transformations involved in projection simulation programs, considering specific mosaic elements from input repositories. The cost for raw data extraction is approximately 3.7% of the total application execution time.
Document type :
Journal articles
Complete list of metadatas

Cited literature [31 references]  Display  Hide  Download

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01181231
Contributor : Patrick Valduriez <>
Submitted on : Thursday, October 11, 2018 - 4:53:57 PM
Last modification on : Monday, October 19, 2020 - 2:34:02 PM
Long-term archiving on: : Saturday, January 12, 2019 - 3:22:49 PM

Identifiers

Collections

Citation

Vitor Silva, Daniel De Oliveira, Patrick Valduriez, Marta Mattoso. Analyzing Related Raw Data Files through Dataflows. Concurrency and Computation: Practice and Experience, Wiley, 2016, 28 (8), pp.2528-2545. ⟨10.1002/cpe.3616⟩. ⟨lirmm-01181231⟩

Share

Metrics

Record views

900

Files downloads

413