PORSCHE: Performance ORiented SCHEma Matching

Khalid Saleem; Zohra Bellahsene; Ela Hunt

Rapport Année : 2006

PORSCHE: Performance ORiented SCHEma Matching

(1) , (2) , (3)

1
2
3

Khalid Saleem

Fonction : Auteur
PersonId : 836878
ORCID : 0000-0001-5651-4746

Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier

Zohra Bellahsene

Fonction : Auteur
PersonId : 169913
IdHAL : zohra-bellahsene
ORCID : 0000-0003-2031-0519
IdRef : 07917857X

Scientific Data Management

Ela Hunt

Fonction : Auteur
PersonId : 836879

Global Informations Systems Group

Résumé

Semantic matching of schemas in heterogeneous data sharing systems is time consuming and error prone. Existing mapping tools employ semi-automatic techniques for mapping two schemas at a time. In a large-scale scenario, where data sharing involves a large number of data sources, such techniques are not suitable. In this paper we present a method, which creates a mediated schema tree from a large set of input schema trees and defines mappings from the contributing schemas to the mediated schema. It is a two-phase approach. First, we use a set of linguistic matchers, which extract the semantics of all distinct node labels, present in input schemas, and form clusters of semantically similar labels. Second, we use a tree-mining data structure, combined with the similar label clusters, to calculate the context of each node, which is used in mapping. Since the input schemas are trees, our tree mining algorithm uses node ranks calculated by pre-order traversal. Tree mining combined with semantic label clustering minimizes the target search space and improves performance, thus making it suitable for large scale data sharing. We report on experiments with up to 80 schemas containing 83,770 nodes. PORSCHE took 587 seconds to match and merge them to create a mediated schema and to return mappings from input schemas to the mediated schema. We compare the quality of matching of PORSCHE with COMA++ on standard XML schemas, and find them to be very similar to the mappings produced by COMA++.

Domaines

Base de données [cs.DB]

Fichier principal

PORSCHE2006.pdf (298.35 Ko)

Khalid Saleem : Connectez-vous pour contacter le contributeur

https://hal-lirmm.ccsd.cnrs.fr/lirmm-00117053

Soumis le : jeudi 11 janvier 2007-14:31:55

Dernière modification le : vendredi 24 mars 2023-14:52:48

Archivage à long terme le : mardi 6 avril 2010-23:39:15

Dates et versions

lirmm-00117053 , version 1 (11-01-2007)

lirmm-00117053 , version 2 (15-01-2008)

Identifiants

HAL Id : lirmm-00117053 , version 1

Citer

Khalid Saleem, Zohra Bellahsene, Ela Hunt. PORSCHE: Performance ORiented SCHEma Matching. RR-06055, 2006. ⟨lirmm-00117053v1⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

200 Consultations

829 Téléchargements

PORSCHE: Performance ORiented SCHEma Matching

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Partager