presentation - Systèmes d’Informations Généralisées

The SIG team, meaning Generalized Information Systems (« Systèmes d’Informations Généralisés »), exists since 2003 (year of the foundation of IRIT lab). It is one of the largest team of the lab and is composed of 21 permanent positions (regular researchers) and around thirty non permanents such as post-doc, PhD students, internship students or research engineers. We are spread over the four universities of the Western Occitanie region: Toulouse 1 Capitole University, University of Toulouse – Jean Jaurès, Paul Sabatier University, Jean François Champollion University (via the ISIS school in Castres).

Our research area is « the Data », and particularly data management and mass data processing (« Big Data »). We propose methods, models, languages and tools for simple, effective, and efficient access to qualified and relevant information. The final goal is enhancing information usage, easing information analysis and supporting the decision-making process.

Our research works are applied on a great variety of datasets: scientific databases, business databases (aeronautics, space, energy, biology, health, etc.), the Web, ambient and mobiles applications (such as user generated content), open data, scientific benchmarks (CLEF, OAEI, SSB, TPC-H/DS, TREC, etc.), semantic and knowledge data (such as ontologies), sensors and connected objects (IoT) and more.

Our research is directed towards the whole data processing chain, from raw data to elaborated data accessible to users who search for information, who visualize for synthetic views and who perform decisional and predictive analyses.

PNG - 5.9 kb
Figure 1 : Data processing chain.

This research is organized in four axes.

Automatic Integration of heterogeneous data

Today available data compose datasets that are voluminous (mass data), that are most often heterogeneous including a large diversity of structures (structured, semi-structured and even non-structured data), and that are widely distributed. Our work addresses different aspects of data heterogeneity: entity heterogeneity, structural heterogeneity, syntactic and semantic heterogeneity of data pieces.

The issue is to propose methods and algorithms that are able to match automatically elements from multiple data or knowledge sources (holistic matching). The targeted matchings may be simple (1 to 1), multiple (1 to n or n to 1), and even complex (n to m).

Non-conventional databases management

Modern databases management systems are expected to manage huge amounts of data. This data always embed an important variety (classical relational data, structured documents – such as XML or JSON -, textual data, domain ontologies…). These systems cannot be based any more on a standard and uniform data model (i.e. relational). Conversely, they are structured over centralized (data warehouses, data lakes) and distributed storage systems that are based on non-conventional data model paradigms, such as key-value, document-oriented, column-oriented or graph-oriented data models, also called noSQL (not only SQL).

In this context, the issue is developing, on the one hand, novel design methods promoting explicitly formalized data representation models (concepts and formalisms). On the other hand, such data models require formalized languages that allow manipulating and processing data. Such languages should prove the completeness of a closed algebraic core of elementary operators in order to ensure the model coverage, and should guarantee the validity and the power of expressivity of the language.

User oriented data

Complex systems, that are able to be more efficient by adapting themselves, strongly require some knowledge about the user. This knowledge is often stored into a user profile that is a set of data characterizing the user (his contexthis habits and the way of using the system)).

In this context, the issue is to define contextual profiles (spatio-temporal, egocentric, etc.) of each user or user set (group). These profiles are then used in order to propose new models or algorithms in recommender systems or information filtering systems. Profile construction technics are also used in social networks analysis context (community, fraud, influence or sentiment detection).

Analysis, learning and mining in huge amounts of data

The big data era has completely renewed computer science. Today, Humanity produces huge amounts of data in the globalized Internet network, in the Internet of Things, and also in scientific observation and recording installations (satellites, particle accelerators, DNA sequencers, etc.). Novel algorithms are now built and can be executed over clusters of computers. They enable the analysis, the mining (predictions), and the simulations over data masses.

SIG team works on the parametrization of analysis and data mining algorithms as well as on machine learning and deep learning algorithms. “Data intelligence” is both an issue and a challenge of data science and greatly depends on the underlying algorithms and the models’ efficiency. “Data intelligence” approaches must guarantee the best possible reproducibility of results. This reproducibility is hard to achieve because very large and heterogeneous datasets are most often of poor quality (or even random quality) and are built on sparse and imbalanced data distributions. These characteristics require a precise tuning of the algorithms used and that makes each approach very specific to a reduced data subset.