|Home | About | Journals | Submit | Contact Us | Français|
We present the first prototype of INDUS (Intelligent Data Understanding System), a federated, query-centric system for information integration and knowledge acquisition from distributed, semantically heterogeneous data sources that can be viewed (conceptually) as tables. INDUS employs ontologies and inter-ontology mappings, to enable a user to view a collection of such data sources (regardless of location, internal structure and query interfaces) as though they were a collection of tables structured according to an ontology supplied by the user. This allows INDUS to answer user queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology.
Ongoing transformation of biology from a data-poor science into an increasingly data-rich science, with the attendant increase in the number, size, and diversity of sources of data (e.g., protein sequences, structures, expression patterns, interactions) offer unprecedented, and as yet, largely unrealized opportunities for large-scale collaborative discovery in a number of areas including characterization of macromolecular sequence-structure-function relationships, discovery of complex genetic regulatory networks, etc.
Biological data sources developed by autonomous individuals or groups differ with respect to their ontological commitments, that is, assumptions concerning the objects that exist in the world, the properties or attributes of the objects, relationships between objects, the possible values of attributes, and their intended meaning, as well as the granularity or level of abstraction at which objects and their properties are described [12, 11]. Therefore, semantic differences among autonomous data sources are simply unavoidable. Effective use of multiple sources of data in a given context requires reconciliation of such semantic differences, which in fact involves solving a data integration problem.
Driven by the semantic Web vision , there have been significant community-wide efforts aimed at the construction of ontologies in life sciences. Examples include the Gene Ontology (www.geneontology.org) in biology and Unified Medical Language System (www.nlm.nih.gov/research/umls) in heath informatics. However, because data sources that are created for use in one context often find use in other contexts or applications (e.g., in collaborative scientific discovery applications involving data-driven construction of classifiers from semantically disparate data sources ), and because users often need to analyze data in different contexts from different perspectives, there is no single privileged ontology that can serve all users, or for that matter, even a single user, in every context. Effective use of multiple sources of data in a given context requires flexible approaches to reconciling such semantic differences from the user’s point of view.
Against this background, we have investigated a federated, query-centric approach to information integration and knowledge acquisition from distributed, semantically heterogeneous data sources, from a user’s perspective.
The choice of the federated, query-centric approach was influenced by the large number and diversity of loosely linked, autonomously maintained data repositories involved and the context and user-specific nature of integration tasks that need to be performed. Our work has led to INDUS, a system for information integration and knowledge acquisition.
We associate ontologies with data sources and users and show how to define mappings between them. We exploit the ontologies and the mappings to develop sound methods for flexibly querying (from a user perspective) multiple semantically heterogeneous distributed data sources in a setting where each data source can be viewed (conceptually) as a single table [5, 4].
The rest of the paper is organized as follows: Section 2 introduces the problem that we are addressing more precisely through an example from biology. Section 3 describes the first prototype of INDUS. We end with conclusions, discussion of related work and directions for future work in Section 4.
The problem that we address is best illustrated by an example. Consider two biological laboratories that independently collect information about protein functions based on the protein sequences. The data collected by the first laboratory contains information about human proteins and their functions (see the entry corresponding to D1 in Table 1), whereas the data collected by the second laboratory contains information about yeast proteins and their functions (see the entry corresponding to D2 in Table 1). Suppose that a biologist (user) U wants to assemble a data set based on the two data sources of interest D1 and D2 from his or her own perspective. The representative attributes from the user’s perspective are ID, AA composition (i.e., the number of occurrences of each amino acid in the amino acid sequence corresponding to the protein), and GO Function (see the entry corresponding to DU in Table 1).
However, we observe that the attributes in the data sources D1 and D2 are different from the user attributes. In order to reconcile these differences, the user must observe that the attributes Protein ID in D1 and Accession Number in D2 are similar to the user attribute ID in DU; the attributes Protein Sequence in D1 and AA Sequence in D2 are also similar, and they can be used to derive the attribute AA Composition in DU; furthermore, the attributes EC Number1 in D1 and MIPS Funcat 2 in D2 are similar to the user attribute GO Function.
To establish the correspondence between values that two similar attributes can take, we need to associate types with attributes and map the domain of the type of an attribute to the domain of the type of the corresponding attribute (e.g., AA Sequence to AA Composition or EC Number to GO Function). We assume that the type of an attribute can be a standard type such as a collection of values (e.g., amino acids, Prosite motifs, etc.), or it can be given by a simple hierarchical ontology (e.g., species taxonomy). Figure 1 shows examples of (simplified) attribute value hierarchies for the attributes EC Number in D1 and GO Function in DU. Examples of semantic correspondences in this case could be: EC 18.104.22.168 in D1 is equivalent to GO 0047696 in DU, EC 22.214.171.124 in D1 is lower than (i.e., hierarchically below) GO 0004672 in DU, or for that matter EC 1.14 is higher than GO 0004597, etc.
In general, a biologist might want to answer queries (e.g., proteins that are involved in catalytic activity or the number of human proteins that are involved in kinase activity) from the integrated data. INDUS, the system that we develop in our lab, can be used to answer such queries against distributed, semantically heterogeneous data sources without the need for a centralized data warehouse or a common global ontology. We will describe INDUS in more detail in the next section.
The current prototype of INDUS enables a biologist with some familiarity with the relevant data sources to integrate and analyze relevant data sources by specifying a user ontology, simple mappings between data source specific ontologies, and executing queries - all without having to write code. The current implementation of INDUS includes support for:
Note that in the current release of the INDUS software, we have assembled two relational databases which contain a subsets of the information gathered from SWISSPROT and MIPS to demonstrate how the user can query the two databases flexibly using user-supplied mappings.
We present the first prototype of INDUS, a federated, query-centric approach to answering user queries from distributed, semantically heterogeneous data sources. INDUS assumes a clear separation between data and the semantics of the data (ontologies) and allows users to specify ontologies and mappings between data source ontologies and user ontology. These mappings are stored in a mappings repository to ensure their re-usability and are made available to a query answering engine. The task of the query answering engine is to decompose a query posed by a user into subqueries according to the distributed data sources and compose the results into a final result to the intial user query. An initial version of INDUS software and documentation are available at: http://www.cild.iastate.edu/software/indus.html.
There is a large body of literature on information integration and systems for information integration. Davidson et al.  and Eckman  survey alternative approaches to data integration. Hull  summarizes theoretical work on data integration. Several systems have been designed specifically for the integration of biological data sources. It is worth mentioning SRS , K2 , Kleisli , IBM’s DiscoveryLink , TAMBIS , OPM , BioMediator , among others.
Systems such as SRS and Kleisli do not assume any data model (or schema). It is the user’s responsability to specify the integration details and the data source locations, when posing queries. Discovery Link and OMP rely on schema mappings and the definition of views to perform the integration task. TAMBIS and BioMediator make a clear distinction between data and the semantics of the data (i.e., ontologies) and take into account semantic correspondences between ontologies (both at schema level and attribute level) in the process of data integration.
Most of the above mentioned systems assume a predefined global schema (e.g., Discovery Link, OMP) or ontology (e.g., TAMBIS), with the notable exception of BioMediator, where users can easily tailor the integrating ontology to their own needs. This is highly desirable in a scientific discovery setting where users need the flexibility to specify their own ontologies.
While some of these systems can answer very complex queries (e.g., BioMediator), others have limited query capabilities (e.g, SRS which is mainly an information retrieval system). Furthermore, for some systems it is very easy to add new data sources to the system (e.g., SRS or Kleisli, where new data source wrappers can be easily developed), while this is not easy for other biological integration systems (e.g., Discovery Link or OMP, where the global schema needs to be reconstructed).
On a different note, there has been a great deal of work on ontology development environments. Before developing INDUS editor, off-the-shelf alternatives such as IBM’s Clio  or Protege  were considered, but they proved insufficient for our needs. Clio provides support only for schema mapping, but not for hierarchical ontology mapping. Protege is a purely knowledge base constructing tool (including ontology mappings). It does not provide support for the association of ontologies with data, data management or queries over the data. Furthermore, neither of these systems allow procedural mappings (a.k.a., conversion functions), which are essential for data integration.
Work in progress is aimed at:
This work was funded in part by grants from the National Science Foundation (IIS 0219699) and the National Institutes of Health (GM 066387).