Briefly, we illustrate the abstract architecture of our approach in figure . In our approach, a TCM e-Science system is composed of client side and server side. We have designed and developed the server side as a layered structure including resource layer, semantic layer and function layer.
Architecture. The abstract architecture of Semantic Web e-Science for TCM.
• The resource layer mainly supports the typical remote operations on the contents of resources on the Web and querying the meta-information of databases and services. The services in this layer extend some core Grid [12
] services from the Globus [14
] platform. We build the whole e-Science system on these Grid services that provide the basic communication and interaction mechanism for TCM e-Science. There are two services in this layer. Resource Access Service
supports the typical remote operations on the contents of databases and execution of services. To relational databases, the operations contain query, insertion, deletion, and modification. Resource Information Service
supports inquiring about the meta-information of database or service resources including: relational schema definition, DBMS descriptions, service descriptions, privilege information, and statistics information.
• The semantic layer is mainly designed for semantic-based information manipulation and integration. This layer is composed of two sub-layers. The lower layer contains two services. Process Semantic Service is used to export services as OWL-S descriptions. Database Semantic Service is used to export the relational schema of databases as RDF/OWL semantic description. The upper layer contains two services. Ontology Service is used to expose the shared TCM ontology and provide basic operations on the ontology. Ontology is used to mediate and integrate heterogeneous databases and services on the Web. Semantic Mapping Service establishes the mappings from local resources to the mediated ontology. Semantic Mapping Service maintains the mapping information and provides the mechanism of registering and inquiring about the information.
• The function layer delivers a semantically superior experience to users to support scientific collaborative research and information sharing. Semantic Query Service accepts semantic query, inquires Semantic Mapping Service to determine which databases are capable of providing the answer, and then rewrites the semantic queries in terms of database schema. A semantic query is ultimately converted into a set of SQL queries. The service wraps the results of SQL queries by semantics and returns them as triples. Semantic Search Service indexes all databases that have been mapped to mediated ontology and accepts semantic-based full-text search. The service uses the standard classes and instances from the TCM ontology as the lexicon in establishing indexes. Collaborative Service discovers and coordinates various services in a process workflow to supports research activities in a virtual community for TCM scientists.
Note that we differentiate two kinds of services. The services in this architecture are fundamental services to support the whole e-Science system, whereas there are many common services treated as Web resources for e-Science process. At the client side, the e-Science system provides a set of semantic-based toolkits to assist scientists to perform complex tasks during research. We call this architecture Dart
(Dynamic, Adaptive, RDF-mediated and Transparent) [15
], which is an abstract model for TCM e-Science. A detailed description of the service-oriented architecture is provided in the Methods section.
TCM domain ontology
Recent advent of the Semantic Web and bioinformatics has facilitated the incorporation of various large-scale online ontologies in biology and medicine, such as UMLS [16
] for integrating biomedical terminology, Gene Ontology [17
] for gene product and MGED Ontology [18
] for microarray experiment. As the backbone of the Semantic Web for TCM, a unified and comprehensive TCM domain ontology is also required to support interoperability in TCM e-Science. To overcome the problem of semantic heterogeneity and encode domain knowledge in reusable format, we need an integrated approach to develop and apply a large-scale domain ontology for the TCM discipline. In collaboration with the China Academy of Traditional Chinese Medicine (CATCM), we have taken more than 5 years in building the world's largest TCM domain ontology [19
We divide the whole TCM domain into several sub-domains. The TCM ontology is developed collaboratively by several branches of the CATCM as categories. A category is a relatively independent ontology corresponding to a relatively closed sub-domain, compared with the ontology corresponding to the whole domain. There are 12 categories in the current knowledge base of the TCM ontology. Each category is corresponding to a sub-domain (Basic Theory of TCM, Formula of Herbal Medicine, Acupuncture, etc) of TCM. We list the characterization of content of each category in table . Considering medical concepts and their relationships from the perspective of TCM discipline, we define the knowledge system of the TCM ontology by two components: concept system and semantic system (see figure ). The concept system contains content classes that represent the domain knowledge of the TCM discipline and 4 kinds of basic implemental classes (name class, definition class, explanation class, and relation class) to define each content class. The semantic system concerns the basic semantic type and semantic relationship of class. A class has literal property and object property. The range of a literal property is a literal or string, whereas the range of an object property is a class. A content class has 5 object properties (see table ) with each related with a class. Relation class has two properties: the range of the former is semantic relationship and the range of the latter is content class. Content classes are related with each other through semantic relationship. In this way, all content classes in the TCM ontology have the unified knowledge structure whereas different instances of content class have various contents and relationships.
TCM ontology categories. The initial categories defined in the TCM ontology corresponding to the sub-domains of TCM.
Semantic system framework. The semantic system framework of the TCM ontology.
Content class structure. The structure of content class in the TCM ontology.
There are more than 20,000 classes and 100,000 instances defined in the TCM ontology and the ontology has become a distributed large-scale knowledge base for TCM domain knowledge. The ontology has become large enough to cover different aspects of the TCM discipline and is used to support semantic e-science for TCM. As a large-scale domain ontology, the TCM ontology is used to integrate various database resources in a semantic view and provide formal semantics to support service coordination in TCM e-Science.
TCM database integration
We use ontology semantics to integrate distributed TCM databases as a global virtual database. We have developed a set of semantic-based toolkits for scientists to integrate and use information in distributed TCM databases.
In our approach, before-mentioned domain ontology acts as a semantic mediator for integrating distributed heterogeneous databases. Relational schemata of distributed TCM databases are mapped to the TCM ontology according to their intrinsic relations. To facilitate the process of semantic mapping between the schemata of local databases and the semantics of the mediated ontology, we have developed a visual semantic mapping tool called DartMapping for integrating relational databases in a Semantic Web way (see figure ). The tool provides two major functions: establishing semantic mapping from heterogeneous relational database to a mediated ontology semi-automatically, especially mapping for composite schema with complex join between tables, and converting relational databases schema to ontology statements based on the semantic mapping information.
DartMapping. The default user interface for DartMapping.
Figure depicts how we use DartMapping to establish mapping between ontology and database schema. Relational database schema is displayed in hierarchy including the names of databases, tables and the corresponding fields (1). The class hierarchy and class properties of the mediated ontology are displayed below (2). Classes and properties are displayed as labels in the panel. User drags tables and classes into the main panel (3), and establishes their mappings directly. One table is likely to be mapped to more than one class. The meta-information about the selected table is shown under the main panel (4). The right panel shows the outline of the mapping definitions (5). A mapping definition can be exported as XML files and reused by applications. Besides, users are able to query mapping information defined previously in DartMapping. TCM scientists are able to map local databases to the mediated TCM ontology with DartMapping. Distributed and heterogeneous databases including TCM databases (e.g. herbal medicine formula database), clinical databases (e.g. EHR database) and biology databases (e.g. neuron database) are integrated as knowledge sources for TCM scientists to carry out research. TCM scientists are able to perform searching and querying over the integrated databases to gain useful information in research.
We developed a database search engine called DartSearch
to enable full-text search over distributed databases. Scientists are able to perform searching through the integrated databases to get required information as we do in search engines like Google [20
]. However, search here is different with Google-like search. The search process is performed based on the semantic relations of the ontology. We call it semantic search
, which is searching for data objects rather than Web pages. Semantics is presented in two aspects in DartSearch:
• We construct a domain-specific lexicon for segmentation based on the TCM ontology. Each term in the lexicon is a class or instance in the ontology plus its part of speech. When we segment a piece of information from database, only the words that appear in the TCM ontology are segmented whereas other words are discarded as irrelevant information to TCM semantic search.
• Unlike keyword-based index in traditional search engine, we construct index for classes or instances in databases. The semantic relations between those classes or instances are encoded as part of the index.
In this way, scientists are able to search with more accurate constraints and get more relevant information from search results. For example, if a TCM scientist wants to find some TCM formulas that cure influenza, then he can use influenza as a keyword to perform a semantic search. The search returns TCM-specific information and the information that doesn't contain the keyword influenza but contains terms related to influenza is also returned. We connect directly-matched information and relevant information by using semantic relations in the ontology.
We provide users with a Google-like search interface to perform semantic search (see the bottom left panel in figure ) in DartSearch. The result of a semantic search request is shown in Figure with gene used as the keyword. The statistics information about the search result (e.g. the number of items) is displayed (1). DartSearch lists the items in a descending order according to their matching degrees to the keywords of the search (2). Each item in the list is a piece of information from databases that have been mapped to TCM ontology classes (3). At the bottom of each item there are the classes the item is mapped to (4), the classes relevant to the mapping classes (5) and the matching degree to the keyword. The classes and relevant classes are connected by semantic relations in the ontology. The schemata of a database are allowed to be mapped to several categories of the TCM ontology. Categories that relevant to the search result are listed in a descending order according to their matching degrees to the search (6).
DartSearch. The default user interface for DartSearch.
Generally, semantic search only gives us a coarse set of result. If scientists want to get more exact information, they are able to perform querying instead of searching in the semantic layer. A Web-based query tool called DartQuery
is provided for scientists to query over distributed TCM databases dynamically (see figure ). Relevant categories generated during semantic search imply the possible scopes from within scientists perform semantic query. They are able to select the category with the largest matching degree to construct a semantic query statement. To enable querying in the semantic layer, we use the SPARQL [21
] query language. Every query in SPARQL is viewed as an ontology class definition, and processing a query request is reduced as computing out ontology instances satisfying the class definition [22
]. The statement of a semantic query about the properties (name, usage, composition, etc.) of a TCM formula that cures influenza is as follows:
DartQuery. The default user interface for DartQuery.
SELECT ?fn ?fu ?fc ?dn ?dp ?ds
?y1 rdf : type tcm : Formula_of_Herbal_Medicine
?y1 tcm : name ?fn
?y1 tcm : usage_and_dosage ?fu
?y1 tcm : composition ?fc
?y1 tcm : cure ?y2
?y2 tcm : name "influenza"
?y2 tcm : pathogenesis ?dp
?y2 tcm : symptom_complex ?ds
Such a query in SPARQL is constructed dynamically. A form-like query interface is used to facilitate users in constructing semantic query statements in web browser. The user interface incorporates an open-source AJAX framework [23
], which enables immediate data update without refresh Web pages in web browsers. DartQuery generates querying forms automatically according to the class definitions in a category. Scientists who want to query something are able to construct a query statement by selecting classes and properties from the forms in the query interface. Figure depicts the process how a user constructs a semantic query about traditional patent medicine
. Starting from the ontology view panel on the left, user is able to browse the hierarchy tree and select the relevant classes (1). A query form corresponding to the property definitions of the selected class is automatically generated and displayed in the middle. User could select properties of interest, and inputs query constraints such as the efficiency
of the medicine (2). An outline of the currently-built query including the current class is displayed (3). User could further explore into the classes related (e.g. disease
) to the current one, and construct more complex semantic queries spanning over several classes (4). User is led into the query interface of related classes, and could add more query constraints (5).
The SPARQL query statement is submitted to the system and converted into a SQL query plan according to the mapping information between database schemata and the mediated ontology. The SQL query plan is then dispatched to specific databases for information retrieval. The query returns all satisfactory records from databases that have been mapped to the ontology. Since the query result from databases is just a record set without any semantics, the system converts the record set into a data stream in RDF/XML format and the semantics of the result is fully presented. Figure depicts the situation in which a user is navigating the query result. The statistics information about the query result is displayed (1). User selects one data object, which is highlighted (2). By following semantic links, user could get all those data objects semantically related to the current one (3). Note that the relations between the selected object and those discovered by following semantic links are derived from the ontology in the semantic layer. User could keep on navigating through a collection of databases as long as they are semantically connected (4).
Query result. The query result of DartQuery.
TCM service coordination
Ontology semantics are used to support dynamic and on-demand service coordination in a VO. Scientists are able to discover, retrieve and compose various services to achieve complex research tasks in a visual environment.
Knowledge discovery service
There are various services in a TCM VO and we mainly recognize 3 kinds of services: computation services, information services, and knowledge services. Computation services are services that execute computational jobs or analyze scientific data. Information services are services that manipulate and provide specific information. Semantic query service and semantic search service mentioned-before are 2 typical information services. Knowledge services are services that apply information to solve domain-specific problems or discover facts. Different services are used to support different kinds of tasks for TCM research.
One of the most important knowledge services for TCM research is the knowledge discovery service. The distributed databases integrated under the ontology contain much implicit domain knowledge that is hard to be discovered manually by human-being and thus require some intelligent methods to assist scientists to discover the implicit knowledge. For example, a formula of herbal medicine is composed of several individual drugs. In database of herbal medicine formula, we get the components of a formula directly; however, the same individual drug may appear in several formulas, and then the correlation between two individual drugs in various formulas can't be acquired directly by querying or searching. Notice that, according to TCM theory, a relatively fixed combination of several individual drugs is called a paired-drug when such a combination is able to strengthen their medical effects, or lessen the toxicity and side effects of some drugs. Implicit knowledge such as "paired-drug" is more likely to be discovered by data mining, instead of directly querying or searching information resources. Our method integrates several semantic-based data mining algorithms like the associated and correlated pattern mining [24
] to achieve knowledge discovery on distributed databases. Scientists are able to select knowledge discovery service according to the requirements of the research task and perform knowledge discovery over a selective set of information from distributed databases.
Besides database integration, a sophisticated e-Science system should also support service coordination for scientists, which is a significant part of TCM e-Science. Similarly with bioinformatics, TCM scientific research often requires coordination and composition of service resources. We have applied semantic techniques to achieve dynamic and on-demand service coordination in a VO and developed a Web-based service coordination tool (see figure ) called DartFlow
]. DartFlow provides a convenient and efficient way for scientists to collaborate with each other in research activities. It offers interfaces to allow researchers to register, query, compose and execute services in the semantic layer.
DartFlow. The user interface of DartFlow.
Service providers register component web service into the VO before service composition. DartFlow integrates a service registration portal for scientists to register new services. The class hierarchy (1) and class properties (2) of the mediated ontology are displayed graphically. Service description (e.g. the input and output parameters) is displayed in hierarchy (3). Similarly with semantic mapping in database integration, service providers create mappings between ontology classes and service descriptions (4). The mapping information is stored in the repository of the portal. Automatic service discovery and service matchmaking is achieved based on semantics. So far DartFlow has been full of a collection of scientific services, which are all provided by different TCM research institutes.
When a VO has been filled with various applied service, scientists are able to build serviceflow to achieve complex research tasks in DartFlow. We should retrieve enough services in order to compose a serviceflow. If scientists want to query services, they submit a service profile (e.g. a service to analyze TCM clinical data) to the portal specifying their requirements. The portal invokes suitable matchmaking agent to retrieve target services for users (5). The agent has been implemented according to some semantic-based service matchmaking algorithm. Scientists are able to compose retrieved services (6) into a serviceflow in the workspace (7) to achieve a research task. In order to enhance the flexibility and usability of serviceflow, DartFlow supports both static activity node and dynamic activity node in serviceflow (8): the former refers to those nodes combined with specific applied services at build-time; and the latter refers to those nodes combined with semantic information. After a serviceflow is designed graphically, the corresponding OWL-S file is generated according to the service mapping information. Scientists are able to validate the serviceflow from both logic aspect and syntax aspect with a validator in DartFlow and the validated serviceflow will be executed ultimately.
TCM collaborative research scenario
The proposed semantic-based approach is able to support TCM scientists to perform research collaboratively in a VO. TCM scientists are able to use the semantic-based toolkits before-mentioned in web browsers anywhere to solve problems and finish tasks. We illustrate the application of our approach through the following collaborative research scenario as several steps (see figure ):
Scenario. A scenario about TCM collaborative research based on the TCM e-Science system.
Task-driven information allocation
Information resources are often related to perform a research task. Grouping task-related information resources is a precondition for achieving collaborative research. Given a research task, TCM scientists are able to perform semantic search or construct semantic query to allocate information according to the task context. A TCM scientist, say, Wang is performing a research task about the impact of herbal medicine formula on gene expression. As a TCM scientist, Wang is not familiar about biology especially gene, so he needs to find some initial information about gene before starting to conduct experiments. He is able to perform semantic search over distributed databases in DartSearch about gene as well as its relations with TCM. DartSearch will return a general result about gene and its relation with TCM. If Wang wants more exact results about current research progress about herbal medicine formula and gene, he can perform semantic query in DartQuery. The semantic search in DartSearch has implied that the required information is mainly located in categories Formula of Herbal Medicine and Diseases. Then Wang is able to perform semantic query within the databases that has been integrated under these two categories. Wang constructs semantic query statements dynamically in DartQuery and the query returns a collection of literature about herbal medicine formula and gene.
Collaborative information sharing
After reading a batch of relevant papers, Wang decides to perform further research about the relations between herbal medicine formula and gene expression. However, he finds the information he allocated is insufficient for his task, and it means the TCM VO lacks the required information. Scientists are able to allocate only a very small sub-set of information or services in the field of TCM. It's impossible for a single scientist to deal with all the domain information. Scientists can share information collaboratively in a VO based on the semantic e-Science system. Wang can communicate with other scientists in the VO to ask for required information. Fortunately, an institute in the VO has developed a new database that contains information about gene expression. The institute registers the database into the VO by creating semantic mappings with DartMapping. Then Wang is able to get further information about gene expression by querying the database.
Scientific service coordination
Wang selects suitable services according to his research requirements and designs a serviceflow in DartFlow to achieve his research goal (see figure ). The first knowledge discovery service in the serviceflow is used to discover some underlying rules from the allocated information. The result of knowledge discovery has shown that there exists underlying relation between Sini decoction (a kind of herbal medicine formula) and glutathione S-transferase (GST) gene expression from many research papers. Wang starts to conduct experiments on the impacts of Sini decoction on GST gene expression. The experiment data is submitted to the computation service in the serviceflow. He also uses bioinformatics services such as BLAST in the serviceflow to deal with the works related with GST gene. The final result of the serviceflow has shown that Sini decoction has strong impacts on GST gene expression. The serviceflow here may involve a recursive process in order to refine the result.