In this section we will describe our approach in terms of ontologies and their application to address information integration challenges in the context of clinical and translational research (italics denote references to concepts defined in the model).
1 – CCTS Environment Model
Typically, research enterprises such as the UTHSC-H can be modeled in terms of their functional units (e.g., Schools, Departments), affiliates (e.g., Collaborating institutions), investigators (as Individuals, Groups, Panels, or Consortiums), and Research Projects or clinical activities around some principal of science (e.g., Rheumatology, Genetics). Each of these entities can be located using their geospatial properties and directory and contact information. Research projects produce, document, maintain and administer informational and capital resources and variety of physical or virtual instruments. Investigators can participate in many departments, or research projects, and interact with entities such as devices, resources and other investigators.
We developed a generic model of a research enterprise using OWL and instantiated the model with data from the CCTS program at UTHSC-H. The primary purpose of this model is to represent the concepts for a collaborative information-sharing network and to enable governance, communication, information exchange, resource management, audit, and role based security and policy management.
2 – Research Documentation Model
Clinical and translational research processes can be modelled as entities (e.g., Patients), events (e.g., Patient Visit) that capture and document the status of some Observations (e.g., existence of Signs or Symptoms). These events are frequently associated with some physical forms of Documentation (Survey Form, Database, Worksheet, etc).
A generalizable ontology has been developed to enable description of informational resources produced through a research activity (e.g., Surveys, Databases). This model extends the CCTS environment model to track activities and entities within projects and to identify, locate and authenticate access to documented research results
3 – Authorization and Control Model
The CCTS Environment model, along with the Research Documentation Model provide the infrastructure for a security, authorization and audit model by describing investigators, their roles and membership in different research projects and the datasets produced within the project.
The Authorization model extends the Environment and Research Documentation models by asserting access and retrieval rights directly (e.g., all members of project A can view information from document B of project C). Alternatively, a reasoner can infer rights using existing facts and axioms (e.g., user A1 can retrieve all information from form C as he is member of a group that participates in the consortium D, Project C is a project of consortium D, all members of a consortium can access all documents of a given consortium project).
A future extension of this model will incorporate concepts required for management of Patient Consent and classes of activities that are authorized or disallowed. Figure represents the relationships between the models (ontologies) described so far.
CCTS Environment Model, Authorization and Consent Model, and Documentation Model and their relationships. Arrows indicate ontology import (reuse).
4 – Medical Information Model
A semantic application intended to operate in a biomedical environment requires a domain model that describes biomedical concepts (e.g., Diseases
) and semantic relationships between them (e.g., All Infectious Disease are Caused by some Infectious Agent
). The Gene Ontology [8
] is an example of such a model that provides a controlled vocabulary to describe gene and gene product attributes in any organism.
Although candidate ontologies have started to appear that claim comprehensive and formal description of biomedical concepts (GALEN [9
], FMA [10
], NCI Thesaurus [11
]), these ontologies are too large and complicated to be effectively used and maintained even by trained human experts [12
]. Rather than using an extremely large ontology that describes every biomedical concept, we chose to create a smaller, more tractable model that includes relevant and domain-specific concepts extracted from larger ontologies [12
]. This model is used to describe and explicate the meaning of information found in our research databases. This model can be replaced, as new more modular ontologies become available, or extended should new types of data be included in the application, and can be aligned with larger ontologies on demand (e.g., GALEN).
5 – Integrated Vocabulary Models
Knowledge organization systems (KOS) such as the Unified Medical Language System (UMLS) Metathesaurus are important for consistent documentation and interoperability of health information systems. The following functionality is required for a semantic application to identify correspondence between domain concepts from an ontology (e.g., Fever from the Medical Information Model described above) and a relative concept 'code' from a vocabulary system (e.g., UMLS CUI:C0015967).
A unified method of explicating biomedical vocabularies and taxonomies (i.e., KOS) using formal information representation frameworks such as RDF: Simple Knowledge Organization System [13
] (SKOS) is an ongoing W3C standardization effort to support the use of KOS such as thesauri, and taxonomies using the Semantic Web.
A method of representing correspondence between concepts from multiple KOS or between concepts from OWL ontologies and KOS. The method should be able to account for synonymy, hyponymy, and hypernymy and other relationships (such as part-whole relationships, parent-child) frequently found in biomedical KOS.
A method of search and retrieval from a set of existing KOS, to identify relevant concepts, and terms based on a combination of concept names, synonyms, broader/narrower relations, codes, and coding schemes. These methods are traditionally implemented as Vocabulary Services within biomedical applications.
In order to support these features we have developed:
An algorithm to extract medical vocabularies from their source format and to translate to a SKOS ontology. Our current implementation of the method extracts source vocabularies included in the UMLS Knowledge Source. We have also translated all 200 value sets from 15 vocabulary groups pertaining to the Public Health Information Network (PHIN) frameworks to SKOS representation.
A SKOS model to represent the UMLS Metathesaurus and the UMLS Semantic Network (UMLS-SN) (Figure ). This also lets us group and classify domain concepts based on UMLS-SN. We have extended the UMLS-SN SKOS model with properties to assert correspondence of OWL concepts or SKOS concepts from different source vocabularies with UMLS Metathesaurus Concept Unique Identifiers (CUI). A reasoner can then infer correspondence between OWL concepts or SKOS concepts from multiple source vocabularies using transitive and functional attributes of the properties (e.g., if concepts A and B both correspond to C, then A corresponds to B, hence all SKOS:Definition of A also applies to B and vice versa
Figure 3 Snapshot of the SKOS representation of a biomedical concept. The right side panel represents the SKOS:Concept related to "Plasminogen Activator Inhibitor", synonymous terms associated with it, two distinct SNOMED-CT concepts and a CUI that correspond (more ...)
A Web Service [14
] with methods to search and navigate the information space in order to identify correspondence of terms, codes, and names used to describe domain concepts, using underlying vocabulary systems.
Once the above mentioned ontologies are aligned (Figure ) and populated with data from research databases, one can navigate from a high level view of the CCTS, its affiliate departments and research projects to the actual documentation of research and clinical activities concerning specific patients, down to the particular medical observations made for each case and its corresponding vocabulary code. A concept-based navigation strategy enables exploring the integrated information. For example one may start from a concept such as ACE_Inhibitor and navigate to see all information related to administration of any of the ACE inhibitor medications that are known to the system, complications reported and other associated data from an integrated pool of patients with a documented history of ACE inhibitor use, research projects and participating investigators that collected such information, with their departmental and personal contact information, as well as pointers to where the actual research data is stored and located.
Alignment of Integrated Vocabulary Models and Domain Model with CCTS Environment and Documentation Models.
The CCTS models described so far were prepared to identify, integrate and make available data that already exists in a research enterprise, but do not explain how new data can be collected and integrated through manual entry or automated data feeds. Next sections of the paper will introduce the models and services that enable provisioning and integrating new datasets.
6 – Ontology Driven Survey Design Model (Figure )
Data entered into a semantic system needs to be mapped to relevant domain ontologies. To automate mapping of new manually-entered data to domain ontologies by users who are not familiar with the underlying technology, we developed a model that describes a typical questionnaire in terms of questions, answer options and some navigation rules to establish relationships between answers and questions (e.g., if female, ask if pregnant, if pregnant ask about complications and risk factors).
We developed a semantic application called the Survey On Demand System (SODS) that uses this model, domain ontologies, and KOS that are available to the system to facilitate construction of an ontology driven structured data entry (SDE) tool for questionnaire and survey design. SODS is an interactive tool to help users design custom questionnaires and automatically map all answer options to domain ontologies. SODS is designed to enforce and facilitate consistent use and reuse of domain vocabulary and for longitudinal integration of survey information. Thus, data captured by multiple independent questionnaires will be automatically integrated (e.g., if the same patient has been studied by several independent projects in multiple occasions). Once questionnaire design is complete, identical SDE forms are automatically deployed on a web site, and to a specialized PDA application that collects and submits new submissions. All submissions are mapped to the existing knowledgebase consistently and automatically, without user involvement and are immediately available for semantic querying and integration.
Clinical Text Understanding and SODS Models and overall orchestration of models.
7 – Automated Ontology Learning Model
For semantic integration, all incoming data should be mapped to an ontology. But transforming and maintaining a consistent mapping of all incoming data to an integration ontology (such as the model described in section 4) becomes very difficult, especially when multiple disparate and heterogeneous data sources are involved. To automate the integration process, we have developed an ontology-driven method to learn and produce an ontological representation, a "proto-ontology," of any given XML message based on the system's previous experience with the same or similar data sources, and inputs from human experts. The system can optionally merge and integrate multiple disparate datasets into a single proto-ontology, or create custom proto-ontologies for each dataset – that would be mapped to domain ontologies individually. Optionally, the proto-ontology can be processed further by vocabulary services (as described in section 5) to suggest potential correspondence with vocabulary systems or alignment with the appropriate ontologies. Once a human expert modifies parts of the proto-ontology, the system automatically applies the same changes to new data (e.g., persisting user changes even if the new data disagrees, infers inheritance for new siblings of a class, cascades deprecation of concepts to subclasses, instantiates appropriately if equivalencies are established or concepts are merged by human expert, identifies appropriate properties to assert new facts based on updates made by human expert). This approach eliminates all programming usually required for converting XML messages (or relational databases) to custom RDF/OWL representations. Instead, conversion becomes an interactive process, regardless of dataset size or complexity.
8 – Clinical Text Understanding
We developed a patient-centric model to describe the syntactic and semantic representations found in a typical clinical text, as it appears in an electronic health record or a reporting system (Figure ). The model describes relationships between symbols found in a clinical text and concepts from Medical Information Ontology and UMLS-KS. The model has enabled implementation of a minimal syntactic and semantic algorithm to extract concepts, and their relationships from a clinical text, when their representation in the text match patterns described in the model. The output is a formal and explicit representation of the input clinical text as an instantiation of an OWL ontology (rather than a set of concepts extracted from the text, as in conventional NLP methods). Being a formal and self-descriptive model, the output can be immediately integrated or mapped to higher level ontologies, or an existing knowledgebase for inference and ad-hoc querying.
To maximize interoperability, scalability and component reuse, all methods, functions and processes are implemented, deployed and orchestrated as Web Services in a Service Oriented Architecture [14