A major element of personalized medicine involves the identification of therapeutic regimes that are safe and effective for specific patients. This contrasts the “one-size-fits-all” well-known concept of “blockbuster” drugs, which are considered safe and effective for the entire population. The concept of targeted patient groups falls in-between these two extremes with the identification of therapeutic regimes targeted to be safe and effective for specific patient groups with similar characteristics [
1]. A number of factors have contributed to a decline in the emphasis of blockbuster therapeutics and a corresponding rise in the quest for tailored therapeutics or personalized medicine. Essential to the realization of personalized medicine is the development of information systems capable of providing accurate and timely information about potentially complex relationships between individual patients, drugs, and tailored therapeutic options. The demands of personalized medicine include integrating knowledge across data repositories that have been developed for divergent uses, and do not normally adhere to a unified schema. This paper demonstrates the integration of such knowledge across multiple heterogeneous datasets. We show the formation of queries that span these datasets, connecting the information required to support the goal of personalized medicine from both the research and the clinical perspectives.
Integration of the patient electronic health record (EHR) with publicly accessible information creates new opportunities and challenges for clinical research and patient care. For example, one challenge is that the complexity of the information provided to the clinician must not impair the clinician’s ability to accurately and rapidly prescribe drugs that are safe and effective for a specific patient, and covered by the patient’s insurance provider. An example opportunity is that EHRs enable the identification of adverse events and outbreak awareness and provide a rich set of longitudinal data, from which researchers and clinicians can study disease, co-morbidity and treatment outcome. Moreover, the increased desire to rapidly translate drug and gene-based drug therapy to clinical practice depends on the comprehensive integration of the entire breadth of patient data to facilitate and evaluate drug development [
2]. Thus, EHR integration could create the ideal conditions under which new or up-to-date evidence-based guidelines for disease diagnosis and treatment can emerge. Although supplying patient data to the scientific community presents both technical and social challenges [
3], a comprehensive system that maintains individual privacy but provides a platform for the analysis of the full extent of patient data is vital for personalized treatment and objective prediction of drug response [
4]. The impetus to collect and disseminate relevant patient-specific data for use by clinicians, researchers, and drug developers has never been stronger. Simultaneously the impetus to provide patient-specific data to patients in a manner that is accurate, timely, and understandable, has also never been stronger.
This motivation takes specific form in the US where health care providers who want stimulus-funded reimbursement from recent electronic health funding, to implement or expand the use of electronic medical records (EMRs) in care practices, must achieve “meaningful use.” An EMR is an electronic record of health-related information on an individual that is created, gathered, managed, and consulted by licensed clinicians and staff from a single organization who are involved in the individual’s health and care. An electronic health record (EHR) is an aggregate electronic record of health-related information on an individual that is created and gathered cumulatively across more than one health care organization and is managed and consulted by licensed clinicians and staff involved in the individual’s health and care. By these definitions, an EHR is an EMR with interoperability (i.e. integration to other providers’ systems). Achieving meaningful use requires both using certified EHR technology and achieving documented objectives that improve the quality, safety, and efficiency of care while simultaneously reducing disparities, engaging patients and families in their care, promoting public and population health, improving care coordination, and promoting the privacy and security of EHRs (CMS 2010) [
5]. A “certified” EHR must meet a collection of regulations and technical requirements to perform the required meaningful use functions (ONCHIT 2010) [
6]. Minimum meaningful use requirements include fourteen core objectives, five out of ten specific objectives, and fifteen clinical quality measures (CMS 2010). These criteria, conditions, and metric achievements are all delayed and complicated by the typical data fragmentation that occurs between the research and health care settings and will continue until a “translational” ontology is available to bridge activities, transferring data and entities between research and medical systems.
Translational medicine refers to the process by which the results of research done in the laboratory are directly used to develop new ways to treat patients. It depends on the comprehensive integration of the entire breadth of patient data with basic life science data to facilitate and evaluate drug development [
2]. In the 1990s, several efforts related to data integration emerged, including the Archimedes Project and the use of heterogeneous data integration, mathematical and computational modeling, and simulation to expose the underlying dynamics and different individual treatment response patterns clinicians observed in patients diagnosed with Major Depressive Disorder [
7][
8]. When information regarding the patient experience (symptoms, pharmacokinetics/pharmacodynamics, outcomes, side effects) can be directly linked to biomedical knowledge (genetics, pathways, enzymes, chemicals, brain region activity), clinical research can gain new insights in causality and potential treatments. Detailed recordings of clinical encounters are a crucial component of this approach [
9][
10] and devices such as personal electronic diaries aid both patient and clinician in capturing accurate patient data of these accounts.
Electronic Medical Records now act as main repositories for patient data. As we continue to explore the intricate relationship between phenotype and genotype, these records become a vital source for monitoring patients’ progression of disease. The presence of a given variation, as it relates to the appearance or absence of disease over time, can be mapped as encounters are recorded by clinicians. Every result, encounter, event, or diagnosis is recorded as a data item and includes a date. This rich longitudinal data provide trends that show improvement or decline in state and occurrence or absence of diagnostic criteria and can be used to guide treatment, provide prognosis, or identify patients who are likely to respond to a potential treatment. The following example illustrates the kinds of data we seek to integrate and analyze for clinical research purposes. Carvedilol is prescribed to a given patient, while a number of blood pressures and heart rate recordings are taken sequentially over time. If this patient takes the medication as prescribed, we can easily observe trends and establish alerts to adjust the medication, if necessary. Alternatively, the simultaneous occurrence of any recorded side effects can be correlated more easily with potential causative agents. Increases or decreases in laboratory parameters can also be viewed graphically and displayed for easy review by clinicians. Rich longitudinal data can also provide the opportunity to validate diagnostic procedures and otherwise catch discrepancies between corresponding clinical reports. This application of longitudinal data is being investigated in the World Wide Web Consortium (W3C) Health Care and Life Science Interest Group (HCLSIG) within the context of breast cancer, where a radiology report is followed by a biopsy and a pathology report. There should be a set of corresponding observations within the two reports, with the pathology report corroborating the findings of the radiology report [
11].
Semantic Web technologies enable the integration of heterogeneous data using explicit semantics, the expression of rich and well-defined models for data aggregation, and the application of logic to gain new knowledge from the raw data [
12]. Semantic technologies can be used to encode metadata such as provenance, i.e. the original source where the data came from and how it was generated [
13][
14]. There are four main Semantic Web standards for knowledge representation: Resource Description Framework (RDF), RDF Schema (RDFS), Web Ontology Language (OWL), and SPARQL query language.
Ontologies, which formalize the meaning of terms used in discourse, are expected to play a major role in the automated integration of patient data with relevant information to support basic discovery and clinical research, drug formulation, and drug evaluation through clinical trials. Already, OWL ontologies have been developed to support drug, pharmacogenomics and clinical trials [
15][
16][
17], provide a mechanism for the integration and exchange of biological pathways [
18,
19], and are increasingly being used in health care and life sciences applications [
20]. Another W3C standard, Gleaning Resource Descriptions from Dialects of Languages (GRDDL) enables users to obtain RDF triples out of XML documents. Collectively, these next generation Semantic Web technologies provide the resources required to systematically re-engineer both EHR and research data warehouse systems. This will make it easier and more practical to integrate, query, and analyze the full spectrum of relevant laboratory and clinical research data, as well as EHRs, in supporting the development of cost effective and outcome-oriented systems.
In this paper, participants in the Translational Medicine task force of the World Wide Web Consortium’s Health Care and Life Sciences Interest Group (W3C HCLSIG) present the Translational Medicine Ontology (TMO) and the Translational Medicine Knowledge Base (TMKB). The TMKB consists of the TMO, mappings to other terminologies and ontologies, and data in RDF format spanning discovery research and drug development, which are of therapeutic relevance to clinical research and clinical practice. The TMO provides a foundation for types declared in Linking Open Drug Data (LODD) [
21] and EHRs. The TMO captures core, high-level terminology to bridge existing open domain ontologies and provides a framework to relate and integrate patient-centric data across the knowledge gap from bench to bedside. With the TMO and TMKB, we demonstrate how to bridge the gap and how to develop valuable translational knowledge pertinent to clinical research, and therefore to clinical practice.
The remainder of the paper is structured as follows: we describe the use case for the TMKB, which centers around Alzheimer’s Disease (AD), then describe the methods used to build the TMKB, the ontology design process, data sources, and mappings. We then explore pertinent questions that the TMKB can answer in the results, discuss our findings, and conclude with a listing of unsolved problems and possible future directions for this work.
Use case
Alzheimer’s Disease (AD) is an incurable, degenerative, and terminal disease with few therapeutic options [
22][
23]. It is a complex disease influenced by a range of genetic, environmental, and other factors [
23]. Recently, Jack
et al.[
24] demonstrated the value of shared data in AD biomarker research. A New York Times article on the role of data sharing, in the advancement of AD research, quotes John Trojanowski at the University of Pennsylvania Medical School: “It’s not science the way most of us have practiced it in our careers. But we all realized that we would never get biomarkers unless all of us parked our egos and intellectual-property noses outside the door and agreed that all of our data would be public immediately.” [
25] Efficient aggregation of relevant information improves our understanding of disease and significantly benefits researchers, clinicians, patients and pharmaceutical companies.
We demonstrate the usefulness of TMO and TMKB in a use case that follows a patient and physician from a first report of symptoms, to diagnosis of AD, selection of an optimal treatment regimen, consideration of alternative treatments following the report of side effects caused by the initial treatment, and finally to the selection of possible appropriate clinical trials for the patient.
The Alzheimer’s Disease patient use case can be summarized in the following way:
1. A patient and family members report symptoms to a physician/clinician. The physician/clinician enters the reported symptoms into an EHR. All concepts are mapped to URIs with the help of TMO.
2. The physician makes a list of differential diagnoses, with a working diagnosis of AD.
3. The physician arranges for the patient to have a basic biochemical, haematological, and SNP profile undertaken. Biochemistry, haematology, and SNP requests are input directly by the various respective departments into the patient’s EHR. Preliminary SNP and genetic data will be submitted directly to the NIH Pharmacogenetics Research Network (PGRN).
4. A follow-up meeting is scheduled to perform a set of diagnostic tests outlined by what the clinician feels initially are most appropriate for disease presentation.
5. The physician continues to add investigations/lab results to the patient’s EHR and these are combined with the patient’s medical history information. A disease is chosen as the most likely of the listed differential diagnoses based on all of the information provided.
6. The physician confirms and now has a refined and widely acceptable diagnosis of AD with behavioral assessments, cognitive tests, and appropriate brain scan if indicated and enters the diagnosis data into the patient’s EHR.
7. The physician selects the most appropriate AD drug and clinical protocol from the patient’s medical record based on the severity of the disease, the patient’s SNP profile (ADME, efficacy/safety based on presence or absence of receptors), patient’s BMI, and concurrent medication, and drug availability on Medicare D. Fundamental questions will be answered by the ontology at this stage by sourcing relevant data sets simultaneously or in a specific order:
• What are the clinically recommended agents?
• What products are available for prescription, and which are legally indicated for AD disease?
• What is the SNP verdict? These agents are sourced with a pharmacogenomics database to determine
– Will they be efficacious? Is the disease receptor positive?
– Will they be harmful? Are there toxic metabolites? Is CYP 450 or acetylator status available?
• Are the preceding predictive genetic SNP tests covered by the patient’s insurance company? Are the resulting pharmaceutical agents covered by the patient’s specific insurance?
8. The physician checks with the pharmacist, or consults drug information literature to avoid potential drug interactions.
9. The physician now prescribes Aricept (Donepezil) as it satisfies criteria listed above. It is indicated, safe, effective, available, there are no drug interactions issues with drug delivery, and it is covered by the insurance.
10. In a follow-up visit the patient later reports nausea from Donepezil. The physician is aware of this common side effect (other side effects reported include bradycardia, diarrhea, anorexia, abdominal pain, and vivid dreams etc...), and re-consults the literature to ensure this is acceptable and agreeable with patient. The physician documents the side effect for post-marketing adverse event pick-up and future study. He changes medication if necessary or adds another medication to alleviate side effects.
11. The physician considers moving the patient to a trial. The physician obtains information on all (local, national, and international trials) for AD. Trials might be listed in data sources from the FDA, WHO, ClinicalTrials.gov, Citeline TrialTrove, etc.; academia or pharma may also solicit patients, or the physician may point the patient to investigators undertaking a trial.
• The physician decides whether
– to enroll the patient in a clinical trial as one of the agents looks very suitable and may benefit patient, or because the patient is interested in participating in the trial;
– not to enroll the patient because the trial is unsuitable or the patient declines to participate in the trial;
– to obtain information for the patient on a trial appropriate for the patient with potential of future enrollment.
12. The physician checks if the patient meets trial inclusion/exclusion criteria by querying the EHR.
13. The patient has a thorough medical assessment (lifestyle, medical history, genomics, proteomics, metabolomics, images, cognition) to supplement and update existing data.
14. The results of the medical exam influence the arm of the trial in which the patient participates. The patient status is updated.
Questions relevant for this use case scenario are listed in Table . Such questions can be formulated in SPARQL queries (see section SPARQL queries, and additional file
1) and answered using TMKB.
| Table 1Questions and answers using TMO-integrated data sources |