In the past decade, there has been a huge splurge of discoveries in genomic sciences involving complex, non-Mendelian diseases that relate single-nucleotide polymorphisms (SNPs) to clinical conditions and measurable traits
1. This has become feasible due to the advances in high-throughput genotyping technologies and genome-wide association studies (GWAS) that allow studying the entire human genome in thousands of unrelated individuals regarding genetic associations with different diseases. However, due to stringent requirements for achieving acceptable levels of statistical significance and replication of findings in GWAS, one of the key aspects in conducting such studies is the “subject sample size” (for cases and controls)—the larger the size of the cohort, the higher the probability for attaining genome-wide significant results and discovering genetic variants that influence diseases.
To address this need, the U.S. National Institutes of Health initiated the Electronic Medical Records and Genomics (eMERGE
2,
3) consortium that aims to determine whether patient data stored in electronic health records (EHRs) can identify disease phenotypes for application in GWAS. Especially in the era of Meaningful Use
4 that promotes wide scale adoption of EHRs, such an approach for identification of disease phenotype cohorts using EHR data, if successful, has the potential to enable and rapidly scale genetic discoveries and research. Early results from the eMERGE network, of which Mayo Clinic is a member, has demonstrated the applicability of EHR-derived phenotyping algorithms for cohort identification to conduct genomic studies for several diseases, including peripheral arterial disease
5, red blood cells
6, and atrioventricular conduction
7. A common thread across the library of algorithms
8 is access to different types and modalities of data for algorithm execution, which includes billing and diagnoses information, laboratory measurements, patient procedure encounters, medication and prescription management data, and co-morbidities (e.g., smoking history, socio-economic status). This naturally presents us with the problem of representing and integration of data from the EHR and public knowledge bases (e.g., a knowledgebase for drug side effects) in a form that would allow federated querying, reasoning and efficient information retrieval across multiple sources of information.
Semantic Web
9 technologies provide such a rigorous mechanism for defining and linking heterogeneous data using Web protocols and a simple data model called Resource Description Framework (RDF
10). By representing data as labeled graphs, RDF provides a powerful framework for expressing and integrating any type of data. As of March 2011, under the auspices of an initiative called the Linked Open Data (LOD
11,
12), more than 215 public datasets from multiple domains (e.g., gene and disease relationships, drugs and side effects) are available in RDF, and have been integrated by specifying approximately 350 million links between the RDF graphs. Not only such efforts provides tremendous opportunities to devise novel approaches for combining private, and institution-specific EHR data with public knowledgebases for phenotyping, but also presents several challenges in representing EHR data using RDF, creating linkages between multiple disparate RDF graphs, and developing mechanisms for executing federated queries analyzing information spanning genes, proteins, pathways, diseases, drugs and adverse events.
To this end, in this paper, we describe our efforts in representing real patient data from EHR systems at Mayo Clinic as RDF graphs. In particular, we leverage open-source tooling and infrastructure developed by the Linked Data community for demonstrating Web-scale federated querying and answering for information about Diabetes Mellitus using public knowledgebases. Our tool highlights the potential of combining and integrating private-public information to answer complex queries in a robust, uniformed, and scalable way.