In the past decade, there has been a plethora of discoveries in genomic sciences involving complex, non-Mendelian diseases that relate single-nucleotide polymorphisms (SNPs) to clinical conditions and measurable traits. This has become feasible due to the advances in high-throughput genotyping technologies and genome-wide association studies (GWAS) that allow studying the entire human genome in thousands of unrelated individuals regarding genetic associations with different diseases. However, unlike Mendelian traits, effect sizes of genetic variants associated with common diseases are relatively small, and thus large sample sizes are required for discovery.
To address this research need, several academic medical centers are forming biorepositories or biobanks that collect and store individual biospecimens from which DNA for conducting genetic research can be extracted. Additionally, these biobanks are often linked to electronic health records (EHRs) that support retrieval and querying for vast amounts of phenotype data
[
1,
2]. The Electronic Medical Records and Genomics (eMERGE
[
3]) consortium—a network of ten academic medical centers, of which Mayo Clinic is a member—has demonstrated the applicability of “EHR-derived phenotyping algorithms” for cohort identification to conduct GWAS for several diseases, including peripheral arterial disease
[
4], red blood cells
[
5] and atrioventricular conduction
[
6]. A common thread across the library of algorithms
[
7] is access to different types and modalities of clinical data for algorithm execution, which includes billing and diagnoses information, laboratory measurements, patient procedure encounters, medication and prescription management data, and co-morbidities (e.g., smoking history, socio-economic status). While on one hand these approaches with EHR-linked biorepositories have successfully facilitated GWAS, such studies typically focus on a narrow phenotypic domain, such as presence or absence of a given disease and ignore the potential power that can be gained through intermediate and sub-phenotypes, as well as considering pleiotropic associations. Furthermore, most existing GWAS results are based on populations with European descent, thereby limiting the understanding of genetic contribution to diseases and traits for other racial and ethnic populations. To this end, there has been an emerging interest in mining the human phenome via a “reverse GWAS” or a PheWAS (Phenome Wide Association Scan)—for a given genotype, the goal is to identify the set of associated clinical phenotypes. By using clinical data from EHRs, a PheWAS allows systematic study of associations between a number of common genetic variations and variety of large number of clinical phenotypes. Recent studies by Denny et al
[
8]. and Pendergrass et al
[
9]. demonstrated the potential for PheWAS to replicate previously published genotype-phenotype associations, as well as, identify novel associations using patient EHR data. However, to extract phenotype data from EHRs, one is posed with the challenge of representing and integrating data in a form that would allow federated querying, reasoning, and efficient information retrieval across multiple sources of clinical data and information.
The work proposed in this study is an attempt to address this challenge by exploring and experimenting with Semantic Web technologies for enabling a PheWAS. A key aspect of Semantic Web is a rigorous mechanism for defining and linking heterogeneous data using Web protocols and a simple data model called Resource Description Framework (RDF). By representing data as labeled graphs, RDF provides a powerful framework for expressing and integrating any type of data. As of March 2012, under the auspices of an initiative called the Linked Open Data (LOD
[
10]), more than 250 public datasets from multiple domains (e.g., gene and disease relationships, drugs and side effects) are available in RDF, and have been integrated by specifying approximately 350 million links between the RDF graphs. Not only do such efforts provides tremendous opportunities to devise novel approaches for combining private, and institution-specific EHR data with public knowledgebases for phenotyping, but they also present several challenges in representing EHR data using RDF, creating linkages between multiple disparate RDF graphs, and developing mechanisms for executing federated queries analy-zing information spanning genes, proteins, pathways, diseases, drugs, and adverse events.
In this paper, we describe our efforts in representing real patient data, both clinical and genomic, from Mayo Clinic’s EHR systems
[
11] and the biobank, respectively as RDF graphs. In particular, we leverage open-source tooling and infrastructure developed within the Semantic Web community to extract phenotype and genotype information on subjects with Type 2 Diabetes Mellitus (T2DM) or Hypothyroidism, and conduct a phenome-wide scan to discover new genetic associations, as well as, replicate existing ones. As a proof of concept, we present our results on eight SNPs associated with T2DM and Hypothyroidism within an EHR population at the Mayo Clinic biobank. Our approach highlights the potential of using Semantic Web technologies for exploring a variety and large range of clinical phenotypes derived from EHRs for genomics research in a very high-throughput manner.