Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Epistasis; genotyping; personalized medicine; pharmacogenomics; quality control; statistics; study design
As genetic epidemiology looks beyond mapping single disease susceptibility loci, interest in detecting epistatic interactions between genes has grown. The dimensionality and comparisons required to search the epistatic space and the inference for a significant result pose challenges for testing epistatic disease models. The Multifactor Dimensionality Reduction Pedigree Disequilibrium Test (MDR-PDT) was developed to test for multilocus models in pedigree data. In the present study we rigorously tested MDR-PDT with new cross-validation (CV) (both 5- and 10-fold) and omnibus model selection algorithms by simulating a range of heritabilities, odds ratios, minor allele frequencies, sample sizes, and numbers of interacting loci. Power was evaluated using 100, 500, and 1000 families, with minor allele frequencies 0.2 and 0.4 and broad-sense heritabilities of 0.005, 0.01, 0.03, 0.05, and 0.1 for 2 and 3-locus purely epistatic penetrance models. We also compared the prediction error measure of effect with a predicted matched odds ratio for final model selection and testing. We report that the CV procedure is valid with the permutation test, MDR-PDT performs similarly with 5 and 10- fold CV, and that the matched odds ratio is more powerful than prediction error as the fitness metric for MDR-PDT.
Epistasis; MDR-PDT; complex disease; family-based association; bioinformatics
In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models.
To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes.
These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci.
A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Epistasis; MDR; Heterogeneity
Only one LDL-C GWAS has been reported in African Americans. We performed a GWAS of LDL-C in African Americans using data extracted from electronic medical records (EMR) in the eMERGE network. African Americans were genotyped on the Illumina 1M chip. All LDL-C measurements, prescriptions, and diagnoses of concomitant disease were extracted from EMR. We created two analytic datasets; one dataset having median LDL-C calculated after the exclusion of some lab values based on co-morbidities and medication (n = 618) and another dataset having median LDL-C calculated without any exclusions (n = 1249). Rs7412 in APOE was strongly associated with LDL-C at levels of GWAS significance in both datasets (p < 5 X 10−8). In the dataset with exclusions, a decrease of 20.0 mg/dl per minor allele was observed. The effect size was attenuated (12.3 mg/dl) in the dataset without any lab values excluded. Although other signals in APOE have been detected in previous GWAS, this large and important SNP association has not been well detected in large GWAS because rs7412 was not included on many genotyping arrays. Use of median LDL-C extracted from EMR after exclusions for medications and co-morbidities increased the percentage of trait variance explained by genetic variation.
GWAS; LDL; electronic medical records
Routine integration of genotype data into drug decision-making could improve patient safety, particularly if many relevant genetic variants can be assayed simultaneously before target drug prescribing. The frequency of pharmacogenetic prescribing opportunities and the potential adverse events (AE) mitigated are unknown. We examined the frequency with which 56 medications with known outcomes influenced by variant alleles were prescribed in a cohort of 52,942 medical home patients at Vanderbilt University Medical Center. Within a five-year window, we estimated that 64.8% (95% CI: 64.4%-65.2%) of individuals were exposed to at least one medication with an established pharmacogenetic association. Using previously published results for six medications with well-characterized, severe genetically-linked AEs, we estimated that 398 events (95% CI, 225 - 583) could have been prevented with an effective preemptive genotyping program. Our results suggest that multiplexed, preemptive genotyping may represent an efficient alternative approach to current single use (“reactive”) methods and may improve safety.
A multi-ethnic study demonstrates that the extrapolation of genetic disease risk models from European populations to other ethnicities is compromised more strongly by genetic structure than by environmental or global genetic background in differential genetic risk associations across ethnicities.
The vast majority of genome-wide association study (GWAS) findings reported to date are from populations with European Ancestry (EA), and it is not yet clear how broadly the genetic associations described will generalize to populations of diverse ancestry. The Population Architecture Using Genomics and Epidemiology (PAGE) study is a consortium of multi-ancestry, population-based studies formed with the objective of refining our understanding of the genetic architecture of common traits emerging from GWAS. In the present analysis of five common diseases and traits, including body mass index, type 2 diabetes, and lipid levels, we compare direction and magnitude of effects for GWAS-identified variants in multiple non-EA populations against EA findings. We demonstrate that, in all populations analyzed, a significant majority of GWAS-identified variants have allelic associations in the same direction as in EA, with none showing a statistically significant effect in the opposite direction, after adjustment for multiple testing. However, 25% of tagSNPs identified in EA GWAS have significantly different effect sizes in at least one non-EA population, and these differential effects were most frequent in African Americans where all differential effects were diluted toward the null. We demonstrate that differential LD between tagSNPs and functional variants within populations contributes significantly to dilute effect sizes in this population. Although most variants identified from GWAS in EA populations generalize to all non-EA populations assessed, genetic models derived from GWAS findings in EA may generate spurious results in non-EA populations due to differential effect sizes. Regardless of the origin of the differential effects, caution should be exercised in applying any genetic risk prediction model based on tagSNPs outside of the ancestry group in which it was derived. Models based directly on functional variation may generalize more robustly, but the identification of functional variants remains challenging.
The number of known associations between human diseases and common genetic variants has grown dramatically in the past decade, most being identified in large-scale genetic studies of people of Western European origin. But because the frequencies of genetic variants can differ substantially between continental populations, it's important to assess how well these associations can be extended to populations with different continental ancestry. Are the correlations between genetic variants, disease endpoints, and risk factors consistent enough for genetic risk models to be reliably applied across different ancestries? Here we describe a systematic analysis of disease outcome and risk-factor–associated variants (tagSNPs) identified in European populations, in which we test whether the effect size of a tagSNP is consistent across six populations with significant non-European ancestry. We demonstrate that although nearly all such tagSNPs have effects in the same direction across all ancestries (i.e., variants associated with higher risk in Europeans will also be associated with higher risk in other populations), roughly a quarter of the variants tested have significantly different magnitude of effect (usually lower) in at least one non-European population. We therefore advise caution in the use of tagSNP-based genetic disease risk models in populations that have a different genetic ancestry from the population in which original associations were first made. We then show that this differential strength of association can be attributed to population-dependent variations in the correlation between tagSNPs and the variant that actually determines risk—the so-called functional variant. Risk models based on functional variants are therefore likely to be more robust than tagSNP-based models.
We genotyped 326 “frequently medicated” individuals of European-descent in Vanderbilt’s biorepository linked to de-identified electronic medical records, BioVU, on the ADME Core Panel to assess quality and performance of the assay. We compared quality control metrics and determined the extent of direct and indirect marker overlap between the ADME Core Panel and the Illumina Omni1-Quad. We found the quality of the ADME Core Panel data to be high, with exceptions in select copy number variants (CNVs) and markers in certain genes (notably CYP2D6). Most of the common variants on the ADME panel are genotyped by the Omni1, but absent rare variants and CNVs could not be accurately tagged by single markers. Finally, our frequently medicated study population did not convincingly differ in allele frequency from reference populations, suggesting that heterogeneous clinical samples (with respect to medications) follow similar allele frequency distributions in pharmacogenetics genes as their appropriate reference populations.
ADME; pharmacogenomics; pharmacogenetics; BioVU; biorepository; CYP2D6; personalized medicine; precision medicine
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria—HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)—were associated (P ≤ 1 × 10−6) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10−8) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
red blood cell (RBC) traits; genome-wide association study; African-Americans; natural selection; informatics; electronic medical record
With the recent decreasing cost of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. We developed BioBin, a flexible collapsing method inspired by biological knowledge that can be used to automate the binning of low frequency variants for association testing. We also built the Library of Knowledge Integration (LOKI), a repository of data assembled from public databases, which contains resources such as: dbSNP and gene Entrez database information from the National Center for Biotechnology (NCBI), pathway information from Gene Ontology (GO), Protein families database (Pfam), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, NetPath - signal transduction pathways, Open Regulatory Annotation Database (ORegAnno), Biological General Repository for Interaction Datasets (BioGrid), Pharmacogenomics Knowledge Base (PharmGKB), Molecular INTeraction database (MINT), and evolutionary conserved regions (ECRs) from UCSC Genome Browser. The novelty of BioBin is access to comprehensive knowledge-guided multi-level binning. For example, bin boundaries can be formed using genomic locations from: functional regions, evolutionary conserved regions, genes, and/or pathways.
We tested BioBin using simulated data and 1000 Genomes Project low coverage data to test our method with simulated causative variants and a pairwise comparison of rare variant (MAF < 0.03) burden differences between Yoruba individuals (YRI) and individuals of European descent (CEU). Lastly, we analyzed the NHLBI GO Exome Sequencing Project Kabuki dataset, a congenital disorder affecting multiple organs and often intellectual disability, contrasted with Complete Genomics data as controls.
The results from our simulation studies indicate type I error rate is controlled, however, power falls quickly for small sample sizes using variants with modest effect sizes. Using BioBin, we were able to find simulated variants in genes with less than 20 loci, but found the sensitivity to be much less in large bins. We also highlighted the scale of population stratification between two 1000 Genomes Project data, CEU and YRI populations. Lastly, we were able to apply BioBin to natural biological data from dbGaP and identify an interesting candidate gene for further study.
We have established that BioBin will be a very practical and flexible tool to analyze sequence data and potentially uncover novel associations between low frequency variants and complex disease.
A report on the Keystone Symposium 'Complex Traits: Genomics and Computational approaches', Breckenridge, Colorado, USA, 20-25 February 2012.
HIV-associated sensory neuropathy remains an important complication of combination antiretroviral therapy (CART) and HIV infection. Mitochondrial DNA haplogroups and single nucleotide polymorphisms (SNPs) have previously been associated with symptomatic neuropathy in clinical trial participants. We examined associations between mitochondrial DNA variation and HIV-associated sensory neuropathy in CHARTER. CHARTER is a U.S. based longitudinal observational study of HIV-infected adults who underwent a structured interview and standardized examination. HIV-associated sensory neuropathy was determined by trained examiners as ≥1 sign (diminished vibratory and sharp-dull discrimination or ankle reflexes) bilaterally. Mitochondrial DNA sequencing was performed and haplogroups were assigned by published algorithms. Multivariable logistic regression of associations between mitochondrial DNA SNPs, haplogroups and HIV-associated sensory neuropathy were performed. In analyses of associations of each mitochondrial DNA SNP with HIV-associated sensory neuropathy, the two most significant SNPs were at positions A12810G (odds ratio [95% confidence interval] = 0.27 [0.11-0.65]; p = 0.004) and T489C (odds ratio [95% confidence interval] = 0.41 [0.21-0.80]; p = 0.009). These synonymous changes are known to define African haplogroup L1c and European haplogroup J, respectively. Both haplogroups are associated with decreased prevalence of HIV-associated sensory neuropathy compared with all other haplogroups (odds ratio [95% confidence interval] = 0.29 [0.12-0.71]; p = 0.007 and odds ratio [95% confidence interval] = 0.42 [0.18-1.0]; p = 0.05, respectively). In conclusion, in this cohort of mostly combination antiretroviral therapy-treated subjects, two common mitochondrial DNA SNPs and their corresponding haplogroups were associated with a markedly decreased prevalence of HIV-associated sensory neuropathy.
genetics; mitochondria; HIV-related neurological diseases; peripheral neuropathy
Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes. Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. In this paper, we present a strategy for identifying complex genetic models for simulation studies that utilizes genetic algorithms. The genetic models used in this study are penetrance functions that define the probability of disease given a specific DNA sequence variation has been inherited. We demonstrate that the genetic algorithm approach routinely identifies interesting and useful penetrance functions in a human-competitve manner.
Over the past several years, genome-wide association studies (GWAS) have succeeded in identifying hundreds of genetic markers associated with common diseases. However, most of these markers confer relatively small increments of risk and explain only a small proportion of familial clustering. To identify obstacles to future progress in genetic epidemiology research and provide recommendations to NIH for overcoming these barriers, the National Cancer Institute sponsored a workshop entitled “Next Generation Analytic Tools for Large-Scale Genetic Epidemiology Studies of Complex Diseases” on September 15–16, 2010. The goal of the workshop was to facilitate discussions on (1) statistical strategies and methods to efficiently identify genetic and environmental factors contributing to the risk of complex disease; and (2) how to develop, apply, and evaluate these strategies for the design, analysis, and interpretation of large-scale complex disease association studies in order to guide NIH in setting the future agenda in this area of research. The workshop was organized as a series of short presentations covering scientific (gene-gene and gene-environment interaction, complex phenotypes, and rare variants and next generation sequencing) and methodological (simulation modeling and computational resources and data management) topic areas. Specific needs to advance the field were identified during each session and are summarized.
gene-gene interactions; gene-environment interactions; rare variants; next generation sequencing; complex phenotypes; simulations; computational resources
Tacrolimus, an immunosuppressive drug widely prescribed in kidney transplantation, requires therapeutic drug monitoring due to its marked interindividual pharmacokinetic variability and narrow therapeutic index. Previous studies have established that CYP3A5 rs776746 is associated with tacrolimus clearance, blood concentration, and dose requirement. The importance of other drug absorption, distribution, metabolism, and elimination (ADME) gene variants has not been well characterized.
We used novel DNA biobank and electronic medical record resources to identify ADME variants associated with tacrolimus dose requirement. Broad ADME genotyping was performed on 446 kidney transplant recipients who had been dosed to steady state with tacrolimus. The cohort was obtained from Vanderbilt's DNA biobank, BioVU, which contains linked, de-identified electronic medical record data. Genotyping included Affymetrix DMET Plus (1936 polymorphisms), custom Sequenom MassARRAY iPLEX Gold assay (95 polymorphisms), and ancestry-informative markers. The primary outcome was tacrolimus dose requirement defined as blood concentration-to-dose ratio.
In analyses that adjusted for race and other clinical factors, we replicated the association of tacrolimus blood concentration-to-dose ratio with CYP3A5 rs776746 (p = 7.15 × 10−29), and identified associations with nine variants in linkage disequilibrium with rs776746, including eight CYP3A4 variants. No NR1/2 variants were significantly associated. Age, weight, and hemoglobin were also significantly associated with the outcome. In final models, rs776746 explained 39% of variability in dose requirement, and 46% was explained by the model containing clinical covariates.
This study highlights the utility of DNA biobanks and electronic medical records for tacrolimus pharmacogenomic research.
pharmacogenomics; pharmacokinetics; calcineurin inhibitor; tacrolimus; electronic medical records; kidney transplant; cytochrome P4503A5; genetic polymorphism; dosing
Mitochondrial DNA (mtDNA) variation has been associated with time to progression to AIDS and adverse effects from antiretroviral therapy (ART). In this study, full mitochondrial DNA (mtDNA) sequence data from U.S.-based adult participants in the AIDS Clinical Trials Group (ACTG) study 384 was used to assess associations between mtDNA variants and CD4 T cell recovery with ART.
Full mtDNA sequence was determined using chip-based array sequencing. Sequence and CD4 cell count data was available at baseline and after ART initiation for 423 subjects with HIV RNA levels <400copies/mL plasma. The primary outcome was change in CD4 count of ≥100 cells/mm3 from baseline. Analyses were adjusted for baseline age, CD4 cell count, HIV RNA, and naïve:memory CD4 cell ratio.
Race-stratified analysis of mtDNA variants with a minor allele frequency >1% revealed multiple mtDNA variants marginally associated (P < 0.05 before Bonferroni correction) with CD4 cell recovery. The most significant SNP associations were those tagging the African L2 haplogroup, which was associated with a decreased likelihood of ≥100 cells/mm3 CD4 count increase at week 48 in non-Hispanic blacks (adjusted OR=0.17; 95% CI=0.06–0.53; P=0.002).
An African mtDNA haplogroup was associated with CD4 cell recovery after ART in this clinical trial population. These initial findings warrant replication and further investigation in order to confirm the role of mtDNA variation in CD4 cell recovery during ART.
HIV; CD4 count; Mitochondrial DNA; Pharmacogenetics; Antiretroviral Therapy
The current paradigm of human genetics research is to analyze variation of a single data type (i.e., DNA sequence or RNA levels) to detect genes and pathways that underlie complex traits such as disease state or drug response. While these studies have detected thousands of variations that associate with hundreds of complex phenotypes, much of the estimated heritability, or trait variability due to genetic factors, remain unexplained. We may be able to account for a portion of the missing heritability if we incorporate a systems biology approach into these analyses. Rapid technological advances will make it possible for scientists to explore this hypothesis via the generation of high-throughput omics data – transcriptomic, proteomic and methylomic to name a few. Analyzing this ‘meta-dimensional’ data will require clever statistical techniques that allow for the integration of qualitative and quantitative predictor variables. For this article, we examine two major categories of approaches for integrated data analysis, give examples of their use in experimental and in silico datasets, and assess the limitations of each method.
computational methods; data integration; pharmacogenomics; systems biology
To identify common genetic variants influencing red blood cell (RBC) traits.
Patients and Methods
We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record–based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss.
We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells.
Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
eMERGE, electronic MEdical Records and GEnomics; EMMAX, mixed-model association-expedited; EMR, electronic medical record; eQTL, expression quantitative trait locus; GHC, Group Health Cooperative--University of Washington; GWAS, genomewide association study; HCT, hematocrit; HGB, hemoglobin; IBS, identity-by-state; LD, linkage disequilibrium; MC, Marshfield Clinic; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MIM, Mendelian Inheritance of Man; NU, Northwestern University; RBC, red blood cell; SNP, single-nucleotide polymorphism; VUMC, Vanderbilt University Medical Center
This study tested the hypothesis that two common polymorphisms in the chromosome 4q25 region that have been associated with atrial fibrillation (AF) contribute to the variable penetrance of familial AF.
Although mutations in ion channels, gap junction proteins, and signaling molecules have been described for Mendelian forms of AF, penetrance is highly variable. Recent studies have consistently identified 2 common single nucleotide polymorphisms (SNPs) in the chromosome 4q25 region as independent AF susceptibility alleles.
We studied 11 families in which AF was present in ≥2 individuals who also shared a candidate gene mutation. These mutations were identified in all subjects with familial lone AF (n=33) as well as apparently unaffected family members (age >50 yrs with no AF; n=17).
Mutations were identified in SCN5A (n=6); NPPA (n=2); KCNQ1 (n=1); KCNA5 (n=1) and NKX2.5 (n=1). In genetic association analyses, un-stratified and stratified according to age of onset of AF and unaffected age > 50 yrs, there was a highly statistically significant association between the presence of both common (rs2200733, rs10033464) as well as rare variants and AF (un-stratified P=1×10−8; stratified [age of onset <50 yrs and unaffected age >50 yrs], P=7.6×10−5) (un-stratified P<0.0001; stratified [age of onset <50 yrs and unaffected age >50 yrs], P<0.0001). Genetic association analyses showed that the presence of common 4q25 risk alleles predicted whether carriers of rare mutations developed AF (P = 2.2×10−4).
Common AF-associated 4q25 polymorphisms modify the clinical expression of latent cardiac ion channel and signaling molecule gene mutations associated with familial AF. These findings support the idea that the genetic architecture of AF is complex and includes both rare and common genetic variants.
Pharmacogenomics is emerging as a popular type of study for human genetics in recent years. This is primarily due to the many success stories and high potential for translation to clinical practice. In this review, the strengths and limitations of pharmacogenomics are discussed as well as the primary epidemiologic, clinical trial, and in vitro study designs implemented. A brief discussion of molecular and analytic approaches will be reviewed. Finally, several examples of bench-to-bedside clinical implementations of pharmacogenetic traits will be described. Pharmacogenomics continues to grow in popularity because of the important genetic associations identified that drive the possibility of precision medicine.
DNA biobanks linked to comprehensive electronic health records systems are potentially powerful resources for pharmacogenetic studies. This study sought to develop natural-language-processing algorithms to extract drug-dose information from clinical text, and to assess the capabilities of such tools to automate the data-extraction process for pharmacogenetic studies.
Materials and methods
A manually validated warfarin pharmacogenetic study identified a cohort of 1125 patients with a stable warfarin dose, in which 776 patients were managed by Coumadin Clinic physicians, and the remaining 349 patients were managed by their providers. The authors developed two algorithms to extract weekly warfarin doses from both data sets: a regular expression-based program for semistructured Coumadin Clinic notes; and an advanced weekly dose calculator based on an existing medication information extraction system (MedEx) for narrative providers' notes. The authors then conducted an association analysis between an automatically extracted stable weekly dose of warfarin and four genetic variants of VKORC1 and CYP2C9 genes. The performance of the weekly dose-extraction program was evaluated by comparing it with a gold standard containing manually curated weekly doses. Precision, recall, F-measure, and overall accuracy were reported. Associations between known variants in VKORC1 and CYP2C9 and warfarin stable weekly dose were performed with linear regression adjusted for age, gender, and body mass index.
The authors' evaluation showed that the MedEx-based system could determine patients' warfarin weekly doses with 99.7% recall, 90.8% precision, and 93.8% accuracy. Using the automatically extracted weekly doses of warfarin, the authors successfully replicated the previous known associations between warfarin stable dose and genetic variants in VKORC1 and CYP2C9.
Automated learning; knowledge representations; discovery; text and data-mining methods; other methods of information extraction; natural-language processing; NLP; warfarin; old epass; Genetics; translational research—application of biological knowledge to clinical care; improving the education and skills training of health professionals; linking the genotype and phenotype
Phenome-Wide Association Studies (PheWAS) can be used to investigate the association between single nucleotide polymorphisms (SNPs) and a wide spectrum of phenotypes. This is a complementary approach to Genome Wide Association studies (GWAS) that calculate the association between hundreds of thousands of SNPs and one or a limited range of phenotypes. The extensive exploration of the association between phenotypic structure and genotypic variation through PheWAS produces a set of complex and comprehensive results. Integral to fully inspecting, analysing, and interpreting PheWAS results is visualization of the data.
We have developed the software PheWAS-View for visually integrating PheWAS results, including information about the SNPs, relevant genes, phenotypes, and the interrelationships between phenotypes, that exist in PheWAS. As a result both the fine grain detail as well as the larger trends that exist within PheWAS results can be elucidated.
PheWAS can be used to discover novel relationships between SNPs, phenotypes, and networks of interrelated phenotypes; identify pleiotropy; provide novel mechanistic insights; and foster hypothesis generation – and these results can be both explored and presented with PheWAS-View. PheWAS-View is freely available for non-commercial research institutions, for full details see http://ritchielab.psu.edu/ritchielab/software.
PheWAS; Phenome-Wide Association Study; Visualization
Atrial fibrillation is a highly prevalent arrhythmia and a major risk factor for stroke, heart failure and death1. We conducted a genome-wide association study (GWAS) in individuals of European ancestry, including 6,707 with and 52,426 without atrial fibrillation. Six new atrial fibrillation susceptibility loci were identified and replicated in an additional sample of individuals of European ancestry, including 5,381 subjects with and 1 0,030 subjects without atrial fibrillation (P < 5 × 10−8). Four of the loci identified in Europeans were further replicated in silico in a GWAS of Japanese individuals, including 843 individuals with and 3,350 individuals without atrial fibrillation. The identified loci implicate candidate genes that encode transcription factors related to cardiopulmonary development, cardiac-expressed ion channels and cell signaling molecules.
Genome Wide Association Studies (GWAS) are a standard approach for large-scale common variation characterization and for identification of single loci predisposing to disease. However, due to issues of moderate sample sizes and particularly multiple testing correction, many variants of smaller effect size are not detected within a single allele analysis framework. Thus, small main effects and potential epistatic effects are not consistently observed in GWAS using standard analytical approaches that consider only single SNP alleles. Here we propose unique methodology that aggregates variants of interest (for example, genes in a biological pathway) using GWAS results. Multiple testing and type I error concerns are minimized using empirical genomic randomization to estimate significance. Randomization corrects for common pathway-based analysis biases such as SNP coverage and density, linkage disequilibrium, gene size and pathway size. PARIS (Pathway Analysis by Randomization Incorporating Structure) applies this randomization and in doing so directly accounts for linkage disequilibrium effects. PARIS is independent of association analysis method and is thus applicable to GWAS datasets of all study designs. Using the KEGG database as an example, we apply PARIS to the publicly available Autism Genetic Resource Exchange (AGRE) GWA dataset, revealing pathways with a significant enrichment of positive association results.
pathway analysis; genomic randomization; gene set; enrichment
Approximately 1 million people in the United States and over 30 million worldwide are living with human immunodeficiency virus type 1 (HIV-1). While mortality from untreated infection approaches 100%, survival improves markedly with use of contemporary antiretroviral therapies (ART). In the United States, 25 drugs are approved for treating HIV-1, and increasing numbers are available in resource-limited countries. Safe and effective ART is a cornerstone in the global struggle against the acquired immunodeficiency syndrome. Variable responses to ART are due at least in part to human genetic variants that affect drug metabolism, drug disposition, and off-site drug targets. Defining effects of human genetic variants on HIV treatment toxicity, efficacy, and pharmacokinetics has far-reaching implications. In 2010, the National Institute of Allergy and Infectious Diseases sponsored a workshop entitled, Pharmacogenomics – A Path Towards Personalized HIV Care. This article summarizes workshop objectives, presentations, discussions, and recommendations derived from this meeting.
HIV therapy; pharmacogenetics; pharmacogenomics; workshop