Research in human genetics and genetic epidemiology has grown significantly over the previous decade, particularly in the field of pharmacogenomics. Pharmacogenomics presents an opportunity for rapid translation of associated genetic polymorphisms into diagnostic measures or tests to guide therapy as part of a move towards personalized medicine. Expansion in genotyping technology has cleared the way for widespread use of whole-genome genotyping in the effort to identify novel biology and new genetic markers associated with pharmacokinetic and pharmacodynamic endpoints. With new technology and methodology regularly becoming available for use in genetic studies, a discussion on the application of such tools becomes necessary. In particular, quality control criteria have evolved with the use of GWAS as we have come to understand potential systematic errors which can be introduced into the data during genotyping. There have been several replicated pharmacogenomic associations, some of which have moved to the clinic to enact change in treatment decisions. These examples of translation illustrate the strength of evidence necessary to successfully and effectively translate a genetic discovery. In this review, the design of pharmacogenomic association studies is examined with the goal of optimizing the impact and utility of this research. Issues of ascertainment, genotyping, quality control, analysis and interpretation are considered.
Epistasis; genotyping; personalized medicine; pharmacogenomics; quality control; statistics; study design
Electrocardiographic QRS duration, a measure of cardiac intraventricular conduction, varies ~2-fold in individuals without cardiac disease. Slow conduction may promote reentrant arrhythmias.
Methods and Results
We performed a genome-wide association study (GWAS) to identify genomic markers of QRS duration in 5,272 individuals without cardiac disease selected from electronic medical record (EMR) algorithms at five sites in the Electronic Medical Records and Genomics (eMERGE) network. The most significant loci were evaluated within the CHARGE consortium QRS GWAS meta-analysis. Twenty-three single nucleotide polymorphisms in 5 loci, previously described by CHARGE, were replicated in the eMERGE samples; 18 SNPs were in the chromosome 3 SCN5A and SCN10A loci, where the most significant SNPs were rs1805126 in SCN5A with p=1.2×10−8 (eMERGE) and p=2.5×10−20 (CHARGE) and rs6795970 in SCN10A with p=6×10−6 (eMERGE) and p=5×10−27 (CHARGE). The other loci were in NFIA, near CDKN1A, and near C6orf204. We then performed phenome-wide association studies (PheWAS) on variants in these five loci in 13,859 European Americans to search for diagnoses associated with these markers. PheWAS identified atrial fibrillation and cardiac arrhythmias as the most common associated diagnoses with SCN10A and SCN5A variants. SCN10A variants were also associated with subsequent development of atrial fibrillation and arrhythmia in the original 5,272 “heart-healthy” study population.
We conclude that DNA biobanks coupled to EMRs provide a platform not only for GWAS but may also allow broad interrogation of the longitudinal incidence of disease associated with genetic variants. The PheWAS approach implicated sodium channel variants modulating QRS duration in subjects without cardiac disease as predictors of subsequent arrhythmias.
cardiac conduction; QRS duration; atrial fibrillation; genome-wide association study; phenome-wide association study; electronic medical records
As genetic epidemiology looks beyond mapping single disease susceptibility loci, interest in detecting epistatic interactions between genes has grown. The dimensionality and comparisons required to search the epistatic space and the inference for a significant result pose challenges for testing epistatic disease models. The Multifactor Dimensionality Reduction Pedigree Disequilibrium Test (MDR-PDT) was developed to test for multilocus models in pedigree data. In the present study we rigorously tested MDR-PDT with new cross-validation (CV) (both 5- and 10-fold) and omnibus model selection algorithms by simulating a range of heritabilities, odds ratios, minor allele frequencies, sample sizes, and numbers of interacting loci. Power was evaluated using 100, 500, and 1000 families, with minor allele frequencies 0.2 and 0.4 and broad-sense heritabilities of 0.005, 0.01, 0.03, 0.05, and 0.1 for 2 and 3-locus purely epistatic penetrance models. We also compared the prediction error measure of effect with a predicted matched odds ratio for final model selection and testing. We report that the CV procedure is valid with the permutation test, MDR-PDT performs similarly with 5 and 10- fold CV, and that the matched odds ratio is more powerful than prediction error as the fitness metric for MDR-PDT.
Epistasis; MDR-PDT; complex disease; family-based association; bioinformatics
Candidate gene and genome-wide association studies (GWAS) have identified genetic variants that modulate risk for human disease; many of these associations require further study to replicate the results. Here we report the first large-scale application of the phenome-wide association study (PheWAS) paradigm within electronic medical records (EMRs), an unbiased approach to replication and discovery that interrogates relationships between targeted genotypes and multiple phenotypes. We scanned for associations between 3,144 single-nucleotide polymorphisms (previously implicated by GWAS as mediators of human traits) and 1,358 EMR-derived phenotypes in 13,835 individuals of European ancestry. This PheWAS replicated 66% (51/77) of sufficiently powered prior GWAS associations and revealed 63 potentially pleiotropic associations with P < 4.6 × 10−6 (false discovery rate < 0.1); the strongest of these novel associations were replicated in an independent cohort (n = 7,406). These findings validate PheWAS as a tool to allow unbiased interrogation across multiple phenotypes in EMR-based cohorts and to enhance analysis of the genomic basis of human disease.
In genetic studies of complex disease a consideration for the investigator is detection of joint effects. The Multifactor Dimensionality Reduction (MDR) algorithm searches for these effects with an exhaustive approach. Previously unknown aspects of MDR performance were the power to detect interactive effects given large numbers of non-model loci or varying degrees of heterogeneity among multiple epistatic disease models.
To address the performance with many non-model loci, datasets of 500 cases and 500 controls with 100 to 10,000 SNPs were simulated for two-locus models, and one hundred 500-case/500-control datasets with 100 and 500 SNPs were simulated for three-locus models. Multiple levels of locus heterogeneity were simulated in several sample sizes.
These results show MDR is robust to locus heterogeneity when the definition of power is not as conservative as in previous simulation studies where all model loci were required to be found by the method. The results also indicate that MDR performance is related more strongly to broad-sense heritability than sample size and is not greatly affected by non-model loci.
A study in which a population with high heritability estimates is sampled predisposes the MDR study to success more than a larger ascertainment in a population with smaller estimates.
Epistasis; MDR; Heterogeneity
Genetic variants of the enzyme that metabolizes warfarin, cytochrome P-450 2C9 (CYP2C9), and of a key pharmacologic target of warfarin, vitamin K epoxide reductase (VKORC1), contribute to differences in patients’ responses to various warfarin doses, but the role of these variants during initial anticoagulation is not clear.
In 297 patients starting warfarin therapy, we assessed CYP2C9 genotypes (CYP2C9 *1, *2, and *3), VKORC1 haplotypes (designated A and non-A), clinical characteristics, response to therapy (as determined by the international normalized ratio [INR]), and bleeding events. The study outcomes were the time to the first INR within the therapeutic range, the time to the first INR of more than 4, the time above the therapeutic INR range, the INR response over time, and the warfarin dose requirement.
As compared with patients with the non-A/non-A haplotype, patients with the A/A haplotype of VKORC1 had a decreased time to the first INR within the therapeutic range (P = 0.02) and to the first INR of more than 4 (P = 0.003). In contrast, the CYP2C9 genotype was not a significant predictor of the time to the first INR within the therapeutic range (P = 0.57) but was a significant predictor of the time to the first INR of more than 4 (P = 0.03). Both the CYP2C9 genotype and VKORC1 haplotype had a significant influence on the required warfarin dose after the first 2 weeks of therapy.
Initial variability in the INR response to warfarin was more strongly associated with genetic variability in the pharmacologic target of warfarin, VKORC1, than with CYP2C9.
The ever-growing wealth of biological information available through multiple comprehensive database repositories can be leveraged for advanced analysis of data. We have now extensively revised and updated the multi-purpose software tool Biofilter that allows researchers to annotate and/or filter data as well as generate gene-gene interaction models based on existing biological knowledge. Biofilter now has the Library of Knowledge Integration (LOKI), for accessing and integrating existing comprehensive database information, including more flexibility for how ambiguity of gene identifiers are handled. We have also updated the way importance scores for interaction models are generated. In addition, Biofilter 2.0 now works with a range of types and formats of data, including single nucleotide polymorphism (SNP) identifiers, rare variant identifiers, base pair positions, gene symbols, genetic regions, and copy number variant (CNV) location information.
Biofilter provides a convenient single interface for accessing multiple publicly available human genetic data sources that have been compiled in the supporting database of LOKI. Information within LOKI includes genomic locations of SNPs and genes, as well as known relationships among genes and proteins such as interaction pairs, pathways and ontological categories.
Via Biofilter 2.0 researchers can:
genomic location or region based data, such as results from association studies, or CNV analyses, with relevant biological knowledge for deeper interpretation
genomic location or region based data on biological criteria, such as filtering a series SNPs to retain only SNPs present in specific genes within specific pathways of interest
Generate Predictive Models
for gene-gene, SNP-SNP, or CNV-CNV interactions based on biological information, with priority for models to be tested based on biological relevance, thus narrowing the search space and reducing multiple hypothesis-testing.
Biofilter is a software tool that provides a flexible way to use the ever-expanding expert biological knowledge that exists to direct filtering, annotation, and complex predictive model development for elucidating the etiology of complex phenotypic outcomes.
Data mining; Bioinformatics; Expert knowledge; Modeling; Pathway analyses; Epistasis
Gene expression profiles have been broadly used in cancer research as a diagnostic or prognostic signature for the clinical outcome prediction such as stage, grade, metastatic status, recurrence, and patient survival, as well as to potentially improve patient management. However, emerging evidence shows that gene expression-based prediction varies between independent data sets. One possible explanation of this effect is that previous studies were focused on identifying genes with large main effects associated with clinical outcomes. Thus, non-linear interactions without large individual main effects would be missed. The other possible explanation is that gene expression as a single level of genomic data is insufficient to explain the clinical outcomes of interest since cancer can be dysregulated by multiple alterations through genome, epigenome, transcriptome, and proteome levels. In order to overcome the variability of diagnostic or prognostic predictors from gene expression alone and to increase its predictive power, we need to integrate multi-levels of genomic data and identify interactions between them associated with clinical outcomes.
Here, we proposed an integrative framework for identifying interactions within/between multi-levels of genomic data associated with cancer clinical outcomes using the Grammatical Evolution Neural Networks (GENN). In order to demonstrate the validity of the proposed framework, ovarian cancer data from TCGA was used as a pilot task. We found not only interactions within a single genomic level but also interactions between multi-levels of genomic data associated with survival in ovarian cancer. Notably, the integration model from different levels of genomic data achieved 72.89% balanced accuracy and outperformed the top models with any single level of genomic data.
Understanding the underlying tumorigenesis and progression in ovarian cancer through the global view of interactions within/between different levels of genomic data is expected to provide guidance for improved prognostic biomarkers and individualized therapies.
Integrative analysis; Multi-omics data; Grammatical evolution neural network; Ovarian cancer
Prior candidate gene studies have associated CYP2B6 516G→T [rs3745274] and 983T→C [rs28399499] with increased plasma efavirenz exposure. We sought to identify novel variants associated with efavirenz pharmacokinetics.
Materials and methods
Antiretroviral therapy-naive AIDS Clinical Trials Group studies A5202, A5095, and ACTG 384 included plasma sampling for efavirenz pharmacokinetics. Log-transformed trough efavirenz concentrations (Cmin) were previously estimated by population pharmacokinetic modeling. Stored DNA was genotyped with Illumina HumanHap 650Y or 1MDuo platforms, complemented by additional targeted genotyping of CYP2B6 and CYP2A6 with MassARRAY iPLEX Gold. Associations were identified by linear regression, which included principal component vectors to adjust for genetic ancestry.
Among 856 individuals, CYP2B6 516G→T was associated with efavirenz estimated Cmin (P = 8.5 × 10−41). After adjusting for CYP2B6 516G→T, CYP2B6 983T→C was associated (P = 9.9 × 10−11). After adjusting for both CYP2B6 516G→T and 983T→C, a CYP2B6 variant (rs4803419) in intron 3 was associated (P = 4.4 × 10−15). After adjusting for all the three variants, non-CYP2B6 polymorphisms were associated at P-value less than 5× 10−8. In a separate cohort of 240 individuals, only the three CYP2B6 polymorphisms replicated. These three polymorphisms explained 34% of interindividual variability in efavirenz estimated Cmin. The extensive metabolizer phenotype was best defined by the absence of all three polymorphisms.
Three CYP2B6 polymorphisms were independently associated with efavirenz estimated Cmin at genome-wide significance, and explained one-third of interindividual variability. These data will inform continued efforts to translate pharmacogenomic knowledge into optimal efavirenz utilization.
CYP2B6; efavirenz; HIV; pharmacogenomics; pharmacokinetics
Marked prolongation of the QT interval on the electrocardiogram associated with the polymorphic ventricular tachycardia Torsades de Pointes is a serious adverse event during treatment with antiarrhythmic drugs and other culprit medications, and is a common cause for drug relabeling and withdrawal. Although clinical risk factors have been identified, the syndrome remains unpredictable in an individual patient. Here we used genome-wide association analysis to search for common predisposing genetic variants. Cases of drug-induced Torsades de Pointes (diTdP), treatment tolerant controls, and general population controls were ascertained across multiple sites using common definitions, and genotyped on the Illumina 610k or 1M-Duo BeadChips. Principal Components Analysis was used to select 216 Northwestern European diTdP cases and 771 ancestry-matched controls, including treatment-tolerant and general population subjects. With these sample sizes, there is 80% power to detect a variant at genome-wide significance with minor allele frequency of 10% and conferring an odds ratio of ≥2.7. Tests of association were carried out for each single nucleotide polymorphism (SNP) by logistic regression adjusting for gender and population structure. No SNP reached genome wide-significance; the variant with the lowest P value was rs2276314, a non-synonymous coding variant in C18orf21 (p = 3×10−7, odds ratio = 2, 95% confidence intervals: 1.5–2.6). The haplotype formed by rs2276314 and a second SNP, rs767531, was significantly more frequent in controls than cases (p = 3×10−9). Expanding the number of controls and a gene-based analysis did not yield significant associations. This study argues that common genomic variants do not contribute importantly to risk for drug-induced Torsades de Pointes across multiple drugs.
Only one LDL-C GWAS has been reported in African Americans. We performed a GWAS of LDL-C in African Americans using data extracted from electronic medical records (EMR) in the eMERGE network. African Americans were genotyped on the Illumina 1M chip. All LDL-C measurements, prescriptions, and diagnoses of concomitant disease were extracted from EMR. We created two analytic datasets; one dataset having median LDL-C calculated after the exclusion of some lab values based on co-morbidities and medication (n = 618) and another dataset having median LDL-C calculated without any exclusions (n = 1249). Rs7412 in APOE was strongly associated with LDL-C at levels of GWAS significance in both datasets (p < 5 X 10−8). In the dataset with exclusions, a decrease of 20.0 mg/dl per minor allele was observed. The effect size was attenuated (12.3 mg/dl) in the dataset without any lab values excluded. Although other signals in APOE have been detected in previous GWAS, this large and important SNP association has not been well detected in large GWAS because rs7412 was not included on many genotyping arrays. Use of median LDL-C extracted from EMR after exclusions for medications and co-morbidities increased the percentage of trait variance explained by genetic variation.
GWAS; LDL; electronic medical records
Routine integration of genotype data into drug decision-making could improve patient safety, particularly if many relevant genetic variants can be assayed simultaneously before target drug prescribing. The frequency of pharmacogenetic prescribing opportunities and the potential adverse events (AE) mitigated are unknown. We examined the frequency with which 56 medications with known outcomes influenced by variant alleles were prescribed in a cohort of 52,942 medical home patients at Vanderbilt University Medical Center. Within a five-year window, we estimated that 64.8% (95% CI: 64.4%-65.2%) of individuals were exposed to at least one medication with an established pharmacogenetic association. Using previously published results for six medications with well-characterized, severe genetically-linked AEs, we estimated that 398 events (95% CI, 225 - 583) could have been prevented with an effective preemptive genotyping program. Our results suggest that multiplexed, preemptive genotyping may represent an efficient alternative approach to current single use (“reactive”) methods and may improve safety.
A multi-ethnic study demonstrates that the extrapolation of genetic disease risk models from European populations to other ethnicities is compromised more strongly by genetic structure than by environmental or global genetic background in differential genetic risk associations across ethnicities.
The vast majority of genome-wide association study (GWAS) findings reported to date are from populations with European Ancestry (EA), and it is not yet clear how broadly the genetic associations described will generalize to populations of diverse ancestry. The Population Architecture Using Genomics and Epidemiology (PAGE) study is a consortium of multi-ancestry, population-based studies formed with the objective of refining our understanding of the genetic architecture of common traits emerging from GWAS. In the present analysis of five common diseases and traits, including body mass index, type 2 diabetes, and lipid levels, we compare direction and magnitude of effects for GWAS-identified variants in multiple non-EA populations against EA findings. We demonstrate that, in all populations analyzed, a significant majority of GWAS-identified variants have allelic associations in the same direction as in EA, with none showing a statistically significant effect in the opposite direction, after adjustment for multiple testing. However, 25% of tagSNPs identified in EA GWAS have significantly different effect sizes in at least one non-EA population, and these differential effects were most frequent in African Americans where all differential effects were diluted toward the null. We demonstrate that differential LD between tagSNPs and functional variants within populations contributes significantly to dilute effect sizes in this population. Although most variants identified from GWAS in EA populations generalize to all non-EA populations assessed, genetic models derived from GWAS findings in EA may generate spurious results in non-EA populations due to differential effect sizes. Regardless of the origin of the differential effects, caution should be exercised in applying any genetic risk prediction model based on tagSNPs outside of the ancestry group in which it was derived. Models based directly on functional variation may generalize more robustly, but the identification of functional variants remains challenging.
The number of known associations between human diseases and common genetic variants has grown dramatically in the past decade, most being identified in large-scale genetic studies of people of Western European origin. But because the frequencies of genetic variants can differ substantially between continental populations, it's important to assess how well these associations can be extended to populations with different continental ancestry. Are the correlations between genetic variants, disease endpoints, and risk factors consistent enough for genetic risk models to be reliably applied across different ancestries? Here we describe a systematic analysis of disease outcome and risk-factor–associated variants (tagSNPs) identified in European populations, in which we test whether the effect size of a tagSNP is consistent across six populations with significant non-European ancestry. We demonstrate that although nearly all such tagSNPs have effects in the same direction across all ancestries (i.e., variants associated with higher risk in Europeans will also be associated with higher risk in other populations), roughly a quarter of the variants tested have significantly different magnitude of effect (usually lower) in at least one non-European population. We therefore advise caution in the use of tagSNP-based genetic disease risk models in populations that have a different genetic ancestry from the population in which original associations were first made. We then show that this differential strength of association can be attributed to population-dependent variations in the correlation between tagSNPs and the variant that actually determines risk—the so-called functional variant. Risk models based on functional variants are therefore likely to be more robust than tagSNP-based models.
We genotyped 326 “frequently medicated” individuals of European-descent in Vanderbilt’s biorepository linked to de-identified electronic medical records, BioVU, on the ADME Core Panel to assess quality and performance of the assay. We compared quality control metrics and determined the extent of direct and indirect marker overlap between the ADME Core Panel and the Illumina Omni1-Quad. We found the quality of the ADME Core Panel data to be high, with exceptions in select copy number variants (CNVs) and markers in certain genes (notably CYP2D6). Most of the common variants on the ADME panel are genotyped by the Omni1, but absent rare variants and CNVs could not be accurately tagged by single markers. Finally, our frequently medicated study population did not convincingly differ in allele frequency from reference populations, suggesting that heterogeneous clinical samples (with respect to medications) follow similar allele frequency distributions in pharmacogenetics genes as their appropriate reference populations.
ADME; pharmacogenomics; pharmacogenetics; BioVU; biorepository; CYP2D6; personalized medicine; precision medicine
To identify novel genetic loci influencing interindividual variation in red blood cell (RBC) traits in African-Americans, we conducted a genome-wide association study (GWAS) in 2315 individuals, divided into discovery (n = 1904) and replication (n = 411) cohorts. The traits included hemoglobin concentration (HGB), hematocrit (HCT), RBC count, mean corpuscular volume (MCV), mean corpuscular hemoglobin (MCH), and mean corpuscular hemoglobin concentration (MCHC). Patients were participants in the electronic MEdical Records and GEnomics (eMERGE) network and underwent genotyping of ~1.2 million single-nucleotide polymorphisms on the Illumina Human1M-Duo array. Association analyses were performed adjusting for age, sex, site, and population stratification. Three loci previously associated with resistance to malaria—HBB (11p15.4), HBA1/HBA2 (16p13.3), and G6PD (Xq28)—were associated (P ≤ 1 × 10−6) with RBC traits in the discovery cohort. The loci replicated in the replication cohort (P ≤ 0.02), and were significant at a genome-wide significance level (P < 5 × 10−8) in the combined cohort. The proportions of variance in RBC traits explained by significant variants at these loci were as follows: rs7120391 (near HBB) 1.3% of MCHC, rs9924561 (near HBA1/A2) 5.5% of MCV, 6.9% of MCH and 2.9% of MCHC, and rs1050828 (in G6PD) 2.4% of RBC count, 2.9% of MCV, and 1.4% of MCH, respectively. We were not able to replicate loci identified by a previous GWAS of RBC traits in a European ancestry cohort of similar sample size, suggesting that the genetic architecture of RBC traits differs by race. In conclusion, genetic variants that confer resistance to malaria are associated with RBC traits in African-Americans.
red blood cell (RBC) traits; genome-wide association study; African-Americans; natural selection; informatics; electronic medical record
With the recent decreasing cost of genome sequence data, there has been increasing interest in rare variants and methods to detect their association to disease. We developed BioBin, a flexible collapsing method inspired by biological knowledge that can be used to automate the binning of low frequency variants for association testing. We also built the Library of Knowledge Integration (LOKI), a repository of data assembled from public databases, which contains resources such as: dbSNP and gene Entrez database information from the National Center for Biotechnology (NCBI), pathway information from Gene Ontology (GO), Protein families database (Pfam), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, NetPath - signal transduction pathways, Open Regulatory Annotation Database (ORegAnno), Biological General Repository for Interaction Datasets (BioGrid), Pharmacogenomics Knowledge Base (PharmGKB), Molecular INTeraction database (MINT), and evolutionary conserved regions (ECRs) from UCSC Genome Browser. The novelty of BioBin is access to comprehensive knowledge-guided multi-level binning. For example, bin boundaries can be formed using genomic locations from: functional regions, evolutionary conserved regions, genes, and/or pathways.
We tested BioBin using simulated data and 1000 Genomes Project low coverage data to test our method with simulated causative variants and a pairwise comparison of rare variant (MAF < 0.03) burden differences between Yoruba individuals (YRI) and individuals of European descent (CEU). Lastly, we analyzed the NHLBI GO Exome Sequencing Project Kabuki dataset, a congenital disorder affecting multiple organs and often intellectual disability, contrasted with Complete Genomics data as controls.
The results from our simulation studies indicate type I error rate is controlled, however, power falls quickly for small sample sizes using variants with modest effect sizes. Using BioBin, we were able to find simulated variants in genes with less than 20 loci, but found the sensitivity to be much less in large bins. We also highlighted the scale of population stratification between two 1000 Genomes Project data, CEU and YRI populations. Lastly, we were able to apply BioBin to natural biological data from dbGaP and identify an interesting candidate gene for further study.
We have established that BioBin will be a very practical and flexible tool to analyze sequence data and potentially uncover novel associations between low frequency variants and complex disease.
A report on the Keystone Symposium 'Complex Traits: Genomics and Computational approaches', Breckenridge, Colorado, USA, 20-25 February 2012.
HIV-associated sensory neuropathy remains an important complication of combination antiretroviral therapy (CART) and HIV infection. Mitochondrial DNA haplogroups and single nucleotide polymorphisms (SNPs) have previously been associated with symptomatic neuropathy in clinical trial participants. We examined associations between mitochondrial DNA variation and HIV-associated sensory neuropathy in CHARTER. CHARTER is a U.S. based longitudinal observational study of HIV-infected adults who underwent a structured interview and standardized examination. HIV-associated sensory neuropathy was determined by trained examiners as ≥1 sign (diminished vibratory and sharp-dull discrimination or ankle reflexes) bilaterally. Mitochondrial DNA sequencing was performed and haplogroups were assigned by published algorithms. Multivariable logistic regression of associations between mitochondrial DNA SNPs, haplogroups and HIV-associated sensory neuropathy were performed. In analyses of associations of each mitochondrial DNA SNP with HIV-associated sensory neuropathy, the two most significant SNPs were at positions A12810G (odds ratio [95% confidence interval] = 0.27 [0.11-0.65]; p = 0.004) and T489C (odds ratio [95% confidence interval] = 0.41 [0.21-0.80]; p = 0.009). These synonymous changes are known to define African haplogroup L1c and European haplogroup J, respectively. Both haplogroups are associated with decreased prevalence of HIV-associated sensory neuropathy compared with all other haplogroups (odds ratio [95% confidence interval] = 0.29 [0.12-0.71]; p = 0.007 and odds ratio [95% confidence interval] = 0.42 [0.18-1.0]; p = 0.05, respectively). In conclusion, in this cohort of mostly combination antiretroviral therapy-treated subjects, two common mitochondrial DNA SNPs and their corresponding haplogroups were associated with a markedly decreased prevalence of HIV-associated sensory neuropathy.
genetics; mitochondria; HIV-related neurological diseases; peripheral neuropathy
Simulation studies are useful in various disciplines for a number of reasons including the development and evaluation of new computational and statistical methods. This is particularly true in human genetics and genetic epidemiology where new analytical methods are needed for the detection and characterization of disease susceptibility genes whose effects are complex, nonlinear, and partially or solely dependent on the effects of other genes. Despite this need, the development of complex genetic models that can be used to simulate data is not always intuitive. In fact, only a few such models have been published. In this paper, we present a strategy for identifying complex genetic models for simulation studies that utilizes genetic algorithms. The genetic models used in this study are penetrance functions that define the probability of disease given a specific DNA sequence variation has been inherited. We demonstrate that the genetic algorithm approach routinely identifies interesting and useful penetrance functions in a human-competitve manner.
Over the past several years, genome-wide association studies (GWAS) have succeeded in identifying hundreds of genetic markers associated with common diseases. However, most of these markers confer relatively small increments of risk and explain only a small proportion of familial clustering. To identify obstacles to future progress in genetic epidemiology research and provide recommendations to NIH for overcoming these barriers, the National Cancer Institute sponsored a workshop entitled “Next Generation Analytic Tools for Large-Scale Genetic Epidemiology Studies of Complex Diseases” on September 15–16, 2010. The goal of the workshop was to facilitate discussions on (1) statistical strategies and methods to efficiently identify genetic and environmental factors contributing to the risk of complex disease; and (2) how to develop, apply, and evaluate these strategies for the design, analysis, and interpretation of large-scale complex disease association studies in order to guide NIH in setting the future agenda in this area of research. The workshop was organized as a series of short presentations covering scientific (gene-gene and gene-environment interaction, complex phenotypes, and rare variants and next generation sequencing) and methodological (simulation modeling and computational resources and data management) topic areas. Specific needs to advance the field were identified during each session and are summarized.
gene-gene interactions; gene-environment interactions; rare variants; next generation sequencing; complex phenotypes; simulations; computational resources
Tacrolimus, an immunosuppressive drug widely prescribed in kidney transplantation, requires therapeutic drug monitoring due to its marked interindividual pharmacokinetic variability and narrow therapeutic index. Previous studies have established that CYP3A5 rs776746 is associated with tacrolimus clearance, blood concentration, and dose requirement. The importance of other drug absorption, distribution, metabolism, and elimination (ADME) gene variants has not been well characterized.
We used novel DNA biobank and electronic medical record resources to identify ADME variants associated with tacrolimus dose requirement. Broad ADME genotyping was performed on 446 kidney transplant recipients who had been dosed to steady state with tacrolimus. The cohort was obtained from Vanderbilt's DNA biobank, BioVU, which contains linked, de-identified electronic medical record data. Genotyping included Affymetrix DMET Plus (1936 polymorphisms), custom Sequenom MassARRAY iPLEX Gold assay (95 polymorphisms), and ancestry-informative markers. The primary outcome was tacrolimus dose requirement defined as blood concentration-to-dose ratio.
In analyses that adjusted for race and other clinical factors, we replicated the association of tacrolimus blood concentration-to-dose ratio with CYP3A5 rs776746 (p = 7.15 × 10−29), and identified associations with nine variants in linkage disequilibrium with rs776746, including eight CYP3A4 variants. No NR1/2 variants were significantly associated. Age, weight, and hemoglobin were also significantly associated with the outcome. In final models, rs776746 explained 39% of variability in dose requirement, and 46% was explained by the model containing clinical covariates.
This study highlights the utility of DNA biobanks and electronic medical records for tacrolimus pharmacogenomic research.
pharmacogenomics; pharmacokinetics; calcineurin inhibitor; tacrolimus; electronic medical records; kidney transplant; cytochrome P4503A5; genetic polymorphism; dosing
Mitochondrial DNA (mtDNA) variation has been associated with time to progression to AIDS and adverse effects from antiretroviral therapy (ART). In this study, full mitochondrial DNA (mtDNA) sequence data from U.S.-based adult participants in the AIDS Clinical Trials Group (ACTG) study 384 was used to assess associations between mtDNA variants and CD4 T cell recovery with ART.
Full mtDNA sequence was determined using chip-based array sequencing. Sequence and CD4 cell count data was available at baseline and after ART initiation for 423 subjects with HIV RNA levels <400copies/mL plasma. The primary outcome was change in CD4 count of ≥100 cells/mm3 from baseline. Analyses were adjusted for baseline age, CD4 cell count, HIV RNA, and naïve:memory CD4 cell ratio.
Race-stratified analysis of mtDNA variants with a minor allele frequency >1% revealed multiple mtDNA variants marginally associated (P < 0.05 before Bonferroni correction) with CD4 cell recovery. The most significant SNP associations were those tagging the African L2 haplogroup, which was associated with a decreased likelihood of ≥100 cells/mm3 CD4 count increase at week 48 in non-Hispanic blacks (adjusted OR=0.17; 95% CI=0.06–0.53; P=0.002).
An African mtDNA haplogroup was associated with CD4 cell recovery after ART in this clinical trial population. These initial findings warrant replication and further investigation in order to confirm the role of mtDNA variation in CD4 cell recovery during ART.
HIV; CD4 count; Mitochondrial DNA; Pharmacogenetics; Antiretroviral Therapy
The current paradigm of human genetics research is to analyze variation of a single data type (i.e., DNA sequence or RNA levels) to detect genes and pathways that underlie complex traits such as disease state or drug response. While these studies have detected thousands of variations that associate with hundreds of complex phenotypes, much of the estimated heritability, or trait variability due to genetic factors, remain unexplained. We may be able to account for a portion of the missing heritability if we incorporate a systems biology approach into these analyses. Rapid technological advances will make it possible for scientists to explore this hypothesis via the generation of high-throughput omics data – transcriptomic, proteomic and methylomic to name a few. Analyzing this ‘meta-dimensional’ data will require clever statistical techniques that allow for the integration of qualitative and quantitative predictor variables. For this article, we examine two major categories of approaches for integrated data analysis, give examples of their use in experimental and in silico datasets, and assess the limitations of each method.
computational methods; data integration; pharmacogenomics; systems biology
To identify common genetic variants influencing red blood cell (RBC) traits.
Patients and Methods
We performed a genomewide association study from June 2008 through July 2011 of hemoglobin, hematocrit, RBC count, mean corpuscular volume, mean corpuscular hemoglobin, and mean corpuscular hemoglobin concentration in 12,486 patients of European ancestry from the electronic MEdical Records and Genomics (eMERGE) network. We developed an electronic medical record–based algorithm that included individuals who had RBC measurements obtained for clinical care and excluded values measured in the setting of hematopoietic disorders, comorbid conditions, or medications known to affect RBC production or a recent history of blood loss.
We identified 4 new genetic loci and replicated 11 loci previously reported to be associated with one or more RBC traits in individuals of European ancestry. Notably, genes present in 3 of the 4 newly identified loci (THRB, PTPLAD1, CDT1) and in 6 of the 11 replicated loci (KLF1, ALDH8A1, CCND3, SPTA1, FBXO7, TFR2/EPO) are implicated in erythroid differentiation and regulation of cell cycle in hematopoietic stem cells.
Genes in the erythroid differentiation and cell cycle regulation pathways influence interindividual variation in RBC indices. Our results provide insights into the molecular basis underlying variation in RBC traits.
eMERGE, electronic MEdical Records and GEnomics; EMMAX, mixed-model association-expedited; EMR, electronic medical record; eQTL, expression quantitative trait locus; GHC, Group Health Cooperative--University of Washington; GWAS, genomewide association study; HCT, hematocrit; HGB, hemoglobin; IBS, identity-by-state; LD, linkage disequilibrium; MC, Marshfield Clinic; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; MCV, mean corpuscular volume; MIM, Mendelian Inheritance of Man; NU, Northwestern University; RBC, red blood cell; SNP, single-nucleotide polymorphism; VUMC, Vanderbilt University Medical Center