Human induced pluripotent stem cell (iPSC)-derived cardiomyocytes show promise for screening during early drug development. Here, we tested a hypothesis that in vitro assessment of multiple cardiomyocyte physiological parameters enables predictive and mechanistically-interpretable evaluation of cardiotoxicity in a high-throughput format. Human iPSC-derived cardiomyocytes were exposed for 30 minutes or 24 hours to 131 drugs, positive (107) and negative (24) for in vivo cardiotoxicity, in up to 6 concentrations (3 nM to 30 μM) in 384-well plates. Fast kinetic imaging was used to monitor changes in cardiomyocyte function using intracellular Ca2+ flux readouts synchronous with beating, and cell viability. A number of physiological parameters of cardiomyocyte beating, such as beat rate, peak shape (amplitude, width, raise, decay, etc.) and regularity were collected using automated data analysis. Concentration-response profiles were evaluated using logistic modeling to derive a benchmark concentration (BMC) point-of-departure value, based on one standard deviation departure from the estimated baseline in vehicle (0.3% dimethylsulfoxide)-treated cells. BMC values were used for cardiotoxicity classification and ranking of compounds. Beat rate and several peak shape parameters were found to be good predictors, while cell viability had poor classification accuracy. In addition, we applied the Toxicological Prioritization Index approach to integrate and display data across many collected parameters, to derive “cardiosafety” ranking of tested compounds. Multi-parameter screening of beating profiles allows for cardiotoxicity risk assessment and identification of specific patterns defining mechanism-specific effects. The data and analysis methods may be used widely for compound screening and early safety evaluation in the drug development process.
Cardiotoxicity; iPSC cardiomyocytes; predictive toxicology; stem cell; calcium transient; ion channel; hERG; fast kinetic fluorescence imaging; screening
We assessed gene expression profiles in 2,752 twins, using a classic twin design to quantify expression heritability and quantitative trait loci (eQTL) in peripheral blood. The most highly heritable genes (~777) were grouped into distinct expression clusters, enriched in gene-poor regions, associated with specific gene function/ontology classes, and strongly associated with disease designation. The design enabled a comparison of twin-based heritability to estimates based on dizygotic IBD sharing and distant genetic relatedness. Consideration of sampling variation suggests that previous heritability estimates have been upwardly biased. Genotyping of 2,494 twins enabled powerful identification of eQTLs, which were further examined in a replication set of 1,895 unrelated subjects. A large number of local eQTLs (6,988) met replication criteria, while a relatively small number of distant eQTLs (165) met quality control and replication standards. Our results provide an important new resource toward understanding the genetic control of transcription.
gene expression; peripheral blood; twin study; heritability; expression quantitative trait loci; eQTL
Diabetes is a common age-dependent complication of cystic fibrosis (CF) that is strongly influenced by modifier genes. We conducted a genome-wide association study in 3,059 individuals with CF (644 with CF-related diabetes [CFRD]) and identified single nucleotide polymorphisms (SNPs) within and 5′ to the SLC26A9 gene that associated with CFRD (hazard ratio [HR] 1.38; P = 3.6 × 10−8). Replication was demonstrated in 694 individuals (124 with CFRD) (HR, 1.47; P = 0.007), with combined analysis significant at P = 9.8 × 10−10. SLC26A9 is an epithelial chloride/bicarbonate channel that can interact with the CF transmembrane regulator (CFTR), the protein mutated in CF. We also hypothesized that common SNPs associated with type 2 diabetes also might affect risk for CFRD. A previous association of CFRD with SNPs in TCF7L2 was replicated in this study (P = 0.004; combined analysis P = 3.8 × 10−6), and type 2 diabetes SNPs at or near CDKAL1, CDKN2A/B, and IGF2BP2 were associated with CFRD (P < 0.004). These five loci accounted for 8.3% of the phenotypic variance in CFRD onset and had a combined population-attributable risk of 68%. Diabetes is a highly prevalent complication of CF, for which susceptibility is determined in part by variants at SLC26A9 (which mediates processes proximate to the CF disease-causing gene) and at four susceptibility loci for type 2 diabetes in the general population.
The Mantel and Knox space-time clustering statistics are popular tools to establish transmissibility of a disease and detect outbreaks. The most commonly used null distributional approximations may provide poor fits, and researchers often resort to direct sampling from the permutation distribution. However, the exact first four moments for these statistics are available, and Pearson distributional approximations are often effective. Thus, our first goal is to clarify the literature and to make these tools more widely available. In addition, by rewriting terms in the statistics we obtain the exact first four permutation moments for the most commonly used quadratic form statistics, which need not be positive definite. The extension of this work to quadratic forms greatly expands the utility of density approximations for these problems, including for high-dimensional applications, where the statistics must be extreme in order to exceed stringent testing thresholds. We demonstrate the methods using examples from the investigation of disease transmission in cattle, the association of a gene expression pathway with breast cancer survival, regional genetic association with cystic fibrosis lung disease, and hypothesis testing for smoothed local linear regression.
Resampling; Exact testing; Statistical Computing
The development of high-throughput biomedical technologies has led to increased interest in the analysis of high-dimensional data where the number of features is much larger than the sample size. In this paper, we investigate principal component analysis under the ultra-high dimensional regime, where both the number of features and the sample size increase as the ratio of the two quantities also increases. We bridge the existing results from the finite and the high-dimension low sample size regimes, embedding the two regimes in a more general framework. We also numerically demonstrate the universal application of the results from the finite regime.
High-Dimension Low Sample Size Data; Principal Component Analysis; Random Matrix
In the past few years, a plethora of methods for rare variant association with phenotype have been proposed. These methods aggregate information from multiple rare variants across genomic region(s), but there is little consensus as to which method is most effective. The weighting scheme adopted when aggregating information across variants is one of the primary determinants of effectiveness. Here we present a systematic evaluation of multiple weighting schemes through a series of simulations intended to mimic large sequencing studies of a quantitative trait. We evaluate existing phenotype-independent and -dependent methods, as well as weights estimated by penalized regression approaches including Lasso, Elastic Net and SCAD. We find that the difference in power between phenotype-dependent schemes is negligible when high quality functional annotations are available. When functional annotations are unavailable or incomplete, all methods suffer from power loss; however, the variable selection methods outperform the others at the cost of increased computational time. Therefore, in the absence of good annotation, we recommend variable selection methods (which can be viewed as “statistical annotation”) on top regions implicated by a phenotype independent weighting scheme. Further, once a region is implicated, variable selection can help to identify potential causal SNPs for biological validation. These findings are supported by an analysis of a high coverage targeted sequencing study of 1898 individuals.
rare variants; association; weighting; variable selection; variant annotation
It has been repeatedly shown that in case-control association studies, analysis of a secondary trait which ignores the original sampling scheme can produce highly biased risk estimates. Although a number of approaches have been proposed to properly analyze secondary traits, most approaches fail to reproduce the marginal logistic model assumed for the original case-control trait and/or do not allow for interaction between secondary trait and genotype marker on primary disease risk. In addition, the flexible handling of covariates remains challenging. We present a general retrospective likelihood framework to perform association testing for both binary and continuous secondary traits which respects marginal models and incorporates the interaction term. We provide a computational algorithm, based on a reparameterized approximate profile likelihood, for obtaining the maximum likelihood (ML) estimate and its standard error for the genetic effect on secondary trait, in presence of covariates. For completeness we also present an alternative pseudo-likelihood method for handling covariates. We describe extensive simulations to evaluate the performance of the ML estimator in comparison with the pseudo-likelihood and other competing methods.
Editor’s Highlight: Byproducts of constitutive metabolism may themselves be toxic, complicating the risk assessment of the same chemicals encountered from external sources. The application of stable labeled compounds offers insight into the source of chemicals producing biological effects and provides a basis to quantify the contribution of exogenous exposure to biological events. This report describes the concentration dependent contributions of exogenous [13C2]-acetaldehyde and endogenously produced acetaldehyde to adduct formation in human lymphoblastoid cells in vitro. — Jeffrey Fisher
The dose-response relationship for biomarkers of exposure (N2-ethylidene-dG adducts) and effect (cell survival and micronucleus formation) was determined across 4.5 orders of magnitude (50nM–2mM) using [13C2]-acetaldehyde exposures to human lymphoblastoid TK6 cells for 12h. There was a clear increase in exogenous N
2-ethylidene-dG formation at exposure concentrations ≥ 1µM, whereas the endogenous adducts remained nearly constant across all exposure concentrations, with an average of 3.0 adducts/107 dG. Exogenous adducts were lower than endogenous adducts at concentrations ≤ 10µM and were greater than endogenous adducts at concentrations ≥ 250µM. When the endogenous and exogenous adducts were summed together, statistically significant increases in total adduct formation over the endogenous background occurred at 50µM. Cell survival and micronucleus formation were monitored across the exposure range and statistically significant decreases in cell survival and increases in micronucleus formation occurred at ≥ 1000µM. This research supports the hypothesis that endogenously produced reactive species, including acetaldehyde, are always present and constitute the majority of the observed biological effects following very low exposures to exogenous acetaldehyde. These data can replace default assumptions of linear extrapolation to very low doses of exogenous acetaldehyde for risk prediction.
acetaldehyde; DNA adduct; micronucleus; biomarker of exposure; biomarker of effect; liquid chromatography–; mass spectrometry.
Background: Benchmark dose (BMD) modeling computes the dose associated with a prespecified response level. While offering advantages over traditional points of departure (PODs), such as no-observed-adverse-effect-levels (NOAELs), BMD methods have lacked consistency and transparency in application, interpretation, and reporting in human health assessments of chemicals.
Objectives: We aimed to apply a standardized process for conducting BMD modeling to reduce inconsistencies in model fitting and selection.
Methods: We evaluated 880 dose–response data sets for 352 environmental chemicals with existing human health assessments. We calculated benchmark doses and their lower limits [10% extra risk, or change in the mean equal to 1 SD (BMD/L10/1SD)] for each chemical in a standardized way with prespecified criteria for model fit acceptance. We identified study design features associated with acceptable model fits.
Results: We derived values for 255 (72%) of the chemicals. Batch-calculated BMD/L10/1SD values were significantly and highly correlated (R2 of 0.95 and 0.83, respectively, n = 42) with PODs previously used in human health assessments, with values similar to reported NOAELs. Specifically, the median ratio of BMDs10/1SD:NOAELs was 1.96, and the median ratio of BMDLs10/1SD:NOAELs was 0.89. We also observed a significant trend of increasing model viability with increasing number of dose groups.
Conclusions: BMD/L10/1SD values can be calculated in a standardized way for use in health assessments on a large number of chemicals and critical effects. This facilitates the exploration of health effects across multiple studies of a given chemical or, when chemicals need to be compared, providing greater transparency and efficiency than current approaches.
Citation: Wignall JA, Shapiro AJ, Wright FA, Woodruff TJ, Chiu WA, Guyton KZ, Rusyn I. 2014. Standardizing benchmark dose calculations to improve science-based decisions in human health assessments. Environ Health Perspect 122:499–505; http://dx.doi.org/10.1289/ehp.1307539
Background: Quantitative estimation of toxicokinetic variability in the human population is a persistent challenge in risk assessment of environmental chemicals. Traditionally, interindividual differences in the population are accounted for by default assumptions or, in rare cases, are based on human toxicokinetic data.
Objectives: We evaluated the utility of genetically diverse mouse strains for estimating toxicokinetic population variability for risk assessment, using trichloroethylene (TCE) metabolism as a case study.
Methods: We used data on oxidative and glutathione conjugation metabolism of TCE in 16 inbred and 1 hybrid mouse strains to calibrate and extend existing physiologically based pharmacokinetic (PBPK) models. We added one-compartment models for glutathione metabolites and a two-compartment model for dichloroacetic acid (DCA). We used a Bayesian population analysis of interstrain variability to quantify variability in TCE metabolism.
Results: Concentration–time profiles for TCE metabolism to oxidative and glutathione conjugation metabolites varied across strains. Median predictions for the metabolic flux through oxidation were less variable (5-fold range) than that through glutathione conjugation (10-fold range). For oxidative metabolites, median predictions of trichloroacetic acid production were less variable (2-fold range) than DCA production (5-fold range), although the uncertainty bounds for DCA exceeded the predicted variability.
Conclusions: Population PBPK modeling of genetically diverse mouse strains can provide useful quantitative estimates of toxicokinetic population variability. When extrapolated to lower doses more relevant to environmental exposures, mouse population-derived variability estimates for TCE metabolism closely matched population variability estimates previously derived from human toxicokinetic studies with TCE, highlighting the utility of mouse interstrain metabolism studies for addressing toxicokinetic variability.
Citation: Chiu WA, Campbell JL Jr, Clewell HJ III, Zhou YH, Wright FA, Guyton KZ, Rusyn I. 2014. Physiologically based pharmacokinetic (PBPK) modeling of interstrain variability in trichloroethylene metabolism in the mouse. Environ Health Perspect 122:456–463; http://dx.doi.org/10.1289/ehp.1307623
Little is known for certain about the genetics of schizophrenia. The
advent of genomewide association has been widely anticipated as holding
promise as a means to identify reproducible DNA sequence variation
associated with this important and debilitating disorder.
738 cases with DSM-IV schizophrenia (all participants in the CATIE
study) and 733 group-matched controls were genotyped for 492,900 single
nucleotide polymorphisms (SNPs) using the Affymetrix 500K two chip
genotyping platform plus a custom 164K fill-in chip. Following multiple
quality control steps for both subjects and SNPs, logistic regression
analyses were used to assess the evidence for association of all SNPs with
We identified a number of promising SNPs for follow-up studies,
although no SNP or multi-marker combination of SNPs achieved genomewide
statistical significance. Although a few signals coincided with genomic
regions previously implicated in schizophrenia, chance could not be
These data do not provide evidence for the involvement of any genomic
region with schizophrenia detectable with moderate sample size. However,
planned GWAS for response phenotypes and inclusion of individual phenotype
and genotype data from this study in meta-analyses holds promise for the
eventual identification of susceptibility and protective variants.
schizophrenia; genome-wide association; CATIE
Motivation: Scientists and regulators are often faced with complex decisions, where use of scarce resources must be prioritized using collections of diverse information. The Toxicological Prioritization Index (ToxPi™) was developed to enable integration of multiple sources of evidence on exposure and/or safety, transformed into transparent visual rankings to facilitate decision making. The rankings and associated graphical profiles can be used to prioritize resources in various decision contexts, such as testing chemical toxicity or assessing similarity of predicted compound bioactivity profiles. The amount and types of information available to decision makers are increasing exponentially, while the complex decisions must rely on specialized domain knowledge across multiple criteria of varying importance. Thus, the ToxPi bridges a gap, combining rigorous aggregation of evidence with ease of communication to stakeholders.
Results: An interactive ToxPi graphical user interface (GUI) application has been implemented to allow straightforward decision support across a variety of decision-making contexts in environmental health. The GUI allows users to easily import and recombine data, then analyze, visualize, highlight, export and communicate ToxPi results. It also provides a statistical metric of stability for both individual ToxPi scores and relative prioritized ranks.
Availability: The ToxPi GUI application, complete user manual and example data files are freely available from http://comptox.unc.edu/toxpi.php.
Genomes of men and women differ in only a limited number of genes located on the sex chromosomes, whereas the transcriptome is far more sex-specific. Identification of sex-biased gene expression will contribute to understanding the molecular basis of sex-differences in complex traits and common diseases.
Sex differences in the human peripheral blood transcriptome were characterized using microarrays in 5,241 subjects, accounting for menopause status and hormonal contraceptive use. Sex-specific expression was observed for 582 autosomal genes, of which 57.7% was upregulated in women (female-biased genes). Female-biased genes were enriched for several immune system GO categories, genes linked to rheumatoid arthritis (16%) and genes regulated by estrogen (18%). Male-biased genes were enriched for genes linked to renal cancer (9%). Sex-differences in gene expression were smaller in postmenopausal women, larger in women using hormonal contraceptives and not caused by sex-specific eQTLs, confirming the role of estrogen in regulating sex-biased genes.
This study indicates that sex-bias in gene expression is extensive and may underlie sex-differences in the prevalence of common diseases.
We introduce the Interactive Decision Committee method for classification when high-dimensional feature variables are grouped into feature categories. The proposed method uses the interactive relationships among feature categories to build base classifiers which are combined using decision committees. A two-stage or a single-stage 5-fold cross-validation technique is utilized to decide the total number of base classifiers to be combined. The proposed procedure is useful for classifying biochemicals on the basis of toxicity activity, where the feature space consists of chemical descriptors and the responses are binary indicators of toxicity activity. Each descriptor belongs to at least one descriptor category. The support vector machine, the random forests, and the tree-based AdaBoost algorithms are utilized as classifier inducers. Forward selection is used to select the best combinations of the base classifiers given the number of base classifiers. Simulation studies demonstrate that the proposed method outperforms a single large, unaggregated classifier in the presence of interactive feature category information. We applied the proposed method to two toxicity data sets associated with chemical compounds. For these data sets, the proposed method improved classification performance for the majority of outcomes compared to a single large, unaggregated classifier.
Chemical toxicity; Decision committee method; Ensemble; Ensemble feature selection; QSAR modeling; Statistical learning
A subset (~3–5%) of patients with cystic fibrosis (CF) develops severe liver disease (CFLD) with portal hypertension.
To assess whether any of 9 polymorphisms in 5 candidate genes (SERPINA1, ACE, GSTP1, MBL2, and TGFB1) are associated with severe liver disease in CF patients.
Design, Setting, and Participants
A 2-stage design was used in this case–control study. CFLD subjects were enrolled from 63 U.S., 32 Canadian, and 18 CF centers outside of North America, with the University of North Carolina at Chapel Hill (UNC) as the coordinating site. In the initial study, we studied 124 CFLD patients (enrolled 1/1999–12/2004) and 843 CF controls (patients without CFLD) by genotyping 9 polymorphisms in 5 genes previously implicated as modifiers of liver disease in CF. In the second stage, the SERPINA1 Z allele and TGFB1 codon 10 genotype were tested in an additional 136 CFLD patients (enrolled 1/2005–2/2007) and 1088 CF controls.
Main Outcome Measures
We compared differences in distribution of genotypes in CF patients with severe liver disease versus CF patients without CFLD.
The initial study showed CFLD to be associated with the SERPINA1 (also known as α1-antiprotease and α1-antitrypsin) Z allele (P value=3.3×10−6; odds ratio (OR) 4.72, 95% confidence interval (CI) 2.31–9.61), and with transforming growth factor β-1 (TGFB1) codon 10 CC genotype (P=2.8×10−3; OR 1.53, CI 1.16–2.03). In the replication study, CFLD was associated with the SERPINA1 Z allele (P=1.4×10−3; OR 3.42, CI 1.54–7.59), but not with TGFB1 codon 10. A combined analysis of the initial and replication studies by logistic regression showed CFLD to be associated with SERPINA1 Z allele (P=1.5×10−8; OR 5.04, CI 2.88–8.83).
The SERPINA1 Z allele is a risk factor for liver disease in CF. Patients who carry the Z allele are at greater odds (OR ~5) to develop severe liver disease with portal hypertension.
Exome sequencing has become a powerful and effective strategy for discovery of genes underlying Mendelian disorders1. However, use of exome sequencing to identify variants associated with complex traits has been more challenging, partly because the samples sizes needed for adequate power may be very large2. One strategy to increase efficiency is to sequence individuals who are at both ends of a phenotype distribution (i.e., extreme phenotypes). Because the frequency of alleles that contribute to the trait are enriched in one or both extremes of phenotype, a modest sample size can potentially identify novel candidate genes/alleles3. As part of the National Heart, Lung, and Blood Institute Exome Sequencing Project (ESP), we used an extreme phenotype design to discover that variants in DCTN4, encoding a dynactin protein, are associated with time to first Pseudomonas aeruginosa (P. aeruginosa) airway infection, chronic P. aeruginosa infection and mucoid P. aeruginosa among individuals with cystic fibrosis (MIM219700).
A shift in toxicity testing from in vivo to in vitro may efficiently prioritize compounds, reveal new mechanisms, and enable predictive modeling. Quantitative high-throughput screening (qHTS) is a major source of data for computational toxicology, and our goal in this study was to aid in the development of predictive in vitro models of chemical-induced toxicity, anchored on interindividual genetic variability. Eighty-one human lymphoblast cell lines from 27 Centre d’Etude du Polymorphisme Humain trios were exposed to 240 chemical substances (12 concentrations, 0.26nM–46.0μM) and evaluated for cytotoxicity and apoptosis. qHTS screening in the genetically defined population produced robust and reproducible results, which allowed for cross-compound, cross-assay, and cross-individual comparisons. Some compounds were cytotoxic to all cell types at similar concentrations, whereas others exhibited interindividual differences in cytotoxicity. Specifically, the qHTS in a population-based human in vitro model system has several unique aspects that are of utility for toxicity testing, chemical prioritization, and high-throughput risk assessment. First, standardized and high-quality concentration-response profiling, with reproducibility confirmed by comparison with previous experiments, enables prioritization of chemicals for variability in interindividual range in cytotoxicity. Second, genome-wide association analysis of cytotoxicity phenotypes allows exploration of the potential genetic determinants of interindividual variability in toxicity. Furthermore, highly significant associations identified through the analysis of population-level correlations between basal gene expression variability and chemical-induced toxicity suggest plausible mode of action hypotheses for follow-up analyses. We conclude that as the improved resolution of genetic profiling can now be matched with high-quality in vitro screening data, the evaluation of the toxicity pathways and the effects of genetic diversity are now feasible through the use of human lymphoblast cell lines.
chemical cytotoxicity; apoptosis; HapMap; lymphoblasts; qHTS
Head and neck squamous cell carcinoma (HNSCC) is a frequently fatal heterogeneous disease. Beyond the role of human papilloma virus (HPV), no validated molecular characterization of the disease has been established. Using an integrated genomic analysis and validation methodology we confirm four molecular classes of HNSCC (basal, mesenchymal, atypical, and classical) consistent with signatures established for squamous carcinoma of the lung, including deregulation of the KEAP1/NFE2L2 oxidative stress pathway, differential utilization of the lineage markers SOX2 and TP63, and preference for the oncogenes PIK3CA and EGFR. For potential clinical use the signatures are complimentary to classification by HPV infection status as well as the putative high risk marker CCND1 copy number gain. A molecular etiology for the subtypes is suggested by statistically significant chromosomal gains and losses and differential cell of origin expression patterns. Model systems representative of each of the four subtypes are also presented.
Resampling-based expression pathway analysis techniques have been shown to preserve type I error rates, in contrast to simple gene-list approaches that implicitly assume the independence of genes in ranked lists. However, resampling is intensive in computation time and memory requirements. We describe accurate analytic approximations to permutations of score statistics, including novel approaches for Pearson's correlation, and summed score statistics, that have good performance for even relatively small sample sizes. Our approach preserves the essence of permutation pathway analysis, but with greatly reduced computation. Extensions for inclusion of covariates and censored data are described, and we test the performance of our procedures using simulations based on real datasets. These approaches have been implemented in the new R package safeExpress.
Gene sets; Multiple hypothesis testing; Permutation approximation
Expression quantitative trait locus (eQTL) analysis is rapidly moving from a cutting-edge concept in genomics to a mature area of investigation, with important connections to genome-wide association studies for human disease, pharmacogenomics and toxicogenomics. Despite the importance of the topic, many investigators must develop their own code or use tools not specifically suited for eQTL analysis. Convenient computational tools are becoming available, but they are not widely publicized, and investigators who are interested in discovery or eQTL, or in using them to interpret genome-wide association study results may have difficulty navigating the available resources. The purpose of this review is to help investigators find appropriate programs for eQTL analysis and interpretation.
bioinformatics; fast linear modeling; gene expression
Motivation: A number of penalization and shrinkage approaches have been proposed for the analysis of microarray gene expression data. Similar techniques are now routinely applied to RNA sequence transcriptional count data, although the value of such shrinkage has not been conclusively established. If penalization is desired, the explicit modeling of mean–variance relationships provides a flexible testing regimen that ‘borrows’ information across genes, while easily incorporating design effects and additional covariates.
Results: We describe BBSeq, which incorporates two approaches: (i) a simple beta-binomial generalized linear model, which has not been extensively tested for RNA-Seq data and (ii) an extension of an expression mean–variance modeling approach to RNA-Seq data, involving modeling of the overdispersion as a function of the mean. Our approaches are flexible, allowing for general handling of discrete experimental factors and continuous covariates. We report comparisons with other alternate methods to handle RNA-Seq data. Although penalized methods have advantages for very small sample sizes, the beta-binomial generalized linear model, combined with simple outlier detection and testing approaches, appears to have favorable characteristics in power and flexibility.
Availability: An R package containing examples and sample datasets is available at http://www.bios.unc.edu/research/genomic_software/BBSeq
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
In genome-wide association studies, population stratification is recognized as producing inflated type I error due to the inflation of test statistics. Principal component-based methods applied to genotypes provide information about population structure, and have been widely used to control for stratification. Here we explore the precise relationship between genotype principal components and inflation of association test statistics, thereby drawing a connection between principal component-based stratification control and the alternative approach of genomic control. Our results provide an inherent justification for the use of principal components, but call into question the popular practice of selecting principal components based on significance of eigenvalues alone. We propose a new approach, called EigenCorr, which selects principal components based on both their eigenvalues and their correlation with the (disease) phenotype. Our approach tends to select fewer principal components for stratification control than does testing of eigenvalues alone, providing substantial computational savings and improvements in power. Analyses of simulated and real data demonstrate the usefulness of the proposed approach.
Genomic Control; GWAS; PCA; Population Stratification
Genetic studies of lung disease in Cystic Fibrosis are hampered by the
lack of a severity measure that accounts for chronic disease progression and
mortality attrition. Further, combining analyses across studies requires common
phenotypes that are robust to study design and patient ascertainment.
Using data from the North American Cystic Fibrosis Modifier Consortium
(Canadian Consortium for CF Genetic Studies, Johns Hopkins University CF Twin
and Sibling Study, and University of North Carolina/Case Western Reserve
University Gene Modifier Study), the authors calculated age-specific CF
percentile values of FEV1 which were adjusted for CF age-specific mortality
The phenotype was computed for 2061 patients representing the Canadian CF
population, 1137 extreme phenotype patients in the UNC/Case Western study, and
1323 patients from multiple CF sib families in the CF Twin and Sibling Study.
Despite differences in ascertainment and median age, our phenotype score was
distributed in all three samples in a manner consistent with ascertainment
differences, reflecting the lung disease severity of each individual in the
underlying population. The new phenotype score was highly correlated with the
previously recommended complex phenotype, but the new phenotype is more robust
for shorter follow-up and for extreme ages.
A disease progression and mortality adjusted phenotype reduces the need
for stratification or additional covariates, increasing statistical power and
avoiding possible distortions. This approach will facilitate large scale genetic
and environmental epidemiological studies which will provide targeted
therapeutic pathways for the clinical benefit of patients with CF.
Forced Expiratory Volume; Age Effects; Severity of Illness Index