Genome-wide association studies can provide novel insights into diseases of interest, as well as to the responsiveness of an individual to specific treatments. In such studies, it is very important to correct for population stratification, which refers to allele frequency differences between cases and controls due to systematic ancestry differences. Population stratification can cause spurious associations if not adjusted properly. The principal component analysis (PCA) method has been relied upon as a highly useful methodology to adjust for population stratification in these types of large-scale studies. Recently, the linear mixed model (LMM) has also been proposed to account for family structure or cryptic relatedness. However, neither of these approaches may be optimal in properly correcting for sample structures in the presence of subject outliers.
We propose to use robust PCA combined with k-medoids clustering to deal with population stratification. This approach can adjust for population stratification for both continuous and discrete populations with subject outliers, and it can be considered as an extension of the PCA method and the multidimensional scaling (MDS) method. Through simulation studies, we compare the performance of our proposed methods with several widely used stratification methods, including PCA and MDS. We show that subject outliers can greatly influence the analysis results from several existing methods, while our proposed robust population stratification methods perform very well for both discrete and admixed populations with subject outliers. We illustrate the new method using data from a rheumatoid arthritis study.
We demonstrate that subject outliers can greatly influence the analysis result in GWA studies, and propose robust methods for dealing with population stratification that outperform existing population stratification methods in the presence of subject outliers.
Population structure; Population stratification; Robust principal component analysis; Resampling by half means; Outlier detection; GWA studies
It has been shown that if genetic relationships among individuals are not taken into account for genome wide association studies, this may lead to false positives. To address this problem, we used Genome-wide Rapid Association using Mixed Model and Regression and principal component stratification analyses. To account for linkage disequilibrium among the significant markers, principal components loadings obtained from top markers can be included as covariates. Estimation of Bayesian networks may also be useful to investigate linkage disequilibrium among SNPs and their relation with environmental variables.
For the quantitative trait we first estimated residuals while taking polygenic effects into account. We then used a single SNP approach to detect the most significant SNPs based on the residuals and applied principal component regression to take linkage disequilibrium among these SNPs into account. For the categorical trait we used principal component stratification methodology to account for background effects. For correction of linkage disequilibrium we used principal component logit regression. Bayesian networks were estimated to investigate relationship among SNPs.
Using the Genome-wide Rapid Association using Mixed Model and Regression and principal component stratification approach we detected around 100 significant SNPs for the quantitative trait (p<0.05 with 1000 permutations) and 109 significant (p<0.0006 with local FDR correction) SNPs for the categorical trait. With additional principal component regression we reduced the list to 16 and 50 SNPs for the quantitative and categorical trait, respectively.
GRAMMAR could efficiently incorporate the information regarding random genetic effects. Principal component stratification should be cautiously used with stringent multiple hypothesis testing correction to correct for ancestral stratification and association analyses for binary traits when there are systematic genetic effects such as half sib family structures. Bayesian networks are useful to investigate relationships among SNPs and environmental variables.
Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs.
Genetic association studies search to identify disease susceptibility genes through the analysis of genetic markers such as single nucleotide polymorphisms (SNPs) in large numbers of cases and controls. In such settings, the existence of sub-structure in the population under study (i.e. differences in ancestry among cases and controls) may lead to spurious results. It is therefore imperative to control for this possible bias. Such biases may arise for example when studying the European American population, which consists of individuals of diverse ancestry proportions from different European countries and to some degree also from African and Native American populations. Here, we study the genetic sub-structure of the European American population, analyzing 1,521 individuals for over 300,000 SNPs across the entire genome. Applying a powerful method that is based on dimensionality reduction (Principal Components Analysis), we are able to identify 200 SNPs that successfully represent the complete dataset. Importantly, we introduce a novel method that effectively removes redundancy from any set of genetic markers, and may prove extremely useful in a variety of different research scenarios, in order to significantly reduce the cost of a study.
The new generation of sequencing platforms opens new horizons in the genetics field. It is possible to exhaustively assay all genetic variants in an individual and search for phenotypic associations. The whole genome sequencing approach, when applied to a large human sample like the San Antonio Family Study, detects a very large number (>25 million) of single nucleotide variants along with other more complex variants. The analytical challenges imposed by this number of variants are formidable, suggesting that methods are needed to reduce the overall number of statistical tests. In this study, we develop a single degree-of-freedom test of variants in a gene pathway employing a random effect model that uses an empirical pathway-specific genetic relationship matrix as the focal covariance kernel. The empirical pathway-specific genetic relationship uses all variants (or a chosen subset) from gene members of a given biological pathway. Using SOLAR's pedigree-based variance components modeling, which also allows for arbitrary fixed effects, such as principal components, to deal with latent population structure, we employ a likelihood ratio test of the pathway-specific genetic relationship matrix model. We examine all gene pathways in KEGG database gene pathways using our method in the first replicate of the Genetic Analysis Workshop 18 simulation of systolic blood pressure. Our random effect approach was able to detect true association signals in causal gene pathways. Those pathways could be easily be further dissected by the independent analysis of all markers.
Principal-component analysis (PCA) has been used for decades to summarize the human genetic variation across geographic regions and to infer population migration history. Reduction of spurious associations due to population structure is crucial for the success of disease association studies. Recently, PCA has also become a popular method for detecting population structure and correction of population stratification in disease association studies. Inspired by manifold learning, we propose a novel method based on spectral graph theory. Regarding each study subject as a node with suitably defined weights for its edges to close neighbors, one can form a weighted graph. We suggest using the spectrum of the associated graph Laplacian operator, namely, Laplacian eigenfunctions, to infer population structures instead of principal components (PCs). For the whole genome-wide association data for the North American Rheumatoid Arthritis Consortium (NARAC) provided by Genetic Workshop Analysis 16, Laplacian eigenfunctions revealed more meaningful structures of the underlying population than PCA. The proposed method has connection to PCA, and it naturally includes PCA as a special case. Our simple method is computationally fast and is suitable for disease studies at the genome-wide scale.
Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification.
We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver.
ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
To account for population stratification in association studies, principal-components analysis is often performed on single-nucleotide polymorphisms (SNPs) across the genome. Here, we use Framingham Heart Study (FHS) Genetic Analysis Workshop 16 data to compare the performance of local ancestry adjustment for population stratification based on principal components (PCs) estimated from SNPs in a local chromosomal region with global ancestry adjustment based on PCs estimated from genome-wide SNPs.
Standardized height residuals from unrelated adults from the FHS Offspring Cohort were averaged from longitudinal data. PCs of SNP genotype data were calculated to represent individual's ancestry either 1) globally using all SNPs across the genome or 2) locally using SNPs in adjacent 20-Mbp regions within each chromosome. We assessed the extent to which there were differences in association studies of height depending on whether PCs for global, local, or both global and local ancestry were included as covariates.
The correlations between local and global PCs were low (r < 0.12), suggesting variability between local and global ancestry estimates. Genome-wide association tests without any ancestry adjustment demonstrated an inflated type I error rate that decreased with adjustment for local ancestry, global ancestry, or both. A known spurious association was replicated for SNPs within the lactase gene, and this false-positive association was abolished by adjustment with local or global ancestry PCs.
Population stratification is a potential source of bias in this seemingly homogenous FHS population. However, local and global PCs derived from SNPs appear to provide adequate information about ancestry.
Meta-analysis of genome-wide association studies (GWAS) has become a useful tool to identify genetic variants that are associated with complex human diseases. To control spurious associations between genetic variants and disease that are caused by population stratification, double genomic control (GC) correction for population stratification in meta-analysis for GWAS has been implemented in the software METAL and GWAMA and is widely used by investigators. In this research, we conducted extensive simulation studies to evaluate the double GC correction method in meta-analysis and compared the performance of the double GC correction with that of a principal components analysis (PCA) correction method in meta-analysis. Results show that when the data consist of population stratification, using double GC correction method can have inflated type I error rates at a marker with significant allele frequency differentiation in the subpopulations (such as caused by recent strong selection). On the other hand, the PCA correction method can control type I error rates well and has much higher power in meta-analysis compared to the double GC correction method, even though in the situation that the casual marker does not have significant allele frequency difference between the subpopulations. We applied the double GC correction and PCA correction to meta-analysis of GWAS for two real datasets from the Atherosclerosis Risk in Communities (ARIC) project and the Multi-Ethnic Study of Atherosclerosis (MESA) project. The results also suggest that PCA correction is more effective than the double GC correction in meta-analysis.
genome-wide association studies; meta-analysis; double genomic control correction; principal components analysis; population stratification
A limitation in traditional stepwise population pharmacokinetic model building is the difficulty in handling interactions between model components. To address this issue, a method was previously introduced which couples NONMEM parameter estimation and model fitness evaluation to a single-objective, hybrid genetic algorithm for global optimization of the model structure. In this study, the generalizability of this approach for pharmacokinetic model building is evaluated by comparing (1) correct and spurious covariate relationships in a simulated dataset resulting from automated stepwise covariate modeling, Lasso methods, and single-objective hybrid genetic algorithm approaches to covariate identification and (2) information criteria values, model structures, convergence, and model parameter values resulting from manual stepwise versus single-objective, hybrid genetic algorithm approaches to model building for seven compounds. Both manual stepwise and single-objective, hybrid genetic algorithm approaches to model building were applied, blinded to the results of the other approach, for selection of the compartment structure as well as inclusion and model form of inter-individual and inter-occasion variability, residual error, and covariates from a common set of model options. For the simulated dataset, stepwise covariate modeling identified three of four true covariates and two spurious covariates; Lasso identified two of four true and 0 spurious covariates; and the single-objective, hybrid genetic algorithm identified three of four true covariates and one spurious covariate. For the clinical datasets, the Akaike information criterion was a median of 22.3 points lower (range of 470.5 point decrease to 0.1 point decrease) for the best single-objective hybrid genetic-algorithm candidate model versus the final manual stepwise model: the Akaike information criterion was lower by greater than 10 points for four compounds and differed by less than 10 points for three compounds. The root mean squared error and absolute mean prediction error of the best single-objective hybrid genetic algorithm candidates were a median of 0.2 points higher (range of 38.9 point decrease to 27.3 point increase) and 0.02 points lower (range of 0.98 point decrease to 0.74 point increase), respectively, than that of the final stepwise models. In addition, the best single-objective, hybrid genetic algorithm candidate models had successful convergence and covariance steps for each compound, used the same compartment structure as the manual stepwise approach for 6 of 7 (86 %) compounds, and identified 54 % (7 of 13) of covariates included by the manual stepwise approach and 16 covariate relationships not included by manual stepwise models. The model parameter values between the final manual stepwise and best single-objective, hybrid genetic algorithm models differed by a median of 26.7 % (q1 = 4.9 % and q3 = 57.1 %). Finally, the single-objective, hybrid genetic algorithm approach was able to identify models capable of estimating absorption rate parameters for four compounds that the manual stepwise approach did not identify. The single-objective, hybrid genetic algorithm represents a general pharmacokinetic model building methodology whose ability to rapidly search the feasible solution space leads to nearly equivalent or superior model fits to pharmacokinetic data.
Electronic supplementary material
The online version of this article (doi:10.1007/s10928-012-9258-0) contains supplementary material, which is available to authorized users.
Pharmacokinetics; Model building; Genetic algorithms
Population stratification is an important issue in case–control studies of disease-marker association. Failure to properly account for population structure can lead to spurious association or reduced power. In this article, we compare the performance of six methods correcting for population stratification in case–control association studies. These methods include genomic control (GC), EIGENSTRAT, principal component-based logistic regression (PCA-L), LAPSTRUCT, ROADTRIPS, and EMMAX. We also include the uncorrected Armitage test for comparison. In the simulation studies, we consider a wide range of population structure models for unrelated samples, including admixture. Our simulation results suggest that PCA-L and LAPSTRUCT perform well over all the scenarios studied, whereas GC, ROADTRIPS, and EMMAX fail to correct for population structure at single nucleotide polymorphisms (SNPs) that show strong differentiation across ancestral populations. The Armitage test does not adjust for confounding due to stratification thus has inflated type I error. Among all correction methods, EMMAX has the greatest power, based on the population structure settings considered for samples with unrelated individuals. The three methods, EIGENSTRAT, PCA-L, and LAPSTRUCT, are comparable, and outperform both GC and ROADTRIPS in almost all situations.
Population structure; association testing; type I error; power
Case-control studies are widely used to detect gene-environment interactions in the etiology of complex diseases. Many variables that are of interest to biomedical researchers are difficult to measure on an individual level, e.g. nutrient intake, cigarette smoking exposure, long-term toxic exposure. Measurement error causes bias in parameter estimates, thus masking key features of data and leading to loss of power and spurious/masked associations. We develop a Bayesian methodology for analysis of case-control studies for the case when measurement error is present in an environmental covariate and the genetic variable has missing data. This approach offers several advantages. It allows prior information to enter the model to make estimation and inference more precise. The environmental covariates measured exactly are modeled completely nonparametrically. Further, information about the probability of disease can be incorporated in the estimation procedure to improve quality of parameter estimates, what cannot be done in conventional case-control studies. A unique feature of the procedure under investigation is that the analysis is based on a pseudo-likelihood function therefore conventional Bayesian techniques may not be technically correct. We propose an approach using Markov Chain Monte Carlo sampling as well as a computationally simple method based on an asymptotic posterior distribution. Simulation experiments demonstrated that our method produced parameter estimates that are nearly unbiased even for small sample sizes. An application of our method is illustrated using a population-based case-control study of the association between calcium intake with the risk of colorectal adenoma development.
Bayesian inference; Errors in variables; Gene-environment interactions; Markov Chain Monte Carlo sampling; Missing data; Pseudo-likelihood; Semiparametric methods
It has been well established that theoretical kernel for recently surging genome-wide association study (GWAS) is statistical inference of linkage disequilibrium (LD) between a tested genetic marker and a putative locus affecting a disease trait. However, LD analysis is vulnerable to several confounding factors of which population stratification is the most prominent. Whilst many methods have been proposed to correct for the influence either through predicting the structure parameters or correcting inflation in the test statistic due to the stratification, these may not be feasible or may impose further statistical problems in practical implementation.
We propose here a novel statistical method to control spurious LD in GWAS from population structure by incorporating a control marker into testing for significance of genetic association of a polymorphic marker with phenotypic variation of a complex trait. The method avoids the need of structure prediction which may be infeasible or inadequate in practice and accounts properly for a varying effect of population stratification on different regions of the genome under study. Utility and statistical properties of the new method were tested through an intensive computer simulation study and an association-based genome-wide mapping of expression quantitative trait loci in genetically divergent human populations.
The analyses show that the new method confers an improved statistical power for detecting genuine genetic association in subpopulations and an effective control of spurious associations stemmed from population structure when compared with other two popularly implemented methods in the literature of GWAS.
Artificial insemination and genetic selection are major factors contributing to population stratification in dairy cattle. In this study, we analyzed the effect of sample stratification and the effect of stratification correction on results of a dairy genome-wide association study (GWAS). Three methods for stratification correction were used: the efficient mixed-model association expedited (EMMAX) method accounting for correlation among all individuals, a generalized least squares (GLS) method based on half-sib intraclass correlation, and a principal component analysis (PCA) approach.
Historical pedigree data revealed that the 1,654 contemporary cows in the GWAS were all related when traced through approximately 10–15 generations of ancestors. Genome and phenotype stratifications had a striking overlap with the half-sib structure. A large elite half-sib family of cows contributed to the detection of favorable alleles that had low frequencies in the general population and high frequencies in the elite cows and contributed to the detection of X chromosome effects. All three methods for stratification correction reduced the number of significant effects. EMMAX method had the most severe reduction in the number of significant effects, and the PCA method using 20 principal components and GLS had similar significance levels. Removal of the elite cows from the analysis without using stratification correction removed many effects that were also removed by the three methods for stratification correction, indicating that stratification correction could have removed some true effects due to the elite cows. SNP effects with good consensus between different methods and effect size distributions from USDA’s Holstein genomic evaluation included the DGAT1-NIBP region of BTA14 for production traits, a SNP 45kb upstream from PIGY on BTA6 and two SNPs in NIBP on BTA14 for protein percentage. However, most of these consensus effects had similar frequencies in the elite and average cows.
Genetic selection and extensive use of artificial insemination contributed to overlapped genome, pedigree and phenotype stratifications. The presence of an elite cluster of cows was related to the detection of rare favorable alleles that had high frequencies in the elite cluster and low frequencies in the remaining cows. Methods for stratification correction could have removed some true effects associated with genetic selection.
Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or “population” formulation. The eigen-equation for samples from an arbitrary number () of populations is reduced to that of a matrix of dimension , the elements of which are determined by the variance-covariance matrix for the random vector of the allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.
Traditional transmission disequilibrium test (TDT) based methods for genetic association analyses are robust to population stratification at the cost of a substantial loss of power. We here describe a novel method for family-based association studies that corrects for population stratification with the use of an extension of principal component analysis (PCA). Specifically, we adopt PCA on unrelated parents in each family. We then infer principal components for children from those for their parents through a TDT-like strategy. Two test statistics within variance-components model are proposed for association tests. Simulation results show that the proposed tests have correct type I error rates regardless of population stratification, and have greatly improved power over two popular TDT-based methods: QTDT and FBAT. The application to the Genetic Analysis Workshop 16 (GAW16) data sets attests to the feasibility of the proposed method.
Family Based Association Tests (FBATs); Transmission Disequilibrium Test (TDT); Principal Component Analysis (PCA); Variance-Components
Population stratification can cause spurious associations in population–based association studies. Several statistical methods have been proposed to reduce the impact of population stratification on population–based association studies. We simulated a set of stratified populations based on the real haplotype data from the HapMap ENCODE project, and compared the relative power, type I error rates, accuracy and positive prediction value of four prevailing population–based association study methods: traditional case-control tests, structured association (SA), genomic control (GC) and principal components analysis (PCA) under various population stratification levels. Additionally, we evaluated the effects of sample sizes and frequencies of disease susceptible allele on the performance of the four analytical methods in the presence of population stratification. We found that the performance of PCA was very stable under various scenarios. Our comparison results suggest that SA and PCA have comparable performance, if sufficient ancestral informative markers are used in SA analysis. GC appeared to be strongly conservative in significantly stratified populations. It may be better to apply GC in the stratified populations with low stratification level. Our study intends to provide a practical guideline for researchers to select proper study methods and make appropriate inference of the results in population-based association studies.
The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis.
A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA.
The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.
Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at http://ml.sheffield.ac.uk/qtl/.
The computational analysis of genetical genomics studies is challenged by confounding variation that is unrelated to the genetic factors of interest. Several approaches to account for these confounding factors have been proposed, greatly increasing the sensitivity in recovering direct genetic (cis) associations between variable genetic loci and the expression levels of individual genes. Crucially, these existing techniques largely rely on the true association signals being orthogonal to the confounding variation. Here, we show that when studying indirect (trans) genetic effects, for example from master regulators, their association signals can overlap with confounding factors estimated using existing methods. This technical overlap can lead to overcorrection, erroneously explaining away true associations as confounders. To address these shortcomings, we propose PANAMA, a model that jointly learns hidden factors while accounting for the effect of selected genetic regulators. In applications to several studies, PANAMA is more accurate than existing methods in recovering the hidden confounding factors. As a result, we find an increase in the statistical power for direct (cis) and indirect (trans) associations. Most strikingly on yeast, PANAMA not only finds additional associations but also identifies master regulators that can be better reproduced between independent studies.
We propose a two-step model-based approach, with correction for ascertainment, to linkage analysis of a binary trait with variable age of onset and apply it to a set of multiplex pedigrees segregating for adult glioma.
First, we fit segregation models by formulating the likelihood for a person to have a bivariate phenotype, affection status and age of onset, along with other covariates, and from these we estimate population trait allele frequencies and penetrance parameters as a function of age (N=281 multiplex glioma pedigrees). Second, the best fitting models are used as trait models in multipoint linkage analysis (N=74 informative multiplex glioma pedigrees). To correct for ascertainment, a prevalence constraint is used in the likelihood of the segregation models for all 281 pedigrees. Then the trait allele frequencies are re-estimated for the pedigree founders of the subset of 74 pedigrees chosen for linkage analysis.
Using the best fitting segregation models in model-based multipoint linkage analysis, we identified two separate peaks on chromosome 17; the first agreed with a region identified by Shete et al. who used model-free affected-only linkage analysis, but with a narrowed peak: and the second agreed with a second region they found but had a larger maximum log of the odds (LOD).
Our approach has the advantage of not requiring markers to be in linkage equilibrium unless the minor allele frequency is small (markers which tend to be uninformative for linkage), and of using more of the available information for LOD-based linkage analysis.
Glioma; model-based linkage; segregation; age of onset; prevalence constraint
Population structure is known to cause false-positive detection in association studies. We compared the power, precision, and type-I error rates of various association models in analyses of a simulated dataset with structure at the population (admixture from two populations; P) and family (K) levels. We also compared type-I error rates among models in analyses of publicly available human and dog datasets. The models corrected for none, one, or both structure levels. Correction for K was performed with linear mixed models incorporating familial relationships estimated from pedigrees or genetic markers. Linear models that ignored K were also tested. Correction for P was performed using principal component or structured association analysis. In analyses of simulated and real data, linear mixed models that corrected for K were able to control for type-I error, regardless of whether they also corrected for P. In contrast, correction for P alone in linear models was insufficient. The power and precision of linear mixed models with and without correction for P were similar. Furthermore, power, precision, and type-I error rate were comparable in linear mixed models incorporating pedigree and genomic relationships. In summary, in association studies using samples with both P and K, ancestries estimated using principal components or structured assignment were not sufficient to correct type-I errors. In such cases type-I errors may be controlled by use of linear mixed models with relationships derived from either pedigree or from genetic markers.
With the advance of next-generation sequencing technologies in recent years, rare genetic variant data have now become available for genetic epidemiology studies. For family samples however, only a few statistical methods for association analysis of rare genetic variants have been developed. Rare variant approaches are of great interest particularly for family data because samples enriched for trait-relevant variants can be ascertained and rare variants are putatively enriched through segregation. To facilitate the evaluation of existing and new rare variant testing approaches for analyzing family data, Genetic Analysis Workshop 18 (GAW18) provided genotype and next-generation sequencing data and longitudinal blood pressure traits from extended pedigrees of Mexican-American families from the San Antonio Family Study. Our GAW18 group members analyzed real and simulated phenotype data from GAW18 by using generalized linear mixed-effects models or principal components to adjust for familial correlation or by testing binary traits using a correction factor for familial effects. With one exception, approaches dealt with the extended pedigrees in their original state using information based on the kinship matrix or alternative genetic similarity measures. For simulated data, our group demonstrated that the family-based kernel machine score test is superior in power to family-based single-marker or burden tests, except in a few specific scenarios. For real data, three contributions identified significant associations. They substantially reduced the number of tests before performing the association analysis. We conclude from our real data analyses that further development of strategies for targeted testing or more focused screening of genetic variants is strongly desirable.
extended pedigrees; rare variant analysis; family-based association test; linear mixed effects model; kernel machine score test; principal components
Association studies using unrelated individuals have become the most popular design for mapping complex traits. One of the major challenges of association mapping is avoiding spurious association due to population stratification. Principal component analysis (PCA) on genome-wide marker genotypes is one of the most popular population stratification control methods. It implicitly assumes that the markers are in linkage equilibrium, a condition that is rarely satisfied and that we plan to relax.
We carefully examined the impact of linkage disequilibrium (LD) on PCA, and proposed a simple modification of the standard PCA to automatically adjust for the correlations among markers.
We demonstrated that LD patterns in genome-wide association datasets can distort the techniques for stratification control, showing ‘subpopulations’ reflecting localized LD phenomena rather than plausible population structure. We showed that the proposed method effectively removes the artifactual effect of LD patterns, and successfully recovers underlying population structure that is not apparent from standard PCA.
PCA is highly influenced by sets of SNPs with high LD, obscuring the true population substructure. Our shrinkage PCA applies to all available markers, regardless of the LD patterns. The proposed method is easier to implement than most existing LD adjusted PCA methods.
PCA; Loadings; GWAS
Although the importance of selecting cases and controls from the same population has been recognized for decades, the recent advent of genome-wide association studies has heightened awareness of this issue. Because these studies typically deal with large samples, small differences in allele frequencies between cases and controls can easily reach statistical significance. When, unbeknownst to a researcher, cases and controls have different substructures, the number of false-positive findings is inflated. There have been three recent developments of purely statistical approaches to assessing the ancestral comparability of case and control samples: genomic control, structured association, and multivariate reduction analyses. The widespread use of high-throughput technology has allowed the quick and accurate genotyping of the large number of markers required by these methods.
Group 13 dealt with four population stratification issues: single-nucleotide polymorphism marker selection, association testing, non-standard methods, and linkage disequilibrium calculations in stratified or mixed ethnicity samples. We demonstrated that there are continuous axes of ethnic variation in both datasets of Genetic Analysis Workshop 16. Furthermore, ignoring this structure created p-value inflation for a variety of phenotypes. Principal-components analysis (or multidimensional scaling) can control inflation as covariates in a logistic regression. One can weight for local ancestry estimation and allow the use of related individuals. Problems arise in the presence of extremely high association or unusually strong linkage disequilibrium (e.g., in chromosomal inversions). Our group also reported a method for performing an association test controlling for substructure when genome-wide markers are not available to explicitly compute stratification.
genetic association; genome-wide association study; principal components; multidimensional scaling; ethnic substructure
Recent large-scale studies of European populations have demonstrated the existence of population genetic structure within Europe and the potential to accurately infer individual ancestry when information from hundreds of thousands of genetic markers is used. In fact, when genomewide genetic variation of European populations is projected down to a two-dimensional Principal Components Analysis plot, a surprising correlation with actual geographic coordinates of self-reported ancestry has been reported. This substructure can hamper the search of susceptibility genes for common complex disorders leading to spurious correlations. The identification of genetic markers that can correct for population stratification becomes therefore of paramount importance. Analyzing 1,200 individuals from 11 populations genotyped for more than 500,000 SNPs (Population Reference Sample), we present a systematic exploration of the extent to which geographic coordinates of origin within Europe can be predicted, with small panels of SNPs. Markers are selected to correlate with the top principal components of the dataset, as we have previously demonstrated. Performing thorough cross-validation experiments we show that it is indeed possible to predict individual ancestry within Europe down to a few hundred kilometers from actual individual origin, using information from carefully selected panels of 500 or 1,000 SNPs. Furthermore, we show that these panels can be used to correctly assign the HapMap Phase 3 European populations to their geographic origin. The SNPs that we propose can prove extremely useful in a variety of different settings, such as stratification correction or genetic ancestry testing, and the study of the history of European populations.