Understanding the role of genetic variation in human diseases remains an important problem to be solved in genomics. An important component of such variation consist of variations at single sites in DNA, or single nucleotide polymorphisms (SNPs). Typically, the problem of associating particular SNPs to phenotypes has been confounded by hidden factors such as the presence of population structure, family structure or cryptic relatedness in the sample of individuals being analyzed. Such confounding factors lead to a large number of spurious associations and missed associations. Various statistical methods have been proposed to account for such confounding factors such as linear mixed-effect models (LMMs) or methods that adjust data based on a principal components analysis (PCA), but these methods either suffer from low power or cease to be tractable for larger numbers of individuals in the sample. Here we present a statistical model for conducting genome-wide association studies (GWAS) that accounts for such confounding factors. Our method scales in runtime quadratic in the number of individuals being studied with only a modest loss in statistical power as compared to LMM-based and PCA-based methods when testing on synthetic data that was generated from a generalized LMM. Applying our method to both real and synthetic human genotype/phenotype data, we demonstrate the ability of our model to correct for confounding factors while requiring significantly less runtime relative to LMMs. We have implemented methods for fitting these models, which are available at http://www.microsoft.com/science.
Population structure occurs when a sample is composed of individuals with different ancestries and can result in excess type I error in genome-wide association studies. Genome-wide principal-component analysis (PCA) has become a popular method for identifying and adjusting for subtle population structure in association studies. Using the Genetic Analysis Workshop 16 (GAW16) NARAC data, we explore two unresolved issues concerning the use of genome-wide PCA to account for population structure in genetic associations studies: the choice of single-nucleotide polymorphism (SNP) subset and the choice of adjustment model. We computed PCs for subsets of genome-wide SNPs with varying levels of LD. The first two PCs were similar for all subsets and the first three PCs were associated with case status for all subsets. When the PCs associated with case status were included as covariates in an association model, the reduction in genomic inflation factor was similar for all SNP sets. Several models have been proposed to account for structure using PCs, but it is not yet clear whether the different methods will result in substantively different results for association studies with individuals of European descent. We compared genome-wide association p-values and results for two positive-control SNPs previously associated with rheumatoid arthritis using four PC adjustment methods as well as no adjustment and genomic control. We found that in this sample, adjusting for the continuous PCs or adjusting for discrete clusters identified using the PCs adequately accounts for the case-control population structure, but that a recently proposed randomization test performs poorly.
To account for population stratification in association studies, principal-components analysis is often performed on single-nucleotide polymorphisms (SNPs) across the genome. Here, we use Framingham Heart Study (FHS) Genetic Analysis Workshop 16 data to compare the performance of local ancestry adjustment for population stratification based on principal components (PCs) estimated from SNPs in a local chromosomal region with global ancestry adjustment based on PCs estimated from genome-wide SNPs.
Standardized height residuals from unrelated adults from the FHS Offspring Cohort were averaged from longitudinal data. PCs of SNP genotype data were calculated to represent individual's ancestry either 1) globally using all SNPs across the genome or 2) locally using SNPs in adjacent 20-Mbp regions within each chromosome. We assessed the extent to which there were differences in association studies of height depending on whether PCs for global, local, or both global and local ancestry were included as covariates.
The correlations between local and global PCs were low (r < 0.12), suggesting variability between local and global ancestry estimates. Genome-wide association tests without any ancestry adjustment demonstrated an inflated type I error rate that decreased with adjustment for local ancestry, global ancestry, or both. A known spurious association was replicated for SNPs within the lactase gene, and this false-positive association was abolished by adjustment with local or global ancestry PCs.
Population stratification is a potential source of bias in this seemingly homogenous FHS population. However, local and global PCs derived from SNPs appear to provide adequate information about ancestry.
Motivation: Adjustment for population structure is necessary to avoid bias in genetic association studies of susceptibility variants for complex diseases. Population structure may differ from one genomic region to another due to the variability of individual ancestry associated with migration, random genetic drift or natural selection. Current association methods for correcting population stratification usually involve adjustment of global ancestry between study subjects.
Results: We suggest interrogating local population structure for fine mapping to more accurately locate true casual genes by better adjusting the confounding effect due to local ancestry. By extensive simulations on genome-wide datasets, we show that adjusting global ancestry may lead to false positives when local population structure is an important confounding factor. In contrast, adjusting local ancestry can effectively prevent false positives due to local population structure and thus can improve fine mapping for disease gene localization. We applied the local and global adjustments to the analysis of datasets from three genome-wide association studies, including European Americans, African Americans and Nigerians. Both European Americans and African Americans demonstrate greater variability in local ancestry than Nigerians. Adjusting local ancestry successfully eliminated the known spurious association between SNPs in the LCT gene and height due to the population structure existed in European Americans.
Supplementary information: Supplementary data are available at Bioinformatics online.
Population stratification is a systematic difference in allele frequencies between subpopulations. This can lead to spurious association findings in the case–control genome wide association studies (GWASs) used to identify single nucleotide polymorphisms (SNPs) associated with disease-linked phenotypes. Methods such as self-declared ancestry, ancestry informative markers, genomic control, structured association, and principal component analysis are used to assess and correct population stratification but each has limitations. We provide an alternative technique to address population stratification.
We propose a novel machine learning method, ETHNOPRED, which uses the genotype and ethnicity data from the HapMap project to learn ensembles of disjoint decision trees, capable of accurately predicting an individual’s continental and sub-continental ancestry. To predict an individual’s continental ancestry, ETHNOPRED produced an ensemble of 3 decision trees involving a total of 10 SNPs, with 10-fold cross validation accuracy of 100% using HapMap II dataset. We extended this model to involve 29 disjoint decision trees over 149 SNPs, and showed that this ensemble has an accuracy of ≥ 99.9%, even if some of those 149 SNP values were missing. On an independent dataset, predominantly of Caucasian origin, our continental classifier showed 96.8% accuracy and improved genomic control’s λ from 1.22 to 1.11. We next used the HapMap III dataset to learn classifiers to distinguish European subpopulations (North-Western vs. Southern), East Asian subpopulations (Chinese vs. Japanese), African subpopulations (Eastern vs. Western), North American subpopulations (European vs. Chinese vs. African vs. Mexican vs. Indian), and Kenyan subpopulations (Luhya vs. Maasai). In these cases, ETHNOPRED produced ensembles of 3, 39, 21, 11, and 25 disjoint decision trees, respectively involving 31, 502, 526, 242 and 271 SNPs, with 10-fold cross validation accuracy of 86.5% ± 2.4%, 95.6% ± 3.9%, 95.6% ± 2.1%, 98.3% ± 2.0%, and 95.9% ± 1.5%. However, ETHNOPRED was unable to produce a classifier that can accurately distinguish Chinese in Beijing vs. Chinese in Denver.
ETHNOPRED is a novel technique for producing classifiers that can identify an individual’s continental and sub-continental heritage, based on a small number of SNPs. We show that its learned classifiers are simple, cost-efficient, accurate, transparent, flexible, fast, applicable to large scale GWASs, and robust to missing values.
Because of the low frequency of rare genetic variants in observed data, the statistical power of detecting their associations with target traits is usually low. The collapsing test of collective effect of multiple rare variants is an important and useful strategy to increase the power; in addition, family data may be enriched with causal rare variants and therefore provide extra power. However, when family data are used, both population structure and familial relatedness need to be adjusted for the possible inflation of false positives. Using a unified mixed linear model and family data, we compared six methods to detect the association between multiple rare variants and quantitative traits. Through the analysis of 200 replications of the quantitative trait Q2 from the Genetic Analysis Workshop 17 data set simulated for 697 subjects from 8 extended families, and based on quantile-quantile plots under the null and receiver operating characteristic curves, we compared the false-positive rate and power of these methods. We observed that adjusting for pedigree-based kinship gives the best control for false-positive rate, whereas adjusting for marker-based identity by state slightly outperforms in terms of power. An adjustment based on a principal components analysis slightly improves the false-positive rate and power. Taking into account type-1 error, power, and computational efficiency, we find that adjusting for pedigree-based kinship seems to be a good choice for the collective test of association between multiple rare variants and quantitative traits using family data.
This article combines social and genetic epidemiology to examine the influence of self-reported ethnicity on body mass index (BMI) among a sample of adolescents and young adults. We use genetic information from more than 5,000 single nucleotide polymorphisms in combination with principal components analysis to characterize population ancestry of individuals in this study. We show that non-Hispanic white and Mexican-American respondents differ significantly with respect to BMI and differ on the first principal component from the genetic data. This first component is positively associated with BMI and accounts for roughly 3% of the genetic variance in our sample. However, after controlling for this genetic measure, the observed ethnic differences in BMI remain large and statistically significant. This study demonstrates a parsimonious method to adjust for genetic differences among individual respondents that may contribute to observed differences in outcomes. In this case, adjusting for genetic background has no bearing on the influence of self-identified ethnicity.
Accurate genetic association studies are crucial for the detection and the validation of disease determinants. One of the main confounding factors that affect accuracy is population stratification, and great efforts have been extended for the past decade to detect and to adjust for it. We have now efficient solutions for population stratification adjustment for single-SNP (where SNP is single-nucleotide polymorphisms) inference in genome-wide association studies, but it is unclear whether these solutions can be effectively applied to rare variation studies and in particular gene-based (or set-based) association methods that jointly analyze multiple rare and common variants. We examine here, both theoretically and empirically, the performance of two commonly used approaches for population stratification adjustment—genomic control and principal component analysis—when used on gene-based association tests. We show that, different from single-SNP inference, genes with diverse composition of rare and common variants may suffer from population stratification to various extent. The inflation in gene-level statistics could be impacted by the number and the allele frequency spectrum of SNPs in the gene, and by the gene-based testing method used in the analysis. As a consequence, using a universal inflation factor as a genomic control should be avoided in gene-based inference with sequencing data. We also demonstrate that caution needs to be exercised when using principal component adjustment because the accuracy of the adjusted analyses depends on the underlying population substructure, on the way the principal components are constructed, and on the number of principal components used to recover the substructure.
sequencing studies; gene-based association test; genomic control; principal component analysis; C-alpha test; burden test
In genetic association studies, it is necessary to correct for population structure to avoid inference bias. During the past decade, prevailing corrections often only involved adjustments of global ancestry differences between sampled individuals. Nevertheless, population structure may vary across local genomic regions due to the variability of local ancestries associated with natural selection, migration, or random genetic drift. Adjusting for global ancestry alone may be inadequate when local population structure is an important confounding factor. In contrast, adjusting for local ancestry can more effectively prevent false-positives due to local population structure. To more accurately locate disease genes, we recommend adjusting for local ancestries by interrogating local structure. In practice, locus-specific ancestries are usually unknown and cannot be accurately inferred when ancestral population information is not available. For such scenarios, we propose employing local principal components (PC) to represent local ancestries and adjusting for local PCs when testing for genotype–phenotype association. With an acceptable computation burden, the proposed algorithm successfully eliminates the known spurious association between SNPs in the LCT gene and height due to the population structure in European Americans.
Genome-wide association studies; Local ancestries; Local principal components; Migration; Random genetic drift; Natural selection; Genomic inflation factor; Genomic control; Local ancestry principal components correction; Fine mapping
It is generally presumed that the Cystic Fibrosis (CF) population is relatively homogeneous, and predominantly of European origin. The complex ethnic make-up observed in the CF patients collected by the North American CF Modifier Gene Consortium has brought this assumption into question, and suggested the potential for population substructure in the three CF study samples collected from North America. It is well appreciated that population substructure can result in spurious genetic associations.
To understand the ethnic composition of the North American CF population, and to assess the need for population structure adjustment in genetic association studies with North American CF patients.
Genome-wide single-nucleotide polymorphisms on 3076 unrelated North American CF patients were used to perform population structure analyses. We compared self-reported ethnicity to genotype-inferred ancestry, and also examined whether geographic distribution and CFTR mutation type could explain the structure observed.
Although largely Caucasian, our analyses identified a considerable number of CF patients with admixed African-Caucasian, Mexican-Caucasian and Indian-Caucasian ancestries. Population substructure was present and comparable across the three studies of the consortium. Neither geographic distribution nor mutation type explained the population structure.
Given the ethnic diversity of the North American CF population, it is essential to carefully detect, estimate and adjust for population substructure to guard against potential spurious findings in CF genetic association studies. Other Mendelian diseases that are presumed to predominantly affect single ethnic groups may also benefit from careful analysis of population structure.
ethnicity; principal component analysis; population substructure; population stratification
There are many ways to perform adjustment for population structure. It remains unclear what the optimal approach is and whether the optimal approach varies by the type of samples and substructure present. The simplest and most straightforward approach is to adjust for the continuous principal components (PCs) that capture ancestry. Through simulation, we explored the issue of which ancestry informative PCs should be adjusted for in an association model to control for the confounding nature of population structure while maintaining maximum power. A thorough examination of selecting PCs for adjustment in a case-control study across the possible structure scenarios that could occur in a genome-wide association study has not been previously reported.
We found that when the SNP and phenotype frequencies do not vary over the sub-populations, all methods of selection provided similar power and appropriate Type I error for association. When the SNP is not structured and the phenotype has large structure, then selection methods that do not select PCs for inclusion as covariates generally provide the most power. When there is a structured SNP and a non-structured phenotype, selection methods that include PCs in the model have greater power. When both the SNP and the phenotype are structured, all methods of selection have similar power.
Standard practice is to include a fixed number of PCs in genome-wide association studies. Based on our findings, we conclude that if power is not a concern, then selecting the same set of top PCs for adjustment for all SNPs in logistic regression is a strategy that achieves appropriate Type I error. However, standard practice is not optimal in all scenarios and to optimize power for structured SNPs in the presence of unstructured phenotypes, PCs that are associated with the tested SNP should be included in the logistic model.
Population stratification can cause spurious associations in a genome-wide association study (GWAS), and occurs when differences in allele frequencies of single nucleotide polymorphisms (SNPs) are due to ancestral differences between cases and controls rather than the trait of interest. Principal components analysis (PCA) is the established approach to detect population substructure using genome-wide data and to adjust the genetic association for stratification by including the top principal components in the analysis. An alternative solution is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls.
We developed a novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias. Through simulation we show that the power of our method is higher than adjustment for PCs in certain situations.
In addition to reducing population stratification bias and improving power, matching creates a clean dataset free of population stratification which can then be used to build prediction models without including variables to adjust for ancestry. The cluster assignments also allow for the estimation of genetic heterogeneity by examining cluster specific effects.
Osteopontin (SPP1) is an important bone matrix mediator found to have key roles in inflammation and immunity. SPP1 genetic polymorphisms and increased osteopontin protein levels have been reported to be associated with SLE in small patient collections. The present study evaluates association between SPP1 polymorphisms and SLE in a large cohort of 1141 unrelated SLE patients [707 European-American (EA) and 434 African-American (AA)], and 2009 unrelated controls (1309 EA and 700 AA). Population-based case-control association analyses were performed. To control for potential population stratification, admixture adjusted logistic regression, genomic control (GC), structured association (STRAT), and principal components analysis (PCA) were applied. Combined analysis of 2 ethnic groups, showed the minor allele of 2 SNPs (rs1126616T and rs9138C) significantly associated with higher risk of SLE in males (P = 0.0005, OR = 1.73, 95% CI = 1.28–2.33), but not in females. Indeed, significant gene-gender interactions in the 2 SNPs, rs1126772 and rs9138, were detected (P = 0.001 and P = 0.0006, respectively). Further, haplotype analysis identified rs1126616T-rs1126772A-rs9138C which demonstrated significant association with SLE in general (P = 0.02, OR = 1.30, 95%CI 1.08–1.57), especially in males (P = 0.0003, OR = 2.42, 95%CI 1.51–3.89). Subgroup analysis with single SNPs and haplotypes also identified a similar pattern of gender-specific association in AA and EA. GC, STRAT, and PCA results within each group showed consistent associations. Our data suggest SPP1 is associated with SLE, and this association is especially stronger in males. To our knowledge, this report serves as the first association of a specific autosomal gene with human male lupus.
C-reactive protein (CRP), a marker of systemic inflammation, is associated with risk of coronary events and sub-clinical measures of atherosclerosis. Evidence in support of this link being causal would include an association robust to adjustments for confounders (multivariable standard regression analysis) and the association of CRP gene polymorphisms with atherosclerosis (Mendelian randomization analysis).
We genotyped 3 tag single nucleotide polymorphisms (SNPs) [+1444T>C (rs1130864); +2303G>A (rs1205) and +4899T>G (rs 3093077)] in the CRP gene and assessed CRP and carotid intima-media thickness (CIMT), a structural marker of atherosclerosis, in 4941 men and women aged 50–74 (mean 61) years (the Whitehall II Study). The 4 major haplotypes from the SNPs were consistently associated with CRP level, but not with other risk factors that might confound the association between CRP and CIMT. CRP, assessed both at mean age 49 and at mean age 61, was associated both with CIMT in age and sex adjusted standard regression analyses and with potential confounding factors. However, the association of CRP with CIMT attenuated to the null with adjustment for confounding factors in both prospective and cross-sectional analyses. When examined using genetic variants as the instrument for serum CRP, there was no inferred association between CRP and CIMT.
Both multivariable standard regression analysis and Mendelian randomization analysis suggest that the association of CRP with carotid atheroma indexed by CIMT may not be causal.
The North American Rheumatoid Arthritis Consortium case-control study collected case participants across the United States and control participants from New York. More than 500,000 single-nucleotide polymorphisms (SNPs) were genotyped in the sample of 2000 cases and controls. Careful adjustment for the confounding effect of population stratification must be conducted when analyzing these data; the variance inflation factor (VIF) without adjustment is 1.44. In the primary analyses of these data, a clustering algorithm in the program PLINK was used to reduce the VIF to 1.14, after which genomic control was used to control residual confounding. Here we use stratification scores to achieve a unified and coherent control for confounding. We used the first 10 principal components, calculated genome-wide using a set of 81,500 loci that had been selected to have low pair-wise linkage disequilibrium, as risk factors in a logistic model to calculate the stratification score. We then divided the data into five strata based on quantiles of the stratification score. The VIF of these stratified data is 1.04, indicating substantial control of stratification. However, after control for stratification, we find that there are no significant loci associated with rheumatoid arthritis outside of the HLA region. In particular, we find no evidence for association of TRAF1-C5 with rheumatoid arthritis.
It is well-known that population substructure may lead to confounding in case-control association studies. Here, we examined genetic structure in a large racially and ethnically diverse sample consisting of 5 ethnic groups of the Multiethnic Cohort study (African Americans, Japanese Americans, Latinos, European Americans and Native Hawaiians) using 2,509 SNPs distributed across the genome. Principal component analysis on 6,213 study participants, 18 Native Americans and 11 HapMap III populations revealed 4 important principal components (PCs): the first two separated Asians, Europeans and Africans, and the third and fourth corresponded to Native American and Native Hawaiian (Polynesian) ancestry, respectively. Individual ethnic composition derived from self-reported parental information matched well to genetic ancestry for Japanese and European Americans. STRUCTURE-estimated individual ancestral proportions for African Americans and Latinos are consistent with previous reports. We quantified the East Asian (mean 27%), European (mean 27%) and Polynesian (mean 46%) ancestral proportions for the first time, to our knowledge, for Native Hawaiians. Simulations based on realistic settings of case-control studies nested in the Multiethnic Cohort found that the effect of population stratification was modest and readily corrected by adjusting for race/ethnicity or by adjusting for top PCs derived from all SNPs or from ancestry informative markers; the power of these approaches was similar when averaged across causal variants simulated based on allele frequencies of the 2,509 genotyped markers. The bias may be large in case-only analysis of gene by gene interactions but it can be corrected by top PCs derived from all SNPs.
AIMs; African American; Native Hawaiian; Latino; admixture; principal component analysis
Motivation: The rapid development of genotyping technology and extensive cataloguing of single nucleotide polymorphisms (SNPs) across the human genome have made genetic association studies the mainstream for gene mapping of complex human diseases. For many diseases, the most practical approach is the population-based design with unrelated individuals. Although having the advantages of easier sample collection and greater power than family-based designs, unrecognized population stratification in the study samples can lead to both false-positive and false-negative findings and might obscure the true association signals if not appropriately corrected.
Methods: We report PHYLOSTRAT, a new method that corrects for population stratification by combining phylogeny constructed from SNP genotypes and principal coordinates from multi-dimensional scaling (MDS) analysis. This hybrid approach efficiently captures both discrete and admixed population structures.
Results: By extensive simulations, the analysis of a synthetic genome-wide association dataset created using data from the Human Genome Diversity Project, and the analysis of a lactase-height dataset, we show that our method can correct for population stratification more efficiently than several existing population stratification correction methods, including EIGENSTRAT, a hybrid approach based on MDS and clustering, and STRATSCORE , in terms of requiring fewer random SNPs for inference of population structure. By combining the flexibility and hierarchical nature of phylogenetic trees with the advantage of representing admixture using MDS, our hybrid approach can capture the complex population structures in human populations effectively.
Software Availability: Codes can be downloaded from http://people.pcbi.upenn.edu/∼lswang/phylostrat/
Contact: firstname.lastname@example.org; email@example.com.
Supplementary information: Supplementary data are available at Bioinformatics online.
Statistical tests on rare variant data may well have type I error rates that differ from their nominal levels. Here, we use the Genetic Analysis Workshop 17 data to estimate type I error rates and powers of three models for identifying rare variants associated with a phenotype: (1) by using the number of minor alleles, age, and smoking status as predictor variables; (2) by using the number of minor alleles, age, smoking status, and the identity of the population of the subject as predictor variables; and (3) by using the number of minor alleles, age, smoking status, and ancestry adjustment using 10 principal component scores. We studied both quantitative phenotype and a dichotomized phenotype. The model with principal component adjustment has type I error rates that are closer to the nominal level of significance of 0.05 for single-nucleotide polymorphisms (SNPs) in noncausal genes for the selected phenotype than the model directly adjusting for population. The principal component adjustment model type I error rates are also closer to the nominal level of 0.05 for noncausal SNPs located in causal genes for the phenotype. The power for causal SNPs with the principal component adjustment model is comparable to the power of the other methods. The power using the underlying quantitative phenotype is greater than the power using the dichotomized phenotype.
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
The definition of European population genetic substructure and its application to understanding complex phenotypes is becoming increasingly important. In the current study using over 4,000 subjects genotyped for 300,000 single-nucleotide polymorphisms (SNPs), we provide further insight into relationships among European population groups and identify sets of SNP ancestry informative markers (AIMs) for application in genetic studies. In general, the graphical description of these principal components analyses (PCA) of diverse European subjects showed a strong correspondence to the geographical relationships of specific countries or regions of origin. Clearer separation of different ethnic and regional populations was observed when northern and southern European groups were considered separately and the PCA results were influenced by the inclusion or exclusion of different self-identified population groups including Ashkenazi Jewish, Sardinian, and Orcadian ethnic groups. SNP AIM sets were identified that could distinguish the regional and ethnic population groups. Moreover, the studies demonstrated that most allele frequency differences between different European groups could be controlled effectively in analyses using these AIM sets. The European substructure AIMs should be widely applicable to ongoing studies to confirm and delineate specific disease susceptibility candidate regions without the necessity of performing additional genome-wide SNP studies in additional subject sets.
In genetic studies, associations between genotypes and phenotypes may be confounded by unrecognized population structure and/or admixture. Studies have shown that even in European populations, which are thought to be relatively homogeneous, population stratification exists and can affect the validity of association studies. A number of methods have been proposed to address this issue in recent years. Among them, the mixed-model based approach and the principal component-based approach have several advantages over other methods. However, these approaches have not been thoroughly evaluated on large human datasets. The objectives of this study are to (1) evaluate and compare the performance of the mixed-model approach and the principal component-based approach for genetic association mapping using human data consisting of unrelated individuals, and (2) understand the relationship between these two approaches. To achieve these goals, we simulate datasets based on the HapMap data under various scenarios. Our results indicate that the mixed-model approach performs well in controlling for population structure/admixture. It has similar performance as that based on principal component analysis. However, the approach combining mixed-model and principal component analysis does not perform as well as either method itself.
Mixed-effects Model; Principal Component Analysis; Population Structure/Admixture; Genetic Association Analysis
Population-based association studies are used to identify common susceptibility variants for complex genetic traits. These studies are susceptible to confounding from unknown population substructure. Here we apply a model-based clustering approach to our case-control study of stroke among young women to examine if self-reported ethnicity can serve as a proxy for genetic ancestry.
A population-based case-control study of stroke among women aged 15-49 identified 361 cases of first ischemic stroke and 401 age-comparable control subjects. Thirty single nucleotide polymorphisms (SNPs) throughout the genome unrelated to stroke risk and with established ancestry-based allele frequency differences were genotyped in all participants. The Structure program was used to iteratively evaluate for K = 1 to 5 potential genetic-based subpopulations. Evaluating the population as a whole, the Structure output plateaued at K = 2 clusters. 98% of self-reported Caucasians had an estimated probability ≥50% of belonging to Cluster 1, while 94% of self-reported African-Americans had an estimated probability ≥50% of belonging to Cluster 2. Stratifying the participants by self-reported ethnicity and repeating the analyses revealed the presence of two clusters among Caucasians, suggesting that potential substructure may exist.
Among our combined sample of African-American and Caucasian participants there is no large unknown subpopulation and self-reported ethnicity can serve as a proxy for genetic ancestry. Ethnicity-specific analyses indicate that population substructure may exist among the Caucasian participants indicating that further studies are warranted.
We compare the SNP-based and gene-based association studies using 697 unrelated individuals. The Benjamini-Hochberg procedure was applied to control the false discovery rate for all the multiple comparisons. We use a linear model for the single-nucleotide polymorphism (SNP) based association study. For the gene-based study, we consider three methods. The first one is based on a linear model, the second is similarity based, and the third is a new two-step procedure. The results of power using a subset of SNPs show that the SNP-based association test is more powerful than the gene-based ones. However, in some situations, a gene-based study is able to detect the associated variants that were neglected in a SNP-based study. Finally, we apply these methods to a replicate of the quantitative trait Q1 and the binary trait D (D = 1, affected; D = 0, unaffected) for a genome-wide gene search.
Principal components analysis (PCA) has been successfully used to correct for population stratification in genome-wide association studies of common variants. However, rare variants also have a role in common disease etiology. Whether PCA successfully controls population stratification for rare variants has not been addressed. Thus we evaluate the effect of population stratification analysis on false-positive rates for common and rare variants at the single-nucleotide polymorphism (SNP) and gene level. We use the simulation data from Genetic Analysis Workshop 17 and compare false-positive rates with and without PCA at the SNP and gene level. We found that SNPs’ minor allele frequency (MAF) influenced the ability of PCA to effectively control false discovery. Specifically, PCA reduced false-positive rates more effectively in common SNPs (MAF > 0.05) than in rare SNPs (MAF < 0.01). Furthermore, at the gene level, although false-positive rates were reduced, power to detect true associations was also reduced using PCA. Taken together, these results suggest that sequence-level data should be interpreted with caution, because extremely rare SNPs may exhibit sporadic association that is not controlled using PCA.
Methods that can evaluate aggregate effects of rare and common variants are limited. Therefore, we applied a two-stage approach to evaluate aggregate gene effects in the 1000 Genomes Project data, which contain 24,487 single-nucleotide polymorphisms (SNPs) in 697 unrelated individuals from 7 populations. In stage 1, we identified potentially interesting genes (PIGs) as those having at least one SNP meeting Bonferroni correction using univariate, multiple regression models. In stage 2, we evaluate aggregate PIG effects on trait, Q1, by modeling each gene as a latent construct, which is defined by multiple common and rare variants, using the multivariate statistical framework of structural equation modeling (SEM). In stage 1, we found that PIGs varied markedly between a randomly selected replicate (replicate 137) and 100 other replicates, with the exception of FLT1. In stage 1, collapsing rare variants decreased false positives but increased false negatives. In stage 2, we developed a good-fitting SEM model that included all nine genes simulated to affect Q1 (FLT1, KDR, ARNT, ELAV4, FLT4, HIF1A, HIF3A, VEGFA, VEGFC) and found that FLT1 had the largest effect on Q1 (βstd = 0.33 ± 0.05). Using replicate 137 estimates as population values, we found that the mean relative bias in the parameters (loadings, paths, residuals) and their standard errors across 100 replicates was on average, less than 5%. Our latent variable SEM approach provides a viable framework for modeling aggregate effects of rare and common variants in multiple genes, but more elegant methods are needed in stage 1 to minimize type I and type II error.