We used empirical data from two GWAS within the CGEMS project to assess the extent and impact of PS in studies with two distinct control selection strategies. We also evaluated our proposed procedures for choosing structural inference SNPs as well as for selecting the PCs for correction of PS. In the two original GWAS based on the nested case-control design, we observed only minor confounding effect by PS with over-dispersion factors of 1.025 and 1.005 for prostate and breast cancer studies, respectively. These small inflations, which in practice may not raise major concern, can be further reduced by adjustment for a single PC. In the two reconstructed studies where cases and controls were collected independently using different designs, we observed more extensive confounding effect by PS with over-dispersion factors of 1.090 and 1.062. In these studies with external controls, three principal components were required to optimally correct for the confounding effect of PS, resulting in the reduction of the inflation factor to a level comparable to that in the original nested case-control studies. Our conclusions are based on two actual studies of two cancer sites and two hypothetical ones using observed data in the European American populations. The impact of PS in other populations, such as African Americans, may be different and thus, requires independent assessment.
Case-control studies nested in prospective cohorts, such as the two original GWAS in the CGEMS project, tend to minimize biases introduced when cases and controls are selected from different populations. We found that cases and controls had comparable genetic background and only minor confounding effect by PS in these two studies. In stand-alone case-control studies, which are not nested within a cohort, the bias is likely to be somewhat greater because of difficulties in control selection when there is no roster of the underlying population producing the cases.
A more extreme but convenient and cost-efficient design alternative, notably taken recently by the WTCCC
[20], is the use of external controls that are collected independently with little reference to the population from which cases are selected. The large number of disease-unrelated SNPs measured in a GWAS can be utilized to evaluate and when applicable, correct for the confounding effect induced by the genetic ancestral disparity between the case and control groups. Therefore, the stringent requirement of control selection imposed according to the classical epidemiology paradigm could be relaxed to some extent. This view is supported by our analyses of two reconstructed studies with independently collected controls. It appears that an appropriate PC adjustment can effectively correct for the elevated confounding effect introduced by the use of less desirable controls.
Adjusting for unnecessary covariates incurs the risk of decreasing power
[31]. We have presented a simulation to demonstrate that the unnecessary adjustment of population substructure (even one PC) could lead to a significant loss in power (
Text S1,
Table S1). A permutation procedure is proposed to identify the minimal number of PCs while allowing an effective correction of the confounding effect. By applying this new procedure to the two original GWAS with internal controls and two reconstructed studies with external controls, we documented its advantage over other commonly used PC selection strategies. At the expense of computing time, the new procedure is able to pick fewer PCs while reducing the over-dispersion factor to a similar or even lower level.
The identified set of 12,898 SNPs with low background LD in European American population and common to both the Illumina and Affymetrix commercial platforms can be used in PCA for evaluation of population structure. We detected similar patterns of population substructure in the original scans even though they were nested within different cohorts. The top three axes from the two independent studies appear to point to similar directions and are likely to be a characteristic of the European American population. Further studies are required to correlate differences along the axes of genetic variation with groups defined by self-described ethnic background, geographic location or specific demographic histories. Based on our present experiences, we believe that this set of SNPs should be sufficient for the inference and correction of population structure in GWAS conducted using either the Illumina or Affymetrix commercial platforms within the European American populations, and enables the comparison of population structure between studies performed on different platforms without relying on genotype imputation. The same search algorithm can be used to identify structure inference SNPs suitable for GWAS in other populations, such as African Americans.
In the replication stages of a multi-stage GWAS, it would be impractical to genotype the entire list of 12,898 SNPs for the correction of PS. In the process of selecting a fixed number of SNPs for the follow-up study that would typically involve 10,000 to 50,000 SNPs, there is always a trade-off between the number of SNPs allocated for population structure inference and the number of candidate disease-associated SNPs chosen for the validation/replication. Recently, Price et al.
[32] and Tian et al.
[33] identified panels of SNPs that are informative for discerning major European ancestries in European American populations. For example, Price et al.
[32] designed a panel of 300 SNPs that aims to distinguish northwest European, southeast European, and Ashkenazi Jewish ancestry. These panels of ancestral informative SNPs are potentially useful in replication studies with a similar anticipated population substructure, but may not be as robust in studies where the population sub-structure may be different or unknown. Rapid accumulation of GWAS and their replication studies should provide ample opportunities for designing and validating panels of ancestral informative markers targeting various stratified or admixed populations.
Our analysis has focused on the confounding effect of PS on single-marker association analyses. While there is an increasing emphasis on detecting interactions between genes and between genes and the environment, Wang et al.
[34] recently evaluated the bias resulting from the confounding effect of PS in studies of gene-gene or gene-environment interactions. Based on simulation studies, they showed that bias due to PS could be large for studies of interactions, especially when strong correlation between genes (or between genetic and environmental factors) takes place. Using data generated from the CGEMS project and tools developed in this paper, we can empirically evaluate the impact of PS on the study of gene-gene interaction under different control selection strategies. However, valid assessment of effect of PS on gene-environment interaction may require additional assumptions depending on the control selection procedure chosen.
There are several additional issues other than the type I error inflation arising from PS to consider when evaluating the appropriateness of convenience controls versus controls selected to reflect the study-base that produced the cases. There may be differential genotyping error between cases and controls due to variation in the processing of biological samples. Also, selection bias for non-genetic covariates that can not be corrected by PCA could lead to misleading estimates of interactions
[35]. The selection of cases and controls from a common prospective cohort tends to minimize potential discrepancies.
The analyses of empirical data generated from the CGEMS project suggest that the effect of PS in the GWAS of prostate and breast cancers conducted in European American is small when the study is epidemiologically well designed, but can be substantial when controls and cases are drawn from separate studies. The elevated confounding effect of PS due to the use of less desirable controls can be effectively mitigated by methods such as the one proposed here. The impact of using convenience controls on the power for the detection of disease related markers needs to be further investigated, especially in recently admixed populations.