Genome-wide association studies (GWAS) have emerged as an effective approach to identify common polymorphisms underlying complex traits [

Hunter et al., 2007;

The Wellcome Trust Case Control Consortium, 2007;

Yeager et al., 2007;

Manolio et al., 2008;

Pearson and Manolio, 2008]. GWAS frequently employ a case-control design because of the efficiency in investigating a large number of common variants in unrelated cases and controls with sufficient statistical power to detect small-to-moderate effects. However, multi-stage population stratification (PS) can lead to a high fraction of putative associations that are spurious, particularly when many SNPs are tested in follow-up but in actuality, only a few are closely related to disease and alpha level is low. This is particularly problematic in multi-stage GWAS, which have been based on selecting a subset of SNPs from a GWAS underpowered to reliably detect low penetrance alleles (estimated odds ratios of < 1.3) so that only a small percentage of notable variants are carried through subsequent stages [

Skol et al., 2006;

Yu et al., 2007].

Because the vast majority of single nucleotide polymorphisms (SNPs) genotyped in a GWAS are not associated with the disease under study, it is feasible to use SNPs measured throughout the genome for the detection and correction of PS. Principal component analysis (PCA) [

Zhu et al., 2002;

Patterson et al., 2006;

Price et al., 2006;

Li and Yu, 2008] uses SNPs measured throughout the genome to uncover hidden population substructure by detecting axes of large genetic variation and, if necessary, to adjust for ancestral background differences between cases and controls along several major axes. Although adjust for PS in a GWAS based on PCA is becoming routine, little evidence on adequacy of adjustment of a subset of principal components (PCs) for correction s for PS or how to best select relevant principal components is available. One commonly used PCA approach adjusts simultaneously for a fixed number of top-ranked PCs according to the size of eigenvalues [

Price et al., 2006]; another approach selects PCs with significant large genetic variation according to the Tray-Wisdom test [

Patterson et al., 2006]. However, both approaches may include some unnecessary PCs and could have a deleterious impact on the power if adjustment for one or more of the PC is unnecessary because the PC is equally distributed among cases and controls or because the adjustment of certain covariates (such as self-identified ethnicity, or recruitment center), which correctly map to major axes of genetic heterogeneity, have already been included in the association analyses. Previously, we [

Yu et al., 2008] presented an example to demonstrate that the unnecessary adjustment of population substructure by even one PC could lead to a significant loss in power, and proposed a permutation procedure to identify the minimal number of PCs while allowing an effective correction of the confounding effect. To apply this procedure, two sets of SNPs are required, one for PCA, the other for the evaluation of type I error inflation in the permutation steps. Selection of relevant PCs using this procedure can be computationally intensive if it is necessary to calculate the association test statistic on a large number of markers in order to have an accurate estimation of the inflation level in type I error. Here, using techniques derived from the distance-based regression model, we propose a computationally efficient procedure to evaluate and correct, when necessary, for PS.

The distance-based regression model was originally proposed by

McArdle and Anderson [2001] for the analysis of ecological data, and can be thought as a non-parametric version of the traditional multivariate regression model. The multivariate regression model is commonly used for the study of the relationship between a set of predictors

**X** and a multivariate outcome

**Y**. The pseudo

*F* statistic [

McArdle and Anderson, 2001] can be applied to test the null hypothesis of no effect of

**X** on

**Y**. Recognizing that the pseudo

*F* statistic can be calculated in term of the Euclidean distance between the outcomes of two subjects,

McArdle and Anderson [2001] proposed the distance-based regression model for the analysis of pair-wise distance (or similarity) measured among a group of subjects by a chosen distance (similarity) metric, and suggested using the similar pseudo

*F* statistic to assess the effect of predictors

**X** on the pair-wise distance (or similarity). Recently, this method has been used for genetic analyses, such as the comparison of microarray gene expression patterns [

Zapala and Schork, 2006], a multilocus test for genetic association studies [

Wessel and Schork, 2006], and assessment of genetic background diversity [

Nievergelt et al., 2007].

As suggested by

Nievergelt et al. [2007], the pseudo

*F* statistic derived from the general distance-based regression model can be used to detect PS in a GWAS if an appropriate metric is used for the measurement of the genetic background similarity between two subjects. Here we extend the pesudo

*F* statistic considered by

McArdle and Anderson [2001] and

Nievergelt et al. [2007] to allow for the adjustment of covariates. The extended

*F* statistic can evaluate the adequacy of PS correction when potential ancestral confounding factors, such as self-identified ethnicity or selected principal components [

Patterson et al., 2006;

Price et al., 2006], have already been included in the adjustment. Built upon this pseudo

*F* statistic, a computationally efficient PC selection procedure, called PC-Finder, is proposed to identify relevant PCs for the correction of PS. Empirical data from two GWAS in the Cancer Genetic Markers of Susceptibility (CGEMS) project were used to demonstrate the application of the proposed methods. We also conducted simulation studies to evaluate the performance of the proposed methods.