The basic technique is simple. We assume our markers are biallelic, for example, biallelic single nucleotide polymorphisms (SNPs). Regard the data as a large rectangular matrix

*C,* with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker choose a reference and variant allele. We suppose we have

*n* such markers and

*m* individuals. Let

*C*(

*i,j*) be the number of variant alleles for marker

*j*, individual

*i.* (Thus for autosomal data we have

*C*(

*i,j*) is 0,1 or 2.) For now suppose that there is no missing data. From each column we subtract the column means. So set for column

*j*:

and then the corrected entries are:

Set

*p*(

*j*) =

*μ*(

*j*)/2, an estimate of the underlying allele frequency (autosomal data). Then each entry in the resulting matrix is

Equation 3 is a normalization step, which is motivated by the fact that the frequency change of a SNP due to genetic drift occurs at a rate proportional to

per generation. It also normalizes (at least if the data is in Hardy-Weinberg equilibrium) each data column to have the same variance. We note that Nicholson et al. use the same normalization, and motivate it similarly [

18].

We verified (unpublished data) that the normalization improves results when using simulated genetic data, and that on real data known structure becomes clearer. (However all the results are just as mathematically valid even without the normalizations.)

The methods also are applicable to data such as microsatellites, where there are more than two alleles at a single site. We use a device of Cavalli-Sforza [

2,

15], making a “marker”

*j* out of each allele, and then setting

*C*(

*i*,

*j*) to be the number of occurrences of the allele for sample

*i*. We omit the normalization step of

Equation 3 for microsatellites, merely subtracting the mean. The normalization has no clear justification for microsatellite data, and results on real data (unpublished) show that it produces worse performance in this case.

An alternative, suggested by a referee, is to use the microsatellite allele length as a continuous variable, and carry out PCA directly after a suitable normalization.

Now we carry out a singular value decomposition on the matrix

*M*. (A standard reference for the numerical methods is [

19]. Public domain software is readily available—we used the well-known package LAPACK,

http://www.netlib.org/lapack.) We are chiefly interested here in the case that the number of samples is less than the number of markers:

*m* <

*n*. Computationally we will form

the sample covariance of the columns of

*M*. The resulting matrix is

*m* ×

*m*, with a dimension equal to the number of samples in the dataset. We then compute an eigenvector decomposition of

*X*. Eigenvectors corresponding to “large” eigenvalues are exposing nonrandom population structure. This means that a central issue for this paper is what is “large” here, or, more precisely, what is the distribution of the largest eigenvalues of

*X* at random (when there is no population structure)?

The method is fast. In practice, the running time is dominated by the calculation of the matrix product

*MM′,* which for extremely large problems is readily computed on a parallel architecture. On a fast workstation, the matrix product for a dataset of 100 individuals and 100,000 markers takes just four seconds. For data with

*m* individuals and

*n* markers, the work is proportional to

*m*^{2}*n,* and thus for a set of 2,000 individuals and 500,000 markers would take about 2.5 hours on the same single processor (see

Methods for more details). For many of the problems we have analyzed, reading and storing the data takes longer than the analysis.

Most, though not all, previous applications of PCA to analysis of population structure have taken the data to be a matrix where the rows are indexed by populations not by individuals (e.g., [

2,

15]). We prefer to consider the larger array where the rows are indexed by individuals. Even when we have population labels, it is useful to examine within-population variation, and we also are often able to find outliers in the data. Furthermore, when population labels are available, we can carry out an analysis to check that the labels do correspond to structure that the eigenanalysis has uncovered.

We note that population labels may be socially constructed. This makes us nervous about employing them in an initial data study. On the other hand, the individual samples certainly do not have any such construct, and even if population labels are available, initial analysis at an individual level allows us to check the meaningfulness of the labels [

20].

Cavalli-Sforza [15, pp.

39–

42] gives an explanation of why PCA can be expected to reveal population structure. We give a different explanation, oriented towards analysis at the individual level. If

**e**^{[1]} is the principal eigenvector of the matrix

*X,* this means that the sum of squares

is maximized over all vectors with constant norm. The second eigenvector

**e**^{[2]} maximizes the same expression with the constraint that

**e**^{[1]},

**e**^{[2]} are orthogonal, and so on. Why would we expect this to reveal population structure? Suppose that in our sample, we have just two populations and that each is homogeneous. Choose a vector with coordinates constant and positive for samples from one population, and coordinates constant and negative for samples from the other. Arrange so that the vector coordinates sum to zero. Then, since alleles within a population will tend to agree more than in the sample as a whole, the quantity

*S* of

Equation 4 will tend to be large. This is exactly what we maximize as a function of the vector

**e**. More generally, if we have

*K* distinct populations, there are

*K* − 1 vectors constant on each population, summing to zero and linearly independent. This implies that, if the number of markers is sufficiently large, there will be

*K* − 1 eigenvalues and

*K* − 1 corresponding eigenvectors of our matrix that are significant and meaningful. Vectors orthogonal to these

*K* − 1 vectors are showing within-population variance, and if each population is homogeneous, this is just reflecting sampling noise.

Tracy–Widom Theory

We now turn to some recent theoretical statistics. Consider an *m* × *n* matrix *M*, each entry of which has an independent standard normal random variable. We are interested in the case that *m* < *n*.

(A notational issue is that in our genetic data, if *m* is the number of individuals, then the square matrix for which we calculate eigenvalues has rank *m* − 1 [we lose a dimension by forcing each column to sum to zero]. We will, for the majority of the paper, write *m*′ = *m* − 1 for the number of nonzero eigenvalues. However in this theoretical section, we will assume there are *m* nonzero eigenvalues.)

*X* is a

*Wishart* matrix. Let {

*λ*_{i}}

_{1}_{≤i≤m} be the eigenvalues of

*X*. The probability density of (

*λ*_{1},…

*λ*_{m}) is known [

21] but not directly relevant to our work here, so we omit the details.

Order the eigenvalues so that

Johnstone in a key paper [

22] showed that suitably normalized, and for

*m, n,* large, the distribution of the largest eigenvalue

*λ*_{1} is approximately a distribution discovered by Tracy and Widom [

23], which in this paper we call TW. Set

Now set

Then the distribution of

*x* is approximately TW.

More precisely, if as

*m*,

*n* → ∞,

*n*/

*m* →

*γ* ≥ 1, then Johnstone proves [22, Theorem 1.1] that

*x* tends to TW in distribution. As we show later (Theorem 2), Johnstone's theorem also holds if in the expression for

*x* in

Equation 7, we replace

*x* by:

where

The only difference here is that the *m* eigenvalues have been normalized to have sum *m*.

In [

22] Johnstone proved his theorem for the case that

*n*,

*m* → ∞ with

*m*/

*n* bounded away from 0, but this condition was shown in [

23] not to be necessary. Johnstone [

22] gives convincing evidence that the fit is good even for values as small as

*m* = 5,

*n* = 20.

We show in a plot of the Tracy–Widom density.

The complexity of the TW definition is irrelevant to its application to real data. One computes a statistic, and then looks up a *p*-value in tables or through a computational interface. This is little different from how one uses (say) a conventional chi-squared test.

One concern with applying this approach to genetic data is that the entries in the matrix

*M* do not have the Gaussian distributions expected for a Wishart matrix; instead, they correspond to the three possible genotypes at each SNP. However it is not critical that the entries in the

*m* ×

*n* matrix

*M* be Gaussian. Soshnikov [

25] showed that the same TW limit arose if the cell entries were any distribution with high-order moments no greater than the Gaussian. The matrix

*X* is a sum of

*n* rank 1 matrices, and Soshnikov's result suggests that the same limit would be obtained from any probability distribution in which the columns of

*M* are independent, isotropic (all directions are equiprobable), and such that the column norms have moments no larger than those for a column of independent Gaussian entries. In all our genetic applications, the column norms are in fact bounded, so we can expect the sample covariance matrices to behave well.

This theory, originally developed for the case of Gaussian matrix entries, thus seemed likely to work well with large genetic biallelic data arrays. The remainder of this paper verifies that this is the case.

Applying Eigenanalysis to Datasets with Linked Markers

For genetic applications we cannot necessarily assume that all our markers are unlinked and thus independent. For instance, in the International Haplotype Map project [

26], markers were chosen about 5,000 bases apart (phase 1), or about 1,000 bases apart (phase 2), and so nearby markers will often be in LD. Mathematically this will induce correlation between nearby columns of our matrix

*M*. The effect of this will be that the matrix

should be “Wishart-like,” but the nonindependence of the columns will reduce the effective sample size. We will discuss this further (see Correcting for LD) but now introduce a new idea. This adds robustness to our methods, so that minor deviations from the model become of lesser importance.

Suppose we have *m* individuals. We will analyze *X* = *MM′* as a Wishart matrix. The rank of *X* will be *m* − 1 (assuming we have many SNPs compared with *m*). We will think of the eigenvalues of *X* as coming from a (*m* − 1) × (*m* − 1) Wishart, and write *m*′ = *m* − 1.

There are two unknowns that we can regard as parameters to the Wishart: 1) *σ*^{2}: the variance of the normal distribution used for the cells of the rectangular matrix *M*; 2) *n*′: The number of columns of *M*.

We want to carefully distinguish here between

*n,* the

*actual* number of columns of our data array, and

*n′,* a theoretical statistical parameter, modeling the approximate Wishart distribution of the square matrix

*X*. We originally fit

*σ*,

*n′* by maximum likelihood. The likelihood, as a function of the two parameters, has two sufficient statistics, which are

, and

. Maximum likelihood did not always work well, in our genetic applications, probably because

is sensitive to

*small* eigenvalues, while we are only interested in large. We recommend a moments estimator:

which is justified later. Note that

*n′* is invariant to scaling of the matrix

*M* as it should be. We estimate

*σ* by:

A Test for Population Structure

This leads immediately to a formal test for the presence of population structure in a biallelic dataset.

1. Compute the matrix

*M* as in

Equations 1,

2 and

3.

*M* has

*m* rows,

*n* columns.

2. Compute *X* = *MM′*. *X* is *m* × *m*.

3. Order the eigenvalues of

*X* so that

where

*m′* =

*m* − 1. (On a large dataset

*X* will always have rank

*m′*.)

4. Using the eigenvalues

*λ*_{i} (1 ≤

*i* ≤

*m*′), estimate

*n*′ from

Equation 10.

5. The largest eigenvalue of

*M* is

*λ*_{1}. Set

6. Normalize

*l* with

Equations 5–

7, where the effective number of markers

*n*′ replaces

*n*. This yields a test statistic

*x* =

*x*(

*M*).

Our statistic *x*(*M*) is approximately TW-distributed. A *p*-value can now be computed from tables of the TW distribution. Notice that our statistic is independent of the scaling of the *λ*_{i}, and it is convenient to normalize by scaling so that the sum of the eigenvalues (that is the trace of *M*) is *m*′. All eigenvalues we report are scaled in this manner.

Simulations to Demonstrate Robustness of the Tests of Significant Structure

We first made a series of simulations in the absence of population structure. (Some additional details are in the Methods section.) Our first set of runs had 100 individuals and 5,000 unlinked SNPs, and the second 200 individuals and 50,000 unlinked SNPs. In each case we ran 1,000 simulations and show in A and B probability–probability (P–P) plots of the empirical and TW tail areas. The results seem entirely satisfactory, especially for low

*p*-values in the top right of A and B. For assessment of statistical significance, it is the low

*p*-value range that is relevant. The simulations show more generally that the TW theory is relevant in a genetic context, that the normalizations of

Equations 5–

7 are appropriate, and that the calculation of the effective marker size has not seriously distorted the TW statistic.

Detecting Additional Structure

It is very important to be able to answer the question:

*Does the data show evidence of additional population structure over and above what has already been detected?* The test we propose is extremely simple. If our matrix

*X* has eigenvalues

and we already have declared the top

*k* eigenvalues to be significant, then we simply test λ

_{k}_{+1},…λ

_{m} as though

*X* was a (

*m*′ −

*k*) × (

*m*′ −

*k*) Wishart matrix. Johnstone shows [22, Proposition 1.2] that this procedure is conservative, at least for a true Wishart matrix. We tested this by generating data in which there is one eigenvalue that is overwhelmingly significant, and examined the distribution of the

*second* eigenvalue. As shown by the P–P plot of , the fit is again very good, especially for small

*p*-values. If an eigenvalue is not significant, then further testing of smaller eigenvalues should not be done.

Cluster Analysis and PCA

There is a much closer relationship between our PCA and a cluster-based analysis than is at first apparent. Consider a model of genetic structure where there are

*K* populations, and fix a marker and variant allele. The populations have diverged from an ancestral population recently. Suppose that the allele frequency of the variant in the ancestral population is

*P,* and in population

*i* is

*p*_{i}. Conditional on

*P,* assume that

**p** = (

*p*_{1},

*p*_{2},…

*p*_{K}) has mean (

*P*,

*P*,…

*P*) and covariance matrix

*P*(1 −

*P*)

*B* for some matrix

*B*. Much past work in genetics uses this paradigm, with variations on the distribution of

*B,* and on the detailed distribution of

**p** conditional on

*P;* for instance, both Nicholson et al. [

18] and STRUCTURE [

9] in “correlated frequency mode,” and in the “F-model” of [

10]. The setup here is nearly inevitable if one is considering allele frequencies in populations that have only diverged a small amount. In [18, p. 700] for the case of a diagonal matrix

*B,* it is shown that the diagonal term

*B*_{ii} can be interpreted as the “time on the diffusion time-scale” (inversely proportional to effective population size) in which population

*i* has undergone genetic drift.

Suppose we sample (autosomal) genotypes from these

*K* populations. Assume there are

*M*_{i} samples from population

*i,* and set

We suppose that the divergence of each population from a root population, as measured by

*F*_{ST} or equivalently by divergence time on the diffusion timescale, is of order

*τ,* which is small. What are the eigenvalues of the theoretical covariance

*C* of the samples for the marker after our mean adjustment and normalization? Let

*M* become large, while the relative abundance of the samples stays constant across populations. It can be shown (see the mathematical details, Theorem 3) that if

*B* has full rank, then

*C* has

*K* − 1 large eigenvalues that tend to infinity with

*M, M* −

*K* eigenvalues that are 1 +

*O*(

*τ*) and one zero eigenvalue that is a structural zero, arising from the fact that our mean adjusted columns all have zero sum. We are interested in the case that

*τ* 1 while

*M* 1.

Thus, natural models of population structure predict that most of the eigenvalues of the theoretical covariance will be “small,” nearly equal, and arise from sampling noise, while just a few eigenvalues will be “large,” reflecting past demographic events. This is exactly the situation that Johnstone's application of Tracy–Widom theory addresses. We also note that on real data (as we show below), the TW theory works extremely well, which shows that the model will be a reasonable approximation to “truth” in many cases.

Axes of Variation

Thus, we expect that the theoretical covariance matrix (approximated by the sample covariance) will have *K* − 1 “large” eigenvalues, with the remainder small and reflecting the sampling variance. We call the eigenvectors of the theoretical covariance, corresponding to the large eigenvalues, “axes of variation.” These are a theoretical construct, as of course we only observe the sample covariance. Nevertheless, for eigenvectors that are highly significant by our tests, we expect the corresponding eigenvector to correlate well with the true “axis of variation.” An analogy here is that we would often think of an allele as having a “true” population frequency in a population, and would regard the frequency of the allele in a sample as an estimator of the true frequency.

As defined, our axes of variation do depend on the relative sample sizes of the underlying populations, so that differences between populations each with a large sample size are upweighted. This is worth remembering when interpreting the results, but does not seem a major impediment to analysis.

A Formal Test Is Appropriate in Our Applications

We do not review in detail older methods for testing for significance. One technique is the “broken stick” model [

27,

28], used, for instance, in a recent population genetics analysis [

5]. In this model, one normalizes the

*m*′ nonzero eigenvalues to sum to 1, then sorts them in decreasing order, and compares the

*k*-th eigenvalue with the expected size of the

*k*-th largest subinterval of the unit interval, partitioned by “breaking” the interval at

*m*′ − 1 uniformly chosen points. This method does not use the number of markers in any way, thus it cannot be making optimum use of the data. In particular for datasets with large numbers of markers, real population structure may go undetected.

We believe that the application of PCA to genetic data—and our way of analyzing the data—provides a natural method of uncovering population structure, for reasons that are subtle; thus, it is useful to spell them out explicitly. In most applications of PCA, the multivariate data has an unknown covariance, and PCA is attempting to choose a subspace on which to project the data that captures most of the relevant information. In many such applications, a formal test for whether the true covariance is the identity matrix makes little sense.

In genetics applications we believe the situation is different. Under standard population genetics assumptions such as a panmictic population, the natural null is that the eigenvalues of the true covariance matrix are equal, a formal test is appropriate, and deviations from the null are likely to be of real scientific and practical significance. To support this, in our experience on real data we take our null very seriously and attempt to explain all statistically significant axes of variation. Often the explanation is true population structure in the data, but we also often expose errors or difficulties in the data processing. Two examples follow.

In some population genetic data from African populations, the fourth axis of variation showed some San individuals at each end of the axis. This made little genetic sense, and the cause was some related samples that should have been removed from the analysis.

In a medical genetic study, an unexplained axis of variation was statistically correlated with the micro-titer plates on which the genotyping had been carried out. This suggested that the experimental setup was contributing to the evidence for structure, instead of real population differentiation.

In both these cases more careful data preprocessing would have eliminated the problem, but analysis and preparation of large datasets is difficult, and more tools for analysis and error-detection are always of benefit.

ANOVA Statistics Given Labeled Populations

In many practical applications, samples will already be grouped into subpopulations (for instance, in medical genetics there are often two populations: cases and controls). It is natural to want to test if our recovered eigenvectors reflect differences among the labeled subpopulations. We therefore fix some eigenvector, and can regard each individual as associated with the corresponding coordinate of the eigenvector. We want to test if the means of these coordinate values in each subpopulation differ significantly. Our motivation is firstly that this is a powerful check on the validity of our (unsupervised) Tracy–Widom statistics, and secondly that the supervised analysis helps in interpretation of the recovered axes of variation.

The conventional statistic here is an ANOVA F-statistic. (See for instance [

29]). We have here a “one-way layout,” where we want to test if the group means significantly differ. This amounts to a check on our Tracy–Widom statistic, which we compute ignoring the labels. We also routinely compute an F-statistic for every pair of populations, and each eigenvector (unpublished data). We give three examples of ANOVA analysis on real data. In the first, we look at population data from sub-Saharan Africa, genotyped with 783 microsatellites and 210 biallelic indels in the CEPH–HGDP dataset [

30,

31]. We group the West African and Bantu speaking populations (Yoruba, Bantu South Africa, and Bantu Kenya) as “Bantu” and also examine samples from San and Mandenka. We show plots of the first two eigenvectors in . shows the key statistics for this dataset. In , the ANOVA

*p*-value is obtained from the usual

*F*-statistic, and we apply ANOVA to each of the first three eigenvectors.

| **Table 1**Statistics from HGDP African Data |

There is excellent agreement between the supervised and unsupervised analysis. The lack of significance of the third eigenvector is an indication that no additional structure was apparent within the Bantu.

Our second study took samples from three regions: Northern Thailand, China (Han Chinese), and Japan. The last two population samples were available from the International Hapmap Project [

32]. The Thai samples (25 individuals after removing some close relatives and outliers) were collected by J. Seidman and S. Sangwatanaroj as part of a disease study, though here we focus on the population genetics. Our analysis of these data used 40,560 SNPs.

In we plot the first two eigenvectors. Notice that the population separation is clear, but that the natural separation axes are not the eigenvectors. Further, the Thai and Chinese populations appear to show a cline, rather than two discrete clusters grouped around a central point. We suspect that this shows some evidence of genetic admixture, perhaps involving a population in Thailand that is related to the Chinese. (See also , which we describe later.) shows the eigenvalues, the TW significance, and an ANOVA *p*-value for the first three eigenvectors. Again there is excellent agreement between the supervised and unsupervised analyses.

| **Table 2**Statistics from Thai/Chinese/Japanese Data |

In the third dataset, which was created and analyzed by Mark Shriver and colleagues [

5], we have data from 12 populations. The missing data pattern showed some evidence of population structure, with the missing data concentrated in particular samples, populations, and SNPs. For this reason, we only used markers for analysis for which there was no missing data, and we corrected for LD using our regression technique (see below). The details of the data preprocessing steps are described in Methods. We analyzed samples from 189 individuals on 2,790 SNPs. On this dataset, we find the leading eigenvalue statistics to be as shown in .

| **Table 3**Statistics from Shriver Dataset |

In all the datasets mentioned above, we have very good agreement between the significance of the TW statistic, which does not use the population labels, and the ANOVA, which does. This verifies that the TW analysis is correctly labeling the eigenvectors as to whether they are reflecting real population structure.

Shriver and colleagues [

5], using different principal components methods and broken stick statistical analysis [

27,

28], recovered four significant components on this dataset. Our analysis has clearly recovered more meaningful structure, providing empirical validation of the power of this approach.

An Estimate for the Data Size Needed for Significance

A recent paper by Baik, Ben Arous, and Péché [

16] gives theorems for the asymptotics of the distribution of the largest eigenvalue of a sample covariance matrix when the true covariance matrix has a few eigenvalues greater than 1 and the rest equal to 1. This is the situation in genetic data for which there are just a few meaningful axes of variation. Unfortunately the theorems proved are only for the case of data matrices whose entries are complex numbers, but Baik, Ben Arous, and Péché conjecture that the results hold for real data, too. We state a form of the conjecture, which we call the BBP conjecture, and then provide evidence for its applicability to genetics.

Let *l*_{1} be the lead eigenvalue of the *theoretical* covariance matrix, with the remainder of the eigenvalues 1. Set *γ*^{2} = *n*/*m*. Let *L*_{1} be the largest eigenvalue of the sample covariance. We will let *n*, *m* become large with the ratio *n*/*m* tending to a limit.

(1) If l_{1} < 1 + 1/*γ, then as m,n → ∞*, *L*_{1}, *suitably normalized, tends in distribution to the same distribution as when l*_{1} *= l.*

(2) If l_{1} > 1 + 1/*γ, then as m,n → ∞*, *the TW statistic becomes unbounded almost surely.*

That is, the behavior of

*L*_{1} is qualitatively different depending on whether

*l*_{1} is greater or less than 1 + 1/

*γ*. This is a

*phase-change* phenomenon, and we will define

as the

*BBP threshold*. This is an asymptotic result, showing that as the data size goes to infinity, the transition of the behavior, as

*l*_{1} varies, becomes arbitrarily sharp. The result, as stated above, is proved in [

16] for data where the matrix entries are complex numbers, and statement (2) of the conjecture is proved in [

17], which demonstrates that the behavior is qualitatively different according to whether

*l*_{1} is greater or less than 1 + 1/

*γ*. There seems little doubt as to the truth of statement (1) above. It has been shown (D. Paul, Asymptotic behavior of the leading sample eigenvalues for a spiked covariance model,

http://anson.ucdavis.edu/~debashis/techrep/eigenlimit.pdf) that, under the assumptions of statement (1) above, the lead eigenvector of the sample covariance is asymptotically uncorrelated with the lead eigenvector of the theoretical covariance, but we believe that the question of the distribution of the leading eigenvalue is still open.

Consider an example of two samples each of size

*m*/2, diverged from each other at time

*τ,* where unit time is 2

*N* generations, and assume that

*N* is the effective population size. We assume

*τ* is small, from which it follows that

We find that

It follows that the BBP threshold is reached when

This is interesting by itself:

*Define D, the data size, to be the product of the number of samples and number of SNPs genotyped. For two subpopulations of equal sample size, the phase change threshold is reached when* 1/*F*_{ST} is equal to the square root of the data size D, independently of the number of individuals and markers, at least when both are large.

At a fixed data size, the expected value of the leading eigenvalue of the data matrix (and the power to detect structure) of course is a continuous function of *F*_{ST}, but the BBP conjecture suggests that for large data sizes there will only be a small transition region. Above the region, detection of structure will be easy, and below it, impossible.

Let us take *nm* = 2^{20} (about one million genotypes), so that the BBP threshold is *F*_{ST} = 2^{−10}. We let *m* = 2^{k} (*k* = 5…8) and set *n* = 2^{20−k} so that *nm* = 2^{20}.

Now for each value of *m,* generate simulated data, varying *F*_{ST} from 2^{−13} to 2^{−7}. For each simulation, we compute *L*_{1}, the TW statistic, and a *p*-value. We show the TW statistics in .

The phase change is evident. Further, from [

16, p. 1650ff] (also see [

17, Equation 1.10]): above the BBP threshold we have that

in probability as

*m*,

*n* → ∞. It then follows that above the BBP threshold, we can expect the TW statistic to be increasing with the number of individuals

*m* if the data size

*mn* is fixed. That is, increasing sample size, rather than marker number, is advantageous for detecting structure above the BBP threshold, but not below it. This effect is clearly visible in (note the behavior of the

*p*-value for

*m* = 256). We summarize:

*For two equal size subpopulations, there is a threshold value of F*_{ST},
*, below which there will be essentially no evidence of population structure. Above the threshold, the evidence accumulates very rapidly, as we increase the divergence or the data size. Above the threshold for fixed data size mn, the evidence is stronger as we increase m, as long as n* *m.*
Another implication is that these methods are sensitive. For example, given a 100,000 marker array and a sample size of 1,000, then the BBP threshold for two equal subpopulations, each of size 500, is

*F*_{ST} = .0001. An

*F*_{ST} value of .001 will thus be trivial to detect. To put this into context, we note that a typical value of

*F*_{ST} between human populations in Northern and Southern Europe is about .006 [

15]. Thus, we predict:

*most large genetic datasets with human data will show some detectable population structure.*The BBP phase change is

*not* just a phenomenon of the eigenvector-based analysis we are discussing here. We suspect that at least for biallelic unlinked markers, no methods for detecting structure will do much better than our TW-based techniques. This implies that no method will have any significant success rate if population divergence is below the BBP threshold, while above threshold, reasonable methods will succeed. To test this we made a series of simulations, each with 1,600 biallelic markers and two populations each of size 50. We varied

*F*_{ST} and ran both our eigenanalysis and STRUCTURE. (See

Methods for more detail about the simulations and analysis.) We were not successful in using STRUCTURE to produce a higher likelihood for the existence of two clusters rather than one except for the very largest

*F*_{ST} levels. We wanted to place our methods and STRUCTURE on a “level playing field.” Our PCA methods return a leading eigenvector, while running STRUCTURE with

*K* = 2 clusters, returns for each individual the probability of belonging to cluster 1. We used a nonparametric idea, applying a probit transform to both the output of both the PCA and of STRUCTURE, and then running an ANOVA analysis, both for PCA and STRUCTURE output. (The probit transform uses order statistics (ranks) to map the observations into points appropriate if the underlying distribution is standard normal. See, for example, [

33].) This amounts to carrying out an unsupervised analysis and then checking to see if the recovered “structure” reflects the truth.

Thus, we will compute three *p*-values: 1) a TW statistic from an unsupervised analysis; 2) an ANOVA *p*-value (F-statistic) after probit transform of the leading principal component; 3) an ANOVA *p*-value (F-statistic) after probit transform of the STRUCTURE cluster probabilities.

shows the results from a representative set of runs: we show the geometric mean of the *p*-value in simulations, based on a TW statistic (unsupervised) or a nonparametric ANOVA analysis, both for the eigenanalysis and for STRUCTURE.

| **Table 4**BBP Phase Change: Eigenanalysis and STRUCTURE |

Here the BBP threshold is .0025. Below the threshold nothing interesting is found by the TW unsupervised statistic. Above the threshold, the TW statistic is usually highly significant, and the ANOVA analyses show that the true structure has become apparent. At the threshold we *sometimes* have recovered significant structure, but it will be hard (usually impossible) to tell if the structure is real or a statistical artifact. Below the threshold, the structure is too weak to be useful. In these runs, at the critical threshold, the eigenanalysis slightly outperformed STRUCTURE. We have not carefully investigated whether we could obtain better results by varying the STRUCTURE parameters.

Summarizing: below the threshold, neither procedure succeeds with reasonable probability, at the threshold success is variable, and above the threshold success is nearly guaranteed.

Admixture

In an admixed population, the expected allele frequency of an individual is a linear mix of the frequencies in the parental populations. Unless the admixture is ancient—in which case the PCA methods will fail as everyone will have the same ancestry proportion—then the mixing weights will vary by individual. Because of the linearity, admixture does not change the axes of variation, or, more exactly, the number of “large” eigenvalues of the covariance is unchanged by adding admixed individuals, if the parental populations are already sampled. Thus, for example, if there are two founding populations, admixed individuals will have coordinates along a line joining the centers of the founding populations.

We generated simulated data, by taking a trifurcation between populations (*A*,*B*,*D*) 100 generations ago. Population *C* is a recent admixture of *A* and *B*. The mixing proportion of *A* in an individual from C is Beta-distributed *B*(3.5,1.5) so that the average contribution of population *A* in an individual of population *C* is .7 (see ). Effective population sizes are 10,000 for each population. We then simulated data for 10,000 unlinked markers (more details are in the Methods section). *F*_{ST} between any pair of *A*,*B*,*D* is .005. We are attempting to mimic the data of , and chose to run our analysis on simulated samples from populations *B,C,D,* not using samples from *A*. We expect two significant eigenvalues corresponding to the splits of populations *B,C,* and *D*. If population *A* is included in the analysis, we also get just two significant eigenvalues, as predicted by theory. This is what is observed (unpublished data), with, as predicted, the admixed population not adding to the number of axes of variation (the third eigenvalue is not significant). In we show a plot of the first two eigenvectors. Note the dispersion of population *C* along a line. This is diagnostic of admixture. The resemblance of and is striking.

There remain issues to resolve here. Firstly, recent admixture generates large-scale LD which may cause difficulties in a dense dataset as the allele distributions are not independent. These effects may be hard to alleviate with our simple LD correction described below. STRUCTURE [

10] allows careful modeling. Secondly, more ancient admixture, especially if the admixed population is genetically now homogeneous, may lead to a causal eigenvalue not very different from the values generated by the sampling noise. Suppose, for example, in our simulation above, we let population

*C* mate panmictically for another 20 generations. Then we will get three clusters for

*A, B, C* that are nearly collinear, but not exactly because of the recent 20-generation divergence, which is reflecting genetic drift unique to that population.

A third issue is that our methods require that divergence is small, and that allele frequencies are divergent primarily because of drift. We attempted to apply our methods to an African-American dataset genotyped on a panel of ancestry-informative markers [

34]. The Tracy–Widom theory breaks down here with dozens of “significant” axes that we do not believe have genetic meaning. Perhaps this is to be expected, as on our informative panel

*F*_{ST} is big (.58) and the theory could be expected to perform poorly. In addition our methods are here not dealing adequately with LD caused by large admixture blocks.

This is an issue for our TW techniques, but not for PCA as such. Indeed, on this dataset the correlation of our principal eigenvector with the estimated European ancestry for each individual recovered by the admixture analysis program ANCESTRYMAP [

12] is a remarkable .995 (STRUCTURE produces similar results). ANCESTRYMAP has complex modeling of admixture LD, and was also provided with parental allele frequencies, but did no better than the simple PCA. (There is an issue of interpretation here: the leading eigenvector is almost perfectly correlated with ancestry, but to infer actual ancestry proportions an affine transform must be applied, translating and scaling the values. In practice, some parental allele frequencies will be needed to determine the appropriate transform. A similar issue arises with STRUCTURE if parental frequencies are unknown.)

Finally, if “admixture LD” is present, so that in admixed individuals long segments of the genome originate from one founder population, simple PCA methods will not be as powerful as programs such as STRUCTURE [

10], ADMIXMAP [

11], and ANCESTRYMAP [

12], where there is careful modeling of the admixture blocks and the transitions. The power of these methods lies in the fact that genome-wide samples may have similar proportions of inheritance from the ancestral populations, but locally they will inherit either 0, 1, or 2 alleles from each ancestral population. Methods that specifically attempt to assign local ancestries will be able to determine the specific patterns typical of each ancestral population locally. An interesting and challenging problem is to build tools that retain the power of these more complex models on admixed data and that also run rapidly on large datasets.

Correcting for LD

The theory above works well if the markers are independent (that is have no LD), but in practice, and especially with the large genotype arrays that are beginning to be available, this is difficult to ensure. In extreme cases uncorrected LD will seriously distort the eigenvector/eigenvalue structure, making results difficult to interpret. Suppose, for example, that there is a large “block” [

35,

36] in which markers are in complete LD, and we have genotyped many markers in the block. A large eigenvector of our Wishart matrix

*X* will tend to correlate with the genotype pattern in the block (all markers producing the same pattern). This will distort the eigenvector structure and also the distribution of eigenvalues.

We recommend the following if LD between markers is a concern in the data. Pick a small integer

*k* > 0, corresponding to the number of adjacent markers one uses for adjustment (

*k* = 1 will often suffice). In the data matrix M we will “predict” each column by running a multivariate regression on the

*k* previous columns. We then will analyze the residuals. Concretely: we first form

*M,* as in

Equation 2. For each column

*j*Set:

Choose

**a** to minimize

and now calculate

*X* =

*RR′* instead of

*MM′*. It is first important to check that in the absence of LD the suggested correction does not seriously distort the Tracy–Widom statistic. In A and B we show P–P plots, uncorrected, and with five levels (

*k* = 1…5) of correction. The first figure is with 100 individuals and 5,000 markers, the second with 200 individuals and 50,000 markers. Then in A and B we analyze a simulated dataset with severe LD. We generate blocks in perfect LD, in which the probability that a block contains

*L* markers is 2

^{−L}. We show the corresponding plots. Note that here the uncorrected statistic is distributed quite differently than the Tracy–Widom distribution. Our suggested correction strategy seems to work well, and should be adequate in practice, especially as most large genotype arrays will attempt to avoid high levels of LD. We would recommend that before analyzing a very large dataset with dense genotyping, one should filter the data by removing a marker from every pair of markers that are in tight LD.

Comparison with STRUCTURE

In the work above on the BBP phase change, we already showed some comparisons between STRUCTURE and our methods. A fair comparison to STRUCTURE is not easy, as the two programs have subtly different purposes and outputs. STRUCTURE attempts to describe the population structure by probabilistic assignment to classes, and we are attempting to determine the statistically significant “axes of variation,” which does not necessarily mean the same thing as assigning individuals to classes.

Our impression, confirmed by , is that when our analysis finds overwhelmingly evident population structure, then STRUCTURE will as well, and when nothing at all is found, STRUCTURE will fail, too. In a problem where the effect is marginal, it may be hard to say which analysis is preferable.

STRUCTURE is a sophisticated program with many features we have not attempted to match. STRUCTURE has an explicit probability model for the data, and this allows extra options and flexibility. It incorporates a range of options for ancestry and for allele frequencies, and has explicit options for modeling microsatellite distributions.

On the other hand, eigenanalysis has advantages over STRUCTURE. First, it is fast and simple, and second, it provides a formal test for the number of significant axes of variation.

One future possibility is to somehow incorporate recovered significant eigenvectors into STRUCTURE—in particular with regard to choosing the number of subpopulations, which is not statistically robust in the STRUCTURE framework. A sensible default for the number of clusters in STRUCTURE is one more than the number of significant eigenvalues under the TW statistic.

Missing Data and Other Problems

The most problematic issue when applying any method to infer population structure is that genotyping may introduce artifacts that induce apparent (but fallacious) population structure.

Missing genotypes by themselves are not the most serious concern. Simply setting

*M*(

*i*,

*j*) = 0 in

Equation 3 if marker

*j* is missing for individual

*i* is reasonable if we are testing the null, that there is no structure, and the missing data is “missing at random.” Unfortunately “informative missingness” [

37,

38] is extremely frequent in genetic data. Probably the most common and serious issue is that with current technology, heterozygotes are more difficult to call than homozygotes. Thus, true heterozygotes are more likely to be called as missing. This is discussed in detail in [

38], which is recommended as a very useful discussion of the issues, especially as they apply to medical genetics. If DNA quality (or quantity) varies among our samples, then certain individuals may have an unusual amount of missing data, and then appear as outliers in our eigenanalysis—we in fact have seen this in many runs on real data.

Another issue that may produce confounding effects is if data from different populations or geographical areas is handled differently (which may be inevitable, especially in the initial processing); then, in principle, this may induce artifacts that mimic real population structural differences. Even restricting analysis to markers with no missing data, apart from an inevitable power loss, does not necessarily eliminate the problems. After all, if a subset (the missing data) is chosen in a biased way, then the complementary subset must also be biased.

We have no complete solution to these issues, though there is no reason to think that our eigenvector-based methods are more sensitive to the problems than other techniques [

9]. One check we do recommend is to generate a test matrix by taking the initial counts

*C*(

*i*,

*j*) to be 0 if the corresponding data is present; otherwise, set

*C*(

*i*,

*j*) = 1. This is equivalent to only focusing on the pattern of missing data. The eigenanalysis on this test matrix will show significant TW statistics if the missing data

*by itself* is showing evidence of population structure. If so, the results should be regarded with some suspicion, especially if the eigenvectors show high correlation to the eigenvectors of the main analysis. We here echo [

38] and recommend that the analyst should “control all aspects of source, preparation and genotyping, using the paradigms of blindness and randomization,” but, as the reference states, this will not always be possible.

Another possible source of error, where the analyst must be careful, is the inclusion of groups of samples that are closely related. Such a “family” will introduce (quite correctly from an algorithmic point of view) population structure of little genetic relevance, and may confound features of the data of real scientific interest. We found that this occurred in several real datasets that we analyzed with eigenanalysis and in which related individuals were not removed.