PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosonePLoS OneView this ArticleSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)
 
PLoS One. 2010; 5(9): e12510.
Published online 2010 September 17. doi:  10.1371/journal.pone.0012510
PMCID: PMC2941459

Theoretical Formulation of Principal Components Analysis to Detect and Correct for Population Stratification

Dale J. Hedges, Editor

Abstract

The Eigenstrat method, based on principal components analysis (PCA), is commonly used both to quantify population relationships in population genetics and to correct for population stratification in genome-wide association studies. However, it can be difficult to make appropriate inference about population relationships from the principal component (PC) scatter plot. Here, to better understand the working mechanism of the Eigenstrat method, we consider its theoretical or “population” formulation. The eigen-equation for samples from an arbitrary number (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e001.jpg) of populations is reduced to that of a matrix of dimension An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e002.jpg, the elements of which are determined by the variance-covariance matrix for the random vector of the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e003.jpg allele frequencies. Solving the reduced eigen-equation is numerically trivial and yields eigenvectors that are the axes of variation required for differentiating the populations. Using the reduced eigen-equation, we investigate the within-population fluctuations around the axes of variation on the PC scatter plot for simulated datasets. Specifically, we show that there exists an asymptotically stable pattern of the PC plot for large sample size. Our results provide theoretical guidance for interpreting the pattern of PC plot in terms of population relationships. For applications in genetic association tests, we demonstrate that, as a method of correcting for population stratification, regressing out the theoretical PCs corresponding to the axes of variation is equivalent to simply removing the population mean of allele counts and works as well as or better than the Eigenstrat method.

Introduction

The genetic structure of populations is important both in population genetics and in genetic epidemiology. From the viewpoint of population genetics, detecting and quantifying population structure is crucial for understanding the demographic and evolutionary histories of populations [1], [2]. In genetic epidemiology, population stratification may induce false positives and must be corrected for [3], [4]. In both candidate gene association studies and genome-wide association studies (GWAS), unrecognized ancestral differences between the cases and controls are one of the main sources of spurious associations.

The most common methods used in the study of human population structure are clustering approaches [5][7] and principal components analysis (PCA) [1], [8], [9]. The most widely used clustering method, as implemented in the STRUCTURE program, provides the probability of group membership of samples [5]. This approach, however, is computationally intensive and hence is in practice not practical for analysis of large numbers of markers. Another problem with the clustering approach is that it assumes that the population of interest can be divided into distinct genetic groups, and therefore it is less suited to the situations where a subtle structure exists, or when there is association among individuals according to different attributes than the specified ancestries.

The PCA method was first applied to detecting and characterizing population structure more than 30 years ago [1]. By taking allele frequencies at different loci as a random vector and using the first few principal components (PCs), Cavalli-Sforza and co-workers constructed synthetic maps in their study of the evolutionary history of human populations [1], [2]. Recently, PCA has been applied to large-scale association studies using data for single-nucleotide polymorphisms (SNPs) in attempting to detect a few top axes of large genetic variation [10], [11].

In 2006, Patterson and co-workers [8] developed a new approach that uses PCA to detect population structures from large-scale genotype data of a sample of individuals. Instead of treating different markers as components and constructing PCs to represent the main variations from all markers as was traditional ([10] e.g.), in this new approach, Patterson et al. [8] indexed the random vector by individuals, taking genotype data at different markers as its realizations. In the resultant PC scatter plot using axes of the top PCs, individuals from different populations have different coordinates and thus have different locations. Price et al. [9] proposed a method of correcting for population stratification in association studies by regressing out the top PCs obtained by this new method from the genotype data. The method was implemented in the package EIGENSTRAT and is referred to as Eigenstrat method. The Eigenstrat method has been applied to quantifying fine structures and describing the relationships of many different populations, such as European American [12], [13], European [14], [15], and Japanese populations [16], and is now the gold standard for detecting and correcting for population stratification.

Although the Eigenstrat method is becoming more popular, appropriate inferences about the population relationships from the PC scatter plot remains a challenging task ([17] e.g.) In this paper, we address this issue by considering the theoretical or “population” formulation of the Eigenstrat PCA method. Here, the term “population” means that the PCA is formulated for a hypothetical marker with allele frequencies drawn from different distributions for different populations. In contrast, the term “sample” means that PCA is performed using markers observed on the sample. We establish an explicit connection between the pattern of the PC plot and the variance-covariance parameters of the random vector of allele frequencies. We propose that these parameters, independent of the relative sample sizes of the population, are more suitable than the patterns of PC plot for quantifying population divergence. Based on our theoretical formulation of PCA, we prove the existence of an asymptotic pattern of the PC plot when the population sizes become large, and derive the formula for numerically calculating this asymptotic pattern for given population parameters. We then illustrate how to apply our theory in quantifying population structures and relationships using HapMap [18] and simulated data. We also use the theoretical formulation to investigate the intra-population fluctuations on the PC scatter plot constructed using “sample” marker data.

Our theoretical formulation of PCA also applies to association studies. In the Eigenstrat method, the confounding effect of population structure is controlled for by regressing out the first few top sample PCs obtained from genotype data. Here, we propose that population stratification can also be corrected for by regressing out the theoretical or “population” PCs calculated from the estimates of the variance-covariance parameters using our formulation. This is not only an alternative to the Eigenstrat method in GWAS, but may also be applied in candidate gene association studies when fewer markers are available for study. It turns out, as we rigorously show, that this method is equivalent to subtracting the population mean of allele counts. The proposed method is tested and compared with other methods using simulations and found to have superior performance to the Eigenstrat method when the differentiation among populations was limited.

Results

Statistical model

Suppose we have genotype data for individuals sampled from An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e004.jpg populations with sample sizes An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e005.jpg, respectively. In Eigenfstrat theory, the data is modeled in the following way: the components of the random vector are the genotypes of the sampled individuals, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e006.jpg with An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e007.jpg. Here, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e008.jpg represents the count of the variant allele for individual An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e009.jpg from population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e010.jpg for a random marker. Data for different markers are taken as different measurements (“samples”) of this random vector. In this statistical model, the randomness of a component, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e011.jpg, comes from two sources: the marker is randomly chosen, and the genotype is determined randomly conditional on the allele frequency of the chosen marker. The probability distribution of the allele frequency depends on which population the individual is from. The populations are characterized by the variance-covariance matrix of the random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e012.jpg of allele frequencies

equation image
(1)

In this model, the genetically independent individuals are not statistically independent. Individuals from the same population have stronger correlations than those from different populations. We denote the variance-covariance matrix of the random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e014.jpg as

equation image
(2)

where

equation image
(3)

and and

equation image
(4)

where

equation image
(5)
equation image
(6)
equation image
(7)

are the variance of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e021.jpg for any individual An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e022.jpg in population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e023.jpg, the covariance of two different individuals An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e024.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e025.jpg in population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e026.jpg, and the covariance of two individuals in populations An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e027.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e028.jpg, respectively. As shown in Text S1, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e029.jpg is related to the variance-covariance matrix of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e030.jpg by

equation image
(8)
equation image
(9)
equation image
(10)

where An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e034.jpg is the mean of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e035.jpg, for An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e036.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e037.jpg.

One-population case

For PCA, we need to calculate the eigenvalues and eigenvectors of the variance-covariance matrix of the random vector of interest [19]. We begin with the simplest case, where all individuals are from the same population. Useful insights can be gained from this trivial situation, as shown below. In this case, the variance-covariance matrix for the random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e038.jpg is given by Equation (3) with An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e039.jpg, for which the eigensolutions can be easily obtained ([19] pp. 469–470). The eigenvalues can be divided into two groups. The first group includes only one large eigenvalue, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e040.jpg, with the associated eigenvector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e041.jpg. The second group includes all the other eigenvalues, which are smaller and are all equal: An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e042.jpg. The coordinate of the random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e043.jpg along the first PC is An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e044.jpg, proportional to the average of the allele counts over all samples. If the individuals within this population are not correlated to each other, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e045.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e046.jpg. That is, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e047.jpg reduces to the small eigenvalue. In contrast, if the individuals are completely correlated (e.g. if all individuals are monozygotic twins) An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e048.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e049.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e050.jpg. In general, the stronger the correlation between individuals, the larger An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e051.jpg is and the smaller the small eigenvalues are. This means that the only PC here represents the co-variation of all individuals, whereas the small eigenvalues represent the variation between individuals.

Since there is only one population, the only PC with the large eigenvalue here does not reflect the variation caused by population structure. In Eigenstrat theory, therefore, one needs to perform the following mean adjustment:

equation image
(11)

For the mean-adjusted random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e053.jpg, the variance-covariance matrix has the same structure as that of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e054.jpg, but with different diagonal and off-diagonal elements given by

equation image
(12)
equation image
(13)
equation image
(14)

It is easy to show that the large eigenvalue is now reduced to An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e058.jpg and the small eigenvalues remain unchanged:

equation image
(15)

This shows how the mean adjustment in Eigenstrat theory removes the overall variance represented by the first PC that reflects the joint variation of all components because of their correlation instead of stratification. The same mean-adjustment will be performed for the general case where individuals are from two or more populations, for the same reason.

Two-population case

Now we turn to the first nontrivial case, where there are An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e060.jpg individuals in population 1 and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e061.jpg individuals in population 2 in the samples. In this case, the variance-covariance matrix of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e062.jpg is

equation image
(16)

and can be shown to have solutions as follows. The small eigenvalues of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e064.jpg are just those of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e065.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e066.jpg, the same as in the one-population case. There are two large eigenvalues, each of which corresponds to an eigenvector whose coordinates are constant for individuals from the same population. However, these two large eigenvalues do not reflect only the variations caused by the population structure; as shown in the last subsection, they also represent the co-variation of the individuals. This can be easily seen if we assume An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e067.jpg. In this case, the two large eigenvalues are simply the large eigenvalues of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e068.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e069.jpg. If we were only interested in detecting population structure, the PCA for An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e070.jpg would be sufficient. However, we are also interested in correcting for the population stratification, and therefore we hope to obtain PCs mainly representing variations due to population structure. This is why we need to investigate the mean-adjusted vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e071.jpg as defined in the last subsection. Here, again, the variance-covariance matrix of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e072.jpg has the same structure as that of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e073.jpg and has the following eigensolution. The small eigenvalues of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e074.jpg are still the same as those of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e075.jpg. (Note that, as in the case of one population, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e076.jpg.) The two large eigenvalues become An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e077.jpg with eigenvector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e078.jpg and

equation image
(17)

with eigenvector

equation image
(18)

Note that Equation (18) is equivalent to Equations (13a) and (16b) in [17]. The first large eigenvalue (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e081.jpg) reflects the fact that the mean of the vector is zero, a result of the mean adjustment, whereas the second large eigenvalue represents the variation caused by the population structure. The only nonzero large eigenvalue (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e082.jpg) is very large compared with the small eigenvalues, for large An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e083.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e084.jpg. So, if there are only two populations, we would have only one eigenvector showing a clustering structure on a PC scatter plot using data of “samples” of markers. Any other eigenvectors would not have anything to do with stratification among populations.

It should be noted that the eigenvector corresponding to the only large eigenvalue and reflecting the population structure (Equation (18)) depends only on the ratio of sample sizes of the two populations, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e085.jpg, not on the other parameters (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e086.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e087.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e088.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e089.jpg, or An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e090.jpg). This is true only in the two-population case and is not a generic property, as will be clear soon.

An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e091.jpg-population case

In the general case of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e092.jpg populations, the eigensolutions of the variance-covariance matrix Equation (2) are as follows. The small eigenvalues are the same as those for the individual populations. There are An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e093.jpg large eigenvalues, each corresponding to an eigenvector whose coordinates are constant for individuals from the same population. These large eigenvalues, again, reflect not only the variation caused by population stratification but also the overall co-variation of the individuals. After the mean adjustment, the variance-covariance matrix An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e094.jpg has the same structure as An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e095.jpg in Equation (2), and its submatrices An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e096.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e097.jpg have the same structure as the corresponding An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e098.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e099.jpg in Equations (3) and (4), respectively. The elements of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e100.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e101.jpg are given by

equation image
(19)
equation image
(20)
equation image
(21)

where

equation image
(22)
equation image
(23)

It is not difficult to show that the eigensolutions with small eigenvalues are still the same as those in the one-population and two-population cases (see also [8]). To find the eigensolutions with large eigenvalues of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e107.jpg, which describe the population distinctions, we define

equation image
(24)

and for each of the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e109.jpg with dimension An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e110.jpg,

equation image
(25)

Then the eigenequation

equation image
(26)

is reduced to

equation image
(27)

By using the following identity, which is proven in Text S2,

equation image
(28)

the trivial eigensolution, reflecting the fact that the components of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e115.jpg have a zero sum after the mean adjustment, is immediately obtained:

equation image
(29)

This identity, (28), also means that any other nontrivial solutions must satisfy

equation image
(30)

The eigenvectors are usually normalized by

equation image
(31)

Equation (27) can be used to numerically calculate the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e119.jpg nontrivial eigenvalues and the corresponding eigenvectors for given variance-covariance parameters and sample sizes (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e120.jpg). Thus, it provides a theoretical tool for connecting the patterns of PC scatter plot and the relationships between populations.

Application to population genetics

PC plot patterns and population structure

The formulation we derived here not only provides a means of connecting PC plot patterns and population structure but also suggests an alternative to the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e121.jpg statistic for describing population relationships. The complete set of parameters describing the population relationships for An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e122.jpg populations can be put into a vector of variance

equation image
(32)

and a covariance matrix

equation image
(33)

These parameters, referred to as variance-covariance parameters, together with the sample sizes (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e125.jpg), completely determine, by Equation (27), the theoretical (or “population”) eigenvectors. These eigenvectors, referred to as axes of variation in [8], are uniform within a population without a structure. Thus, when they are used to make a PC plot, each population is represented by a single point, which is referred to as the representative point of the population. Here, we illustrate, by using examples, how the pattern of the representative points can be used to infer population relationships.

In the first example, we illustrate the effect of population sizes on the pattern of a PC scatter plot. Although the variance and covariance between each pair of populations are the same, as shown in Table 1, the three representative points in the PC scatter plot are distributed unevenly because of the unbalanced sample sizes, as seen in Figure 1. The representative points of populations with small sample sizes are around the borders, whereas the points for populations with large sample sizes are located near the zero point, as can be explained by Equation (30).

Figure 1
Example 1 of axes of variation calculated from variance-covariance parameters and sample sizes.
Table 1
Parameters for the three populations in Figure 1.

In the second example, we have five populations, the first three of which are close to one another (in the sense that the corresponding covariances are large) and are far away from the other two distant populations (see Table 2). Figure 2 shows that the first three populations can hardly be distinguished from the two-dimensional PC plot using the first two eigenvectors. In eigenvector 1, P4 and P5 are contrasted with the three closely related populations (P1, P2 and P3), while in eigenvector 2 P4 is contrasted with P5, and the other three are in the middle. The three closely related populations are distinguished in eigenvectors 3 and 4. This example suggests that one has to examine a large enough number of eigenvectors in order to find all the significant population differences. The first two eigenvectors are the most important, but the others are also needed if the samples are from more than three populations. However, if there are only two populations, a two-dimensional PC plot is not needed; only the first eigenvector shows the population structure.

Figure 2
Example 2 of axes of variation calculated from variance-covariance parameters and sample sizes.
Table 2
Parameters for the five populations in Figure 2.

The representative points depend on the sample sizes as well as the variance-covariance parameters, as pointed out in [8]. We note that, even if equal sample sizes are used, the representative points cannot replace the variance-covariance parameters in characterizing the population relationships, because their values depend on the presence of one another in the analysis.

Estimation of variance-covariance parameters and axes of variation

In practice, the variance-covariance parameters in Equations (32) and (33) are unknown and can only be estimated from genotype data of a large number of markers. The estimates of these parameters can be used to calculate the estimates of axes of variation using Equation (27). If the population memberships of the samples are known, estimation of the variance-covariance parameters is straightforward. For any one of the parameters in Equations (32) and (33), we can simply take the average of the corresponding elements of the sample variance-covariance matrix as its estimate. However, in reality, the information on population membership is usually unavailable and needs to be inferred using the PCA or other methods such as STRUCTURE. For inference of population structure from a PC plot, a generic clustering algorithm may be appropriate [20].

In contrast, PCA in practice is based on the “sample” of markers. Namely, eigenvectors are calculated using genotypes of a large number of markers. These eigenvectors are therefore referred to as “sample” eigenvectors. The points on the PC scatter plot using the sample eigenvectors, called the sample points, are scattered to some extent because of sampling fluctuation. Our representative points using the estimated axes of variation should be located in the middle of the corresponding clusters of the sample points. We performed simulations to show how the axes of variation can be evaluated and to compare them with the sample eigenvectors used in Eigenstrat theory. Figure 3 shows an example of our simulation, using the simulating parameter values given in Table 3. P1 and P2 were simulated using an An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e130.jpg of 0.01 and thus were closer to each other than to P3 or P4, for which An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e131.jpg was much larger (0.43). The distance between P3 and P4 was even larger than their distances to P1 or P2. The representative points obtained from the estimated parameters listed in Table 3 were right in the centers of the corresponding sample points. Also listed in Table 3 are the estimated correlation coefficients An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e132.jpg (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e133.jpg). Compared with the covariances An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e134.jpg, the corresponding correlations An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e135.jpg seem to be more suitable for representing the population distances.

Figure 3
PC scatter plot and estimated axes of variation for a simulation.
Table 3
Simulating values for An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e137.jpg used for the four populations P1, P2, P3, and P4 and the estimated parameters.

Fluctuations in sample eigenvectors within populations and asymptotic PC plot patters

In practice, within-population fluctuations of the PC scatter plot using sample eigenvectors may be so strong that closely related populations have overlapping clusters and hence cannot be distinguished. Here, we first investigate the factors that affect the within-population fluctuations for a given population divergence: the sample sizes of the populations and the number of markers. We then study the asymptotic behavior of the patterns of the PC scatter plot as the sample size becomes large.

A remark is in order. The fluctuations of the sample points on the PC scatter plot within a population should not be confused with the random variation of marker data between individuals within a population. Recall from our theoretical consideration in Section 2 that the between-individual variation within a pure population exists even theoretically and is represented by the small eigenvalues and the corresponding eigenvectors. However, the within-population fluctuation of the sample points is mainly due to the limited “sample size” of markers and should be decreased as more markers are included in the analysis. The observed fluctuations of sample points around the representative points may also reflect effects from subtle population structure and cannot be reduced by increasing marker numbers.

We performed simulations to demonstrate the effect of increasing population size (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e140.jpg) and number of markers (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e141.jpg) on the within-population fluctuations of sample points on the PC scatter plot. In our simulations, we first generated three populations (P1, P2, and P4) each with the same An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e142.jpg value of 0.003; the three populations should therefore be equidistant from one another. In order to mimic a subtle subpopulation structure, we created another population (P3) based on P2 by using the allele frequency vector of P2 added to a random vector, each of the elements of which was independently and uniformly distributed within An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e143.jpg. This resulted in three distinct populations, P1, P2+P3 and P4; P2+P3 has a subtle structure. Figure 4 shows the PC scatter plot using the first two eigenvectors for various values of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e144.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e145.jpg. If the subtle structure within P2+P3 is ignored, this plot should be all that is needed to distinguish these three populations. As we see from this figure, as the number of markers (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e146.jpg) increased, the fluctuations within each population gradually decreased, but the distance between P2 and P3 remained unchanged, indicating that fluctuations due to limited “sample size” can be reduced by increasing it, whereas fluctuations reflecting subtle structure cannot. Since there were actually four populations, we also plotted the first and third eigenvectors in Figure 5. Here, we see that after mainly addressing the difference between P1+P4 and P2+P3 in eigenvector 1, and the difference between P1 and P4 in eigenvector 2, the difference between P2 and P3 was further addressed in eigenvector 3.

Figure 4
Effects of sample size and marker number on within-population fluctuations of PC scatter plot: eigenvector 2 vs. eigenvector 1.
Figure 5
Effects of sample size and marker number on within-population fluctuations of PC scatter plot: eigenvector 3 vs. eigenvector 1.

Figures 4 and and55 also show the effect of increasing the number of individuals in each population. As An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e153.jpg increased, the within-population fluctuations became smaller and the distinctions among populations became clearer. The effect of increasing An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e154.jpg was much stronger than that of increasing An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e155.jpg, in agreement with what was found in [8]. Here, we give an explanation for this phenomenon as follows. For a finite number of markers, each individual in PCA actually acts as a population. When the number of individuals (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e156.jpg) is small, the variation between individuals from the same population is comparable to that caused by population difference, so PCA tends to address these variations in the first few eigenvectors. As An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e157.jpg increases, the variation due to population differences becomes overwhelming, so PCA addresses only this variation in the first few eigenvectors and leaves the trivial ones to other eigenvectors with small eigenvalues.

In Figures 4 and and5,5, we also plotted the theoretical patterns calculated using the population sizes (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e158.jpg) and the estimated variance-covariance parameters from the case with the largest An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e159.jpg (500) and largest An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e160.jpg (100,000). The absolute distances between the representative points became smaller as An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e161.jpg increased, because of the normalization equation (31), but the relative pattern remained almost the same, especially for large An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e162.jpg. The ratio of distance between P2 and P3 to that between P2+P3 and P3 on the PC plot approached a constant value, implying that the pattern on the PC plot reached an asymptotic shape as An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e163.jpg became large. This kind of asymptotic behavior can be derived from our theoretical considerations as follows.

Let An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e164.jpg (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e165.jpg) denote the relative sample size (or proportion) of population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e166.jpg in the samples of interest. It is shown in Text S2 that the asymptotic form of the eigen-equation (27) is

equation image
(34)

with

equation image
(35)

for An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e169.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e170.jpg. The small part being neglected for large An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e171.jpg is

equation image
(36)

where An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e173.jpg is the small eigenvalue for population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e174.jpg. From Equation (34), we see that asymptotically the eigenvectors are independent of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e175.jpg for given proportions of populations and variance-covariance parameters listed in Equations (32) and (33), whereas the large eigenvalues increase linearly with An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e176.jpg. Figure 6 shows how the theoretical predictions of the dimensions of the pattern on the PC plot vary with sample size for the simulated datasets plotted in Figures 4 and and5.5. The dimensions shown are: An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e177.jpg, the distance between P2 and P3 on eigenvector 1; An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e178.jpg, the distance between P1 and P3 on eigenvector 1; An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e179.jpg, the distance between P1 and P4 on eigenvector 2; and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e180.jpg, the distance between P2 and P3 on eigenvector 3. Here, the estimated variance-covariance parameters from the case with the largest number of markers (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e181.jpg) were used.

Figure 6
Approach to the asymptotic form.

Note that the asymptotic form of the eigen-equation, and hence the asymptotic form of the PC plot patterns, do not depend on the values of the variances An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e187.jpg (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e188.jpg). They are determined only by the intra- and inter-population covariances, and the relative sample sizes.

For An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e189.jpg to be neglectable compared to An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e190.jpg, we need to have an An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e191.jpg such that

equation image
(37)

In the simplest case, where all populations have the same variance (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e193.jpg) and the same intra-population covariance (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e194.jpg) and each pair of populations has the same inter-population covariance (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e195.jpg), we have a simple expression for the critical size An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e196.jpg as

equation image
(38)

indicating that the closer the populations (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e198.jpg) the larger the sample size needed for the asymptotic pattern to be approached. Note, however, that this is true only for cases where there are more than two populations in the sample. If there are only two populations, as shown in the previous section, the pattern in eigenvector 1 is determined only by the relative sample sizes (see Equation (18)).

Application to HapMap data

We estimated the variance-covariance parameters and the axes of variation for some of the HapMap [18] populations. An example is given in Figure 7, where the first three sample eigenvectors are plotted from the analysis of the four populations: Chinese in Denver (CHD), Gujarati Indians in Houston (GIH), Japanese in Tokyo (JPT), and Tuscan Italians (TSI) using markers on chromosome 1. The estimates of the variance-covariance parameters for these four populations are given in Table 4. The axes of variation calculated from these estimates are also shown in Figure 7 and were in consistent with the sample eigenvectors calculated directly from the raw variance-covariance matrix. It can be seen from Figure 7 that the two genetically very close populations, CHD and JPT, are contrasted only on the third eigenvector; the first two eigenvectors are used to address the difference between PHD+JPT vs GIH and TSI, and the difference between GIH vs TSI. We note that CHD and JPT can be distinguished not only by PCA together with other populations but also by PCA by themselves (data not shown). As shown in Table 4, the genetic diversity within GIH or TSI is so large that the average covariance between two random individuals both from GIH or from TSI is larger than the covariance between an individual from CHD and an individual from JPT. This explains why on eigenvector 3, where CHD and JPT are contrasted, the clusters of GIH and TSI are also elongated, showing a within-population structure.

Figure 7
Four HapMap populations.
Table 4
Estimates and the corresponding standard errors for the variance-covariance matrix of the four populations CHD, GIH, JPT, and TSI using HapMap data for chromosome 1.

Application to genetic epidemiology

Correcting for population stratification using axes of variation

Now we turn to the issue of correcting for population stratification in genetic association studies. In the Eigenstrat method, the correction for stratification is performed by regressing out the variation caused by population structures [9]. In the theory of PCA, the random vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e199.jpg can be expanded as a linear combination of all PCs:

equation image
(39)

where An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e201.jpg is the total number of individuals and

equation image
(40)

is the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e203.jpgth PC. To correct for population stratification, we subtract the first An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e204.jpg terms from this expansion, which are the variations due to the differences between the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e205.jpg populations. The sum of the remaining terms, describing the variations between individuals,

equation image
(41)

is then used for disease association test.

The PCs removed for correcting population stratification in Equation (39) are those obtained using the “sample” markers in the Eigenstrat method and is referred to as sample PCs. Here, we propose to use the same strategy for correcting population stratification but with the PCs defined using the axes of variation (referred to as representative PCs or “population” PCs). Our motivation is as follows. When the Eigenstrat method is used, some of the genotype variations corresponding to within-population fluctuations in the sample eigenvectors are removed, in addition to those corresponding to population stratification. As shown in previous sections, these within-population fluctuations are mainly due to the finite “sample size” (i.e. limited number of markers) and thus are irrelevant to the issue of population stratification. Indeed, in practice, a proportion of the within-population fluctuations may be due to a subpopulation structure. However, as shown in previous sections, the fluctuations in the first few eigenvectors only partially represent this kind of subpopulation structure. Consider the example given in Figures 4 and and5.5. When the samples are thought of as coming from three distinct populations P1, P2+P3, and P4, only the first two eigenvectors should be used to correct for population stratification. The variation caused by the difference between P2 and P3 would then be partially removed; although the fluctuation described by An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e207.jpg on eigenvector 1 would be taken into account, that described by An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e208.jpg on eigenvector 3 would remain in the residual. So the variation caused by the difference between P2 and P3 would only be altered and not completely removed. Only partially removing the variations caused by a subpopulation structure may not be really helpful for reducing false-positive rates in a case-control study. In this specific example, where the subpopulation structure is the simplest, we could remove the corresponding variation by simply adding the third PC in the sum in Equation (41). In reality, however, subpopulation structures are far more complex and are hence represented by many PCs. Regressing out too many PCs in Equation (41) would remove too much inter-individual variation within a pure population and hence would significantly reduce the power of association tests. We therefore prefer to use the representative PCs in Equation (41) for removing the main variations caused by major population stratification, while keeping the variations due to subtle subpopulation structures unchanged. Since the “population” PCs are used here, the proposed method is referred to as popu-Eigenstrat.

Now let us derive the theoretical expression of the residuals in Equation (41). Since we use the representative PCs, for a given An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e209.jpg, the vector An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e210.jpg, which is subtracted from An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e211.jpg, has a structure like

equation image
(42)

and for each of the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e213.jpg,

equation image
(43)

In Text S3, it is shown that

equation image
(44)

where

equation image
(45)

is the mean of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e217.jpg over individuals from population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e218.jpg. So it turns out that our representative PC-based correction is simply equivalent to subtracting the population group means for each individual. Our method does not even use any information about the variance-covariance parameters.

Results of simulations

We conducted simulations for comparing the performance of the popu-Eigenstrat method with that of the original Eigenstrat method as well as with that of the covariate-adjustment method, which used population labels as a covariate. See Method section for details of the simulations. Table 5 shows the results of our simulations. For An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e219.jpg, which is typical of differentiation between divergent European populations, the proposed method, popu-Eigenstrat, using the representative PCs, achieved almost the same rates of false-positive associations and comparable power as the original EIGENSTRAT method, which uses the sample PCs. Compared with the covariate-adjustment method, our method has slightly lower power, but also a lower rate of false-positive associations.

Table 5
Proportion of associations reported as significant (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e220.jpg) using different methods of stratification correction for simulated data.

It is interesting to compare the results for the two different allele frequency sets for the Causal-Specific SNPs (see Methods for definitions of these simulated SNPs). When the allele frequency ratio was the inverse of the sample size ratio, the power was zero if no correction was performed. The reason was that in this case, even though a very high proportion of tests had very low p-values, the disease allele was incorrectly identified as the wild allele. The power increased to An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e230.jpg after the methods of stratification correction were applied. In contrast, when the allele frequency ratio was the same as the sample size ratio, the power without correction was as high as An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e231.jpg! Without population stratification, we could not have achieved such an extremely high power for the given sample size, allele frequencies and relative risk. In this case, the population with a high disease allele frequency was over-sampled and thus the difference of allele frequency between the case and control groups was enlarged. After the methods of stratification correction were applied, the power was reduced to its “normal” level (An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e232.jpg). This observation indicates that special attention has to be paid to the population's allele frequency spectra when powers for stratification correction strategies are compared. For the causal-StruInfo SNPs in our simulation (see Methods for definitions of these simulated SNPs), some of the SNPs for which the allele frequency ratio was the inverse of the sample size ratio may have contributed to an increase in power after stratification correction was applied. However, more SNPs had allele frequency ratio with the same trend as the sample size ratio and hence contributed more to a reduction in power when stratification correction was applied.

For a smaller An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e233.jpg, 0.003, we found that the performance of the original Eigenstrat method was poorer. The rate of false positives was reduced less by Eigenstrat method than by the other methods. The power was increased less by Eigenstrat method than it was by the other methods in the case when the ratio of disease-causing allele frequencies was the inverse of the ratio of sample sizes. Only when the two ratios were the same did the Eigenstrat method improve the power to significantly higher than the others. The other methods of stratification correction worked similarly as in the case of larger An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e234.jpg. This can be explained as follows. As An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e235.jpg decreased, the distances between the populations decreased, and hence the fluctuations in the sample eigenvectors increased. This in turn made these sample PCs less representative of the population structure, resulting in a poorer performance in correcting for stratification.

We also examined how power was affected by regressing out too many PCs using Eigenstrat method. In the case of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e236.jpg, for the Causal-StruInfo SNPs, the power was reduced from An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e237.jpg (see Table 5) obtained using only two PCs to An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e238.jpg, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e239.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e240.jpg when using 50, 90 and 150 PCs, respectively.

Finally, the results on our comparison between Eigenstrat and popu-Eigenstrat should be taken with caution, because additional information (i.e. population memberships) was given to popu-Eigenstrat.

Discussion

The Eigenstrat method is a powerful tool to detect and correct for population stratification by treating genotype data as “samples” of markers. In this work, we have provided a framework in which the large eigenvalues and the corresponding eigenvectors necessary for differentiating the population structure are theoretically connected to the variance-covariance parameters of the random vector of the allele frequencies of the populations. These variance-covariance parameters can serve as an alternative to the traditional An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e241.jpg statistic for quantifying population relationships. In practice, our formulation provides theoretical guidance on how to correctly infer population structures from the pattern of the PC plot. Using the developed formulation, we have shown that there exists an asymptotic pattern on the PC plot as the sample size become large. We have also shown that the asymptotic pattern can be easily obtained by numerically solving the asymptotic form of the reduced eigen-equation for given covariance parameters and the relative sample sizes.

Based on our theoretical consideration and simulations, we have investigated the factors that affect the within-population fluctuations of the sample eigenvectors (as obtained from the Eigenstrat method) around the axes of variation. As the sample size becomes large, the overall asymptotic pattern of the PC plot quickly forms. The within-population fluctuations in the asymptotic pattern are then mainly determined by the number of markers and the subpopulation structure. These fluctuations corresponding to the very subtle subpopulation structures are entangled with the normal inter-individual variations within a pure population and hence can hardly be adjusted without significantly affecting the power of association tests.

These conclusions led us to a novel method of correcting for population stratification: We can regress out the representative PCs, instead of the sample PCs as done in Eigenstrat theory. We theoretically showed that this method is equivalent to simply removing the population mean of the allele counts. Therefore, implementation of the proposed method becomes trivial, whence the samples' population memberships are known (either self-reported or identified using the Eigenstrat method or any other methods, such as STRUCTURE). Our simulation studies showed that the proposed method worked as well as the Eigenstrat method for reducing false positive-rates and for maintaining the power of association tests. The proposed method outperformed the method of simply using the population label as a covariate in reducing false-positive rates, and it had slightly lower power. Our proposed method can also be used in candidate gene association studies or replication studies as long as the population memberships are known and a trend test is preferred.

In the present work, we have not considered admixture of populations. As shown in [8], PCA carried out on samples that include admixed individuals produces an interesting pattern on the PC scatter plot: the admixed samples are lying along a line between the two source populations. Similar patterns have been observed for other populations [13], [16]. Work is currently in progress to extend our theoretical formulation to the situation of admixture in order to explain the observed patterns.

Methods

Simulations of population structure

Following [4] and [9], we simulated genotype data for a specified number of populations with specified values of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e242.jpg using the Balding-Nichols model [21]. The ancestral allele frequency An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e243.jpg was first generated from the uniform distribution on An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e244.jpg for each locus An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e245.jpg. The allele frequencies in population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e246.jpg were then drawn from a beta distribution with parameters An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e247.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e248.jpg, where An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e249.jpg is the An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e250.jpg for population An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e251.jpg. No linkage disequilibrium was considered here. Distances between a pair of population was determined by the populations' An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e252.jpgs with the ancestral population. Only when An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e253.jpg is chosen to be the same for all populations to be simulated does it become an estimate of An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e254.jpg for all the populations.

Simulations of association tests

Our simulations for association tests were similar to those reported in [9]. One of the differences between our simulations and those in [9] is that we considered three populations rather than two populations. In our simulations, we assumed that the prior probability of sampling individuals from each of these three populations was the same, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e255.jpg. We assumed that the ratio of disease prevalences in the three populations was 1[ratio]2[ratio]5, and that these prevalences were very small. The numbers of cases and controls simulated for each population were their expected values, namely, 30, 60, 150 for cases and 80, 80, 80 for controls.

For each individual in each population, we simulated four different categories of SNPs. The first and second categories were generated using allele frequencies based on the Balding-Nichols model [21] with An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e256.jpg or An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e257.jpg for all populations, and the SNPs were thus population-structure-informative (StruInfo SNPs). The genotype data were generated differently for the first and second categories. For the first category, data for both cases and controls were generated in the same way, by using Hardy-Weinberg equilibrium, and hence were not associated with the disease. SNPs in the first category were thus referred to as Null-StruInfo SNPs. We simulated genotypes of 10,000 Null-StruInfo SNPs. They were first used to infer the variance-covariance matrix of the three populations and the sample PCs and then served as replicates for estimating the type I error rate. For the second category of SNPs, referred to as Causal-StruInfo SNPs, the genotypes were simulated differently for cases and controls. For controls, the simulation of genotypes was the same as for the Null-StruInfo SNPs, whereas for cases, we used a risk model with a relative risk An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e258.jpg for the causal allele. A case individual was assigned genotype 0,1, or 2 with probabilities An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e259.jpg, or An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e260.jpg, respectively, where An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e261.jpg is the causal allele frequency.

We also simulated 10,000 SNPs for each of the third and fourth categories. The third category of SNPs, referred to as Null-Specific SNPs, were disease-independent, like the Null-StruInfo SNPs, but had a fixed allele frequency set for the three populations. For the fourth category, Causal-Specific SNPs, the allele frequencies for the three populations were also fixed, but the cases and controls were simulated differently, as for the Causal-StruInfo SNPs, using the same relative risk, An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e262.jpg. We used two different allele frequency sets for the specific SNPs: An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e263.jpg, and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e264.jpg; and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e265.jpg and An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e266.jpg. In the first set, the ratio of the allele frequencies of the populations was the same as that of the sample sizes. In the second set, the ratio of the allele frequencies was the inverse of that of the sample sizes. The Null-Specific SNPs were intended for estimating type I error rate and the Causal-Specific SNPs for estimating power for a specific allele frequency set in the populations.

Following [8], we used the Armitage trend An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e267.jpg statistic for association tests without stratification correction (without-correction), and we used the generalized Armitage trend An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e268.jpg statistic when stratification was corrected for by regressing out the sample PCs (Eigenstrat) or the representative PCs (popu-Eigenstrat). For a fourth test, we used the population label as a covariate (covariate-adjustment). Association statistics producing a P value An external file that holds a picture, illustration, etc.
Object name is pone.0012510.e269.jpg were reported as significant.

Programs and scripts used in this work are available at https://cge.mdanderson.org/dma/User/ProgramsScripts/popuPCA/: (a) VarCov, a C++ program for calculating the sample variance-covariance matrix from genotype data; (b) EstimateSigma, a C++ program for estimating the variance-covariance parameters from the sample variance-covariance matrix; (c) ReducedMat.awk, an awk script for calculating the reduced matrix from the variance-covariance parameters and the sample sizes; and (d) An R script for calculating the eigenvalues and eigenvectors of the reduced matrix.

Supporting Information

Text S1

Derivation of Equations (8–10)

(0.03 MB PDF)

Text S2

Derivation of Equation (28) and Equation (34)

(0.03 MB PDF)

Text S3

Derivation of Equation (44)

(0.02 MB PDF)

Footnotes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by National Institutes of Health grants RO1ES09912, R01CA133996, P01CA034936,P30ES007784,and P50CA016672. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Menozzi P, Piazza A, Cavalli-Sforza L. Synthetic maps of human gene frequencies in europeans. Science. 1978;201:786–792. [PubMed]
2. Cavalli-Sforza L, Menozzi P, Piazza A. The History and Geography of Human Genes. Princeton University Press; 1994.
3. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
4. Pritchard J, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60:227–237. [PubMed]
5. Pritchard J, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PubMed]
6. Falush D, Stephens M, Pritchard J. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. [PubMed]
7. Tang H, Coram M, Wang P, Zhu X, Risch N. Reconstructing genetic ancestry blocks in admixed individuals. Am J Hum Genet. 2006;79:1–12. [PubMed]
8. Patterson N, Price A, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;4:2074–2093. [PMC free article] [PubMed]
9. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006;38:904–909. [PubMed]
10. Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genet Epidemiol. 2002;23:184–196. [PubMed]
11. Yu K, Wang Z, Li Q, Wacholder S, Hunter D, et al. Population substructure and control selection in genome-wide association studies. PLoS ONE. 2008;3:e2551. [PMC free article] [PubMed]
12. Price A, Butler J, Patterson N, Capelli C, Pascali VL, et al. Discerning the ancestry of european americans in genetic association studies. PLoS Genet. 2008;4:0009–0017. [PMC free article] [PubMed]
13. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, et al. Analysis and application of european genetic substructure using 300k snp information. PLoS Genet. 2008;4:e4. [PubMed]
14. Heath SC, Gut IG, Brennan P, McKay JD, Bencko V, et al. Investigation of the fine structure of european populations with applications to disease association studies. Eur J Hum Genet. 2008;16:1413–1429. [PubMed]
15. Nelis M, T E, Mägi R, Zimprich F, Zimprich A, et al. Genetic structure of europeans: A view from the north-east. PLoS One. 2009;4:e5472. [PMC free article] [PubMed]
16. Yamaguchi-Kabata Y, Nakazono K, Takahashi A, Saito S, Hosono N, et al. Japanese population structure, based on snp genotypes from 7003 individuals compared to other ethnic groups: effects on population-based association studies. Am J Hum Genet. 2008;83:445–456. [PubMed]
17. McVean G. A genealogical interpretation of principal components analysis. PLoS Genet. 2009;5:e1000686. [PMC free article] [PubMed]
18. Gibbs R, Belmont J, Hardenbol P, Willis T, Yu F, et al. The international hapmap project. Nature. 2003;426:789–796. [PubMed]
19. Johnson R, Wichern D. Applied Multivariate Statistical Analysis. 1982. Prentice-Hall, Englewood Cliffs, NJ.
20. Lee C, Abdool A, Huang C. PCA-based population structure inference with generic clustering algorithms. BMC Bioinformatics. 2009;10(suppl 1):S73. [PMC free article] [PubMed]
21. Balding D, Nichols R. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identify and paternity. Genetica. 1995;96:3–12. [PubMed]

Articles from PLoS ONE are provided here courtesy of Public Library of Science