PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of plosgenPLoS GeneticsSubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
 
PLoS Genet. 2010 September; 6(9): e1001117.
Published online 2010 September 16. doi:  10.1371/journal.pgen.1001117
PMCID: PMC2940725

Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis

Bruce Walsh, Editor

Abstract

We consider the statistical analysis of population structure using genetic data. We show how the two most widely used approaches to modeling population structure, admixture-based models and principal components analysis (PCA), can be viewed within a single unifying framework of matrix factorization. Specifically, they can both be interpreted as approximating an observed genotype matrix by a product of two lower-rank matrices, but with different constraints or prior distributions on these lower-rank matrices. This opens the door to a large range of possible approaches to analyzing population structure, by considering other constraints or priors. In this paper, we introduce one such novel approach, based on sparse factor analysis (SFA). We investigate the effects of the different types of constraint in several real and simulated data sets. We find that SFA produces similar results to admixture-based models when the samples are descended from a few well-differentiated ancestral populations and can recapitulate the results of PCA when the population structure is more “continuous,” as in isolation-by-distance models.

Author Summary

Two different approaches have become widely used in the analysis of population structure: admixture-based models and principal components analysis (PCA). In admixture-based models each individual is assumed to have inherited some proportion of its ancestry from one of several distinct populations. PCA projects the individuals into a low-dimensional subspace. On the face of it, these methods seem to have little in common. Here we show how in fact both of these methods can be viewed within a single unifying framework. This viewpoint should help practitioners to better interpret and contrast the results from these methods in real data applications. It also provides a springboard to the development of novel approaches to this problem. We introduce one such novel approach, based on sparse factor analysis, which has elements in common with both admixture-based models and PCA. As we illustrate here, in some settings sparse factor analysis may provide more interpretable results than either admixture-based models or PCA.

Introduction

The problem of analyzing the structure of natural populations arises in many contexts, and has attracted considerable attention. For example, methods for analyzing population structure have been used in studies of human history [1], [2], conservation genetics [3], domestication events [4], and to correct for cryptic population stratification in genetic association studies [5][7].

Two types of methods for analyzing population structure have become widely used: methods based on admixture models, such as those implemented in the software packages structure [6], [8], FRAPPE [9], SABER [10], and admixture [11]; and principal components analysis (e.g., [7], [12]), such as is implemented in the program SmartPCA [13]. In admixture-based models each individual is assumed to have inherited some proportion of its ancestry from each of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e001.jpg distinct populations. These proportions are known as the admixture proportions of each individual, and a key goal of these methods is to estimate these proportions and the allele frequencies of each population. Principal components analysis (PCA) can be thought of as projecting the individuals into a low-dimensional subspace in such a way that the locations of individuals in the projected space reflects the genetic similarities among them. For example, when the population structure conforms to a simple isolation-by-distance model with homogeneous migration then PCA effectively recapitulates the geographic locations of individuals [14], [15].

At first sight, these two different approaches to analysis of population structure appear to have little in common. For example, admixture-based methods involve an explicit model, whereas PCA, as usually described, does not. In this paper we describe how these approaches can be viewed within a single unifying framework. Specifically, they are both examples of low-rank matrix factorization with different constraints on the factorized matrices (e.g., [16]). Motivated by this general view we also consider a new method for analyzing population structure, sparse factor analysis (SFA), which lies in this same model class. We perform parameter estimation for SFA using a version of the expectation maximization (EM) algorithm, enabling application of SFA to genome-wide data.

We compare and contrast these three different methods on a range of real data and simulated examples. We find that SFA produces similar results to admixture-based models when the data conform to discrete and admixed populations, and can produce results similar to PCA when allele frequencies vary continuously with geography. Placing these different methods into a single framework also greatly aids comparisons among the methods, and provides helpful insights into why they may produce different results in practical applications.

Population structure via low-rank matrix factorization

In this section, we describe how admixture-based models and PCA can be viewed as factorizing an observed genotype matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e002.jpg into a product of two low-rank matrices. We assume that An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e003.jpg contains the genotypes of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e004.jpg individuals at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e005.jpg SNPs with genotypes coded as An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e006.jpg copies of a reference allele. Then both admixture-based models and PCA can be framed as models in which:

equation image
(1)

or, equivalently,

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e009.jpg is a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e010.jpg matrix and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e011.jpg is a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e012.jpg matrix, where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e013.jpg is typically small (Figure 1) (see Table 1 for a complete list of terms and constraints). In this framework, the primary difference between the approaches lies in the constraints or prior distributions placed on matrices An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e014.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e015.jpg as follows.

Figure 1
Low-dimensional matrix factorization via factor analysis.
Table 1
Relationship of terms in PCA, SFA, and admixture-based models.

Admixture-based models

Under admixture-based models (as found in, e.g., structure [17] and related work), explicitly marginalizing the multinomial latent variables representing individual- and SNP-specific ancestry, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e029.jpg is assumed to be distributed as binomial An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e030.jpg, with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e031.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e032.jpg is the admixture proportion of individual An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e033.jpg in population An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e034.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e035.jpg is the allele frequency of the reference allele in population An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e036.jpg. It follows that An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e037.jpg, as in Equation 2 above with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e038.jpg. Thus, admixture-based models can be viewed as performing the matrix factorization (Equation 1) with the following constraints on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e039.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e040.jpg: the elements of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e041.jpg are constrained to be non-negative with each column summing to one; the elements of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e042.jpg are constrained to lie within An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e043.jpg. In Bayesian applications of this model, priors are placed on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e044.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e045.jpg, which can be thought of as imposing additional “soft” constraints on the matrices.

Principal component analysis

PCA can be derived by considering the model An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e046.jpg. Specifically, consider maximizing the likelihood of this model with respect to parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e047.jpg, subject to the constraints: i) the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e048.jpg columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e049.jpg are orthogonal (so An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e050.jpg is diagonal); ii) the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e051.jpg rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e052.jpg are orthonormal (so An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e053.jpg). Then the columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e054.jpg and rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e055.jpg give the principal components (PCs) and the corresponding PC loadings. To see this, consider performing the constrained optimization via singular value decomposition (SVD) of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e056.jpg: if An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e057.jpg is the SVD for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e058.jpg, then setting An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e059.jpg to the first An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e060.jpg columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e061.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e062.jpg to the first An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e063.jpg rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e064.jpg satisfies the constraints and maximizes the likelihood (by standard results on optimality of the SVD; e.g., [18]). However, PCA can be performed in exactly the same way, and so the result follows.

Placing these two approaches to the analysis of population structure within a single framework helps illuminate some of their similarities and differences. For example, we can view both methods as attempting to approximate each individual's genotype vector by a linear combination of allele frequencies (Figure 2 illustrates different but equivalent linear combinations), but the admixture-based models are more restrictive because they insist on this linear combination being a convex combination (the admixture proportions must be non-negative and sum to one). This restriction makes sense if the study individuals conform closely to this assumption – that is, if each individual is indeed an admixture of a small number of ancestral populations – and in this case imposing this restriction leads to improved interpretability (each factor in An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e065.jpg corresponds to the allele frequencies of an ancestral population). On the other hand, where the study individuals do not conform closely to this assumption, such as in isolation-by-distance models considered later, the less restrictive approach of PCA may enable the representation of a wider range of underlying structure.

Figure 2
Illustration of two different ways that African and European individuals could be represented.

Furthermore, viewing both methods within the framework of matrix factorization immediately suggests many alternative approaches to analyzing population structure. By modifying the constraints or priors on the matrices, one may hope to develop better methods for different latent structures. To illustrate this possibility, we consider here a version of sparse factor analysis (SFA) where the key idea is to encourage the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e070.jpg matrix to be sparse, attempting to represent each individual as a linear combination of a small number of underlying factors, without constraints (e.g., orthogonality) on the factors. Intuitively, sparsity can lead to more interpretable results than PCA, while the use of general linear combinations (and not only convex combinations) maintains flexibility in capturing a wider range of underlying structures. There are several different approaches to SFA (e.g., [19][22]); here we use a novel approach described below. Other possible methods for matrix factorization that may be appropriate for this problem include non-negative matrix factorization [23], and sparse PCA (e.g., [24]). We summarize results from these methods in our Discussion.

Sparse factor analysis

We now briefly describe our novel approach to SFA; see Methods for further details. The SFA model assumes An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e071.jpg, and encourages sparsity in the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e072.jpg matrix by putting a prior on its elements (thus sparsity is a “soft” constraint, rather than a hard requirement). Specifically we use the automatic relevance determination (ARD) prior [25][27], which assumes An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e073.jpg where the variances An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e074.jpg are hyper-parameters that are estimated by maximum likelihood. If the data are consistent with a small absolute value of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e075.jpg then An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e076.jpg will be estimated to be small, which results in strong shrinkage of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e077.jpg towards zero, inducing sparsity where it is consistent with the data. To ensure identifiability we constrain the rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e078.jpg to have unit variance, which effectively determines the scale of the columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e079.jpg; other than this we place no orthogonality constraints or prior distributions on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e080.jpg (unlike most applications of factor analysis; see also [28]).

Results

We use simulated and real human genotype data to compare and contrast SFA, PCA, and an admixture-based model, admixture [11]. (admixture typically produces results that are qualitatively similar to the results from structure, but is computationally more convenient for large data sets.) In particular, we will compare the matrices An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e081.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e082.jpg produced by each method (see above) in a variety of settings. For consistency of terminology we will refer to the columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e083.jpg as the loadings and the rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e084.jpg as the factors for each method. Because each method scales the absolute values of the factors (and loadings) in different ways, the absolute values of the factors (and loadings) are not comparable across methods, but the relative values are. Thus, when looking at the figures to follow, differences in the scales of the axes for different methods are irrelevant and should be ignored. A summary of the results with simple interpretations is in Table 2.

Table 2
Summary of results across PCA, SFA, and admixture-based models.

For PCA we follow the common practice (e.g., as in SmartPCA [13]) of first mean-centering the columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e096.jpg and standardizing them to have unit variance before applying PCA. This slightly complicates comparisons across methods because, formally, we are using PCA to factorize a different matrix than the other two methods. However, the results of PCA on the standardized matrix actually imply a factorization of the original matrix, but with one additional factor and corresponding loading. Specifically, the additional factor corresponds to the vector of genotype means and the additional loading corresponds to a vector of ones (see Text S1). To aid comparisons among the methods we explicitly include this additional factor and loading in the figures and discussions.

Discrete and admixed populations

For simplicity we begin by applying the methods to a small data set of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e097.jpg SNPs typed on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e098.jpg unrelated HapMap individuals: An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e099.jpg Europeans, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e100.jpg Africans, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e101.jpg Chinese and Japanese (data from [29]). In these data, the three continental groups are well separated, making interpretation of the results relatively straightforward and selection of an appropriate number of factors simple. (We discuss the issue of selecting an appropriate number of factors later.) We ran SFA and admixture with three factors; since both of these methods involve a numerical optimization we ran each An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e102.jpg times, using An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e103.jpg different random starting points, and in each case the results were effectively identical across runs.

Figure 3 compares the loadings from SFA and admixture with the first three PCA loadings. All three methods clearly separate out the three groups, but SFA and admixture produce qualitatively different results from PCA. In particular, in SFA and admixture, each individual has appreciable loading on only one of the three factors; from this we infer that the three corresponding factors each represent the allele frequencies of a single continental group. In contrast, in PCA, each individual has appreciable loading on all three factors, and the factors themselves do not have such a straightforward interpretation.

Figure 3
Results of applying SFA, PCA, and admixture to the HapMap genotype data.

In some ways the different representations obtained by SFA, PCA, and admixture are equivalent: the resulting matrix product, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e107.jpg, from each method is essentially identical (not shown). However, in this case we view the results of SFA and admixture as more easily interpretable. Specifically, the three SFA and admixture factors correspond to the Asian, African, and European allele frequencies, respectively. In contrast, the first PCA factor corresponds to the overall mean allele frequency, and subsequent factors correspond to other linear combinations of the allele frequencies in each group. These differences are driven by the different constraints on the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e108.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e109.jpg matrices, not by one factorization fitting the data better. Note that, although PCA is forced into using the mean allele frequencies as its first factor by our following the common practice of applying it to the standardized genotype matrix with the genotype means removed, in this case PCA produces almost identical results when applied to the original genotype matrix (results not shown).

One consequence of SFA and admixture factors corresponding to individual group frequencies is that their results are more robust to the number of individuals included from each group. For example, when we removed half of the Africans from the sample and reran the methods, the results from SFA and admixture were essentially unchanged, whereas PCA results changed more appreciably (Figure S1). The intuition here is that, for SFA and admixture, removing some African individuals has only a small effect on the factor corresponding to Africans (because the sample African allele frequencies change slightly) and a negligible effect on the factors corresponding to the European and Asian individuals. These small changes in the factors translate into correspondingly small changes in the loadings for each remaining individual. In contrast, removing half of the Africans changes all three PCA factors: the modified sample has a different overall mean allele frequency (first factor), and this has a cascading effect on subsequent factors and their loadings. Indeed, the general lack of robustness of PCA to sampling scheme is well known [30], [31].

In more complex settings, we have also found SFA and admixture to be more robust than PCA to sampling scheme. We illustrate this using data on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e111.jpg SNPs typed in An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e112.jpg individuals from An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e113.jpg worldwide populations, including the HapMap individuals considered above plus the Human Genome Diversity Panel [29]. These data contain a much higher proportion of individuals with European or Asian ancestry than the HapMap data alone. Analyzing these data with three factors, SFA and admixture produce loadings for the HapMap individuals that are essentially identical to those obtained from the analysis of the HapMap individuals alone (Pearson correlation An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e114.jpg for SFA; An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e115.jpg for admixture). In contrast, the corresponding PCA loadings change more substantially (correlation An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e116.jpg).

Isolation by distance models

We now compare the methods on some simple isolation-by-distance scenarios, involving both one dimensional and two dimensional habitats. For the 1-D habitat we assume An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e117.jpg demes equally-spaced on a line, and for the 2-D habitat we assume An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e118.jpg demes arranged uniformly on a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e119.jpg by An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e120.jpg square grid. In each case demes are assumed to exchange migrants in each generation with neighboring demes. We applied PCA, SFA and admixture to data from both 1-D and 2-D simulations.

In the 1-D scenario, for each method, two factors suffice to capture the underlying geographical structure (Figure 4). However, as for the discrete data considered above, the interpretations of the resulting factors differ across methods. In SFA and admixture, the two factors represent, roughly, the allele frequencies near either end of the line (Figure 5). The genotype of each individual along the line is then naturally approximated by a linear combination of these two factors, with weights determined by their position along the line (e.g., individuals near the center of the line have roughly equal weight on the two factors). The loadings in SFA seem to capture the underlying structure slightly better near either end of the line than those from admixture, whose loadings effectively saturate at zero on the first and last third of each line. This may partly reflect the constraint that the admixture loadings must sum to one, but may also be exacerbated by the assumption of a binomial distribution, and in particular the assumption of a binomial variance. In contrast, in PCA, the first factor represents the mean allele frequencies and the second represents a difference between the allele frequencies near either end of the line. Thus PCA represents each individual as the mean allele frequency, plus the allele frequency difference weighted according to the location of the individual relative to the center (the weight being zero for individuals near the center of the line, positive at one end of the line, and negative at the other). Again, this behavior is not solely due to our applying PCA to the standardized genotype matrix: it produces almost identical results when applied to the original genotype matrix (results not shown).

Figure 4
Estimated factor loadings from PCA, SFAm, SFA, and admixture for the 1-D isolation-by-distance simulation.
Figure 5
Estimated scaled factors from SFA and admixture on the 1-D isolation-by-distance simulation against the generating allele frequencies.

For the 2-D scenario (Figure 6), the methods differ more substantially in their results. In particular they differ in the number of factors that they need to model the underlying geographical structure.

Figure 6
Results of SFA, PCA, SFAm, and admixture applied to simulated genotype data from a single 2-D habitat.

Due to the convexity constraint, admixture requires four factors, corresponding roughly to the allele frequencies at the four corners of the square habitat. (This result depends on the shape of the habitat; intuitively, the convexity constraint means that admixture needs a factor for each extreme point of a convex habitat.) Even then, the 2-D structure is only easy to visualize after the four factor loadings have been mapped into two dimensions (see Methods). As in the 1-D setting, the loadings for individuals near the edges of the grid saturate near zero or one.

In contrast, both PCA and SFA can capture the structure using three factors, although again they accomplish this in different ways. PCA uses the mean allele frequencies as the first factor, and then two factors that represent deviations from this mean in two orthogonal directions (e.g., the diagonals of the square). As a result the PCA loadings on the second and third factors effectively recapitulate the geography of the space, as previously observed [14], [15], [30].

The results from SFA are more complicated to describe. All three factors represent linear combinations of the allele frequencies on the grid, where the weights of these allele frequencies vary in a consistent way along a particular direction. For example, in the first row of Figure 6B, the first factor has increasing weight as one moves from the bottom to the top of the grid. The result is that the loadings from any two factors recapitulate a skewed version of the geography.

In both of these settings, particularly the 2-D case, the PCA loadings seem to have the simplest interpretation. This is because, after subtracting the genotype mean, the 1-D structure can be captured by a single factor, and the 2-D structure captured by two factors, in each case yielding an attractive geographical interpretation. Thus PCA's use of the mean allele frequency as its first factor, which hinders interpretability in the discrete case, actually aids interpretability in settings with more continuous structure.

However, the use of the mean allele frequencies as the first factor need not be limited to PCA. In particular it is straightforward to modify SFA to behave in a similar way, either by applying it to the genotype matrix with the genotype means subtracted, or by modifying the model to include a mean term (i.e., a factor for which all individuals have loading one). We take the later path here because we think there are advantages to estimating the mean along with the factors, rather than as a preprocessing step. We refer to this approach as SFAm; see Methods for details. Applying SFAm to both the 1-D and 2-D scenarios produces results that are effectively identical to PCA, recapitulating the geographic structure in one or two additional factors respectively (Figure 4 and Figure 6).

In summary, the fact that the first factor in PCA represents the mean allele frequencies is responsible both for the fact that it produces less interpretable factors in the discrete case and more interpretable results in the continuous case. Because SFA provides the flexibility of choice whether or not to include the mean, it can produce interpretable results in both scenarios. Indeed, in the discrete case SFA effectively recapitulates the results of admixture, and in the continuous settings SFAm effectively recapitulates the results of PCA.

Mixture of continuous and discrete populations

To illustrate the potential for SFA to produce new insights in population structure analyses, we now present a hypothetical example for which SFA seems better suited than either admixture or PCA. For this simulation we generated samples from two independent 2-D habitats, so the data have both discrete structure (between the habitats) and continuous structure (within each habitat) (Figure 7A).

Figure 7
Results on simulated genotype data from a two independent 2-D habitats.

We applied PCA, SFA and admixture to these data. Because SFA effectively requires three factors to capture a 2-D structure, we expected it to require six factors to capture this mixture of two 2-D structures, and so we applied SFA with six factors. By analogous reasoning we applied admixture with eight factors.

Reassuringly, SFA behaved as one might predict from the results on discrete and continuous simulations above: three factors were used to represent each of the two 2-D habitats. In particular SFA successfully captured the discrete structure in this case, in that individuals from the first habitat have near-zero loadings on the factors corresponding to the second habitat, and vice versa (Figure 7B). These results were consistent across multiple runs from different random starting points.

In contrast, admixture produced less consistent results from multiple runs (results not shown). In about An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e124.jpg of runs it behaved as we might have hoped, using four factors to represent the corners of each of the two habitats, and effectively capturing both the continuous and the discrete structure. In other cases admixture would converge to alternative solutions, for example using five factors for one habitat and three for another.

PCA produced qualitatively different results, with each individual having a non-zero loading on most factors. The second PCA loading is straightforward to interpret, since it separates individuals from the two habitats. However, subsequent PCA loadings, while jointly capturing the underlying structure, are geometrically beautiful but individually difficult to interpret (Figure 7C).

In this case we view the results from SFA as preferable to those from admixture or PCA. In particular, in a real data analysis, where the underlying structure is unknown, we think that we would more easily deduce the underlying structure (Figure 7A) from the results of SFA (Figure 7B) than from the results of PCA (Figure 7C). However, we could envisage results that are still more interpretable than those from SFA. In particular, one could imagine developing a method (e.g., by appropriate constraints or priors on the matrices) that mimics the results from SFAm or PCA on the single 2-D habitat. That is, one could imagine a method that uses three factors for each 2-D habitat: one factor to be the mean allele frequency, and two factors to capture the geography. Incorporating a single mean term, as do SFAm and PCA, does not achieve this goal because a single mean term does not capture the different mean allele frequencies of the two independent habitats.

Clustered sampling from a continuous population

Up to now we have avoided discussion of automatic selection of an appropriate number of factors, instead relying on intuition and heuristic arguments to guide this selection. In principle one could attempt to formalize this process within a model-selection framework, since SFA has an underlying probabilistic model. However, automatic selection of an appropriate number of factors is difficult, not least because in many practical applications there does not exist a single “correct” number of factors. For example, our 1-D simulations involved An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e125.jpg discrete populations exchanging migrants locally, so in some sense a “correct” number of factors is An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e126.jpg, but for realistic-sized data sets reliably identifying An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e127.jpg factors will not be possible, and analyzing the data with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e128.jpg factors is unlikely to yield helpful insights. Note that interpretability of factors does not necessarily correspond with statistical significance: in isolation by distance scenarios many PCA factors may be statistically significant [13], but usually only the first few are easily interpretable, with additional factors representing mathematical artifacts [30]. For these reasons, in practice it can be helpful to run methods such as admixture and SFA multiple times, with different numbers of factors, to see what different insights may emerge. (PCA need only be run once, because adding additional factors does not change existing factors.)

To illustrate these issues we applied the methods to a situation that mimics clustered sampling from a continuous habitat; specifically we used samples of twenty individuals from each of five evenly-spaced demes from the 1-D simulation above. These samples can be represented in either a low-dimensional way, as five clusters along a continuum, or a higher-dimensional way, as five distinct populations.

Applying SFA to these data (Figure 8A), we obtain qualitatively different results depending on the number of factors used: with two factors the SFA loadings represent the five demes as five points along a line (so each factor corresponds, roughly, to the allele frequencies near each end of the line), whereas, with five factors, the SFA loadings separate the five demes into discrete groups (so each factor corresponds to the allele frequencies within a single deme).

Figure 8
Results from SFA, admixture, and PCA for the clustered 1-D simulation.

Applying admixture to these data (Figure 8B), we obtain similar results as for SFA, except that in the two factor case the five groups are compressed into three groups. Thus, as with the 1-D isolation-by-distance simulations, admixture tends to over-discretize continuous variation.

Applying PCA to these data (Figure 8C), the first two factors capture the continuous variation along the line, as in the 1-D simulations. Subsequent factors each distinguish finer-scale structure among the five demes, and the first five PCA factors, jointly, fully capture the structure. However, each factor is individually difficult to interpret. In particular, because computing additional PCA factors does not affect earlier factors, PCA never reaches a representation in which five factors each represent the allele frequencies of a single deme.

Applying SFAm to these data, with one factor plus the mean term, produces results almost identical to the first two factors of PCA (results not shown).

In summary, this simulation illustrates two important points. First, there is not necessarily a single “correct” number of factors: by applying methods such as SFA and admixture with different numbers of factors, we may obtain qualitatively different results that provide complimentary insights into the underlying structure. Second, SFA seems to be more flexible than either PCA or admixture in its ability to represent both discrete and continuous structure.

European genotype data

We now compare the three methods on a set of European individuals, consisting of genotype data on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e130.jpg individuals at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e131.jpg200,000 SNPs (after thinning to remove correlated SNPs). The collections and methods for the Population Reference Sample (POPRES) are described by [32]. Previous analyses of these and similar data using PCA have found that the first two PCA factors recapitulate the geography of Europe (e.g., [14], [15]).

Based on the results from the 2-D simulations, we chose to apply SFAm (with two factors plus a mean) here, rather than SFA. The results from SFAm are strikingly similar to those from PCA (Figure 9). In a few cases the sparsity-inducing prior we used in SFAm is evident, in that there is a slight tendency for factor loadings near zero to be shrunk closer to zero (appearing as faint diagonal lines of individuals in the rotated SFAm plot). However in general the effect of the sparsity-inducing prior is minimal in these kinds of situations, where the data do not actually exhibit sparsity. Different runs of SFAm produce alternative rotations of this same basic image.

Figure 9
Results from PCA, SFAm, and admixture for the POPRES European data.

As in the 2-D simulations, admixture with four factors is able to capture the geography, but only after these four factors have been mapped to a two-dimensional space (see Methods). As in the 1-D and 2-D simulations, admixture tends to push the data towards the extremes relative to PCA or SFAm, although this effect is substantially less prominent than in the simulations (perhaps due, in part, to the larger number of SNPs). The ability of admixture-based models to capture geography has been noted before [33].

All three methods are computationally tractable for data sets of this size. Of the three methods, PCA was fastest and admixture was slowest, but all three methods took less than a few hours on a modern desktop.

Admixture and Indian genotype data

Recall that, in settings with discrete structure, the SFA factors, like the admixture factors, correspond to the allele frequencies of each discrete populations. One consequence of this is that in settings involving admixed groups, the SFA loadings are highly correlated with the admixture proportions of each individual. Indeed, in some settings it is possible to translate the SFA loadings into estimates of admixture proportions. Specifically, if an individual An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e132.jpg has all positive loadings, and the loading on factor An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e133.jpg is An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e134.jpg, then An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e135.jpg is a natural estimate of that individual's admixture proportion from the population represented by factor An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e136.jpg. However, this estimate assumes implicitly that factors have all been scaled appropriately, which will only be true if the variance of the allele frequencies in the ancestral populations is similar (something that may well hold in many contexts, but would be difficult to check).

To compare all three methods on real data that appear to involve admixture, we consider the data from a recent study on individuals from India [2]. These data were sampled from An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e137.jpg “groups” geographically distributed across India; [2] hypothesized the different groups to be admixed between two ancestral population: ancestral north Indians (ANI) and ancestral south Indians (ASI). This is a challenging data set for admixture analysis because the sample contains no individuals representative of either of the two ancestral populations. For this reason, [2] uses a novel tree-based method (An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e138.jpg ancestry estimation, described in their supplemental information) to estimate the ancestry proportions of each group.

We applied PCA, SFA with two factors, and admixture with two factors to the genotype data from this study, after imputing the missing genotypes, removing some of the outlier populations as defined in the original study, and removing SNPs with a minor allele frequency less than An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e139.jpg (see Methods). We encountered problems applying SFA to these data with the low frequency SNPs included; specifically, SFA often converged to a solution where one individual had a very small residual variance term. All three methods produce very similar loadings (Figure S2) that correlate well with the ancestry proportions estimated in [2] (Pearson correlations of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e140.jpg for PCA, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e141.jpg for SFA, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e142.jpg for admixture) (Figure 10).

Figure 10
Plot of estimated admixture proportions of each Indian group versus the relative admixture proportions from SFA on the Indian data set.

In one sense, the factor loadings provide more detailed ancestry information than the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e143.jpg method, because the loadings are individual-specific rather than group-level. However, in this setting, the loadings provide measures of individual-specific ancestry that are reliable only in a relative sense. That is, they may correctly order the individuals in terms of their degree of ancestry in each ancestral population, but do not necessarily provide accurate ancestry proportions for each individual. For example, the estimated ancestry proportions from admixture range from An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e144.jpg to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e145.jpg, whereas the group-level estimates from the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e146.jpg method range from An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e147.jpg to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e148.jpg. This reflects the difficulty of reliably estimating the ancestral population allele frequencies in the absence of any reference individuals from the ancestral populations.

Discussion

In this paper we have presented a unified view of the two most common methods to analyzing population structure – admixture-based models and PCA – by interpreting both as matrix factorization methods with different constraints on the matrices. This unification provides insights into the different behavior of these methods under various scenarios. For example, viewing admixture-based models as imposing a convexity constraint explains why these models would be expected to need four factors to capture the structure across a square habitat, whereas PCA requires only two factors plus a mean.

Viewing these methods as special cases of a much larger class of matrix factorization methods also immediately suggests many possible novel approaches to the analysis of population structure. Here we consider one such method, sparse factor analysis (SFA). We illustrate that SFA bridges the gap between PCA and admixture-based models by effectively recapitulating the results from admixture-based models in discrete population settings, and recapitulating the results from PCA in continuous settings. We also illustrate a scenario involving a mixture of discrete and continuous structure where SFA produces more interpretable results than either admixture-based models or PCA.

We have also experimented with two other matrix factorization approaches in the analysis of population structure: sparse principal components (SPC) [24] and non-negative matrix factorization [23]. SPC, implemented in the R function SPC in the R package PMA, computes sparse PCs by solving a penalized matrix factorization problem with an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e149.jpg penalty (a penalty on the sum of the absolute values of the factor loadings) to encourage sparsity. The algorithm is greedy in that it computes the factors one at a time, each time removing the effect of the previous factors from the original matrix. The user can choose whether to require the factors to be orthogonal; in our experiments we did not require orthogonality. SPC has a user-defined tuning parameter that controls the level of sparsity. We found that, with careful choice of this parameter, we were able to get SPC to produce results similar to PCA when the data are continuous, and closer to an admixture-based model when the data are from discrete groups. In particular, the main difference from SFA was on the data from two independent 2-D habitats. where SPC did not model the two habitats in separate factors. (We were unable to apply SPC to the larger European and Indian data sets, due to limitations of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e150.jpg.)

As its name suggests, non-negative matrix factorization (NMF) [23], [34] constrains the factors and loadings to have non-negative values. For data sets considered here, we found that NMF typically produced results similar to SFA. However, NMF is less flexible than SFA in that it effectively requires the input matrix to be non-negative. In the genetic context this is not a big limitation as genotype data are most often encoded as non-negative integers (An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e151.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e152.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e153.jpg), but even here it makes NMF slightly less flexible. For example, this means that NMF cannot be applied to genotype data that have been mean-centered, and there is no sensible way to include a mean term as in SFAm. As we have seen, in some settings incorporating a mean improves the interpretability of the results.

The computational methods used to perform the matrix factorization for PCA, SFA, and admixture (and also structure) are practically quite different. In particular, the PCA factorization has a single global optimum that can be obtained analytically, and so multiple runs of PCA produce the same results. In contrast both admixture-based models and the SFA factorizations can have multiple local optima, and the computational algorithms used can produce different results depending on their starting point. In practice, in simple cases (e.g., involving a moderate number of discrete populations), both algorithms appear to produce consistent results across runs. In more complex situations we have found more variability in the results, particularly when the number of factors is large. In some cases there appear to be identifiability issues: for example, in the European data, multiple runs of SFAm produce loadings that are rotations of one another.

Another qualitative difference between the three methods is that PCA produces consistent results as more factors are added, whereas admixture-based methods and SFA may produce qualitatively different results with different numbers of factors. Although consistency may seem a desirable property, there can be benefits to the different perspectives obtained by using different numbers of factors, as we illustrated in the results. To further contrast these two behaviors, consider the application of these methods to data from a continuous 1-D habitat. As noted previously [30], the first PCA loading (after removing the mean) roughly captures position within the habitat, whereas subsequent loadings are sinusoidal functions of increasing frequency. In contrast, when SFA or admixture are run with an increasing number of factors, they redistribute their factors along the line so that each factor represents the average allele frequencies of an increasingly local region. (If too many factors are used, there is not enough signal in the data to differentiate populations on small neighboring segments, and the results become unreliable.) Although the additional factors in each case are qualitatively very different, they simply reflect different ways to capture finer-scale structure in the data. Which of these behaviors is preferable may be context-dependent, but understanding these differences is certainly helpful in interpreting the results of a data analysis.

Although we have focused on the different constraints imposed by different matrix factorization methods, they also differ in another way: their assumed error distribution. In particular, admixture-based models assume a binomial error, whereas PCA is based on a least-squares criterion, which can be interpreted as a Gaussian error, and our SFA explicitly assumes Gaussian error. The binomial error may be more appropriate for data from an admixed population, but in general it is less flexible than the Gaussian model because the binomial variance is determined by the mean, rather than being a free parameter. It seems possible that this partly explains the convergence problems we observed in admixture for the 2-D habitat, in which case it may be worth adapting the admixture model to assume a Gaussian error.

We note that there are several existing approaches to sparse factor analysis besides the novel approach that we introduce here [19][21], [35]. Although these methods have similar motivations, they differ in several respects, and we have found that these differences can substantially impact results (not shown). One advantage of our approach is its computational speed. Another feature of our approach is its lack of manually-tunable parameters (other than the number of factors). This, of course, is a double-edged sword, since on the one hand, it makes the method easy to apply, but on the other hand, reduces flexibility. In practice, as our results show, our approach is sufficiently flexible to deal with a range of contexts involving different levels of sparsity.

Our approach to SFA may also be useful in other contexts (e.g., gene expression data [22], [35] or collaborative filtering [36]). In some cases, particularly when the data do not exhibit much sparsity, it may be desirable to extend our method in various ways. For example, as we have implemented it here, SFA encourages sparsity only on the loadings, and in some contexts it may be desirable to encourage sparsity on both the factors and the loadings (as in the general penalized matrix decomposition method [24]). This could be achieved by putting an ARD prior on the elements of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e154.jpg, and applying an analog of our ECME algorithm. It may also be fruitful to consider ways to increase the sparsity in the loadings, since in some other contexts we have found that the ARD prior we use can be generous in its use of non-zero loadings. Finally, although we have argued that in the context of population structure that applying methods with different numbers of factors may yield more insight than selecting a single “correct” number of factors, this may not be equally true in all contexts. In particular, the population structure case is complicated by the fact that the factors are often highly correlated with one another (e.g., because they often represent allele frequencies in closely-related populations); in settings where factors are less correlated it may be more helpful to consider methods for automatically selecting the number factors (e.g., [37]).

Methods

Genotype simulations

We simulated genotypes from 1-D and 2-D habitats using the program ms [38], using stepping-stone models similar to [30]. In the 1-D model we assumed An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e155.jpg demes along a line and allowing a high level of migration (An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e156.jpg) between adjacent demes. This migration rate produced an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e157.jpg of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e158.jpg between the two demes at either end of the line, which enables the two most extreme demes to be easily separable with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e159.jpg SNPs. We sampled one diploid individual (two independent haplotypes) from each deme at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e160.jpg independent SNPs.

For the 2-D simulations, we assumed An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e161.jpg demes arranged in a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e162.jpg by An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e163.jpg square grid, with migration parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e164.jpg between neighboring demes. We then sampled one diploid individual from each deme at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e165.jpg independent SNPs. For the two 2-D habitat simulations, we simulated two independent sets of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e166.jpg demes and sampled a single individual from each deme at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e167.jpg independent SNPs.

For both the simulated and the real genotype data, we encoded each genotype (AA, AB, or BB) as An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e168.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e169.jpg or An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e170.jpg.

POPRES European data

We used the POPRES European data set from [32], and processed the data as in [14]. The POPRES data set was obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi- bin/study.cgi?study_id=phs000145.v1.p1 through dbGaP accession number phs000145.v1.p1. This data included An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e171.jpg individuals, each of whom identify all four grandparents as being from a particular European country, genotyped at An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e172.jpg SNPs, and pruned down to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e173.jpg SNPs after removing one of any pair of SNPs that had an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e174.jpg [14].

Since our SFA method does not currently deal with missing data, we imputed missing genotypes using impute2 [39]. We imputed each chromosome by intervals of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e175.jpgMb, starting at position An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e176.jpg, with a buffer of size An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e177.jpgMb on either side of the interval. We set the number of burn-in iterations to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e178.jpg and the number of MCMC iterations to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e179.jpg. We set the effective population size of the European sample to be An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e180.jpg, and we used the combined linkage maps from build An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e181.jpg, release An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e182.jpg (downloaded from the impute website). We used these imputed genotypes as input to all three methods to facilitate fair comparisons.

Indian data

We used the Indian genotype data from [2]. The original data includes An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e183.jpg individuals from An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e184.jpg groups; we removed the groups that appeared to be genetic outliers as described in the original paper (Sahariya, Nysha, Aonaga, Siddi, Great Andamanese, Hallaki, Santhal, Kharia, Onge, and Chenchu), leaving An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e185.jpg groups and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e186.jpg individuals with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e187.jpg genotyped SNPs. We imputed missing genotypes using impute2 as above, but with an effective population size of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e188.jpg, and used these imputed genotypes as input to all three methods. After imputation, we pruned the data down to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e189.jpg SNPs by removing one of any pair of SNPs that had an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e190.jpg, and removing SNPs that had a minor allele frequency less than An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e191.jpg.

Sparse factor analysis

Let An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e192.jpg be the number of individuals in a sample and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e193.jpg be the number of genotypes. Represent each allele at a locus as a number (e.g., for SNPs from a diploid organism, as in our results above, represent An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e194.jpg as An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e195.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e196.jpg as An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e197.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e198.jpg as An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e199.jpg). Our factor analysis model with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e200.jpg factors can be written as:

equation image
(3)

or, equivalently,

equation image
(4)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e203.jpg is an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e204.jpg data matrix, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e205.jpg is a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e206.jpg-vector of column-specific means, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e207.jpg is the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e208.jpg matrix of factor loadings, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e209.jpg is the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e210.jpg matrix of factors, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e211.jpg is an An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e212.jpg matrix with each element independently distributed An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e213.jpg. We put a gamma prior on the inverse residual variance that acts as a regularizer: An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e214.jpg, which has mean An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e215.jpg and variance An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e216.jpg. In practice, we set An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e217.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e218.jpg. This model, with a mean term, is referred to as SFAm in the main text; the SFA model is obtained by fixing the vector An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e219.jpg at zero. The ECME algorithm for fitting SFAm is described below; the ECME algorithm for fitting SFA is obtained by simply setting An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e220.jpg throughout. Note that here we have chosen to have column-specific (i.e., SNP-specific) means and row-specific (i.e., individual-specific) variances An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e221.jpg. It is possible to modify the ECME updates below to allow for different assumptions, for example to allow row-specific means or column-specific variances. In some contexts, including the population structure problem considered here, it might make sense to allow more general assumptions, such as variance terms on both the rows and columns of the matrix; indeed these options are implemented in the SFA software, although not investigated here.

To induce sparsity in the factor loadings An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e222.jpg, we use an automatic relevance determination (ARD) prior [40]. Specifically, we assume An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e223.jpg, where the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e224.jpg is a parameter that we estimate, together with the other parameters, using maximum likelihood. If the estimate of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e225.jpg, this implies that An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e226.jpg, thus inducing sparsity.

Integrating out An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e227.jpg, the rows of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e228.jpg are conditionally independent given the other parameters, with:

equation image
(5)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e230.jpg (a diagonal matrix with the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e231.jpg-vector An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e232.jpg on the diagonal), and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e233.jpg. Thus the log marginal likelihood for the parameters An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e234.jpg is:

equation image
(6)

equation image
(7)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e237.jpg.

Sparse factor analysis ECME algorithm

We fit this model using an expectation conditional maximization either (ECME) algorithm [41] to maximize An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e238.jpg. This algorithm is similar to an EM algorithm, but each maximization step maximizes either the expected log likelihood, or the marginal log likelihood, for a subset of the parameters conditional on the others. Specifically, the updates to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e239.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e240.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e241.jpg involve maximizing the expected log likelihood (with the expectation taken over An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e242.jpg), whereas the updates to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e243.jpg directly maximize the log marginal likelihood.

To compute the expected log likelihood requires the first and second moments of the factor loadings An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e244.jpg. The data An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e245.jpg and the loadings An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e246.jpg are jointly normal (as in, e.g., [42]):

equation image
(8)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e248.jpg is a An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e249.jpg-vector of zeros. Standard results for joint Gaussian distributions give the conditional expectation for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e250.jpg:

equation image
(9)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e252.jpg. Similarly, the conditional second moment is given by:

equation image
(10)

The updates for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e254.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e255.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e256.jpg involve maximizing the expected complete data log likelihood, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e257.jpg, which from Equation 4, and including the prior distribution on An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e258.jpg, is given by:

equation image
(11)

where

equation image
(12)

Taking the derivative of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e261.jpg with respect to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e262.jpg and setting to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e263.jpg, we get the update for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e264.jpg:

equation image
(13)

equation image
(14)

In these expressions, and in what follows, we are assuming element-wise multiplication when a scalar multiplies a vector or a matrix.

Taking the derivative of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e267.jpg with respect to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e268.jpg and setting to zero, we get the update for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e269.jpg:

equation image

equation image
(15)

Taking the derivative of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e272.jpg with respect to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e273.jpg and setting to zero, we get the update for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e274.jpg:

equation image
(16)

To update An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e276.jpg we can use the result from [40] to obtain the values of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e277.jpg that maximize the log marginal likelihood An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e278.jpg with fixed values of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e279.jpg, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e280.jpg, and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e281.jpg:

equation image
(17)

where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e283.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e284.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e285.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e286.jpg. Note that An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e287.jpg when An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e288.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e289.jpg otherwise. This works because, given An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e290.jpg, the SFA model (Equation 3) is essentially the sparse regression model considered in [40] with An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e291.jpg playing the role of the covariates.

Note that An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e292.jpg and An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e293.jpg are non-identifiable in that multiplying the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e294.jpg row of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e295.jpg by a constant An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e296.jpg and dividing the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e297.jpg column of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e298.jpg by An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e299.jpg will not change the likelihood (Equation 6). To deal with this we impose an identifiability constraint, An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e300.jpg for An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e301.jpg, where An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e302.jpg. Specifically, after each iteration we divide every element of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e303.jpg by its standard deviation An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e304.jpg, and multiply the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e305.jpg column of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e306.jpg by An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e307.jpg.

Because we choose not to update the expected values of the loading matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e308.jpg between the CM steps, monotone convergence of the log marginal likelihood is not guaranteed, although in practice it appears to converge well. We find that convergence is reached for the applications described here after fewer than An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e309.jpg iterations. For each genotype data set, we run SFA multiple times with random seeds, setting the number of factors as described in the text; results presented in figures are a representative example. A C++ package containing the SFA and SFAm code is available for download at http://stephenslab.uchicago.edu/software.html.

Principal components analysis

For smaller data sets (all but the European and Indian data), we computed principal components by first standardizing the columns of the matrix An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e310.jpg (subtracting their mean and dividing by their standard deviation) and then finding the eigenvectors of the An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e311.jpg covariance matrix of the individuals in R [43] using the function eigen. In our terminology, these eigenvectors, or principal components (PCs), are the loadings, i.e., the columns of An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e312.jpg. For larger data sets, we identify the PCs using the SmartPCA software from the EigenSoft vAn external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e313.jpg package [7], [13]. For both the European genotype data and the Indian genotype data, we set the number of output vectors to An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e314.jpg, we use the default normalization style, we do not identify outliers, we have no missing data, and we remove all An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e315.jpg chromosome data.

Admixture

We ran admixture vAn external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e316.jpg [11] with multiple random starting points using the -s option.

We mapped the four-dimensional admixture proportions into two-dimensions for visualization as follows: the four-dimensional vector An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e317.jpg maps to the two-dimensional vector An external file that holds a picture, illustration, etc.
Object name is pgen.1001117.e318.jpg.

Supporting Information

Figure S1

Results of applying SFA, PCA, and ADMIXTURE to the HapMap genotype data after removing half of the Africans. Each plot in the first three columns shows the loadings estimated from the modified data set across individuals. Each plot in the second three columns shows the estimated factors for the original data set against the estimated factors for the modified data set. The first row is SFA, the second row is PCA, and the third row is ADMIXTURE. European individuals are denoted with blue ‘x’s, African individuals are denoted with red triangles, and Asian individuals are denoted with green ‘+’s. A dashed horizontal line is at zero on the y-axis. Note how the correlation of the two unaffected populations for SFA and ADMIXTURE is much higher than for any of the factors in PCA.

(5.76 MB TIF)

Figure S2

Results from PCA, SFA, and ADMIXTURE for the Indian data. Only one estimated loading from SFA and ADMIXTURE are shown because the second set of loadings are perfectly negatively correlated to the first. The results from SFA are almost identical to those from PCA for these data. The individuals are colored as in the map from Figure 10 in the main text according to their population group.

(2.06 MB TIF)

Text S1

Supplemental information. In particular, this information addresses the mathematical consequences of standardizing the genotype matrix before applying a matrix factorization method.

(0.04 MB PDF)

Acknowledgments

The authors gratefully acknowledge the help of John Novembre for providing ms scripts for the habitat simulations, information about the preprocessing of the GSK European data set, and thoughtful discussions and Bryan Howie for providing a pre-release version of impute2.

Footnotes

The authors have declared that no competing interests exist.

This work was supported in part by the Bioinformatics Research Development Fund, supported by Kathryn and George Gould, to BEE and by NIH grant HG002585 to MS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, et al. Genetic Structure of Human Populations. Science. 2002;298:2381–2385. [PubMed]
2. Reich D, Thangaraj K, Patterson N, Price AL, Singh L. Reconstructing Indian population history. Nature. 2009;461:489–494. [PMC free article] [PubMed]
3. Wasser SK, Mailand C, Booth R, Mutayoba B, Kisamo E, et al. Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban. Proceedings of the National Academy of Sciences. 2007;104:4228–4233. [PubMed]
4. Parker HG, Kim LV, Sutter NB, Carlson S, Lorentzen TD, et al. Genetic Structure of the Purebred Domestic Dog. Science. 2004;304:1160–1164. [PubMed]
5. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. American Journal of Human Genetics. 1999;65:220–228. [PubMed]
6. Pritchard J. Case-Control Studies of Association in Structured or Admixed Populations. Theoretical Population Biology. 2001;60:227–237. [PubMed]
7. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics. 2006;38:904–909. [PubMed]
8. Falush D, Stephens M, Pritchard JK. Inference of Population Structure Using Multilocus Genotype Data: Linked Loci and Correlated Allele Frequencies. Genetics. 2003;164:1567–1587. [PubMed]
9. Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: Analytical and study design considerations. Genetic Epidemiology. 2005;28:289–301. [PubMed]
10. Tang H, Coram M, Wang P, Zhu X, Risch N. Reconstructing genetic ancestry blocks in admixed individuals. American Journal of Human Genetics. 2006;79:1–12. [PubMed]
11. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Research. 2009;19:1655–1664. [PubMed]
12. Zhu X, Zhang S, Zhao H, Cooper RS. Association mapping, using a mixture model for complex traits. Genetic Epidemiology. 2002;23:181–196. [PubMed]
13. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLoS Genetics. 2006;2:e190. doi: 10.1371/journal.pgen.0020190. [PubMed]
14. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, et al. Genes mirror geography within Europe. Nature. 2008;456:98–101. [PMC free article] [PubMed]
15. Lao O, Lu TT, Nothnagel M, Junge O, Freitag-Wolf S, et al. Correlation between Genetic and Geographic Structure in Europe. Current Biology. 2008;18:1241–1248. [PubMed]
16. Buntine W. Variational extensions to EM and multinomial PCA. 2002. In: Proceedings of the European Conference on Machine Learning.
17. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PubMed]
18. Eckart C, Young G. The approximation of one matrix by another of lower rank. Psychometrika. 1936;1:211–218.
19. Lucas J, Carvalho C, Wang Q, Bild A, Nevins J, et al. Sparse Statistical Modelling in Gene Expression Genomics. Cambridge University Press; 2006. pp. 155–176.
20. Fokoue E. Stochastic determination of the intrinsic structure in Bayesian factor analysis. 2004. Tech. rep., Statistical and Applied Mathematical Sciences Institute (SAMSI)
21. Carvalho C, Chang J, Lucas J, Nevins JR, Wang Q, et al. High-Dimensional Sparse Factor Modelling: Applications in Gene Expression Genomics. Journal of the American Statistical Association. 2008;103:1438–1456. [PMC free article] [PubMed]
22. Pournara I, Wernisch L. Factor analysis for gene regulatory networks and transcription factor activity profiles. BMC Bioinformatics. 2007;8 [PMC free article] [PubMed]
23. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999;401:788–791. [PubMed]
24. Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10:515–534. [PMC free article] [PubMed]
25. Mackay DJC. Bayesian methods for adaptive models. 1992. Ph.D. thesis, California Institute of Technology, Pasadena, CA.
26. Neal RM. Bayesian Learning for Neural Networks. 1996. Lecture Notes in Statistics No. 118, Springer-Verlag.
27. Tipping ME. The relevance vector machine. 2000. In: Proceedings of the Neural Information Processing Systems 12.
28. Lawrence N. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research. 2005;6:1783–1816.
29. Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nature Genetics. 2006;38:1251–1260. [PubMed]
30. Novembre J, Stephens M. Interpreting principal component analyses of spatial population genetic variation. Nature Genetics. 2008;40:646–649. [PMC free article] [PubMed]
31. McVean G. A Genealogical Interpretation of Principal Components Analysis. PLoS Genetics. 2009;5:e1000686. doi: 10.1371/journal.pgen.1000686. [PMC free article] [PubMed]
32. Nelson MR, Bryc K, King KS, Indap A, Boyko AR, et al. The Population Reference Sample, POPRES: A Resource for Population, Disease, and Pharmacological Genetics Research. American Journal of Human Genetics. 2008;83:347–358. [PubMed]
33. Serre D, Pääbo S. Evidence for Gradients of Human Genetic Diversity Within and Among Continents. Genome Research. 2004;14:1679–1685. [PubMed]
34. Lee DD, Seung SH. Algorithms for Non-negative Matrix Factorization. 2001. pp. 556–562. In: Advances in Neural Information Processing Systems 13.
35. West M. Bayesian Factor Regression Models in the Large p, Small n Paradigm. Bayesian Statistics. 2003;7:723–732.
36. Canny J. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA: ACM; 2002. Collaborative filtering with privacy via factor analysis. pp. 238–245.
37. Lopes HF, West M. Bayesian model assessment in factor analysis. Statistica Sinica. 2004;14:41–67.
38. Hudson RR. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002;18:337–338. [PubMed]
39. Howie BN, Donnelly P, Marchini J. A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics. 2009;5:e1000529. doi: 10.1371/journal.pgen.1000529. [PMC free article] [PubMed]
40. Tipping ME, Faul AC. Bishop CM, Frey BJ, editors. Fast marginal likelihood maximization for sparse Bayesian models. Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. 2003.
41. Liu C, Rubin DB. The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika. 1994;81:633–648.
42. Ghahramani Z, Hinton GE. The EM algorithm for mixtures of factor analyzers. 1996. Tech. rep., CRG-TR-96-1.
43. R Development Core Team. R: A Language and Environment for Statistical Computing. 2008. R Foundation for Statistical Computing, Vienna, Austria.

Articles from PLoS Genetics are provided here courtesy of Public Library of Science