Before presenting a unifying approach, we review the justification and underlying principles of both methods.
What Is SAT?
Hoggart et al. (p. 1492 in [7
]) articulated the rationale behind SAT: “In general, population stratification exists when the total population has been formed by admixture between subpopulations and when admixture proportions (defined as the proportions of the genome that have ancestry from each subpopulation) vary between individuals. …If the risk of disease varies with admixture proportions, this will confound associations of disease with genotype at any locus where allele frequencies vary between subpopulations. …If the confounder—admixture proportions—can be measured accurately, control for it can be achieved in a straightforward manner by modeling its effects in the analysis.”
We will show that how one attempts to control for parental ancestry is critical to determining whether one eliminates potential confounding due to variations in parental ancestry. To our knowledge, there are four published approaches to SAT [7
]. All are built on this general principle, but take somewhat different approaches. We will not explore the specifics of those approaches here but note that none are couched in a general framework that includes both RAM and SAT. Furthermore, none allow flexible generalization to as broad a range of situations as we would wish.
The overall issue of confounding due to admixture disequilibrium, generalized to any population, is portrayed in the path diagram of . In the path diagram, rectangles represent directly observed variables, ellipses represent unobserved or latent variables, dashed ellipses represent variables that can potentially exert influences, and arrows represent direct or casual relationships. The path diagram introduces two key latent constructs, individual ancestry and individual admixture, which underlie the issue of confounding due to variation in individual ancestry. Specifically, an individual ancestry proportion, with respect to a specific parental population, is defined as the proportion of that individual's ancestors who were members of that parental population in the generation prior to the first admixture event. This is in contrast to an individual's admixture, which is the proportion of the individual's genome that is inherited from a specific parental population.
Path Diagram Illustrating the Relationship between Admixture, Ancestry, and Phenotype
The figure indicates that association testing is not a simple issue. The relationship between the putative quantitative trait locus (QTL) and phenotype is the one of interest, but it can be confounded by other variables. First, note that QTLs and individual admixture can be directly influenced by random variation due to meiosis. In addition, both the phenotype and measured admixture are potentially subject to measurement error. Furthermore, measured admixture is directly affected by individual admixture, which in turn is affected by individual ancestry. Naturally, the ancestry of the parents, represented by P1 and P2, affects individual ancestry. Individual ancestry can directly affect the putative QTL, which in turn can affect the phenotype, so individual ancestry has an indirect affect on the phenotype via the putative QTL. The right–hand side of the path diagram is a mirror image of the left–hand side, with unobserved QTL replacing the putative QTL and represents the potential path of spurious associations. The diagram also indicates that the product of parental ancestries also affects both QTLs. Justification for these paths is provided below.
The consequences of failing to control for variation is ancestry is illustrated in A. The simple simulation reveals type I errors occur 13.24, 41.2, and 193 times as often as expected at the .05, .01, and .001 α levels, respectively, and this inflation is attributable to confounding due to variation in ancestry. SATs are designed to be resistant to such confounding.
Conditioning on Individual Ancestry and the Product of Parental Ancestries Is Necessary and Sufficient to Control for Confounding
What Is RAM?
We define region-specific admixture as a characteristic of segments of the genomes of individuals. For any given region of the genome, one's region-specific admixture from population V
is the proportion of alleles in that region that are copies of alleles from members of population V.
The rationale for RAM rests on two premises. First, the process of admixture creates linkage disequilibrium among linked loci that tends to extend over longer genetic distances than does disequilibrium under long-term panmixia. Second, even after appropriately adjusting for the degree of individual ancestry, the degree of individual region-specific admixture will covary with phenotypes that are influenced by loci that are (1) in the region under study; and (2) in disequilibrium with loci that have different allele frequencies in the parental populations. Both premises are well established [23
]. Prior to the late 1990s, several authors had formally discussed the possibility of RAM-type approaches [23
], but did not offer methods that would control for potential spurious associations [4
]. McKeigue first introduced modern approaches to RAM that attempted to control for spurious associations induced by the admixture process [6
Several approaches to RAM [6
] have been published. Some [28
] use a two-stage approach in which estimates of individual admixture and region-specific admixture are first obtained in a specialized procedure and then used in an ordinary logistic regression approach with case-control data. This two-stage approach lends itself to generalization and is a simplified form of the unified general linear model approach we present.
There are a number of methodologic points that have been alluded to but have not been completely elucidated in the literature pertaining to how one should condition upon (control for) ancestry within RAM and SAT. Within the next few sections, we seek to clarify these points.
It is unclear from past writing whether it is sufficient to control for individual admixture, individual ancestry, or both to eliminate confounding due to the admixture process. We first clarify that, although sometimes used interchangeably, an individual's admixture and an individual's ancestry are not equivalent variables. To illustrate, consider a set of full siblings that does not include any monozygotic twins. Because they are full siblings, all individuals in the set have equal individual ancestry from specific populations or regions. In fact, all individuals in the set have ancestry equal to the mean or midpoint of their parent's ancestries, represented as P1 and P2. However, due to recombination, all individuals will have slightly different admixture values.
Here we show by counterexamples that it is not sufficient to control for individual admixture and it is also not sufficient to control for individual ancestry. We then show that it is sufficient to control for both individual ancestry and the product of parental ancestry. Throughout the paper and our examples, i represents the ith individual, j the jth locus, k the number of alleles at the jth locus, and V the number of founding populations. For simplicity we assume 2 founding populations in this paper.
Controlling for individual admixture is not sufficient.
Given variations in parental ancestry, controlling for individual admixture is not sufficient. Imagine an organism with W
independent genetic segments of equal genetic length. For each individual, let the two parents have equal ancestry. Suppose that the admixture of each segment is known without (measurement) error. Without loss of generality, assume that the segment-specific admixture values (denoted Xj
for the jth
segment) and the ancestry values are all scaled to have variance 1.0. Given the assumptions above, all segment-specific admixture values will have equal covariance with ancestry. Denote this covariance as β.
denote the overall individual admixture value (for ease of exposition, we have not divided by W,
but this is only a linear transformation and will have no impact on the result). Then, the correlation coefficient between Xj1
. The correlation coefficient between Xj1
The correlation coefficient can be written in terms of simple correlation coefficients
after substituting and reducing,
> 1. Thus, it is clear in this situation that the partial correlation coefficient can never be zero and only asymptotically approaches zero as W
approaches infinity (i.e., as the amount of independent information that goes into the emergent variable of admixture increases infinitely). If
is not guaranteed to be zero, then, conditional on individual admixture, what is inherited at one segment can be correlated with what is inherited at another chromosome. Therefore, controlling for individual admixture is not sufficient to eliminate correlations among unlinked loci and is not sufficient to control for spurious associations. The formula further implies that the distinction between individual ancestry and individual admixture will, all other things being equal, be greatest in organisms such as Arabidopsis
(diploid chromosome number = 10, 8.0 × 107
base pairs in total length) with short genomes and less in organisms such as crayfish (diploid chromosome number = 200, 8.22 × 109
base pairs in total length) with long genomes (see http://www.genomesize.com
Controlling for individual ancestry is not sufficient.
denote Bernoulli-distributed random variables indicating whether or not one has inherited two alleles from population V
at locus 1 and locus 2, respectively, and let vij
denote the number of alleles inherited from population V
at the jth
locus for the ith
individual. Assume that the two loci are unlinked and that we begin with two inbred populations, V
), denoting nonadmixed individuals from population V
and nonadmixed members of the population
. Subsequently, N1
individuals from populations V
, and, subsequently their offspring, begin intermating for two generations in an unspecified pattern. Then, in the second admixed generation, we have a population that can be described as in .
Expected Population Resulting from Two Generations of Random Mating between Two Inbred Populations
As can be seen in , P
= 2) is not determined solely by individual ancestry but also depends on mating patterns and mixing proportions, via their influence on the distribution of parental mating types. This means that, even conditional upon individual ancestry, there can still be confounding because X1
will be correlated with X2.
Controlling for individual ancestry may remove most of the confounding, but not all. This is even more evident when one imagines a dataset including only the two rows with V
ancestry of 1/2. Within these two rows, although individual ancestry would be controlled perfectly (there would be no variation), the opportunity for confounding is present. Only members of the V
matings can have either X1
= 1 or X2
Some models (e.g., [7
]) control for the linear effect of individual ancestry or individual admixture in regression-type models in an attempt to insure that RAM and SAT tests are not confounded by variation in ancestry. This will only be valid if one tests only for linear allelic (additive) effects at loci without testing for dominance (genotypic) effects or epistasis. This is because when testing for the allelic effects, the expected number of alleles from population V
at any one locus among individuals with ancestry A
from population V
However, the locus-specific effects on complex and quantitative traits cannot a priori be assumed to be additive and can even be overdominant [30
]. For this reason, many investigators wisely choose to test for genotypic effects in two degrees of freedom models (e.g., [12
]) rather than restricting themselves to allelic (additive) effects (compare with [35
]). In such situations, controlling only for the linear term of individual ancestry will be insufficient if one uses tests that allow for nonadditive genotypic effects.
Controlling for individual ancestry and the product of parental ancestries is sufficient.
The premise of conditioning on parental ancestry was first introduced by McKeigue [26
]. Here we expand on the idea and show that it is necessary to condition on both individual ancestry and the product of parental ancestries. It is important to note in the following that, although we are controlling for parental ancestries, this does not imply it is necessary to include parents in RAM and SAT studies (see Text S1
for discussion of estimating parental ancestry solely from offspring data).
denote the individual ancestries from population V
for the two parents, respectively. Note that for any locus, the expected number of V
alleles depends only on the individual's ancestry; hence, we drop the locus-specific subscript j
in subsequent equations. Then, at every locus:
Furthermore, conditional on P1i
the number of alleles inherited from one population at a given locus is independent of the number of alleles inherited at another locus for all loci that are unlinked as defined by Mendel's law of independent assortment. Therefore, controlling for P
), and P
) is sufficient to eliminate confounding by unlinked loci. Given that P
) + P
) + P
) = 1, it is only necessary to control for any two in a model. We choose to control for P
) and P
). If we let Y
denote a phenotype and f
) denote some function of Y,
then a testing model that would eliminate confounding induced by variations in parental ancestry would take the form:
in which the missing terms denoted by the ellipsis are those that one is primarily interested in testing. Letting β0 α0
, and β2 α1
and substituting terms yields:
Noting that, by definition, (P1i
)/2 is individual ancestry (Ai
As can be seen, the probability distribution of the descent status (and therefore the genotypes if allele frequencies differed in the parental populations) depends on both first- and second-order functions of ancestry but not on any higher-order terms. Thus, to eliminate confounding due to variations in parental ancestry, it is sufficient to control for individual ancestry and the product of parental ancestries. B–D illustrates these points. Specifically, B indicates that if the confounding locus acts in a additive fashion, controlling for ancestry without the product of parental ancestries does provide adequate type I control. However, C reveals type I errors occur 6.16, 16.4, and 36 times as often as expected at the .05, .01, and .001 α levels, respectively, when the confounding locus acts in an overdominant fashion and the linear term of ancestry alone is used to control for variation in ancestry. Finally, D indicates adequate control is achieved when the confounding locus acts in an overdominant fashion and both the linear term of ancestry and the product of parental ancestries are used to control for variation in ancestry.
The insufficiency of “conditional conditioning.”
One may choose to condition on parental ancestry only if parental ancestry is found to be statistically significant when included in the model or if significant structure is detected in the sample as was described by Pritchard et al. [22
] as the first step in their three-step SAT procedure and by Hoggart et al. (p. 1502 in [7
]). We refer to this approach as conditional conditioning. If one's goal is to ensure that under H0
, the type 1 error rate remains ≤ α, which generally defines a valid test in the frequentist context, then conditional conditioning is not a valid testing strategy. That is, even though covariates may not meet criteria for statistical significance in a finite sample, this does not mean they are not confounders, and failing to include them in the model can lead to inflated type 1 error rates [36
]. Therefore, if one is interested in valid RAM and SAT tests of linkage in the presence of association, it is necessary to control for parental ancestry terms as in Equations 10
regardless of their degree of statistical significance in the model. By analogy, the practice of only controlling for parental ancestry only if a significance test of Hardy-Weinberg equilibrium is rejected has the same problem [37
]. So too would the practice of attempting to control for parental ancestry only if other tests yielded significant evidence that the sample came from a structured population. This is illustrated in ,
which reveals type I errors occur 7.93, 28.87, and 66.1 times as often as expected at the .05, .01, and .001 α levels, respectively, when conditional conditioning is used.
Effect of “Conditional Conditioning” on Type 1 Error Rates
A General Linear Model
Here we introduce general models for RAM and SAT that are highly extensible. We define the following notation: Y,
a phenotype that can be continuous, ordinal, or dichotomous; Ai,
ancestry for the ith
individual, the proportion of the ith
individual's ancestors that came from parental population V
a dummy-coded (0,1) indicator variable indicating whether the ith
individual has inherited k
and only k
alleles at the jth
locus from an ancestor that was from parental population V
; and Gijk,
a dummy-coded (0,1) indicator variable indicating whether the ith
individual has k
and only k
alleles at the jth
locus of a specified type. We use f
) to denote the link function, a monotone function linking the dependent variables to the estimated model [38
], a device also employed by Hoggart et al. [7
]. We offer the following simple models for generalized RAM and SAT. We assume for now that all variables are known without error. However we return to the important issue of measurement error issues later.
These general linear models are very flexible. First, dichotomous (e.g., case vs. control), ordinal, time-to-event, or continuous phenotypes can be accommodated by letting the regression model be logistic, Poisson, Cox, or ordinary least squares, respectively. This flexibility is important. Investigators frequently want to not only assess genetic association for dichotomous and static phenotypes such as lupus (yes vs. no) in a case-control study, but also wish to assess genetic association with longitudinal outcomes (e.g., clinical course in medical research or growth rate in agricultural research), adjusting for covariates including demographic and ancestry. Such longitudinal phenotypes can also be accommodated by this general model via the use of mixed models and related techniques for longitudinal data [39
]. Therefore, the models can be fit in standard software (e.g., SAS), which has the advantage of being widely accessible, well documented, and well tested. This radically increases the likelihood of wide and proper use. Moreover, by being framed in a regression approach, all of the machinery of regression, including diagnostics [41
], well-recognized effect size metrics, robust variations [42
], the ability to include covariates, and the ability to test interactions are at one's disposal. This immediately makes the models extensible to multilocus and epistatic models. Finally, the RAM approach can be expanded to test a region of a chromosome by, instead of including marker-specific ancestry, including an estimate of the admixture of the region.
A conceptual bridge to identity in state and identity by descent.
Another advantage of the models in Equations 10
is that they make clear the relationships between RAM and SAT and identity by descent and identity in state in family-based tests of linkage and linkage in the presence of association. RAM is analogous to linkage testing, whereas SAT is analogous to association testing. The Aijk
values correspond to “descent states,” whereas the Gijk
values correspond to specific allele states. Indeed, Zhu et al. [16
], citing [26
], refer to such Aijk
quantities as “X by descent” to denote an allele having ancestry from X. This conceptual bridge is more than an intellectual nicety. It immediately makes clear how we can borrow the concept of testing for linkage conditional upon association that is now popular in linkage analysis [43
], as we shall discuss below.
As already discussed, the models in Equations 10
are easily extended to allow for any phenotypic distribution. Because no constraints are placed on the distribution of the phenotypes, with two exceptions, the models can accommodate selective sampling (e.g., sampling phenotypically extreme subjects or sampling subjects on the basis of ancestry) without modification. In addition, covariates, multiple loci, gene by environment (or gene by sex, gene by age, etc.), and gene by gene (epistasis) effects are easily modeled by simply adding appropriate terms to the right side of the equation. The general linear model presented here can be extended to deal with several situations, which are briefly introduced below. If there are a total of M
phenotypes to include, one can replace the variable Y
on the left side of Equations 10
with a weighted linear composite of Y
values representing the multiple phenotypes as follows:
Multivariate RAM model:
Multivariate SAT model:
s are constants to be estimated within the regression framework and are constrained such that
= 1. This constraint is necessary to make the model identifiable.
To our knowledge, no current RAM or SAT test allows related individuals to be included as subjects. (We distinguish the inclusion of related individuals as subjects from the requirement that parents or other relatives be included in some testing procedures as a means of controlling for ancestry [e.g., [46
].) Equations 10
can accommodate related individuals by utilizing software that models the covariance structure among the residuals. Finally, proper estimation of parental ancestry values will require special accommodations for related individuals (e.g., full siblings should obviously be constrained to have the same parental ancestry values, etc.).
The general linear model offered can be extended to allow one to test for linkage conditional upon association with a polymorphism in a region and, thereby, test whether that polymorphism appears to account for an observed linkage signal that was detected with RAM. The right side of Equation 10
can be expanded to include the Gijk
values. In this situation, one desires a test of whether the amount of variance explained by the Aijk
variables conditional on all other variables in the model is significantly less when the Gijk
values are included in the model compared to when the Gijk
values are excluded from the model. In many cases, these tests entail the use of bootstrapping.
Nonparametric Measurement Error Assessment and Accommodation
Until now, we have assumed that all variables are known without error. In reality, this will not be the case and is an important point to recognize. Any of the variables involved can be measured with error and we now address the consequences of error in each and propose responses to ensure validity of the tests in terms of type 1 error rate control. Throughout, we assume that the measurement errors are independent of each other and of all of the variables under study. We also do not dwell on how one should calculate estimates of individual and parental admixture or estimates of the reliability thereof when used as estimates of individual and parental ancestry. For now, we simply assume that it is possible to do so and briefly address ways in which this might best be accomplished in the Text S1
Error in the genotypes.
It is well known that genotyping errors occur and, when they occur, result in reduced power [48
]. However, if the measurement error is in the determination of Gijk,
this will only lower power, not inflate the type 1 error rate. Therefore, no response is needed to ensure validity of the test.
Error in the phenotypes.
Phenotypes are also often measured with error but, again, this will only serve to lower power of the tests we offer and not inflate type 1 error rates [49
]. Therefore, no response is needed to ensure validity of the tests.
Error in the estimates of region-specific individual admixture.
Unless a perfectly informative marker (i.e., a marker with allele frequencies of zero and one in one parental population and complementary frequencies in the other, respectively) is available at exactly the locus under study, the degree of regional admixture for any individual will only be known probabilistically. Let us denote the (Bayesian posterior) probabilities of individual region-specific admixture as:
Then one can replace Aij1
, respectively, in the various regression models in an analogous manner to what would be done in some multipoint mapping approaches in experimental crosses (see p. 433 in [50
]). Measurement errors here will, again, lower power, but not affect the type 1 error rate.
Error in the estimates of parental ancestry.
Error in the estimates of parental ancestry poses the greatest challenge. As several authors [7
] noted, unchecked errors in the putatively confounding variables on which one must condition will lead to incomplete control and potentially to residual confounding [51
]. Therefore, some method is required to deal with measurement error in the estimates of individual ancestry. Moreover, such measurement errors, or unreliability, can be substantial, as it is illustrated in .
Reliability of Individual Admixture Estimates Used as Estimates of Individual Ancestry
Montana and Pritchard [27
] noted that Hoggart et al. [7
] had criticized their use of a two-stage approach in which one first calculates ancestry estimates and then in a separate analysis uses those estimates as covariates. A basis of the criticism was that this approach does not account for uncertainty (measurement error) in the ancestry estimates. Montana and Pritchard (p. 786 in [27
]) acknowledge that this concern is “theoretically plausible, [but that] extensive simulations of the admixture mapping tests presented here, as well as simulations of the STRAT test … show that, in practice, the statistical tests are indeed correctly calibrated under the null hypothesis… [and that] there are some practical advantages to the two-stage process. First, the two-stage process makes the output much more transparent and interpretable for the end user. Second, it makes it much easier for users to take the ancestry estimates and develop other tests of association that are appropriate for their own data.” We agree with Hoggart et al. [7
] that the measurement errors are a concern and our simulations herein demonstrate that under some circumstances measurement errors can produce substantial type 1 error rate inflation. On the other hand, we also agree with Montana and Pritchard [27
] that the advantages of the two-stage approach in terms of flexibility and conceptual clarity are profound. Fortunately, measurement error correction methods can allow “the best of both worlds” by retaining the flexibility of the two-stage approach while properly accounting for the measurement error.
While many methods are available (e.g., [52
]), the most common approach to dealing with errors in variables on the right side of regression equations is regression calibration. In some circumstances (e.g., linear regression), it is effectively the correction for attenuation. This method is a type of resubstitution; instead of the true but unobservable predictor, one substitutes an estimate of it, conditional on the observed covariates (but not the response). Then the idea is to run a standard analysis, and “fix up” the standard errors at the end via devices such as bootstrapping. In linear regression, regression calibration is often considered the default option because it often works surprisingly well. In logistic regression with a relatively rare disease, regression calibration is an almost exact method. One of the major advantages of regression calibration is that it is easy to implement; after the resubstitution, a standard analysis can be run to obtain estimates [54
Another alternative is the simulation extrapolation (SIMEX) approach [54
]. SIMEX is more computationally intensive than regression calibration, but it is one of the major default options for nonlinear models that cannot be handled by correction for attenuation techniques or regression calibration—that is, it is extremely flexible and can be used with any incarnation of the general linear model. It is also extremely useful for problems in which the measurement error is not of the classic, additive homoscedastic type, as will occur, for example, in the current case in which the predictor variable (ancestry) is a proportion. As with regression calibration, a great advantage of SIMEX is that it separates the primary statistical modeling component from the error correction component, thereby freeing data analysts to implement the full range of their usual battery of procedures.
Several other methods exist [58
], including multiple imputation [59
]. A and B, respectively, illustrate the residual confounding that can occur when conducting a SAT procedure without correcting for measurement error and the proper control of confounding that occurs when a measurement error correction is used. A reveals type I errors occur 1.4, 2.6, and 4 times as often as expected at the .05, .01, and .001 α levels, respectively, when the correct SAT model is specified but imperfect measured of ancestry are used. Once measurement error corrections are applied, B indicates that the correct type I error rates are restored.
The Importance of Accommodating Measurement Error in Models
Our purpose here has not been to become bogged down in the logistics of setting up RAM and SAT studies or to provide detailed evaluations of the performance characteristics of specific designs and analytic implementations. Rather, our goal was to articulate a unified and generalizable approach to RAM and SAT. We have shown through proofs, counterexamples, and small simulations that it is necessary and sufficient to condition on both individual ancestry and the product of parental ancestries, and it is not sufficient to “conditionally condition” on parental ancestries, in order to control for confounding in admixture studies. We provide a general linear model that is extensible to a multitude of study designs, conditions, and populations of interest that are briefly presented, but left to future work for detailed descriptions. Within Text S1
, we have also provided a semiparametric reliability assessment method as well as suggestions for accommodating measurement errors. It is worth noting that several open questions, or areas for future research, remain in order for studies using RAM and SAT to be optimally useful. These include expanding our RAM approach to case-only analysis, methods for selecting markers with which to estimate ancestry, development of panels of such markers for different ethnic groups (or demonstration that such a priori–defined panels are not needed [60
]), and evaluation of methods for estimating individual ancestry and region-specific admixture (for further discussion on such issues, see [2
]). Additional issues include how RAM and SAT can best be utilized in studies involving DNA pooling and how individual ancestry estimation procedures, and the estimation of the reliability thereof, can best utilize knowledge about the pedigree structure among individuals when related individuals are studied. How to best accommodate pedigree data in the analyses remains a question for RAM and SAT as it does for association testing in general [63
]. Finally, now that a general model exists, the time is opportune for a thorough evaluation of the performance characteristics under multiple different population genetic models, genetic architectures, sampling strategies, and phenotypic distributions.