Home | About | Journals | Submit | Contact Us | Français |

**|**PLoS Genet**|**v.2(8); 2006 August**|**PMC1557785

Formats

Article sections

Authors

Related links

PLoS Genet. 2006 August; 2(8): e137.

Published online 2006 August 25. Prepublished online 2006 July 18. doi: 10.1371/journal.pgen.0020137

PMCID: PMC1557785

David T Redden,^{1,}^{2,}^{*} Jasmin Divers,^{1} Laura Kelly Vaughan,^{1} Hemant K Tiwari,^{1} T. Mark Beasley,^{1} José R Fernández,^{1,}^{2,}^{3} Robert P Kimberly,^{4} Rui Feng,^{1} Miguel A Padilla,^{1} Nianjun Liu,^{1} Michael B Miller,^{5} and David B Allison^{1,}^{2,}^{3}

Wayne N Frankel, Editor^{}

The Jackson Laboratory, United States of America

* To whom correspondence should be addressed. E-mail: DRedden/at/UAB.edu

Received 2005 November 24; Accepted 2006 July 18.

This article has been cited by other articles in PMC.

Individual genetic admixture estimates, determined both across the genome and at specific genomic regions, have been proposed for use in identifying specific genomic regions harboring loci influencing phenotypes in regional admixture mapping (RAM). Estimates of individual ancestry can be used in structured association tests (SAT) to reduce confounding induced by various forms of population substructure. Although presented as two distinct approaches, we provide a conceptual framework in which both RAM and SAT are special cases of a more general linear model. We clarify which variables are sufficient to condition upon in order to prevent spurious associations and also provide a simple closed form “semiparametric” method of evaluating the reliability of individual admixture estimates. An estimate of the reliability of individual admixture estimates is required to make an inherent errors-in-variables problem tractable. Casting RAM and SAT methods as a general linear model offers enormous flexibility enabling application to a rich set of phenotypes, populations, covariates, and situations, including interaction terms and multilocus models. This approach should allow far wider use of RAM and SAT, often using standard software, in addressing admixture as either a confounder of association studies or a tool for finding loci influencing complex phenotypes in species as diverse as plants, humans, and nonhuman animals.

In recent years, scientific efforts to find genes influencing disease and health-related traits have sought to capitalize on the unique genetic characteristics of admixed populations. Admixture can refer to the event of two or more genetically diverse populations intermating and producing an admixed population. Admixture creates the potential for efficient identification of trait-influencing genes. However, genetic association studies using admixed populations are also prone to incorrectly concluding that a gene is linked and associated with a trait even when it is not. Several researchers have produced promising statistical methodologies for genetic association studies within admixed populations. In this paper, the authors show how these statistical methods can be unified in a broadly applicable regression framework and discuss which variables should be included in the regression models for valid testing. Because the variables required in this regression framework can only be measured with error, the authors show the consequences of these measurement errors and present measurement error correction methods applicable to this problem. By recasting the statistical methods for genetic association studies within admixed populations as regression models, a broader range of modeling and hypothesis testing becomes available.

When two or more populations have been separated by geographic or cultural boundaries for many generations, differential selection pressures, drift, and spontaneous mutations may lead to different allele frequencies in each population. If individuals from these founding populations subsequently mate, disequilibrium among linked markers in their offspring may span a greater genetic distance than typically found in panmictic populations. This extended disequilibrium can greatly facilitate the ability to detect regions of the genome harboring phenotype-influencing loci by reducing both the number of marker loci required and the cost when compared to disequilibrium mapping in panmictic populations [1,2]. However, this admixture process can, under some circumstances, produce disequilibrium between pairs of unlinked loci, creating confounding (i.e., spurious associations) in genetic association studies [3–5].

Recently, with the availability of genome-wide markers, the wider use and application of Bayesian statistical methods, the use of Markov chain Monte Carlo and hidden Markov methods, and the insight of several investigative groups [6–12], the opportunity for sophisticated admixture mapping has become a reality. These advances also provide the ability to control for possible confounding due to disequilibrium between pairs of unlinked loci created by the admixture process. Several strategies have been proposed for estimating admixture for individuals over the whole genome, as well as in specific regions of the genome [8,10,13]. Methods referred to as structured association tests (SATs) have been proposed that use individual admixture estimates to perform tests of association within admixed populations [7,11,14,15]. Regional admixture mapping (RAM) methods use genome-wide admixture estimates and region-specific admixture estimates to identify specific regions of the genome harboring loci that influence phenotypes [1,13,16]. These methods are especially interesting due to their potential for identifying genetic variants contributing to diseases or phenotypes that have markedly different distributions among breeding groups (or in humans, ethnic groups) [17]. Other methods, such as genomic control, proposed by Devlin and Roeder, attempt to correct for population stratification due to admixture in association testing without inferring or utilizing the details of the population structure [14,18,19]. These methods do not involve the estimation of individual admixture values and will not be discussed in detail here; however, they have been discussed and compared with existing SAT methods elsewhere [14,20,21].

The overall aim of this paper is to provide a general model that conceptually unites RAM and SAT methodologies into an extensible form. To accomplish this, we provide an overview of the problem and existing methods, followed by methodologic clarification. We then present our model and illustrate its properties via simulation. These simulations are not meant to provide comprehensive description of the operating characteristics of the methods across many situations, but rather offer illustrations of key methodological points.

Before presenting a unifying approach, we review the justification and underlying principles of both methods.

Hoggart et al. (p. 1492 in [7]) articulated the rationale behind SAT: “In general, population stratification exists when the total population has been formed by admixture between subpopulations and when admixture proportions (defined as the proportions of the genome that have ancestry from each subpopulation) vary between individuals. …If the risk of disease varies with admixture proportions, this will confound associations of disease with genotype at any locus where allele frequencies vary between subpopulations. …If the confounder—admixture proportions—can be measured accurately, control for it can be achieved in a straightforward manner by modeling its effects in the analysis.”

We will show that how one attempts to control for parental ancestry is critical to determining whether one eliminates potential confounding due to variations in parental ancestry. To our knowledge, there are four published approaches to SAT [7,11,12,22]. All are built on this general principle, but take somewhat different approaches. We will not explore the specifics of those approaches here but note that none are couched in a general framework that includes both RAM and SAT. Furthermore, none allow flexible generalization to as broad a range of situations as we would wish.

The overall issue of confounding due to admixture disequilibrium, generalized to any population, is portrayed in the path diagram of Figure 1. In the path diagram, rectangles represent directly observed variables, ellipses represent unobserved or latent variables, dashed ellipses represent variables that can potentially exert influences, and arrows represent direct or casual relationships. The path diagram introduces two key latent constructs, individual ancestry and individual admixture, which underlie the issue of confounding due to variation in individual ancestry. Specifically, an individual ancestry proportion, with respect to a specific parental population, is defined as the proportion of that individual's ancestors who were members of that parental population in the generation prior to the first admixture event. This is in contrast to an individual's admixture, which is the proportion of the individual's genome that is inherited from a specific parental population.

The figure indicates that association testing is not a simple issue. The relationship between the putative quantitative trait locus (QTL) and phenotype is the one of interest, but it can be confounded by other variables. First, note that QTLs and individual admixture can be directly influenced by random variation due to meiosis. In addition, both the phenotype and measured admixture are potentially subject to measurement error. Furthermore, measured admixture is directly affected by individual admixture, which in turn is affected by individual ancestry. Naturally, the ancestry of the parents, represented by P_{1} and P_{2}, affects individual ancestry. Individual ancestry can directly affect the putative QTL, which in turn can affect the phenotype, so individual ancestry has an indirect affect on the phenotype via the putative QTL. The right–hand side of the path diagram is a mirror image of the left–hand side, with unobserved QTL replacing the putative QTL and represents the potential path of spurious associations. The diagram also indicates that the product of parental ancestries also affects both QTLs. Justification for these paths is provided below.

The consequences of failing to control for variation is ancestry is illustrated in Figure 2A. The simple simulation reveals type I errors occur 13.24, 41.2, and 193 times as often as expected at the .05, .01, and .001 α levels, respectively, and this inflation is attributable to confounding due to variation in ancestry. SATs are designed to be resistant to such confounding.

We define region-specific admixture as a characteristic of segments of the genomes of individuals. For any given region of the genome, one's region-specific admixture from population *V* is the proportion of alleles in that region that are copies of alleles from members of population *V.* The rationale for RAM rests on two premises. First, the process of admixture creates linkage disequilibrium among linked loci that tends to extend over longer genetic distances than does disequilibrium under long-term panmixia. Second, even after appropriately adjusting for the degree of individual ancestry, the degree of individual region-specific admixture will covary with phenotypes that are influenced by loci that are (1) in the region under study; and (2) in disequilibrium with loci that have different allele frequencies in the parental populations. Both premises are well established [23,24]. Prior to the late 1990s, several authors had formally discussed the possibility of RAM-type approaches [23], but did not offer methods that would control for potential spurious associations [4]. McKeigue first introduced modern approaches to RAM that attempted to control for spurious associations induced by the admixture process [6,25,26].

Several approaches to RAM [6,13,16,25–29] have been published. Some [28] use a two-stage approach in which estimates of individual admixture and region-specific admixture are first obtained in a specialized procedure and then used in an ordinary logistic regression approach with case-control data. This two-stage approach lends itself to generalization and is a simplified form of the unified general linear model approach we present.

There are a number of methodologic points that have been alluded to but have not been completely elucidated in the literature pertaining to how one should condition upon (control for) ancestry within RAM and SAT. Within the next few sections, we seek to clarify these points.

It is unclear from past writing whether it is sufficient to control for individual admixture, individual ancestry, or both to eliminate confounding due to the admixture process. We first clarify that, although sometimes used interchangeably, an individual's admixture and an individual's ancestry are not equivalent variables. To illustrate, consider a set of full siblings that does not include any monozygotic twins. Because they are full siblings, all individuals in the set have equal individual ancestry from specific populations or regions. In fact, all individuals in the set have ancestry equal to the mean or midpoint of their parent's ancestries, represented as P_{1} and P_{2}. However, due to recombination, all individuals will have slightly different admixture values.

Here we show by counterexamples that it is not sufficient to control for individual admixture and it is also not sufficient to control for individual ancestry. We then show that it is sufficient to control for both individual ancestry and the product of parental ancestry. Throughout the paper and our examples, *i* represents the i^{th} individual, *j* the j^{th} locus, *k* the number of alleles at the j^{th} locus, and *V* the number of founding populations. For simplicity we assume 2 founding populations in this paper.

Given variations in parental ancestry, controlling for individual admixture is not sufficient. Imagine an organism with *W* independent genetic segments of equal genetic length. For each individual, let the two parents have equal ancestry. Suppose that the admixture of each segment is known without (measurement) error. Without loss of generality, assume that the segment-specific admixture values (denoted *X _{j}* for the j

denote the overall individual admixture value (for ease of exposition, we have not divided by *W,* but this is only a linear transformation and will have no impact on the result). Then, the correlation coefficient between *X _{j}*

The correlation coefficient can be written in terms of simple correlation coefficients

after substituting and reducing,

for *W* > 1. Thus, it is clear in this situation that the partial correlation coefficient can never be zero and only asymptotically approaches zero as *W* approaches infinity (i.e., as the amount of independent information that goes into the emergent variable of admixture increases infinitely). If
is not guaranteed to be zero, then, conditional on individual admixture, what is inherited at one segment can be correlated with what is inherited at another chromosome. Therefore, controlling for individual admixture is not sufficient to eliminate correlations among unlinked loci and is not sufficient to control for spurious associations. The formula further implies that the distinction between individual ancestry and individual admixture will, all other things being equal, be greatest in organisms such as *Arabidopsis* (diploid chromosome number = 10, 8.0 × 10^{7} base pairs in total length) with short genomes and less in organisms such as crayfish (diploid chromosome number = 200, 8.22 × 10^{9} base pairs in total length) with long genomes (see http://www.genomesize.com).

Let *X*_{1} and *X*_{2} denote Bernoulli-distributed random variables indicating whether or not one has inherited two alleles from population *V* at locus 1 and locus 2, respectively, and let *v _{ij}* denote the number of alleles inherited from population

As can be seen in Table 1, *P*(*v _{ij}* = 2) is not determined solely by individual ancestry but also depends on mating patterns and mixing proportions, via their influence on the distribution of parental mating types. This means that, even conditional upon individual ancestry, there can still be confounding because

Some models (e.g., [7,12]) control for the linear effect of individual ancestry or individual admixture in regression-type models in an attempt to insure that RAM and SAT tests are not confounded by variation in ancestry. This will only be valid if one tests only for linear allelic (additive) effects at loci without testing for dominance (genotypic) effects or epistasis. This is because when testing for the allelic effects, the expected number of alleles from population *V* at any one locus among individuals with ancestry *A* from population *V* is

However, the locus-specific effects on complex and quantitative traits cannot a priori be assumed to be additive and can even be overdominant [30–34]. For this reason, many investigators wisely choose to test for genotypic effects in two degrees of freedom models (e.g., [12]) rather than restricting themselves to allelic (additive) effects (compare with [35]). In such situations, controlling only for the linear term of individual ancestry will be insufficient if one uses tests that allow for nonadditive genotypic effects.

The premise of conditioning on parental ancestry was first introduced by McKeigue [26]. Here we expand on the idea and show that it is necessary to condition on both individual ancestry and the product of parental ancestries. It is important to note in the following that, although we are controlling for parental ancestries, this does not imply it is necessary to include parents in RAM and SAT studies (see Text S1 for discussion of estimating parental ancestry solely from offspring data).

Let *P*_{1i} and *P*_{2i} denote the individual ancestries from population *V* for the two parents, respectively. Note that for any locus, the expected number of *V* alleles depends only on the individual's ancestry; hence, we drop the locus-specific subscript *j* in subsequent equations. Then, at every locus:

Furthermore, conditional on *P*_{1i} and *P*_{2i}*,* the number of alleles inherited from one population at a given locus is independent of the number of alleles inherited at another locus for all loci that are unlinked as defined by Mendel's law of independent assortment. Therefore, controlling for *P*(*v _{i}* = 0|

in which the missing terms denoted by the ellipsis are those that one is primarily interested in testing. Letting β_{0} *α*_{0} + *α*_{1}, β_{1} −2*α*_{1}, and β_{2} *α*_{1} + *α*_{2} and substituting terms yields:

Noting that, by definition, (*P*_{1i} + *P*_{2i})/2 is individual ancestry (*A _{i}*), yields:

As can be seen, the probability distribution of the descent status (and therefore the genotypes if allele frequencies differed in the parental populations) depends on both first- and second-order functions of ancestry but not on any higher-order terms. Thus, to eliminate confounding due to variations in parental ancestry, it is sufficient to control for individual ancestry and the product of parental ancestries. Figure 2B–2D illustrates these points. Specifically, Figure 2B indicates that if the confounding locus acts in a additive fashion, controlling for ancestry without the product of parental ancestries does provide adequate type I control. However, Figure 2C reveals type I errors occur 6.16, 16.4, and 36 times as often as expected at the .05, .01, and .001 α levels, respectively, when the confounding locus acts in an overdominant fashion and the linear term of ancestry alone is used to control for variation in ancestry. Finally, Figure 2D indicates adequate control is achieved when the confounding locus acts in an overdominant fashion and both the linear term of ancestry and the product of parental ancestries are used to control for variation in ancestry.

One may choose to condition on parental ancestry only if parental ancestry is found to be statistically significant when included in the model or if significant structure is detected in the sample as was described by Pritchard et al. [22] as the first step in their three-step SAT procedure and by Hoggart et al. (p. 1502 in [7]). We refer to this approach as conditional conditioning. If one's goal is to ensure that under H_{0}, the type 1 error rate remains ≤ α, which generally defines a valid test in the frequentist context, then conditional conditioning is not a valid testing strategy. That is, even though covariates may not meet criteria for statistical significance in a finite sample, this does not mean they are not confounders, and failing to include them in the model can lead to inflated type 1 error rates [36]. Therefore, if one is interested in valid RAM and SAT tests of linkage in the presence of association, it is necessary to control for parental ancestry terms as in Equations 10 and 11 regardless of their degree of statistical significance in the model. By analogy, the practice of only controlling for parental ancestry only if a significance test of Hardy-Weinberg equilibrium is rejected has the same problem [37]. So too would the practice of attempting to control for parental ancestry only if other tests yielded significant evidence that the sample came from a structured population. This is illustrated in Figure 3**,** which reveals type I errors occur 7.93, 28.87, and 66.1 times as often as expected at the .05, .01, and .001 α levels, respectively, when conditional conditioning is used.

Here we introduce general models for RAM and SAT that are highly extensible. We define the following notation: *Y,* a phenotype that can be continuous, ordinal, or dichotomous; *A _{i},* ancestry for the i

RAM model:

SAT model:

These general linear models are very flexible. First, dichotomous (e.g., case vs. control), ordinal, time-to-event, or continuous phenotypes can be accommodated by letting the regression model be logistic, Poisson, Cox, or ordinary least squares, respectively. This flexibility is important. Investigators frequently want to not only assess genetic association for dichotomous and static phenotypes such as lupus (yes vs. no) in a case-control study, but also wish to assess genetic association with longitudinal outcomes (e.g., clinical course in medical research or growth rate in agricultural research), adjusting for covariates including demographic and ancestry. Such longitudinal phenotypes can also be accommodated by this general model via the use of mixed models and related techniques for longitudinal data [39,40]. Therefore, the models can be fit in standard software (e.g., SAS), which has the advantage of being widely accessible, well documented, and well tested. This radically increases the likelihood of wide and proper use. Moreover, by being framed in a regression approach, all of the machinery of regression, including diagnostics [41], well-recognized effect size metrics, robust variations [42], the ability to include covariates, and the ability to test interactions are at one's disposal. This immediately makes the models extensible to multilocus and epistatic models. Finally, the RAM approach can be expanded to test a region of a chromosome by, instead of including marker-specific ancestry, including an estimate of the admixture of the region.

Another advantage of the models in Equations 10 and 11 is that they make clear the relationships between RAM and SAT and identity by descent and identity in state in family-based tests of linkage and linkage in the presence of association. RAM is analogous to linkage testing, whereas SAT is analogous to association testing. The *A _{ijk}* values correspond to “descent states,” whereas the

As already discussed, the models in Equations 10 and 11 are easily extended to allow for any phenotypic distribution. Because no constraints are placed on the distribution of the phenotypes, with two exceptions, the models can accommodate selective sampling (e.g., sampling phenotypically extreme subjects or sampling subjects on the basis of ancestry) without modification. In addition, covariates, multiple loci, gene by environment (or gene by sex, gene by age, etc.), and gene by gene (epistasis) effects are easily modeled by simply adding appropriate terms to the right side of the equation. The general linear model presented here can be extended to deal with several situations, which are briefly introduced below. If there are a total of *M* phenotypes to include, one can replace the variable *Y* on the left side of Equations 10 or 11 with a weighted linear composite of *Y* values representing the multiple phenotypes as follows:

Multivariate RAM model:

Multivariate SAT model:

The ξ* _{m}*s are constants to be estimated within the regression framework and are constrained such that
= 1. This constraint is necessary to make the model identifiable.

To our knowledge, no current RAM or SAT test allows related individuals to be included as subjects. (We distinguish the inclusion of related individuals as subjects from the requirement that parents or other relatives be included in some testing procedures as a means of controlling for ancestry [e.g., [46,47].) Equations 10 and 11 can accommodate related individuals by utilizing software that models the covariance structure among the residuals. Finally, proper estimation of parental ancestry values will require special accommodations for related individuals (e.g., full siblings should obviously be constrained to have the same parental ancestry values, etc.).

The general linear model offered can be extended to allow one to test for linkage conditional upon association with a polymorphism in a region and, thereby, test whether that polymorphism appears to account for an observed linkage signal that was detected with RAM. The right side of Equation 10 can be expanded to include the *G _{ijk}* values. In this situation, one desires a test of whether the amount of variance explained by the

Until now, we have assumed that all variables are known without error. In reality, this will not be the case and is an important point to recognize. Any of the variables involved can be measured with error and we now address the consequences of error in each and propose responses to ensure validity of the tests in terms of type 1 error rate control. Throughout, we assume that the measurement errors are independent of each other and of all of the variables under study. We also do not dwell on how one should calculate estimates of individual and parental admixture or estimates of the reliability thereof when used as estimates of individual and parental ancestry. For now, we simply assume that it is possible to do so and briefly address ways in which this might best be accomplished in the Text S1.

It is well known that genotyping errors occur and, when they occur, result in reduced power [48]. However, if the measurement error is in the determination of *G _{ijk},* this will only lower power, not inflate the type 1 error rate. Therefore, no response is needed to ensure validity of the test.

Phenotypes are also often measured with error but, again, this will only serve to lower power of the tests we offer and not inflate type 1 error rates [49]. Therefore, no response is needed to ensure validity of the tests.

Unless a perfectly informative marker (i.e., a marker with allele frequencies of zero and one in one parental population and complementary frequencies in the other, respectively) is available at exactly the locus under study, the degree of regional admixture for any individual will only be known probabilistically. Let us denote the (Bayesian posterior) probabilities of individual region-specific admixture as:

Then one can replace *A _{ij}*

Error in the estimates of parental ancestry poses the greatest challenge. As several authors [7,13] noted, unchecked errors in the putatively confounding variables on which one must condition will lead to incomplete control and potentially to residual confounding [51]. Therefore, some method is required to deal with measurement error in the estimates of individual ancestry. Moreover, such measurement errors, or unreliability, can be substantial, as it is illustrated in Figure 4.

Montana and Pritchard [27] noted that Hoggart et al. [7] had criticized their use of a two-stage approach in which one first calculates ancestry estimates and then in a separate analysis uses those estimates as covariates. A basis of the criticism was that this approach does not account for uncertainty (measurement error) in the ancestry estimates. Montana and Pritchard (p. 786 in [27]) acknowledge that this concern is “theoretically plausible, [but that] extensive simulations of the admixture mapping tests presented here, as well as simulations of the STRAT test … show that, in practice, the statistical tests are indeed correctly calibrated under the null hypothesis… [and that] there are some practical advantages to the two-stage process. First, the two-stage process makes the output much more transparent and interpretable for the end user. Second, it makes it much easier for users to take the ancestry estimates and develop other tests of association that are appropriate for their own data.” We agree with Hoggart et al. [7] that the measurement errors are a concern and our simulations herein demonstrate that under some circumstances measurement errors can produce substantial type 1 error rate inflation. On the other hand, we also agree with Montana and Pritchard [27] that the advantages of the two-stage approach in terms of flexibility and conceptual clarity are profound. Fortunately, measurement error correction methods can allow “the best of both worlds” by retaining the flexibility of the two-stage approach while properly accounting for the measurement error.

While many methods are available (e.g., [52,53]), the most common approach to dealing with errors in variables on the right side of regression equations is regression calibration. In some circumstances (e.g., linear regression), it is effectively the correction for attenuation. This method is a type of resubstitution; instead of the true but unobservable predictor, one substitutes an estimate of it, conditional on the observed covariates (but not the response). Then the idea is to run a standard analysis, and “fix up” the standard errors at the end via devices such as bootstrapping. In linear regression, regression calibration is often considered the default option because it often works surprisingly well. In logistic regression with a relatively rare disease, regression calibration is an almost exact method. One of the major advantages of regression calibration is that it is easy to implement; after the resubstitution, a standard analysis can be run to obtain estimates [54].

Another alternative is the simulation extrapolation (SIMEX) approach [54–57]. SIMEX is more computationally intensive than regression calibration, but it is one of the major default options for nonlinear models that cannot be handled by correction for attenuation techniques or regression calibration—that is, it is extremely flexible and can be used with any incarnation of the general linear model. It is also extremely useful for problems in which the measurement error is not of the classic, additive homoscedastic type, as will occur, for example, in the current case in which the predictor variable (ancestry) is a proportion. As with regression calibration, a great advantage of SIMEX is that it separates the primary statistical modeling component from the error correction component, thereby freeing data analysts to implement the full range of their usual battery of procedures.

Several other methods exist [58], including multiple imputation [59]. Figure 5A and and5B,5B, respectively, illustrate the residual confounding that can occur when conducting a SAT procedure without correcting for measurement error and the proper control of confounding that occurs when a measurement error correction is used. Figure 5A reveals type I errors occur 1.4, 2.6, and 4 times as often as expected at the .05, .01, and .001 α levels, respectively, when the correct SAT model is specified but imperfect measured of ancestry are used. Once measurement error corrections are applied, Figure 5B indicates that the correct type I error rates are restored.

Our purpose here has not been to become bogged down in the logistics of setting up RAM and SAT studies or to provide detailed evaluations of the performance characteristics of specific designs and analytic implementations. Rather, our goal was to articulate a unified and generalizable approach to RAM and SAT. We have shown through proofs, counterexamples, and small simulations that it is necessary and sufficient to condition on both individual ancestry and the product of parental ancestries, and it is not sufficient to “conditionally condition” on parental ancestries, in order to control for confounding in admixture studies. We provide a general linear model that is extensible to a multitude of study designs, conditions, and populations of interest that are briefly presented, but left to future work for detailed descriptions. Within Text S1, we have also provided a semiparametric reliability assessment method as well as suggestions for accommodating measurement errors. It is worth noting that several open questions, or areas for future research, remain in order for studies using RAM and SAT to be optimally useful. These include expanding our RAM approach to case-only analysis, methods for selecting markers with which to estimate ancestry, development of panels of such markers for different ethnic groups (or demonstration that such a priori–defined panels are not needed [60]), and evaluation of methods for estimating individual ancestry and region-specific admixture (for further discussion on such issues, see [2,61,62]). Additional issues include how RAM and SAT can best be utilized in studies involving DNA pooling and how individual ancestry estimation procedures, and the estimation of the reliability thereof, can best utilize knowledge about the pedigree structure among individuals when related individuals are studied. How to best accommodate pedigree data in the analyses remains a question for RAM and SAT as it does for association testing in general [63]. Finally, now that a general model exists, the time is opportune for a thorough evaluation of the performance characteristics under multiple different population genetic models, genetic architectures, sampling strategies, and phenotypic distributions.

Simulation studies were performed using the software SAS (Cary, North Carolina, United States) under the “general island” and intermixture models presented by Zhu et al. [16]. The SAT model β *f*(*Y _{i}*) = β

The authors would like to thank Dr. Chenxi Wang for providing some initial code for the SIMEX implementation, Dr. Raymond Carroll for helpful advice; Drs. David Siegmund, Jonathan Pritchard, Robert Elston, and Hongyu Zhao for helpful comments on earlier drafts; and Dr. Barbara Gower for graciously providing some of the data used to calculate reasonable parameters for our simulations.

- QTL
- quantitative trait loci, (or locus)
- RAM
- regional admixture mapping
- SAT
- structured association testing
- SIMEX
- simulation extrapolation

A previous version of this article appeared as an Early Online Release on July 18, 2006 (DOI: 10.1371/journal.pgen.0020137.eor).

**Author contributions.** DTR, JD, TMB, MBM, and DBA conceived and designed the experiments. DTR, JD, and DBA performed the experiments. DTR, JD, LKV, and DBA analyzed the data. DTR, JD, HKT, JRF, MAP, and DBA contributed reagents/materials/analysis tools. DTR, JD, LKV, HKT, TMB, RPK, RF, MAP, NL, MBM, and DBA wrote the paper.

**Funding.** Supported in part by National Institutes of Health grants DK49779–03, DK51684–01, RR11811, AR049084, AR048311, AR007450, DK056336, DK067426, HL072757, CA100949, ES09912, and DK062817. The collection of some of these data was also supported by University of Alabama at Birmingham General Clinical Research Center (grant MO1-RR-00032), Nestle Food (Stouffer's Lean Cuisine Entrees) and H. J. Heinz (Weight Watchers Smart Ones).

**Competing interests.** The authors have declared that no competing interests exist.

- Lee WC, Yen YC. Admixture mapping using interval transmission/disequilibrium tests. Ann Hum Genet. 2003;67:580–588. [PubMed]
- Smith MW, O'Brien SJ. Mapping by admixture linkage disequilibrium: Advances, limitations and guidelines. Nat Rev Genet. 2005;6:623–632. [PubMed]
- Chen HS, Zhu X, Zhao H, Zhang S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet. 2003;67:250–264. [PubMed]
- Halder I, Shriver M. Measuring and using admixture to study the genetics of complex diseases. Hum Genet. 2003;1:52–62.
- Cardon LR, Palmer LJ. Population stratification and spurious allelic association. Lancet. 2003;361:598–604. [PubMed]
- McKeigue PM, Carpenter JR, Parra EJ, Shriver MD. Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: Application to African-American populations. Ann Hum Genet. 2000;64:171–186. [PubMed]
- Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, et al. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003;72:1492–1504. [PubMed]
- Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587. [PubMed]
- Hanis CL, Chakraborty R, Ferrell RE, Schull WJ. Individual admixture estimates—Disease associations and individual risk of Diabetes and gallbladder-disease among Mexican-Americans in Starr County, Texas. Am J Phys Anthropol. 1986;70:433–441. [PubMed]
- Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: Analytical and study design considerations. Genet Epidemiol. 2005;28:289–301. [PubMed]
- Satten GA, Flanders WD, Yang QH. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001;68:466–477. [PubMed]
- Purcell S, Sham P. Properties of structured association approaches to detecting population stratification. Hum Hered. 2004;58:93–107. [PubMed]
- Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, et al. Methods for high-density admixture mapping of disease genes. Am J Hum Genet. 2004;74:979–1000. [PubMed]
- Pritchard JK, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60:227–237. [PubMed]
- Purcell S. Sample selection and complex effects in quantitative trait loci analysis [dissertation]. London: University of London; 2003. 409 p. p.
- Zhu XF, Cooper RS, Elston RC. Linkage analysis of a complex disease through use of admixed populations. Am J Hum Genet. 2004;74:1136–1153. [PubMed]
- Nievergelt C, Schork N. Admixture Mapping As a Gene Discovery Approach for Complex Human Traits and Diseases. Curr Hypertens Rep. 2005;7:31–37. [PubMed]
- Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]
- Bacanu SA, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet. 2000;66:1933–1944. [PubMed]
- Kohler K, Bickeboller H. Case-control association tests correcting for population stratification. Ann Hum Genet. 2006;70:98–115. [PubMed]
- Kohler K, Bickeboller H. Structured Association tests in case-control studies. Ann Hum Genet. 2005;69:768.
- Pritchard JK, Stephens M, Rosenberg NA, Donnelly P. Association mapping in structured populations. Am J Hum Genet. 2000;67:170–181. [PubMed]
- Stephens JC, Briscoe D, Obrien SJ. Mapping by admixture linkage disequilibrium in human-populations—Limits and guidelines. Am J Hum Genet. 1994;55:809–824. [PubMed]
- Pfaff CL, Parra EJ, Bonilla C, Hiester K, McKeigue PM, et al. Population structure in admired populations: Effect of admixture dynamics on the pattern of linkage disequilibrium. Am J Hum Genet. 2001;68:198–207. [PubMed]
- McKeigue PM. Mapping genes underlying ethnic differences in disease risk by linkage disequilibrium in recently admired populations. Am J Hum Genet. 1997;60:188–196. [PubMed]
- McKeigue PM. Mapping genes that underlie ethnic differences in disease risk: Methods for detecting linkage in admixed populations, by conditioning on parental admixture. Am J Hum Genet. 1998;63:241–251. [PubMed]
- Montana G, Pritchard JK. Statistical tests for admixture mapping with case-control and cases-only data. Am J Hum Genet. 2004;75:771–789. [PubMed]
- Zhang C, Chen K, Seldin MF, Li HZ. A hidden Markov modeling approach for admixture mapping based on case-control data. Genet Epidemiol. 2004;27:225–239. [PubMed]
- McKeigue PM. Multipoint admixture mapping. Genet Epidemiol. 2000;19:464–465. [PubMed]
- Cockett NE, Jackson SP, Shay TL, Farnir F, Berghmans S, et al. Polar overdominance at the Ovine callipyge locus. Science. 1996;273:236–238. [PubMed]
- Kim JJ, Farnir F, Savell J, Taylor JF. Detection of quantitative trait loci for growth and beef carcass fatness traits in a cross between Bos taurus (Angus) and Bos indicus (Brahman) cattle. J Anim Sci. 2003;81:1933–1942. [PubMed]
- Kim KS, Kim LL, Dekkers LCM, Rothschild MF. Polar overdominant inheritance of a DLK1 polymorphism is associated with growth and fatness in pigs. Mamm Genome. 2004;15:552–559. [PubMed]
- Luo LJ, Li ZK, Mei HW, Shu QY, Tabien R, et al. Overdominant epistatic loci are the primary genetic basis of inbreeding depression and heterosis in rice. II. Grain yield components. Genetics. 2001;158:1755–1771. [PubMed]
- Li ZK, Luo LJ, Mei HW, Wang DL, Shu QY, Tabien R, et al. Overdominant epistatic loci are the primary genetic basis of inbreeding depression and heterosis in rice. I. Biomass and grain yield. Genetics. 2001;158:1737–1753. [PubMed]
- Rebbeck TR, Martinez ME, Sellers TA, Shields PG, Wild CP, Potter JD. Genetic variation and cancer: Improving the environment for publication of association studies. Cancer Epidemiol Biomarkers Prev. 2004;13:1985–1986. [PubMed]
- Mickey RM, Greenland S. The impact of confounder selection criteria on effect estimation. Am J Epidemiol. 1989;129:125–137. [PubMed]
- Deng HW, Chen WM, Recker RR. Population admixture: Detection by Hardy-Weinberg test and its quantitative effects on linkage-disequilibrium methods for localizing genes underlying complex traits. Genetics. 2001;157:885–897. [PubMed]
- McCullagh P, Nelder J. Generalized linear models. London: Chapman and Hall; 1989. 511 p. p.
- Heo M, Faith MS, Mott JW, Gorman BS, Redden DT, Allison DB. Hierarchical linear models for the development of growth curves: An example with body mass index in overweight/obese adults. Stat Med. 2003;22:1911–1942. [PubMed]
- Sullivan L, Dukes K, Losina E. Tutorial in biostatistics. An introduction to hierarchical linear modeling. Stat Med. 1999;18:855–888. [PubMed]
- Fox J. Regression diagnostics. Newbury Park (California): Sage Publications; 1991. 92 p. p.
- Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. New York: John Wiley & Sons, Inc; 1987. 329 p. p.
- Almasy L, Blangero J. Exploring positional candidate genes: Linkage conditional on measured genotype. Behav Genet. 2004;34:173–177. [PubMed]
- Li MY, Boehnke M, Abecasis GR. Joint modeling of linkage and association: Identifying SNPs responsible for a linkage signal. Am J Hum Genet. 2005;76:934–949. [PubMed]
- Li C, Scott LJ, Boehnke M. Assessing whether an allele can account in part for a linkage signal: The Genotype-IBD Sharing Test (GIST) Am J Hum Genet. 2004;74:418–431. [PubMed]
- Lee WC, Yen YC. Admixture mapping using interval transmission/disequilibrium tests. Ann Hum Genet. 2003;67:580–588. [PubMed]
- Lin S, Chakravarti A, Cutler DJ. Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet. 2004;36:1181–1188. [PubMed]
- Kang SJ, Finch SJ, Haynes C, Gordon D. Quantifying the percent increase in minimum sample size for SNP genotyping errors in genetic model-based association studies. Hum Hered. 2004;58:139–144. [PubMed]
- Edwards BJ, Haynes C, Levenstein MA, Finch SJ, Gordon D. Power and sample size calculations in the presence of phenotype errors for case/control genetic association studies. BMC Genet. 2005;6:18. [PMC free article] [PubMed]
- Liu B. Statistical genomics: Linkage, mapping, and QTL analysis. Boca Raton: CRC Press; 1997. 611 p. p.
- Becher H. The concept of residual confounding in regression-models and some applications. Stat Med. 1992;11:1747–1758. [PubMed]
- Cheng CL, Schneeweiss H, Thamerus M. A small sample estimator for a polynomial regression with errors in the variables. J R Stat Soc Ser B Stat Methodol. 2000;62:699–709.
- Cheng CL, Van Ness J. Statistical regression with measurement error. New York: Oxford University Press; 1999. 262 p. p.
- Carroll RJ, Kuchenhoff H, Lombard F, Stefanski LA. Asymptotics for the SIMEX estimator in nonlinear measurement error models. J Am Stat Assoc. 1996;91:242–250.
- Lin XH, Carroll RJ. Nonparametric function estimation for clustered data when the predictor is measured without/with error. J Am Stat Assoc. 2000;95:520–534.
- Carroll RJ, Ruppert D, Stefanski LA. Measurement error in nonlinear models. London: Chapman & Hall/CRC; 1998. 305 p. p.
- Stefanski LA, Cook JR. Simulation extrapolation: The measurement error jackknife. J Am Stat Assoc. 1995;90:1247–1256.
- Gustafson P. Measurement error and misclassification in statistics and epidemiology: Impacts and Bayesian adjustments. London: Champlan & Hall/CRC; 2004. 188 p. p.
- Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc. 1996;91:473–489.
- Zhang SL, Zhu XF, Zhao HY. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol. 2003;24:44–56. [PubMed]
- McKeigue PM. Prospects for admixture mapping of complex traits. Am J Hum Genet. 2005;76:1–7. [PubMed]
- Reich D, Patterson N. Will admixture mapping work to find disease genes? Philos Trans R Soc Lond B Biol Sci. 2005;360:1605–1607. [PMC free article] [PubMed]
- Slager SL, Schaid DJ, Wang L, Thibodeau SN. Candidate-gene association studies with pedigree data: Controlling for environmental covariates. Genet Epidemiol. 2003;24:273–283. [PubMed]
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PubMed]
- Neale MC, Cardon LR. Methodology for genetic studies of twins and families. Boston: Kluwer Academic Publishers; 1992. 496 p. p.
- Lara-Castro C, Hunter GR, Lovejoy JC, Gower BA, Fernandez JR. Apolipoprotein A-II polymorphism and visceral adiposity in African-American and white women. Obes Res. 2005;13:507–512. [PubMed]
- Bonilla C, Shriver MD, Parra EJ, Jones A, Fernandez JR. Ancestral proportions and their association with skin pigmentation and bone mineral density in Puerto Rican women from New York city. Hum Genet. 2004;115:57–68. [PubMed]
- Gower BA, Fernandez JR, Beasley TM, Shriver MD, Goran MI. Using genetic admixture to explain racial differences in insulin-related phenotypes. Diabetes. 2003;52:1047–1051. [PubMed]
- Sasieni PD. From genotypes to genes: Doubling the sample size. Biometrics. 1997;53:1253–1261. [PubMed]
- Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. J Am Stat Assoc. 1994;89:1314–1328.

Articles from PLoS Genetics are provided here courtesy of **Public Library of Science**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |