Home | About | Journals | Submit | Contact Us | Français |

**|**Hum Hered**|**PMC2716289

Formats

Article sections

Authors

Related links

Hum Hered. 2009 April; 68(1): 65–72.

Published online 2009 April 1. doi: 10.1159/000210450

PMCID: PMC2716289

NIHMSID: NIHMS116126

Miguel A. Padilla,^{a,}^{b,}^{*} Jasmin Divers,^{c} Laura K. Vaughan,^{d} David B. Allison,^{d,}^{e} and Hemant K. Tiwari^{d}

*Miguel A. Padilla, PhD, Department of Psychology, Old Dominion University, 250 Mills Godwin Building, Norfolk, VA 23505 (USA), Tel. +1 757 683 4448, Fax +1 757 683 5087, E-Mail ude.udo@llidapam

Received 2008 June 30; Accepted 2008 November 6.

Copyright © 2009 by S. Karger AG, Basel

This article has been cited by other articles in PMC.

Structured association tests (SAT), like any statistical model, assumes that all variables are measured without error. Measurement error can bias parameter estimates and confound residual variance in linear models. It has been shown that admixture estimates can be contaminated with measurement error causing SAT models to suffer from the same afflictions. Multiple imputation (MI) is presented as a viable tool for correcting measurement error problems in SAT linear models with emphasis on correcting measurement error contaminated admixture estimates.

Several MI methods are presented and compared, via simulation, in terms of controlling Type I error rates for both non-additive and additive genotype coding.

Results indicate that MI using the Rubin or Cole method can be used to correct for measurement error in admixture estimates in SAT linear models.

Although MI can be used to correct for admixture measurement error in SAT linear models, the data should be of reasonable quality, in terms of marker informativeness, because the method uses the existing data to borrow information in which to make the measurement error corrections. If the data are of poor quality there is little information to borrow to make measurement error corrections.

In statistical modeling, ignoring confounding variables can lead to either increased false positive or increased false negative rates [1] and a bias in parameter estimates either away from or toward a null value. A confounder is a variable that is correlated with the predictor(s) and the outcome variable(s) in the model, and can cause a biased estimation of the causal association between these variables if not properly taken into account. To control for a confounder's effects, it is often included in the model as a covariate, which partials out its relationship with the predictor(s) and outcome variables in the model to obtain more accurate estimates of the relationship between predictor(s) and outcome(s) variables. In genetic association studies is overwhelming evidence that population stratification, assortative mating, and admixture among populations can result in intrapopulation variation in ancestry, correlations of allelic variation among unlinked loci, and ultimately confound association studies [2,3,4,5,6].

When discussing individual ancestry and individual admixture, it is important to distinguish what is meant by these two concepts. By individual ancestry (proportion) we mean the proportion of an individual's ancestors that come from a specified population. In contrast, individual admixture (proportion) is defined as the proportion of an individual's genome that is inherited from a specific parental population [7].

Several approaches to correct for population stratification and admixture have been proposed. Genomic control (GC) [4, 8, 9] and structured association testing (SAT) [10,11,12,13] are two such statistical approaches. Although GC can be useful in correcting for population stratification, we focus here on precisely estimating ancestry and using it as a covariate in SAT. The SAT model can flexibly accommodate time-to-event, dichotomous, ordinal, or continuous responses for the outcome measure and the model parameters can be estimated through standard statistical software. However, the model is subject to the same assumptions associated with standard linear models, including an implicit assumption that *all variables are measured without error*. In linear models, measurement error in predictors can introduce bias in the parameter estimates and increase the residual variance, which translates into inaccurate conclusions about hypotheses being tested.

Admixture may mask the true relationship between the phenotype (outcome variable) and genotypes (predictors) and produce false positives [14,15,16,17] and/or false negatives [18]. Individual admixture estimates are typically used as proxies for individual ancestry because individual ancestry is rarely known. Redden et al. [7] and Divers et al. [19] have shown that individual admixture estimates, as proxies for individual ancestry, are contaminated with measurement error for several reasons. First, only a subset of genetic markers with imperfectly known ancestral population allele frequencies is used to estimate admixture (i.e., not fully ancestry informative markers). Second, imperfect historical knowledge about the admixed population can lead to inaccurate estimates of individual admixture. Third, individual ancestry is the expected value of individual admixture, but the process of meiosis introduces random variation between the two constructs. Finally, genotyping errors will also contribute to individual admixture being estimated with error. All or any one of these conditions will cause a discrepancy between individual ancestry and estimates of individual admixture, which translates into error contaminated ancestry estimates.

This paper addresses accounting for admixture measurement error in SAT and explores a specific alternative, multiple imputation (MI), to the methods previously described by Divers et al. [19]. We use simulation to evaluate the performance of the proposed methods and conclude with a discussion of results and how the methods can be extended.

Redden et al. [7] formulated SAT in the form of a general linear model as follows:

$$f\left({Y}_{i}\right)={\beta}_{0}+{\beta}_{1}{A}_{i}+{\beta}_{2}{P}_{1i}{P}_{2i}+{\beta}_{3}{G}_{ij1}+{\beta}_{4}{G}_{ij2}+{\epsilon}_{i}.$$

(1.1)

In the model *f*(*Y*_{i}) is the link function linking *Y*_{i} variable (phenotype) to the parameters of the model, *A*_{i} is the ancestry of the *i*-th individual, *P*_{1i} and *P*_{2i} are the ancestry values of the two parents, and *G*_{ijk} is an indicator variable for the *i*-th individual with *k* and only *k* alleles at the *j*-th locus of type *m* (specific allele states). Redden et al. [7] propose inclusion of the product term for parental ancestry to better control for spurious association and achieve the desired Type I error rate. This general model can accommodate covariates such as gender, age, and treatment group and phenotypes such as time-to-event, dichotomous, ordinal, or continuous responses. The *A*_{i} ancestry component is included to control for the potential confounding effect and must either be assumed to be measured without error or necessitates a measurement error correction.

Admixture estimates can be expressed in the form of the classical true-score model (CTM) [20, 21] as

$${x}_{ij}={\tau}_{i}+{u}_{ij}$$

(1.2)

where *x*_{ij} is the *j*-th observed score (estimated admixture) for the *i*-th individual, τ_{i} is the true score (ancestry) for the *i*-th individual, and *u*_{ij} are the random components for the *j*-th admixture estimate (*j* = 1,2, …, *p*). In the CTM it is typically assumed that *E*(*U*_{ij}) = 0 and *var*(*u*_{ij}) = σ^{2}_{u} with *u*_{ij} mutually independent of each other and of τ_{j}[20, 21]. It can then be shown that *E*(*x*_{ij}) = τ_{i} or μ_{xi} = τ_{i} and σ^{2}_{x} = σ^{2}_{τ} + σ^{2}_{u}. Note that τ_{i} and *u*_{ij} are latent variables that are never observed, but both influence *x*_{ij}, which is observed. Nevertheless, an estimate of σ^{2}_{u} can be obtained using only the data from the *x*_{ij}'s. This can be done through a reliability coefficient, generically defined as

$${\sigma}_{x\tau}^{2}={\sigma}_{\tau}^{2}/{\sigma}_{x}^{2}={\sigma}_{\tau}^{2}/({\sigma}_{\tau}^{2}+{\sigma}_{u}^{2})$$

(1.3)

and ranges from 0 to 1 [20]. It should be noted that ρ^{2}_{xτ} is sometimes referred to as the intra-class correlation. Of specific interest here is Cronbach's alpha (α_{c}), a measure of the reliability of the sum of the equally weighted

$$x=\sum _{j=1}^{p}{x}_{j}$$

[22], computed as

$${\alpha}_{c}=\left[p/\left(p-1\right)\right]\left[1-\sum _{j=1}^{p}cov\left({x}_{j},{x}_{j}\right)/\sum _{j=1}^{p}\sum _{j=1}^{p}cov\left({x}_{j},{x}_{{j}^{\prime}}\right)\right].$$

(1.4)

The computation of α_{c} only requires that the *x*_{i}'s measure the same construct or latent variable (i.e., Tau-equivalence) [22]. The estimated reliability coefficient in turn provides an estimate of σ^{2}_{u} as σ^{2}_{u} = σ^{2}_{x} (1 – ρ^{2}_{xτ}) = σ^{2}_{x} (1 – α_{c}) [20] and is a weighted estimate of the observed score variance. Note that α_{c} is being used instead of ρ^{2}_{xτ}. In genetic association/mapping studies of population data, ancestry informative markers (AIMs) on each of the autosomal chromosomes can be used to obtain chromosome-specific admixture estimate for each person who, conditional on true individual ancestry, is independent. From here on we denote admixture estimate for an individual by *x*_{ij}. The chromosome-specific admixture estimates can be used to estimate α_{c}. For a discussion of how Cronbach's alpha effects association tests see Divers et al. [19].

Consider the linear model

$$Y={\beta}_{0}+\beta X+\epsilon ,$$

(1.5)

with ~ *NID*(0, σ^{2}_{}). If *X* is measured with error, it can be shown that the Ordinary Least Squares (OLS) regression of *Y* on *X* yields a consistent estimator of

$$\beta *=\left[{\sigma}_{\tau}^{2}/({\sigma}_{\tau}^{2}+{\sigma}_{u}^{2})\right]\beta ={\alpha}_{c}\beta ,$$

(1.6)

which is attenuated towards zero. In addition, measurement error affects the residual variance as seen in the expression

$$var(Y|X)={\sigma}_{\epsilon}^{2}+{\sigma}_{u}^{2}\left[{\sigma}_{\tau}^{2}/\left({\sigma}_{\tau}^{2}+{\sigma}_{u}^{2}\right)\right]{\beta}^{2}.$$

(1.7)

From the above two expressions, the smaller the measurement error variance (σ^{2}_{u}), the closer β* will be to β and the residual variance will be less confounded. Of course, neither problem will exist when there is no measurement error (σ^{2}_{u} = 0).

Divers et al. [19] demonstrated the use of quadratic measurement error correction (QMEC) [23, 24], regression calibration [25], expanded regression calibration [26, 27], and the simulation extrapolation (SimEx) algorithm [28, 29] to address the admixture measurement error challenge in SAT models. They found that the QMEC method performed best in terms of controlling the Type I error rate and the expanded regression calibration method performed the worst. However, the QMEC method is limited to linear models making a more flexible model desirable. Multiple imputation (MI) can in principle correct for measurement error in the general SAT model of Redden et al. [7] and flexibly accommodate a variety of special cases such as logistic and Cox regression.

Measurement error problems may be conceptualized as missing data problems in which we observe imperfect measurements but true scores are never seen (missing) [29]. Using MI to impute the missing true values as a means of correcting for measurement error in conjunction with alpha, which is used to estimate the measurement error variance, has the advantage of using the observed data as opposed to using (a) validation data in which the true values of the variable are actually observed, (b) replication data where multiple measurements of the variable are made, or (c) instumental data [29] in which two or more alternative methods are required to measure the variable.

In MI one treats imputed true values as probable and not as the one ‘true’ value, and using the one ‘true’ value ignores imputation variability or uncertainty about the actual value. Imputing a single value would fail to take into account the uncertainty about the actual value and can lead to underestimated standard errors, confidence intervals that cover less than their nominal coverage, and inflated Type I error rates. MI accounts for the uncertainty by imputing multiple values for each missing value and accounting for the resulting uncertainty and will yield valid estimates and tests pursuant to certain assumptions about the missing data mechanism [for details, see [32, 33]].

To use MI for measurement error correction one can proceed by obtaining an estimate of the true score (ancestry) for *i*-th individual based on the observed data [21] by formulating the prediction equation from regression theory as follows:

$$({\stackrel{\u02c6}{Y}}_{i}-{\mu}_{Y})/{\sigma}_{Y}={\rho}_{XY}({X}_{i}-{\mu}_{X})/{\sigma}_{X},$$

(1.8)

where Ŷ_{i} is the predicted score, ρ_{XY} is the correlation between *X* and *Y*, μ_{Y} and μ_{X}, and σ_{Y} and σ_{X} are the means and standard deviations of *Y* and *X*, respectively. Equation 1.8 can be rewritten as

$${\stackrel{\u02c6}{Y}}_{i}={\rho}_{XY}\frac{{\sigma}_{Y}}{{\sigma}_{X}}\left({X}_{i}-{\mu}_{X}\right)+{\mu}_{Y}.$$

(1.9)

Substituting ${\stackrel{\u2038}{\tau}}_{i}$ for Ŷ_{i}, ρ_{Xτ} for ρ_{XY}, σ_{τ}/σ_{X} = ρ_{Xτ}, and μ_{τ} = μ_{X} yields

$${\stackrel{\u02c6}{\tau}}_{i}={\rho}_{X\tau}^{2}({X}_{i}-{\mu}_{X})+{\mu}_{X}.$$

(1.10)

Note that α_{c} is used instead of ρ^{2}_{Xτ}. The variance associated with this estimated true score is ${\stackrel{\u2038}{\sigma}}_{u}^{2}={\stackrel{\u2038}{\sigma}}_{x}^{2}\left(1-{\stackrel{\u2038}{\alpha}}_{c}\right)$. The reliability index is defined as ρ_{Xτ} = σ_{τ}/σ_{X}[21]. Equation 1.10 is a Bayesian or ‘shrunken’ estimator [30]. Thus, probable true scores can be generated using estimated coefficients $\left({\stackrel{\u2038}{\alpha}}_{c}\right)$ and variances $\left({\sigma}_{u}^{2}\right)$. This idea will be revisited in the imputation process.

Redden et al. [7] indicated that the product of parental ancestries is required to achieve the desired Type I error rate when genotypic (as opposed to simply allelic) effects at the marker locus are tested. Divers et al. [19] found that squaring the individual admixture estimate ‘adequately approximates the product of ancestral ancestries’. Hence, in the present context, quadratic terms of the probable true scores are also required. Here, we justify the centering of the admixture estimate before implementation of MI. Assume that *X* ~ *N*(μ, σ^{2}), then

$$cov(X,{X}^{2})=({\mu}^{3}+3\mu {\sigma}^{2})-\mu ({\mu}^{2}+{\sigma}^{2})$$

(1.11)

$$cov(X,{X}^{2})=2\mu {\sigma}^{2}.$$

(1.12)

By centering *X*, then (*X* – μ) ~ *N*(0, σ^{2}), it then follows that cov((*X* – μ), (*X*– μ)^{2}) = 0. Thus, centering the admixture estimate allows one to ignore the covariance between *X* and *X*^{2} in the imputation process and subsequently only requires the squaring of the probable true score.

Using the SAT model proposed by Redden et al. [7], and given in equation (1.1), the following steps were implemented for the MI process.

- 1. Measurement model
- a.
*Regression method:*Regress*X*_{i}, the error contaminated variable, on the other variables in the model of interest. In our model this is*X*_{i}= β_{0}+ β_{Y}*Y*_{i}+ β_{2}*G*_{ij,1}+ β_{3}*G*_{ij,2}+ . This step is identical to standard imputation routines in which*X*_{i}is the variable with missing values.

- 2. Imputation process: Draw regression coefficients from the posterior distribution
- a.
*Cole et al. (2006):*This method uses the estimated parameters $\stackrel{\u2038}{\beta}={\left({\stackrel{\u2038}{\beta}}_{0}\hspace{0.17em}{\stackrel{\u2038}{\beta}}_{Y}\hspace{0.17em}{\stackrel{\u2038}{\beta}}_{2}\hspace{0.17em}{\stackrel{\u2038}{\beta}}_{3}\right)}^{\prime}$ and ${\stackrel{\u2038}{\Sigma}}_{\stackrel{\u2038}{\beta}}$ from Step 1, where ${\stackrel{\u2038}{\Sigma}}_{(\cdot )}={\stackrel{\u2038}{\sigma}}^{2}{\left({X}^{\prime}X\right)}^{-1}$, and ${\stackrel{\u2038}{\sigma}}^{2}={\alpha}_{c}{\stackrel{\u2038}{\sigma}}_{e}^{2}$. Draw a new set of*m*random parameter estimates as ${\stackrel{\u2038}{\beta}}^{\left(m\right)}=\stackrel{\u2038}{\beta}+{V}_{\stackrel{\u2038}{\beta}}^{\prime}Z$ from Step 1, where ${\stackrel{\u2038}{\Sigma}}_{(\cdot )}={V}_{(\cdot )}^{\prime}{V}_{(\cdot )}$, and Z is a vector of*z*_{i}~*NID*(0, 1). - b.
*Rubin (1987, pp 166–167):*In this method draws are made from the new set of*m*random parameter estimates as ${\stackrel{\u2038}{\beta}}^{\left(m\right)}=\stackrel{\u2038}{\beta}+{\sigma}_{*}V\prime Z$ from Step 1, where ${\left(X\prime X\right)}^{-1}=V\prime V,\hspace{0.17em}{\sigma}_{*}^{2}={\stackrel{\u2038}{\sigma}}^{2}\left(d{f}_{\stackrel{\u2038}{e}}-1\right)/g,{\stackrel{\u2038}{\sigma}}^{2}={\alpha}_{c}{\stackrel{\u2038}{\sigma}}_{e}^{2},g\sim {\chi}^{2}\left(d{f}_{\stackrel{\u2038}{e}}-1\right)$, and $d{f}_{\stackrel{\u2038}{e}}$ is the degrees of freedom (df) for the error term. - c.
*Bootstrap (Rubin, 1987):*With this method rather than making draws from*Z*~*NID*(0, 1) as in 2(a) and 2(b), the residuals from the fitted model are bootstrapped. Everything remains the same as option 2(a) and 2(b) except thatis used instead of$${e}_{i}^{*}={e}_{i}/\sqrt{{\stackrel{\u02c6}{\sigma}}^{2}\left(1-k/n\right)}$$*z*_{i}, where*e*_{i}is the standardized residual for the*i*-th individual, ${\stackrel{\u2038}{\sigma}}^{2}$ is the estimated variance,*k*is the number of parameters in the model, and*n*is the sample size. This method has the advantage of imputing values whose distribution is similar to that of the observed values [31]. All options in Step 2 simulate draws from the posterior predictive distribution of the parameters. This allows for ‘proper’ imputation [32] because the estimates produced in Step 2 are only probable estimates and not the true estimates.

- 3. Imputation Process: Drawing
*m*new probable true scores.- a. ${\stackrel{~}{T}}_{i}^{(m)}={\stackrel{~}{\beta}}_{0}^{(m)}+{\stackrel{~}{\beta}}_{Y}^{(m)}{Y}_{i}+{\stackrel{~}{\beta}}_{2}^{(m)}{G}_{ij,1}+{\stackrel{~}{\beta}}_{3}^{(m)}{G}_{ij,2}+{z}_{i}\stackrel{\u02c6}{\sigma}(\text{Cole})$
- b. ${\stackrel{\u20db}{T}}_{i}^{(m)}={\stackrel{~}{\beta}}_{0}^{(m)}+{\stackrel{~}{\beta}}_{Y}^{(m)}{Y}_{i}+{\stackrel{~}{\beta}}_{2}^{(m)}{G}_{ij,1}+{\stackrel{~}{\beta}}_{3}^{(m)}{G}_{ij,2}+{z}_{i}{\sigma}_{*}(\text{Rubin})$
- c. ${\dot{T}}_{i}^{(m)}={\stackrel{~}{\beta}}_{0}^{(m)}+{\stackrel{~}{\beta}}_{Y}^{(m)}{Y}_{i}+{\stackrel{~}{\beta}}_{2}^{(m)}{G}_{ij,1}+{\stackrel{~}{\beta}}_{3}^{(m)}{G}_{ij,2}+{z}_{i}^{*}{\sigma}_{*}(\text{Bootstrap}),$ where
*z*~_{i}*NID*(0, 1). - 4. Fit the model of interest using the new
*m*probable true scores. This isfor the SAT Model discussed.$${Y}_{i}^{(m)}={\beta}_{0}+{\beta}_{1}{\stackrel{\u02c6}{T}}_{i}^{(m)}+{\beta}_{2}{\stackrel{\u02c6}{T}}_{i}^{(m)2}+{\beta}_{3}{G}_{ij,1}+{\beta}_{4}{G}_{ij,2}+{\epsilon}_{i}$$(1.13)

In the above steps, measurement correction is essentially variance correction in the form of ${\stackrel{\u2038}{\sigma}}^{2}={\alpha}_{c}{\stackrel{\u2038}{\sigma}}_{e}^{2}$.

It is important to recall that MI assumes that the missing values are missing at random (MAR). In short, MAR means the probability that values are missing on a certain variable *Y* depends on other variables in the model, but not on *Y* itself. Although, MI is not specifically being used to impute missing values, the MAR assumption still holds. What is being treated as missing are the true value, which are not observed. Even so, it is assumed that the true values have a relationship with the other variables in the model, which is the MAR assumption.

For comparative purposes, the data were analyzed through a naïve model, a model that treats the variables as if they had no measurement error.

The simulation investigated the effect of error-contaminated individual ancestry proportions on the Type I error rate in SAT models. The underlying individual ancestry distribution (*X*) was simulated by making draws from a mixture of uniform and normal distributions that mimic the ancestral distribution observed in African American populations following the simulation procedures by Tang et al. [34]. A thousand datasets, each containing 500 markers and 1000 individuals were generated. The delta-value of each marker is allowed to vary between 0 and 0.9. However, only ancestry informative markers were retained for individual ancestry proportion estimation. They were sampled more heavily toward the upper bound of this interval for high Cronbach's alpha values and more toward the lower bound for lower Cronbach's alpha values. These markers were evenly divided into 22 blocks, which are used to provide a set of 22 estimates of individual ancestry. These estimates are used to estimate Cronbach's alpha. From these sets, 20 sets of 500 markers for each mean Cronbach Alpha values of ${\overline{\alpha}}_{c}=0.90,\hspace{0.17em}0.80,\hspace{0.17em}0.70$ were randomly selected. The allele frequency of each marker in the admixed sample was computed as a mixture of two parental allele frequencies as follow:

$${P}_{ij}^{adx}={X}_{i}{P}_{j}^{1}+\left(1-{X}_{i}\right){P}_{j}^{2}$$

(1.14)

where *P*^{1}_{j} and *P*^{2}_{j} are frequencies of allele 1 at the *j*-th marker for the 1st and 2nd parental populations, *X*_{i} the simulated ancestry of the *i*-th admixed individual, and *P*^{adx}_{ij} is the allele 1 frequency for the *i*-th admixed individual for the *j*-th marker. In this simulation, given a specific delta value, *P*^{1}_{j} ~ *U*(0, 1), *P*^{2}_{j} = *P*^{1}_{j} + δ where δ ~ *Bin*(100, *delta*) × 0.01, and *X*_{i} = 0.2 × *U*(0.1, 0.9) + 0.8 × *N*(0.15, 0.05^{2}) [19, 34]. The trait or phenotypic variable was generated as

$${Y}_{i}=35+5{X}_{i}+0{G}_{ij,1}+3{G}_{ij,2}+{\epsilon}_{i}$$

(1.15)

$${Y}_{i}=35+5{X}_{i}+5{X}_{i}^{2}+0{G}_{ij,1}+3{G}_{ij,2}+{\epsilon}_{i}$$

(1.16)

for the linear and quadratic model, respectively, where _{i} ~ *N*(0, 4). The linear model was generated for comparative purposes. In the simulation *X*_{i} is the simulated true ancestry proportion from the above mixture distribution and *W*_{i} = *X*_{i} + *e*_{i} is the observed ancestry proportion, where *e*_{i} ~ *N*(0, σ^{2}_{i}), is the error-contaminated ancestry coefficient. Note that this is ancestry estimated in the form of the classical true-score model (CTM). The σ^{2}_{i} values were selected so that the observed correlations between *W*_{i} and *X*_{i} vary between 0.85 and 0.95, and to demonstrate that highly yet still imperfectly correlated true and estimated (or measured) ancestry proportions can still lead to Type I error inflation. We note that a correlation between 0.85 and 0.95 ensures that Cronbach's alpha is bounded between 0.7 and 0.9. Under this scheme, 20 datasets of 500 markers containing 1000 individuals were simulated for a total of 10,000 markers. Each marker was tested for association with the simulated phenotype.

Each dataset contained a sample of 1000 individuals with 500 markers. Both the SAT models with and without the squared ancestry term were fitted to the data; we refer to the former as a linear SAT model and the latter a quadratic SAT model. Assume there are two alleles (*A*, *a*) at a locus forming three genotypes (*aa*, *aA*, *AA*) and allele *A* is of interest. The genotypes can be coded to allow for testing of only additive or both additive and non-additive effects and table table11 offers respective coding schemes.

Table Table22 contains the Type I error rates of the linear and quadratic SAT models with additive and non-additive genotypic coding for different reliability coefficients’ corresponding to naïve model (i.e. without measurement correction). The type I error rates are liberal irrespective of genotype coding, a linear or quadratic SAT model, and reliability coefficient, implying that the association test will have a higher false positive rate if there is confounding by admixture and the model is not corrected for measurement error.

Type I error rates corresponding to the β coefficients in SAT models without any measurement corrections

Tables Tables3,3, ,4,4, ,55 provide the type I error rates with measurement correction corresponding to the Rubin, Bootstrap, and Cole methods. Table Table33 contains Type I error rates for both the linear and quadratic SAT model with additive and non-additive genotypic coding for reliability coefficient of 0.90. The type I error rates for all three methods of imputation were slightly conservative for the linear SAT model. A similar trend occurred for the quadratic SAT model with the exception of the bootstrap method, where the type I error rates for the β_{3} were slightly liberal.

Average Type I error rates after measurement correction corresponding to the β coefficients for reliability coefficient of 0.90, using 10,000 replicates

Average Type I error rates after measurement correction corresponding to the β coefficients for reliability coefficient of 0.80, using 10,000 replicates

Average Type I error rates after measurement correction corresponding to the β coefficients for reliability coefficient of 0.70, using 10,000 replicates

The Type I error rates of the linear and quadratic SAT model with additive and non-additive genotypic coding with reliability coefficient of 0.80 are presented in table table44 with measurement correction using the Rubin, Bootstrap, and Cole's method. For the linear SAT model, the Bootstrap imputation method controlled the Type I error rate best followed closely by the Cole and Rubin's methods. Additionally, the Cole and Rubin methods were not as conservative as before. However, the type I error rates were liberal for the quadratic SAT model using the Bootstrap method irrespective of genotype coding system. Both Rubin and Cole's methods provided type I error rates closer to nominal significance level of 0.05 and slightly less conservative compared to the situation with reliability coefficient of 0.90.

Lastly, table table55 displays the Type I error rates of the linear and quadratic SAT models with additive and non-additive genotypic coding with reliability of 0.70. The type I error rates for the Bootstrap method were very liberal compared to either of Rubin's or Cole's method. However, all methods performed poorly for the quadratic SAT model. However, the Rubin and Cole methods kept the type I error rate closer the nominal significance level of 0.05. The slight exception here is that both the Rubin and Cole methods were slightly conservative for the β_{4} parameter estimate.

Measurement error in linear model variables is an important consideration, and through simulation we demonstrated the importance for correcting measurement error in linear models. Of particular interest was using multiple imputation (MI) for measurement error correction for the Redden et al. [7] SAT model. Although the Redden SAT model requires individual ancestry estimates to control for admixture confounding, individual admixture estimates were used because individual ancestry estimates are rarely known, so admixture estimates can be used as a surrogate for the ancestry estimates. We then describe how to use MI for measurement correction. Like Divers et al. [19], we also used Cronbach's alpha [35] as a component of our measurement error correction procedure. We also described three different methods for imputing probable true scores for admixture: Rubin, Bootstrap, Cole.

In the linear SAT model, of the three different methods for imputing probable admixture scores, the Rubin and Cole methods appear to work best. Although at first it looks like the Bootstrap method controls the Type I error correctly whereas the Rubin and Cole methods are slightly conservative, as the marker informativeness begins to decrease it is the Rubin and Cole methods that control Type I error rate and the Bootstrap method becomes liberal. Consistently, the Rubin and Cole method provided better control of the Type I error rate than the Bootstrap method. This same pattern was observed in Divers et al. [19], in that measurement error correction only appears to be required when the informativeness of the markers is of intermediate value. The reason for this is that when markers are highly informative, the measurement correction method provides little improvement. On the other hand, when marker informativeness is low, the measurement correction method has poor information to borrow for measurement correction. MI for measurement correction as presented uses the existing data to accomplish this goal and require no external information.

In the quadratic SAT model, of the three different methods for imputing probable admixture scores, the Rubin and Cole methods again appear to work best. The Bootstrap method did not consistently provide reasonable control of the Type I error rate. One interesting point is that the type I error rates of the Bootstrap method, in all models, are very similar to the type I error rates of the model without measurement error correction, suggesting that the Bootstrap method is not providing much measurement error correction. Notably, none of the methods works particularly well for a quadratic SAT model with admixture reliability of 0.70. Because of this result the linear SAT model corrected for measurement error may be considered, yet it too can have problems if the genetic effects are markedly non-additive (e.g., overdominance).

There is now much agreement that population admixture and/or population stratification can confound association studies when not taken into account. However, it should also be mentioned that accuracy with which admixture is measured will have an influence on Type I error. When admixture or any other continuous variable are contaminated with error, MI for measurement error correction can help control the specified Type I error rate. However, this method is only useful if the data are of reasonably good quality with respect to marker information, which means that much care should still be taken when designing association studies, and in particular when measuring variables that will be used for analysis in a statistical model.

This work was supported in part by National Institutes of Health grants: 5R01AR052658-02, ES009912, DK056336, CA100949-03, HL072757, AR007450, AR049084, R21LM008791, R01GM077490. The opinions expressed are solely those of the authors and do not necessarily represent those of the NIH or any other organization with which the authors are affiliated.

1. Weinberg CR. Toward a clearer definition of confounding. Am J Epidemiol. 1993;137:1–8. [PubMed]

2. Knowler WC, Williams RC, Pettitt DJ, Steinberg AG. Gm3;5,13,14 and type 2 diabetes mellitus: An association in american indians with genetic admixture. Am J Hum Genet. 1988;43:520–526. [PubMed]

3. Spielman RS, McGinnis RE, Ewens WJ. Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (iddm) Am J Hum Genet. 1993;52:506–516. [PubMed]

4. Devlin B, Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [PubMed]

5. Freedman ML, Reich D, Penney KL, McDonald GJ, Mignault AA, Patterson N, Gabriel SB, Topol EJ, Smoller JW, Pato CN, Pato MT, Petryshen TL, Kolonel LN, Lander ES, Sklar P, Henderson B, Hirschhorn JN, Altshuler D. Assessing the impact of population stratification on genetic association studies. Nat Genet. 2004;36:388–393. [PubMed]

6. Redden DT, Allison DB. The effect of assortative mating upon genetic association studies: Spurious associations and population substructure in the absence of admixture. Behav Genet. 2006;36:678–686. [PubMed]

7. Redden DT, Divers J, Vaughan LK, Tiwari HK, Beasley TM, Fernández JR, Kimberly RP, Feng R, Padilla MA, Liu N, Miller MB, Allison DB. Regional admixture mapping and structured association testing: Conceptual unification and an extensible general linear model. Plos Genetics. 2006;2:1254–1264. [PMC free article] [PubMed]

8. Devlin B, Bacanu SA, Roeder K. Genomic control to the extreme. Nat Genet. 2004;36:1129–1130. author reply 1131. [PubMed]

9. Devlin B, Roeder K, Wasserman L. Genomic control, a new approach to genetic-based association studies. Theor Popul Biol. 2001;60:155–166. [PubMed]

10. Pritchard JK, Donnelly P. Case-control studies of association in structured or admixed populations. Theor Popul Biol. 2001;60:227–237. [PubMed]

11. Satten GA, Flanders WD, Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet. 2001;68:466–477. [PubMed]

12. Chen HS, Zhu X, Zhao H, Zhang S. Qualitative semi-parametric test for genetic associations in case-control designs under structured populations. Ann Hum Genet. 2003;67:250–264. [PubMed]

13. Zhang S, Zhu X, Zhao H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet Epidemiol. 2003;24:44–56. [PubMed]

14. Ziv E, Burchard EG. Human population structure and genetic association studies. Pharmacogenomics. 2003;4:431–441. [PubMed]

15. Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM. Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003;72:1492–1504. [PubMed]

16. Halder I, Shriver MD. Measuring and using admixture to study the genetics of complex diseases. Hum Genomics. 2003;1:52–62. [PMC free article] [PubMed]

17. Marchini J, Cardon LR, Phillips MS, Donnelly P. The effects of human population structure on large genetic association studies. Nat Genet. 2004;36:512–517. [PubMed]

18. Deng HW. Population admixture may appear to mask, change or reverse genetic effects of genes underlying complex traits. Genetics. 2001;159:1319–1323. [PubMed]

19. Divers J, Vaughan LK, Padilla MA, Fernandez JR, Allison DB, Redden DT. Correcting for measurement error in individual ancestry estimates in structured association tests. Genetics. 2007;176:1823–1833. [PubMed]

20. Allen MJ, Yen WM. Introduction to measurement theory. Monterey, CA: Brooks/Cole Pub. Co.; 1979.

21. Crocker LM, Algina J. Introduction to classical and modern test theory. New York: Holt, Rinehart, and Winston; 1986.

22. Bollen KA. Structural equations with latent variables. New York: Wiley; 1989.

23. Cheng C-L, Van Ness JW. Statistical regression with measurement error. London: Arnold; 1999.

24. Cheng CL, Schneeweiss H. Polynomial regression with errors in the variables. J R Stat Soc Ser B (Statistical Methodology) 1998;60:189–199.

25. Carroll RJ, Stefanski LA. Approximate quasi-likelihood estimation in models with surrogate predictors. J Am Stat Ass. 1990;85:652–663.

26. Schneeweiss H, Nitter T. Estimating a polynomial regression with measurement errors in the structural and in the functional case – a comparison. In: Mohammed AK, Saleh E, editors. Data Analysis from Statistical Foundations: A Festschrift in Honour of the 75th Birthday of Das Fraser. Huntington, NY: Nova Science Publishers; 2001. pp. 195–207.

27. Kuha J, Temple J. Covariate measurement error in quadratic regression. Int Stat Rev. 2003;71:131–150.

28. Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. J Am Stat Ass. 1994;89:1314–1328.

29. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models : A modern perspective. ed 2. Boca Raton: Chapman & Hall/CRC; 2006.

30. Lindley DV, Smith AFM. Bayes estimates for the linear model. J R Stat Soc. 1972;34:1–41.

31. Rubin DB. Multiple imputation for nonresponse in surveys. New York: Wiley; 1987.

32. Little RJA, Rubin DB. Statistical Analysis with Missing Data. ed 2. Hoboken, NJ: Wiley-Interscience; 2002.

33. Barnard J, Rubin DB. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86:948–955.

34. Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: Analytical and study design considerations. Genet Epidemiol. 2005;28:289–301. [PubMed]

35. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334.

Articles from Human Heredity are provided here courtesy of **Karger Publishers**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |