Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Biometrics. Author manuscript; available in PMC 2009 September 22.
Published in final edited form as:
PMCID: PMC2748149

Covariate Adjusted Correlation Analysis with Application to FMR1 Premutation Female Carrier Data


Motivated by molecular data on female premutation carriers of the fragile X mental retardation 1 (FMR1) gene, we present a new method of covariate adjusted correlation analysis to examine the association of messenger RNA (mRNA) and number of CGG repeat expansion in the FMR1 gene. The association between the molecular variables in female carriers needs to adjust for activation ratio (ActRatio), a measure which accounts for the protective effects of one normal X chromosome in females carriers. However, there are inherent uncertainties in the exact effects of ActRatio on the molecular measures of interest. To account for these uncertainties, we develop a flexible adjustment that accommodates both additive and multiplicative effects of ActRatio nonparametrically. The proposed adjusted correlation uses local conditional correlations, which are local method of moments estimators, to estimate the Pearson correlation between two variables adjusted for a third observable covariate. The local method of moments estimators are averaged to arrive at the final covariate adjusted correlation estimator, which is shown to be consistent. We also develop a test to check the nonparametric joint additive and multiplicative adjustment form. Simulation studies illustrate the efficacy of the proposed method. (Application to FMR1 premutation data on 165 female carriers indicates that the association between mRNA and CGG repeat after adjusting for ActRatio is stronger.) Finally, the results provide independent support for a specific jointly additive and multiplicative adjustment form for ActRatio previously proposed in the literature.

Keywords: Conditional correlation, Fragile X syndrome, Local method of moments, Mental retardation, Nonparametric partial correlation, Pearson correlation, Semiparametric modeling

1. Introduction

Fragile X syndrome (FXS) is the most common inherited form of X-linked intellectual disability, with cognitive and behavioral impairments associated with distinct physical features. FXS results from a hyperexpansion of a CGG trinucleotide repeat in the promoter region of the fragile X mental retardation 1 (FMR1) X-linked gene (Oberlé et al., 1991; Verkerk et al., 1991). When the number of CGG repeats exceeds 200 (full mutation) methylation and transcriptional silencing of the gene occur (Pieretti et al., 1991) with consequent absence or deficiency of the FMR1 protein (FMRP; Devys et al., 1993). Individuals with smaller expansions in the premutation range of 55 to 200 CGG repeats are called premutation carriers. Many premutation carriers have some physical and behavioral characteristics of FXS (Hagerman, 2002) while a subgroup of older adult carriers develop fragile X-associated tremor/ataxia syndrome (FXTAS) later in their lives (Jacquemont et al., 2004) and about 20% develop premature ovarian failure. For a review, see Hagerman and Hagerman (2004). However, molecular mechanisms/models for the myriad of clinical involvements associated with premutation carriers, a current area of active research, are distinct from the molecular model that characterizes FXS. More precisely, unlike full mutation, premutation alleles do not lead to transcriptional silencing of FMR1. Indeed, it has been shown that premutation male carriers have significantly elevated levels of FMR1 mRNA compared to normal controls (Tassone, Hagerman, Chamberlain, et al., 2000; Tassone, Hagerman, Taylor, et al., 2000; Kenneson et al., 2001) and mRNA levels are positively correlated with the number of CGG repeats.

For female premutation carriers, the underlying association/correlation between CGG repeat size and mRNA level is more complex. The analysis of this correlation needs to take into account (or adjust for) the protective effects from one normal X chromosome. This protective effect is quantified by the activation ratio (ActRatio), which measures the proportion of normal active X chromosomes. Although it is difficult to precisely account for the effect of ActRatio on observed mRNA level, Tassone, Hagerman, Chamberlain, et al. (2000) proposed to examine the relationship between CGG repeat size and mRNA level, after adjusting for ActRatio, based on the following adjustment


where X is the observed mRNA level, X is the unobserved (adjusted) mRNA level due to the carrier chromosome, U is the ActRatio, and a is the fixed mean level of mRNA in normal alleles. The parametric adjustment in (1) is a simple decomposition of the observed mRNA level into two parts, one from the normal allele and the other from the diseased allele. Although this simple decomposition serves as a simple and biologically sensible adjustment, it does not account for the inherent uncertainties in the precise effect of ActRatio on mRNA expression level (Tassone, Hagerman, Chamberlain, et al., 2000). Hence, we propose a more general, fully non-parametric adjustment that incorporates both additive and multiplicative effects of U, as in (1). More precisely, we consider the following adjustment, of which the previous adjustment (1) is a special case,


where [var phi]1(·) and [var phi]2(·) are unknown smooth functions of U. Similarly, the potential effect of U on the variability in CGG repeats is modeled as


where ψ1(·) and ψ2(·) are also allowed to be general unknown smooth functions to accommodate uncertainties in the effects of U, is the observed CGG repeat size, and Y is its U-adjusted form. In (2)–(3), the unobserved variables (X, Y) are defined to be the parts of (X,Ỹ) that are independent of U. Our aim is estimation of the correlation between X and Y, denoted ρXY, adjusted for the general effects of U based on the observed data {X,Ỹ,U}.

The adjustments that we consider in (2)–(3) are flexible to accommodate linear effects of U, as in (1), or nonlinear effects. In addition, the effects of U may be additive, multiplicative, or a combination of both. The trivial case where U has no effect is accommodated with [var phi]1(·) = ψ1(·) = 1 and [var phi]2(·) = ψ2(·) = 0. Also, because there are no assumptions made on the unknown functions in (2)–(3), other than smoothness, an important property of the proposed estimator for ρXY is its invariance under linear transformations, similar to the Pearson correlation. Thus, the proposed covariate adjusted correlation is unaffected by the scale of the measurements.

We note that the adjustments (2)–(3) are partly related to the work of Şentürk and Müller (2005b). They proposed an estimator for ρXY (a) under the special case when X = X[var phi](U) and Ỹ = Yψ(U) and (b) that requires identifiability conditions of E{[var phi](U)} = 1 and E{ψ(U)} = 1 and the assumptions that E() ≠ 0 and E(X) ≠ 0 for estimation. These identifiability conditions and assumptions are difficult to verify in practice generally, and they are not satisfied for the FMR1 data adjustment described above. In the current work, we propose the more general adjustments (2)–(3) that account for the X-linked nature of disease and develop a completely different estimator for ρXY that does not require the above assumptions and identifiability conditions. A key observation in the development of the proposed estimator is that the (local) conditional correlation Corr(X,Ỹ | U = u) is equal to ρXY under (2)–(3), when [var phi]1(·) and ψ1(·) are of the same sign. This condition is satisfied when both [var phi] 1(·) and ψ1(·) are positive. This means the observed measurements are positively correlated with what we want to measure, i.e., the correlation between Xand X and between Yand are positive. Based on the above observation that Corr(X,Ỹ | U = u) = ρXY under (2)–(3), the data are stratified with respect to U, where the levels of U will be approximately constant in each stratum. We then average the local method of moments estimators of ρXY, obtained in each strata, to arrive at our final covariate adjusted correlation estimator.

Although the proposed adjustment formulation (2)–(3) is motivated from the problem of assessing the association between molecular measures, adjusted for ActRatio in female premutation carrier data, it is sufficiently general for a variety of other applications. Thus, the proposed adjusted estimator and the associated theory should be of broader interest beyond the motivating area of application. For instance, examples of covariate adjustments (2)–(3) include normalization of albumin turnover and protein catabolic rate (Kaysen et al., 2002), through division by U, body surface area. Such a normalization of the observed variables is common in biomedical studies, and can be viewed as a special case of the adjustments (2)–(3) with [var phi]2(·) = ψ2(·) = 0 and [var phi]1(·) = ψ1(·) = U. A similar adjustment in environmental health is described in Schisterman et al. (2005), where the exposure level of polychlorinated biphenyl (PCB), a lipophilic compound, is adjusted through division by a function of serum lipid levels (U).

We also note that although the additive effects of a covariate can be adjusted for with standard approaches, such as partial correlation or nonparametric partial correlation, these methods cannot adjust for multiplicative (possibly nonlinear) effects. A limitation of the partial correlation is that it adjusts for only additive linear effects of a covariate. More specifically, it can be shown that the standard partial correlation between X and adjusted for a covariate U targets ρXY, when X = X + a1U + a2 and Ỹ = Y + b1U + b2. These standard methods no longer target ρXY under the more general adjustments (2)–(3). We elaborate on the special cases of the additive effects of U in Section 2 and the Appendix section.

The remainder of the article is organized as follows. We detail the proposed covariate adjusted estimator of ρXY in Section 2. The asymptotic result is also given in Section 2, where the proof is deferred to the Appendix section. In Section 3, we propose a bootstrap test to check the proposed dual additive and multiplicative adjustment structure of (2)–(3). The proposed method is further examined with simulation studies and illustrated with an application to the aforementioned data on female FMR1 premutation carriers in Sections 4 and 5, respectively.

2. Estimation

Estimation of ρXY is based on the observed data of size n, {(Ui,Xi,Yi)}i=1,n where Xi = Xi[var phi] 1 (Ui) + [var phi] 2(Ui),i = Yiψ1(Ui) + ψ2(Ui) and the unobserved variables (X, Y) are defined to be the parts of X and that are independent of U. The proposed estimator of ρXY is constructed from local method of moments estimates of ρXY. These local estimates utilize the fact that, under the general adjustments (2)–(3), the correlation between X and at a fixed U is equal to the correlation ρXY. To be more precise, denote [rho with tilde](u) to be the correlation between X and given U = u, defined by [rho with tilde](u) [equivalent] Corr(X,Ỹ | U = u) = Cov(X,Ỹ | U = u)/{Var(X | U = u)Var( | U = u)}1/2. Note that by conditioning on U = u, it follows from the definitions of and X and the invariance of ρXY to linear transformations that


if [var phi]1(u) and ψ1(u) are assumed to be of the same sign. The above relationship implies that within a neighborhood of u, the correlation between the observed variables X and , denoted ρX, will target ρXY of interest. The proposed estimator of ρXY, based on this relationship, is an average of localized method of moments estimates of [rho with tilde](u).

To obtain the targeted local estimates, we bin the observed data with respect to U. The range of U is divided into m equidistant intervals, referred to as bins and denoted by B1,… , Bm. Let Lj denote the number of subjects falling into bin j, 1 ≤ jm. To track the observations that fall into a given bin, bin-specific observations are marked by a prime. For example, data for subject k in bin j is (Ujk,Xjk,Yjk) for 1kLj. We define the following local method of moments estimator of the correlation between X and within bin j,


where MXY,j=Lj1k=1LjXjkYjk,MX,j=Lj1k=1LjXjk,MY,j=Lj1k=1LjYjk,MX2,j=Lj1k=1LjXjk2, and MY2,j=Lj1k=1LjYjk2. Guidelines for choosing the total number of bins m will be given in the simulation studies of Section 4. Because rj targets ρXY for all j = 1, …, m, a natural estimator of ρXY can be based on the average of {rj}j=1m. Therefore, the proposed covariate adjusted correlation estimator of ρXY is


which is a weighted average of the bin specific estimators. Note that the weights are proportional to the numbers of points in each bin. The covariate adjusted estimator, r, is consistent for ρXY, as given by the following result. The proof is deferred to the Appendix section.

Theorem 1: Under the technical conditions given in the Appendix,


where cn = {n/log(n)}−1/3.

We emphasize here that the consistency of the covariate adjusted correlation estimator, r, holds under the general additive and multiplicative adjustments (2)–(3). However, as pointed out in the “Introduction” section and proven in the Appendix section, the special case of additive linear effects of U (i.e., X = X + a1 U + a2 and Ỹ = Y + b1 U + b2) can be handled with standard partial correlation analysis. The partial correlation estimate is obtained by first regressing (1) X on U and (2) on U to obtain two sets of residuals. The partial correlation estimate is then obtained as the Pearson correlation between the two sets of residuals. In contrast to the additive linear case, the partial correlation does not target ρXY under general additive effects of U on X and Y, such as nonlinear effects. More specifically, consider X = X + [var phi](U) and Ỹ = Y + ψ(U), where [var phi](·) and ψ(·) are unknown smooth functions of U that may be nonlinear. Under these general additive effects, it is also shown in the Appendix section that a simple generalization of the partial correlation, called non-parametric partial correlation, targets ρXY. The only difference between partial and nonparametric partial correlation is that, for the latter, the two sets of residuals are obtained from nonparametric regressions of X on U and on U. Both partial and nonparametric partial regression do not target ρXY under the more general form of (2)–(3), as shown in the Appendix section.

We note that while r is based on an equidistant binning procedure, alternative binning approaches can be integrated to the estimation procedure proposed above. One alternative approach that we also explored is based on nearest neighbor binning. As pointed out earlier, for the equidistant binning used, Bj, j = 1,…, m, are fixed and equidistant; however, the number of data points, Lj, falling into each bin is random. In nearest neighbor binning, the bin lengths and boundaries are random, but each bin contains the same number of observations, denoted by L. This alternative binning utilizes the nearest neighbor idea by first ordering the observed distortion values Ui, i = 1,…, n, and then forming the m = n/L number of bins by grouping the sets of L nearest neighbor values among the ordered set starting with the first L to the last. Once the bins are formed, the rest of the procedure is the same as explained for the case of equidistant binning. We compare the performance of the two binning procedures in more detail in Section 4.4 with respect to various distributions for U.

Also, upon the suggestion of the editor, we explored a variation on the proposed estimator in (4) by replacing the rj's in (4) with their Fisher's z transformed values (i.e., .5{ln (1 + rj) − ln (1 − rj)}). Comparison of r with this variation is given in Section 4.5.

For inference, we use the bootstrap percentile method to form confidence intervals (CIs) based on the proposed covariate adjusted estimator in the analysis of the female FMR1 premutation data. The estimated nonparametric density of the standardized 1000 bootstrap estimates of ρXY is given in Figure 1 (bottom panel), along with the standard normal density curve. The fitted density appears close to the standard normal density, indicating that the percentile bootstrap approximation is reasonable. The coverage of the proposed bootstrap percentile CIs are examined through simulations reported in Section 4.3.

Figure 1
(Top panel) Scatter plot of the local correlation estimates rj versus UJM for j = 1,…, 20 bins, with approximately eight points per bin. A local linear smooth overlays the scatterplot with an automatically selected bandwidth of h = 0.1. (Bottom ...

An important practical issue with the application of the proposed estimator is the adequacy of the assumed adjustment forms (2)–(3). Although these assumed dual additive and multiplicative adjustment forms are fairly general compared to the additive linear restriction of other methods like partial correlation, it is still of interest to check the adequacy of these forms. We address this issue next by developing a bootstrap test to check this assumption.

3. Assessing the Adjustment Model Assumption

The dual additive and multiplicative adjustment form of (2)–(3) imply that the local correlation [rho with tilde](u) = ρXY is free of u. Hence, under the null hypothesis, H0 : Ỹ = ψ1(U)Y + ψ2(U) and X = [var phi] 1(U)X + [var phi]2(U), the scatterplot of the local correlation estimates should be randomly scattered around the constant ρXY. Even though this scatterplot, augmented with a scatterplot smoother, can provide an initial graphical check of the above assumption, we also develop a more formal assessment through a hypothesis testing procedure. The proposed test will be based on the local correlation estimates {rj}j=1m coming from each bin. Let {UjM}j=1m denote the corresponding midpoints of the bins. Then a smooth fit to the scatterplot of {(UjM,rj)}j=1m is expected to be a constant function under H0. We consider a smooth test as they are expected to be more powerful (Hart, 1997, p. 140). Reasonable test statistics quantify departures of the smooth estimator from a horizontal line at the sample mean of {rj}. Alternatively, one can quantify the departures (from the horizontal line at zero) of the smooth estimator fitted to the centered scatterplot, {(UjM,rjC)}j=1m, where rjC=rjm1jmrj. Similar to the statistics proposed by Hart (1997) and Şentürk and Müller (2005a), we adopt as a measure of departure,


where ρ^(UM;hT)=j=1mrjCwj(UM,hT) is the linear smooth fitted to the centered scatterplot using the bandwidth hT and weights wj(UM,hT), evaluated at UM.

An automatic data-based choice of the bandwidth parameter hT that is fast to implement and that leads to good results, adopted from Rice (1984), is


where Wh is an m × m matrix with ([ell], j)th element wj(UM;h),RSS(h)=rρ^2 for r′ = (r1,…, rm)T, and ρ^={ρ^(U1M,h),,ρ^(UmM,h)}T.

We approximate the sampling distribution of Rn by the wild bootstrap, because the local estimators rj are heteroscedastic with their variance dependent on U. The bootstrap samples have the form {(U1M,r1CV1),,(UmM,rmCVm)}, where Vj is sampled from the two-point distribution attaching masses (5+1)/25 and (51)/25 to the points (51)/2 and (5+1)/2 (Davidson and Hinkley, 1997, p. 272). The variables {rjCVj} have mean zero and crudely approximate the variance and skewness of the underlying distribution, because (Vj)j=1m are independent and identically distributed random variables with mean zero and with variance and third moment equal to one. Properties of this test are studied in Section 4.3.

4. Simulation Studies

In this section, we summarize the simulation studies conducted to examine (i) the finite-sample performance of the proposed covariate adjusted correlation estimator and its relative performance in comparison to no adjustment, parametric adjustment of Tassone, Hagerman, Chamberlain, et al. (2000), partial correlation and nonparametric partial correlation, (ii) the sensitivity of the proposed estimator to the choice of the number of bins m, (iii) the power of the proposed bootstrap test for checking the dual additive and multiplicative adjustment forms and the coverage of the proposed bootstrap percentile confidence interval (CI), (iv) the performance of the two binning procedures (equidistant and nearest neighbor) and their robustness to the distribution of U, and (v) the performance of the alternative estimator proposed via Fisher's z transformations.

4.1 Finite Sample Performance and Comparison to Other Adjustments

The simulation setup was designed to reflect the observed FMR1 premutation data, where the means and variation of (X,,U) are chosen to be similar to those of ( mRNA, CGG, ActRatio). Also, the correlation ρX,Ỹ = 0.32 was chosen to approximately match the observed correlation rmRNA,CGG=0.29. The covariate U is simulated from Uniform [0.2, 0.9]. The underlying unobserved variables (X, Y)T are obtained from the bivariate normal distribution with mean vector (5.5,7.5)T, σX2=1.5, σY2=1.1, and ρXY = 0.6. The functional effects of U are given by ψ1(U) = (U + 4)2/2, ψ2(U) = 25(1 − U), [var phi]1(U) = U3, and [var phi]2(U) = 2(1 − U2). The observed data are obtained as X = X[var phi]1(U) + [var phi]2(U) and = 1 (U) + ψ2(U). The simulation studies were carried out for sample sizes of n = 150, 300, and 600. The proposed covariate adjusted correlation estimator is compared to estimators from no adjustment (rX), the parametric adjustment (1) of Tassone, Hagerman, Chamberlain, et al. (2000), partial correlation and nonparametric partial correlation under three distortion settings: (a) nonparametric additive and multiplicative distortion effects, (b) parametric additive and multiplicative distortion as given in equation (1), and (c) parametric additive distortion.

The first distortion considered is (a) the general case of nonparametric additive and multiplicative effects under which only the proposed covariate adjusted estimator targets the underlying correlation coefficient. Table 1 reports the estimated absolute bias, variance, and MSE of the correlation estimators based on 1000 Monte Carlo data sets for each sample size. As evident from the results in Table 1, only the bias of the proposed covariate adjusted correlation decreases with increasing sample size. As expected, the biases of the other methods remain substantial across the different sample sizes, because they do not target ρXY.

Table 1
Estimated absolute bias, variance, and MSE of the estimators from proposed covariate adjusted correlation, no adjustment, parametric adjustment (1) of Tassone, Hagerman, Chamberlain, et al. (2000), partial correlation, and nonparametric partial correlations ...

In the second distortion setup (b) of parametric additive and multiplicative distortion ( = Y, X = (1 − U)X + 1.42U), the parametric adjustment (1) of Tassone et al. and the proposed covariate-adjusted estimator are the two methods that target the underlying correlation. For the third setup (c) of parametric additive distortion ( = Y + 5U, X = X + 10U), partial correlation, nonparametric partial correlation, and the proposed method target the correct correlation. The results for (b) and (c) are reported in Tables 2 and and3,3, respectively. As can be seen from Table 2, both partial and nonparametric partial correlation perform (equally) poorly and their biases do not decrease with increasing sample size, as expected for model (b). For parametric additive distortion, namely case (c), the incorrect form of adjustment (1) due to Tassone et al. and the unadjusted correlation result in biases that do not decrease with increasing sample size (Table 3). The biases of parametric, nonparametric partial correlation and the proposed covariate-adjusted correlation decrease with increasing n. The simpler methods of parametric and partial correlation are more efficient than the proposed method under the null models (b), (c) for the small sample size of n = 150, as expected. However, this difference seems to diminish quickly as the sample size increases to n = 300 and 600.

Table 2
Estimated absolute bias, variance, and MSE of the estimators of the correlation under the null case of parametric adjustment given in (1) from proposed covariate adjusted correlation, no adjustment, parametric adjustment (1) of Tassone, Hagerman, Chamberlain, ...
Table 3
Estimated absolute bias, variance, and MSE of the estimators of the correlation under the null case of additive parametric adjustment from proposed covariate adjusted correlation, no adjustment, parametric adjustment (1) of Tassone, Hagerman, Chamberlain, ...

4.2 Choice of m

In the simulation studies we also examine the effect of the total number of bins, m, on the proposed estimators. Similar to the results of Şentürk and Müller (2005b), where the corresponding estimator was also obtained through binning, the estimates are found to be robust to the choice of m. The results indicate that the correlation estimates and MSEs were very similar for m between 15 and 30 for n = 150, m between 20 and 40 for n = 300, and m between 25 and 50 for n = 600. For example, under simulation setup (a) of Section 4.1, for n = 300 with m [set membership] {20, 30, 40} the mean of covariate adjusted correlation estimates for ρXY = 0.467 were (0.449, 0.447, 0.439) and the MSEs were (0.0027, 0.0030, 0.0034), respectively. Although our experience with the proposed estimator indicates that the final estimate is fairly robust to a reasonably wide range of m in practice, for any given application it is prudent to consider an analysis of sensitivity to m, as was done for the FMR1 data application above.

4.3 Power of the Proposed Bootstrap Test and Coverage of the Proposed Bootstrap Confidence Interval

Next, to examine the power of the proposed bootstrap test for checking the dual additive and multiplicative form of (2)–(3), we considered two cases of deviations (alternatives) from this assumption (null case). The simulation model (a) of Section 4.1 described above is used for the null case. In the first alternative case, and X deviate from the additive and multiplicative forms (2)–(3) through:


where N0 [equivalent] [var phi]1(U)X + [var phi]2(U), M0 [equivalent] ψ1(U)Y + ψ2(U), I{e} is the indicator function for event E, and θ = 0, 1,…, 8. The functions [var phi]1(U), [var phi]2(U), ψ1(U), and ψ2(U) are as defined above. The null hypothesis (i.e., assumption (2)–(3) holds) corresponds to θ = 0 and θ = 1,…, 8 correspond to increasing alternatives. These alternatives as well as the null are displayed in Figure 2 (top left plot), where the conditional correlation functions, [rho with tilde](u), are provided. When the additive and multiplicative forms are satisfied (θ = 0), [rho with tilde](u) is constant (see Section 3). The second set of alternatives/violations explored in this simulation study are provided graphically in the top right plot of Figure 2. These conditional correlation functions correspond to the following alternative deviations:

Figure 2
(Top panel) Plots of [rho with tilde](u) from the two cases of alternatives to the proposed additive and multiplicative distortion form. The null hypothesis of additive and multiplicative forms corresponds to θ = 0, i.e., the (conditional) correlation ...

for θ = 0,…, 8. Similarly, the null hypothesis corresponds to the case of θ = 0.

Given in the lower panel of Figure 2 are the power of the bootstrap test proposed in Section 3 to check the adequacy of the dual additive and multiplicative forms. Displayed are three sets of power curves corresponding to sample sizes n = 150, 300, and 600. Two curves of the same line type are for the test at significance levels 0.05 (bottom curve) and 0.10 (top curve). Power estimates are based on 1000 Monte Carlo runs. The observed type I errors at θ = 0, for the above significance levels are, (0.015, 0.025) for n = 150, and (0.04, 0.08) for n = 600 in the first alternative deviation case. For the second alternative deviation case, the observed levels are (0.01, 0.04) for n = 150 and (0.04, 0.09) for n = 600. As expected, the levels of the bootstrap test move closer to the target values and the power functions increase with increasing deviation away from the null case of θ = 0 and with increasing sample size.

We also examined the estimated coverage levels of the proposed bootstrap percentile CIs under the simulation setting (a) of Section 4.1. Briefly, 1000 data sets were simulated at two sample sizes of n = 165 and n = 300. For each data set, 1000 bootstrap samples were generated and the estimated coverage values of the CIs corresponding to levels of (0.80, 0.90, and 0.95) are (0.75, 0.88, 0.93) for n = 165 and (0.80, 0.89, 0.95) for n = 300.

4.4 Alternate Binning Procedure and Robustness to the Distribution of U

We also ran a simulation study to compare the proposed equidistant binning estimator to one obtained via nearest neighbor binning and to evaluate the performance of the two binning procedures under different U distributions. While the bin size is kept constant in equidistant binning, the number of points per bin is kept constant in nearest neighbor binning (see Section 2). We compare the two binning procedures for three different distributions of U. Under model (a) of Section 4.1, where U was sampled from Uniform [0.2, 0.9] uniform distribution, we additionally consider cases where U is sampled from N (0.55, 0.04) and χ2 (1)/6 + 0.4. The three distributions are chosen such that they have approximately the same first two moments. The results are summarized in Table 4. The results suggest that the proposed equidistant binning is quite robust to the distribution of U and the nearest neighbor binning do not improve on the proposed equidistant binning.

Table 4
Equidistant and nearest neighbor binning for uniform, normal, and χ2 distributed U. The results are presented for three sample sizes of n = 150, 300, and 600

4.5 Comparison to Estimators via Fisher's z Transformation

We implemented the variation of the proposed estimator by replacing rj by its the Fisher-z values, and compared its performance to that of r under the simulation setup of Section 4.4 in terms of bias, variance, and MSE. We compared their performance under U distributed as uniform, normal, and χ2 as described earlier. Even though neither estimation approach was superior to the other in all aspects, we believe the simulation results reported in Table 5 are of interest. While the performance of the estimators are quite similar for uniformly distributed U, the estimator averaging Fisher-z values improves on the bias of the proposed estimator for U distributed as normal and χ2. However, the original proposed estimator yields smaller variance, especially for small sample sizes (n = 150 and 300); thus original estimator results in smaller MSE for the smaller sample size. Their MSE estimates are similar for larger n (300 and 600) in the simulation study, as expected.

Table 5
Comparison between the proposed method and the one based on Fisher's z-transformation for uniform, normal, and χ2 distributed U. The results are presented for three sample sizes of n = 150, 300, and 600

5. Application to Female FMR1 Premutation Data

The molecular measurements, CGG, mRNA, and activation ratio (ActRatio) U [equivalent] ActRatio, were obtained from experiments at the University of California at Davis on 165 female premutation carriers. Our main interest here is to target ρXY [equivalent] ρmRNA,CGG, the activation ratio-adjusted correlation between the mRNA level and CGG repeat size. Figure 3 gives the matrix plot for the observed variables [ CGG, mRNA, ActRatio], where CGG, mRNA, and ActRatio range between (57, 138) repeats, (0.78, 6.3) and (0.19, 0.91), respectively. Figure 3 suggests a potential ActRatio effect on both CGG and mRNA.

Figure 3
Matrix plot of the observed variables CGG, mRNA, and ActRatio for n = 165 female premutation carriers.

Prior to estimating ρmRNA,CGG, we assess the adequacy of the assumed dual additive and multiplicative forms (2)–(3) for the data, as described in Section 3. The local correlation estimators from each bin are given in Figure 1 (top panel), along with a local linear smooth using the automatic bandwidth choice of h = 0.1 determined by (5). A p-value of 0.56 was obtained from 1000 bootstrap replications of Rn. Thus, the adequacy of the assumed adjustment forms (2)–(3) is not rejected. Graphically, this can also be seen from Figure 1 where the linear smooth fitted to the scatterplot of the local correlations is approximately close to a constant function.

In our analysis, we compare the proposed covariate (ActRatio) adjusted estimate for the correlation, ρmRNA,CGG, to estimates obtained without adjustment and with adjustment (1) on mRNA, previously proposed by Tassone, Hagerman, Chamberlain, et al. (2000), partial correlation and nonparametric partial correlation. The estimate without adjustment corresponds to the observed Pearson correlation between mRNA and CGG(X and ). The proposed estimate is obtained using a total of 20 bins. We note here that the covariate adjusted correlation estimate was quite robust to the choice of the number of bins. For example, the estimates were very similar for the number of bins from m = 17 to m = 25. The estimates and approximate 95% CIs for ρXY from these five methods are provided in Table 6. For the unadjusted Pearson correlation, the assumed adjustment (1), partial correlation, and nonparametric partial correlation approximate CIs can be obtained using Fisher's z-transformation. The CI for the proposed covariate adjusted estimator in (4) was obtained using the percentile bootstrap method with 1000 bootstrap replications.

Table 6
Estimates and approximate 95% CIs for ρmRNA,CGG adjusted for ActRatio in n = 165 female premutation carriers. The first four estimates correspond to unadjusted Pearson correlation and parametric adjustment (1) from Tassone, Hagerman, Chamberlain, ...

The correlation between the observed mRNA levels and the CGG in female premutation carriers, unadjusted for the effect of ActRatio, is 0.29 (95% CI: 0.15–0.43, Table 6). Although still significant, the correlation estimate for female carriers falls substantially below the corresponding estimate for male carriers (correlation ≈ 0.57), which has been established in literature. As described in the “Introduction” section, this weaker association in female carriers is attributed partly to the protective effects from one normal X chromosome in female carriers which is absent in male carriers. Applying the (parametric) adjustment (1) of Tassone, Hagerman, Chamberlain, et al. (2000), specifically mRNA = (1 − ActRatio)mRNA + aActRatio, to account for ActRatio results in a stronger (adjusted) correlation point estimate of 0.34. We used the constant a = 1.42, which is the empirical mean mRNA level for normal/unaffected individuals from Tassone, Hagerman, Taylor, et al. (2000). The proposed covariate adjusted correlation under the general adjustment forms (2)–(3), allowing for nonparametric effects of ActRatio on both observed mRNA and CGG, results in an adjusted correlation point estimate of 0.37 (95% CI: 0.25 – 0.51). Although the proposed method suggests that the underlying correlation is slightly higher, this result is quite similar to the result using the parametric additive and multiplicative adjustment (1) proposed by Tassone, Hagerman, Chamberlain, et al. (2000). Hence, this application provides an independent empirical support for the previously proposed parametric joint additive and multiplicative effect of ActRatio on mRNA, derived mainly from biological motivations. Also, because the nonparametric partial correlation (0.367, 95% CI: (0.228, 0.491)) is close to the proposed adjusted correlation estimate (0.372, 95% CI: (0.252, 0.520)), informally, it is interpreted that the nonlinearity is due to the additive distortion part.

6. Concluding Remarks

Motivated by an adjustment for ActRatio in fragile X premutation female carriers, we proposed a general dual additive and multiplicative correlation adjustment model for the correlation between mRNA and CGG repeat expansion. A key feature of the methodology is that the uncertainty in the precise effects of ActRatio at the molecular level is modeled nonparametrically, thus accommodating linear or nonlinear effects. We proposed a simple covariate adjusted correlation estimator that is easy to obtain, showed that it is consistent, and examined its numerical properties in simulation studies. Although the adjustment forms are fairly general and, therefore, are automatically adaptive to special cases like linear additive or nonlinear additive effects, we also developed and assessed the performance of a bootstrap test procedure to check the adequacy of the dual additive and multiplicative forms. A test for detecting whether the distortion setting at hand would reduce to parametric or only additive cases would also be of interest, because then a simpler adjustment method can be employed. Nevertheless this remains an open problem requiring further research.

Application of the proposed covariate adjusted correlation to n = 165 fragile X premutation female carriers indicates stronger association between FMR1 mRNA level and CGG repeat expansion compared to unadjusted analysis. Our results provide new insights and additional support for a dual additive and multiplicative parametric adjustment previously proposed in the fragile X premutation literature. The proposed adjustment is also applicable to the multiplicative adjustments (normalizations) used in biomedical research, including adjustments of biomarkers of inflammation by body mass index or body surface area and individual levels of PCB exposure by individual serum lipids.

Extension of the proposed algorithm to accommodate multiple covariates poses challenges. While the adjustment for two covariates (U = (U1, U2)) would be a straightforward extension of the proposed algorithm using a two-dimensional binning procedure, as the dimension of U increases, one would quickly run into the curse of dimensionality. Because the proposed procedure involves localizing with respect to U, when the dimension of U increases, the data needed for the localization (binning) would become highly sparse. In these cases, a dimension reduction approach, such as taking a linear combination of the components of U vector may be of interest.


We are grateful to the reviewer, associate editor, and the editor for many detailed suggestions that substantially improved the article. Support for this work includes the National Institute of Health grants UL1DE19583, RL1AG032119, and RL1AG032115 (VN, FT, RJH, PJH), National Institute of Child Health and Human Development grant HD036071 (RJH, DVN, FT), National Cancer Institute grant CA-57030 (RJC), and grant UL1 RR024146 from the National Center for Research Resources (NCRR).


Proof of Consistency

We first state the technical conditions that will be used in the proof of consistency. They are: C1. The adjusting variable U is independent of the variables X and Y. In addition, the marginal density f(U) of U has compact support, i.e., aUb for some constants a, b, and satisfies infa≤u≤b f(u) > 0, supa≤u≤b f(u) < ∞. The marginal density is also uniformly Lipschitz. C2. The functions [var phi]1(·), [var phi]2(·), ψ1(·), and ψ2(·) have continuous derivatives. Furthermore, [var phi]1(·) and ψ1(·) are of the same sign. C3. The two variables X and Y satisfy the following moment conditions, E|X| < ∞ and E|Y| < ∞ for some λ ≥ 3. C4. The function h(u) = ∫ xg(x,u)dx is uniformly Lipschitz, where g(·, ·) is the joint density function of X and U and of and U. C5. The number of bins m is of order {n/log (n)}1/3.

Considering the definition of the Nadaraya–Watson estimator (Fan and Gijbels, 1996), we note that all the five terms in rj are Nadaraya–Watson estimators. For instance, consider MX, j. It has the form MX,j=Lj1k=1LjXjk={i=1nK((UiUjM)/h)Xi}/i=1nK((UiUjM)/h)N^(UjM), which is a Nadaraya–Watson estimator with K(·) = (1/2) 1 [−1,1], h = (ba)/m, and UjM is the midpoint of the jth bin, as defined in Section 3. Uniform consistency of Nadaraya– Watson estimators with kernels of compact support has been shown in Härdle, Janssen, and Serfling (1988),


where N(u) = E(X | U = u) and cn = {n/log (n)}−1/3. Then equation (6) implies supj|N^(UjM)N(UjM)|=Op(cn). Similar to the uniform consistency of MX, j, it follows that MỸ, j, MX2,j, MY2,j and MXỸ,j are all uniformly consistent over j. Hence the following holds uniformly in j


where c=E(XY|U=UjM)E(X|U=UjM)E(Y|U=UjM),s1=[E(X2|U=UjM){E(X|U=UjM)}2]1/2 and s2=[E(Y2|U=UjM){E(X|U=UjM)}2]1/2. Hence, Theorem 1 follows.


The consistency of r also follows under weaker moment conditions than the ones given in C3, where 2 < λ < 3. For the order of m and the convergence rates under weaker moment conditions, see Härdle et al. (1988) for details.

Target of partial correlation

The partial correlation between and X adjusted for U is equivalent to the correlation between the variables eỸU and eXU, denoted ρ eỸU eXU, where eỸU and eXU are the errors from the regression models = a0 + a1U + eỸU and X = b0 + b1U + eXU, respectively. Using the population normal equations for regression, under adjustments (2)–(3), we have a1 = cov{U, Ỹ}/var(U) = [cov {U,ψ2(U)} +cov{U,ψ1(U)Y}] / var(U),a0 = E() − a1 E (U), b1=cov{U, X} / var(U) = [cov {U,[var phi]2(U)} + cov{U,[var phi]1(U)X}]/ var(U), and b0 = E(X) − b1E(U). Thus, plugging in definitions of a0 and b0, we have eỸU = a0a1 U = [ψ2 (U) − E{ψ2(U)}]+ [ψ1 (U)YE{ψ1 (U)Y}]− a1{UE(U)} and eXU=Xb0b1U = [[var phi]2(U) − E{[var phi]2(U)}] + [[var phi]1(U)XE{([var phi]1(U)X}] − b1{UE(U)}. Therefore, ρeỸU eXU is not necessarily equal ρXY. However note that ρeỸU eXU = ρXY holds if ψ1(U) = [var phi]1(U) = 1 and ψ2(U) and [var phi]2(U) are both linear in U, i.e., X = X + c1U + c2 and Ỹ = Y + d1U + d2. This follows from the fact that a1 = c1 and b1 = d1 and hence eỸU = YEY and eXU = XEX when ψ1(U) = [var phi]1(U) = 1 and ψ2(U) and [var phi]2(U) are linear in U. An implicit assumption here is that cov (X, U) = cov (Y, U) = 0.

Target of nonparametric partial correlation

The nonparametric partial correlation between and X adjusted for U is equivalent to ρỸUXU, where ỸU and XU are the errors from the nonparametric regression models = E( | U) + Ỹ U and X = E(X | U) + XU, respectively. Thus, under adjustments (2)–(3), Ỹ U = E( | U) = ψ1(U){YE(Y | U)} and X U = XE(X | U) = [var phi]1(U){XE(X | U)}. Therefore, ρỸUXU does not necessarily equal ρXY unless ψ1(U) = [var phi]1(U) = 1, i.e., X = X + [var phi](U) and = Y + ψ(U) and X and Y are independent of U.


  • Davidson AC, Hinkley DV. Bootstrap Methods and Their Applications. New York: Cambridge University Press; 1997.
  • Devys D, Lutz Y, Rouyer N, Bellocq JP, Mandel JL. The FMR-1 protein is cytoplasmic, most abundant in neurons and appears normal in carriers of a fragile X premutation. Nature Genetics. 1993;4:335–340. [PubMed]
  • Fan J, Gijbels I. Local Polynomial Modelling and its Applications. London: Chapman and Hall; 1996.
  • Hagerman RJ. Physical and behavioral phenotype. In: Hagerman RJ, Hagerman PJ, editors. Fragile X Syndrome: Diagnosis, Treatment and Research. 3rd. Baltimore: Johns Hopkins University Press; 2002. pp. 3–109.
  • Hagerman PJ, Hagerman RJ. The fragile-X premutation: A maturing perspective. American Journal of Human Genetics. 2004;74:805–816. [PubMed]
  • Härdle W, Janssen P, Serfling R. Strong uniform consistency rates for estimators of conditional functionals. Annals of Statistics. 1988;16:1428–1449.
  • Hart J. Nonparametric Smoothing and Lack of Fit Tests. New York: Springer-Verlag; 1997.
  • Jacquemont S, Hagerman RJ, Leehey MA, Hall DA, Levine RA, Brunberg JA, Zhang L, Jardini T, Gane LW, Harris SW, Herman K, Grigsby J, Greco C, Berry-Kravis E, Tassone F, Hagerman PJ. Penetrance of the fragile X-associated tremor/ataxia syndrome (FXTAS) in a premutation carrier population: Initial results from the California-based study. Journal of the American Medical Association. 2004;291:460–469. [PubMed]
  • Kaysen GA, Dubin JA, Müller HG, Mitch WE, Rosales LM, Levin NW, Hemo Study Group Relationship among inflammation nutrition and physiologic mechanisms establishing albumin levels in hemodialysis patients. Kidney International. 2002;61:2240–2249. [PubMed]
  • Kenneson A, Zhang F, Hagedorn CH, Warren ST. Reduced FMRP and increased FMR1 transcription is proportionally associated with CGG repeat number in intermediate-length and premutation carriers. Human Molecular Genetics. 2001;10:1449–1454. [PubMed]
  • Oberlé L, Rousseau F, Heitz D, Kretz C, Devys D, Hanauer A, Boue J, Bertheas MF, Mandel JL. Instability of a 550-base pair DNA segment and abnormal methylation in fragile X sydrome. Science. 1991;252:1097–1102. [PubMed]
  • Pieretti M, Zhang F, Fu YH, Warren ST, Oostra BA, Caskey CT, Nelson DL. Absence of expression of the FMR-1 gene in fragile X syndrome. Cell. 1991;66:817–822. [PubMed]
  • Rice J. Bandwidth choice for nonparametric regression. Annals of Statistics. 1984;12:1215–1230.
  • Schisterman EF, Whitcomb BW, Louis GMB, Louis TA. Lipid adjustment in the analysis of environmental contaminants and human health risks. Environmental Health Perspectives. 2005;113:853–857. [PMC free article] [PubMed]
  • Şentürk D, Müller HG. Covariate adjusted regression. Biometrika. 2005a;92:59–74.
  • Şentürk D, Müller HG. Covariate adjusted correlation analysis via varying coefficient models. Scandinavian Journal of Statistics. 2005b;32:365–383.
  • Tassone F, Hagerman RJ, Chamberlain WD, Hagerman PJ. Transcription of the FMR1 gene in individuals with fragile X syndrome. American Journal of Medical Genetics. 2000;97:195–203. [PubMed]
  • Tassone F, Hagerman RJ, Taylor AK, Gane LW, Godfrey TE, Hagerman PJ. Elevated levels of FMR1 mRNA in carrier males: A new mechanism of involvement in fragile X syndrome. American Journal of Human Genetics. 2000;66:6–15. [PubMed]
  • Verkerk AJ, Pieretti M, Sutcliffe JS, Fu YH, Kuhl DP, Pizzuti A, Reiner O, Richards S, Victoria MF, Zhang F, Eussen BE, van Ommen GJB, Blonden LAJ, Riggins GJ, Chastain JL, Kunst CB, Galjaard H, Caskey CT, Nelson DL, Oostra BA, Warren ST. Identification of a gene (FMR-1) containing a CGG repeat co-incident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell. 1991;65:905–914. [PubMed]