Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2719909

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Partial covariate adjusted regression models
- 3 Estimation procedure
- 4 Asymptotic properties
- 5 Application to the Pima Indians diabetes data
- 6 Numerical studies
- 7 Proofs of the main results
- 8 Discussion
- References

Authors

Related links

J Stat Plan Inference. Author manuscript; available in PMC 2010 February 1.

Published in final edited form as:

J Stat Plan Inference. 2009 February 1; 139(2): 454.

doi: 10.1016/j.jspi.2008.04.030PMCID: PMC2719909

NIHMSID: NIHMS80163

Covariate adjusted regression (CAR) is a recently proposed adjustment method for regression analysis where both the response and predictors are not directly observed (Şentürk and Müller, 2005). The available data has been distorted by unknown functions of an observable confounding covariate. CAR provides consistent estimators for the coefficients of the regression between the variables of interest, adjusted for the confounder. We develop a broader class of partial covariate adjusted regression (PCAR) models to accommodate both distorted and undistorted (adjusted/unadjusted) predictors. The PCAR model allows for unadjusted predictors, such as age, gender and demographic variables, which are common in the analysis of biomedical and epidemiological data. The available estimation and inference procedures for CAR are shown to be invalid for the proposed PCAR model. We propose new estimators and develop new inference tools for the more general PCAR setting. In particular, we establish the asymptotic normality of the proposed estimators and propose consistent estimators of their asymptotic variances. Finite sample properties of the proposed estimators are investigated using simulation studies and the method is also illustrated with a Pima Indians diabetes data set.

Covariate adjusted regression has been recently proposed to adjust for the distorting effects of a confounder in a regression setting. It was motivated by a common adjustment method in medical and health related studies. The adjustment entails normalization by anthropometric measurements, such as body mass index (*BMI*) and/or other measures of body configuration, as confounding variables that affect the primary variables of interest. For example, in a study involving haemodialysis patients, it is of interest to examine the relationship between elevated plasma fibrinogen level (a risk factor for cardiovascular disease in haemodialysis patients) and other predictors, such as serum transferrin protein level (Kaysen et al., 2003; Şentürk and Müller, 2005). However, both primary variables, fibrinogen and transferrin protein levels, are known to depend on body mass index, which exerts a confounding effect on the protein measurements. A common approach to adjust for the confounders, like *BMI*, is to normalize the primary variables of interest by simply dividing (by *BMI*). Note that this adjustment by division implies that the assumed contamination is of a multiplicative form. Let *Ỹ*, , and *U* denote the observed fibrinogen concentration, serum transferrin level, and confounder *BMI*, respectively. Using these notations, the adjusted primary variables that are thought to be free from the confounding effect of *BMI* are,

The basic motivation for the above adjustment is to obtain normalized versions of the observed primary variables by removing the confounder effects, so that the measurements are comparable across patients. Other examples include normalizations by *BMI* in studies on diabetes, and division of brain volumetric structures by total brain volume in neurological studies (Pinter et al., 2001).

Şentürk and Müller (2005, 2006) proposed a more flexible adjustment, by modeling the confounding through *unknown functions* of the confounder instead of the confounder itself. This reflects the uncertainty encountered in many applications about the precise nature of the commonly assumed multiplicative relation between the confounder and the variables. For the case of *p* predictors, Şentürk and Müller model the underlying variables as

where they are defined to be the parts of the observed variables, *Ỹ*, _{1},…, * _{p}*, that are independent of the observable confounder

where *e* is the error term, assumed to be independent of and *U*. The estimation procedure is based on the observed data: the distorted response, *Ỹ*, distorted predictors, , and confounder *U*.

The main goal of this paper is to construct estimation and inference procedures needed to allow some of the variables as unadjusted/undistorted predictors, denoted by *Z*_{1},…, *Z _{s}*. The proposed underlying regression model is then of the form

(1)

where * _{r}* = ϕ(

Under the more general partial covariate adjusted regression (PCAR) setting, formally presented in Section 2, the original CAR estimators (Şentürk and Müller, 2005) for are inconsistent and may have an arbitrarily large asymptotic bias as shown in Section 3. We propose alternative estimators that are consistent under this extended CAR setting where the issues of estimation are discussed in Section 3. The proposed PCAR methodology, like CAR, provides consistent estimators not only under multiplicative but also under additive distortion as discussed also in Section 3. The inference procedures developed for the CAR modeling are not valid for the PCAR setting, mainly due to the different dependence structure needed for PCAR. This new structure is explained in detail in Section 2 and Section 3. Thus, we develop new theoretical tools for valid inference in the PCAR model. We derive the asymptotic distributions of the proposed estimators, and present them in Section 4. Consistent estimators of the asymptotic variance are also derived in Section 4. Simulation studies to characterize the finite sample properties of the proposed estimators are summarized in Section 6. The method is further illustrated with a Pima Indians diabetes data set given in Section 5. The proofs of the main results are assembled in Section 7, where some technical conditions and auxiliary results are deferred to the Appendix. We conclude with a brief discussion in Section 8.

We note here that an advantage of CAR and PCAR is that, under the identifiability conditions introduced in Section 2, it yields consistent estimates whether the distortion is multiplicative or additive, i.e. *Y* = *Ỹ* − ψ(*U*) and *X _{r}* =

We consider the underlying (unobserved) regression model

(2)

where *Y _{ni}, e_{ni}, χ_{ni}* = (1,

(3)

The unknown distorting functions are assumed to be smooth functions of the confounder, *U*.

Some constraints on the unknown smooth distortion functions are needed for the identifiability of the estimation problem. A set of reasonable constraints for ψ(·) and {ϕ_{r}(·)} is implied by the natural assumption that the mean distorting effect should correspond to no distortion (Şentürk and Müller, 2005), i.e. the means of adjusted variables are the same as the means of the observed variables, *E*(* _{r}*) =

(4)

We consider the following dependence structure. The underlying predictors *X _{r}* and the undistorted predictors

The assumption that the underlying predictors, , and response, *Y*, are independent of the contaminating variable *U* is a fundamental assumption for the estimation procedure. It defines the proposed contamination setting through defining the unobserved, underlying variables. This independence assumption cannot be checked in practice since *X _{r}* and

We refer to the model described by (2)–(4) as the partial covariate adjusted regression (PCAR) model, since only a partial set of the predictors are adjusted for the confounder. Note also that the CAR model is a special case of the PCAR model.

For the estimation, note that it follows from (2) and the mutual independence of {*e* and *U*}, {*e* and (*X _{r},Z_{s}*)}, and {

(5)

where

(6)

and ϵ(*U _{ni}*) = ψ

The estimation of the regression coefficients, γ_{0}, , in the underlying regression model is a two-step procedure. The first step involves estimation of the varying coefficient functions in model (5), namely β_{0}(·), using a binning approach. These varying functions are estimable because *Ỹ*, * _{r}*,

The binning approach for the estimation of the varying coefficient functions involves dividing the support of *U* into *m* equidistant bins and then fitting linear regressions of *Ỹ* on * using the data falling within each bin. The observed data is the collection of **n* samples: . It is assumed that the confounding covariate, *U*, is bounded below and above, *a* ≤ *U* ≤ *b*, where *a* < *b* are real numbers. In practice *a* and *b* would be taken to be min_{i} *U _{ni}* and max

After the initial binning of the data, a linear regression is fitted to the data observed within each bin *B _{nj}*,

(7)

where the response vector is is the *L _{nj}* × (

In the second step of the estimation procedure, the estimators of the targeted regression parameters, γ*0*, , are obtained as weighted averages of the estimators from the *m* bins. The proposed PCAR estimators for γ_{0}, are

(8)

where . The weights in (8) depend on the number of data points in each bin, namely *L _{nj}* for

We note that the estimators _{n0} and _{nr} have the same form as the CAR estimators (Şentürk and Müller, 2005), whereas _{ns} are different. Furthermore, a straightforward application of the CAR algorithm yield inconsistent estimators for *δ _{s}* under the more general PCAR model. To see this, denote the original CAR estimators for

where . The estimators do not target *δ _{s}*, instead they target

For the former mentioned case (1) when the response is undistorted, simpler adjustments given by Hwang (1986) and Iturria et al. (1999), for multiplicative measurement error only in the predictors, would also be applicable. Hwang (1986) proposes a consistent estimator for the regression coefficients by estimating and adjusting for the bias of the regular least squares estimator. The estimation assumes that consistent estimates of the moments of the measurement error are available. Iturria (1999) proposes two estimation methods, where the first considers specific distributional forms for the measurement error and the second also models the distribution of the unobserved predictors. The two approaches of Hwang and Iturria are similar to PCAR in assuming that the error is independent of the unobserved predictors and that it has a mean of one. The difference of PCAR from the previous two approaches is that no knowledge of the distributional forms or the specific moments are assumed. Instead, information from the observed covariate *U*, of which the measurement error ϕ_{r}(*U*) is a function of, is utilized in the proposed estimation procedure.

The CAR estimators are biased for only the case of undistorted predictors. In other words, if all variables are considered as distorted then CAR provides consistent estimators. This is the difference between CAR and PCAR. While the PCAR set-up allows researchers the choice to consider a subset of predictors as distorted, the CAR set-up requires that all predictors are considered as distorted in order to yield consistent estimates.

As the PCAR set-up allows researchers to consider a subset of predictors as distorted, an important issue is how to determine whether a predictor should be considered as distorted or undistorted. Note that this distinction cannot be made using a statistical/analytical approach via studying the dependence or the relations between the observed variables and *U*. This is because all observed variables are dependent on *U* in the model (correlate with *U*) *whether they are distorted or undistorted*. Hence, this decision instead should be made by considering the duality between (a) the decision/assumption on undistorted (distorted) variables and (b) the specification of the underlying model; more specifically, the consideration of the predictor choice in the underlying model.

More precisely, determining to include a predictor as distorted or undistorted corresponds to two completely different underlying models of interest, one involving *Z _{s}* and the other involving , respectively. Specification of the underlying model will depend on the specific interest of the researcher and the specific context of the application. For example, in the data analysis presented in Section 5, the goal is to uncover the relation between a diabetes marker and diastolic blood pressure adjusted for body mass index. Hence, the diabetes marker and diastolic blood pressure are considered as adjusted/distorted variables. On the other hand, age and triceps skin fold thickness are considered unadjusted/undistorted, as we are interested in the direct effects of age and triceps skin fold thickness on the

We also note here that the PCAR estimators given in (8) are consistent for the parameters of the underlying model (2) also under additive distortion. More precisely, consider the simple case of one distorted and one undistorted predictor. The regression model in (2) simplifies to *Y* = γ_{0}+γ_{1}*X* +δ_{1}*Z*+*e*, where the multiplicative error structure is replaced by the additive error structure, given by *Ỹ* = *Y* + ψ_{a}(*U*) and = *X* + ϕ_{a}(*U*). The proposed PCAR estimators given in (8) are consistent even when the multiplicative error is replaced by additive error as described above. The above additive error model leads to the following specific varying coefficient model: *E*(*Ỹ|,Z,U*) = β_{0}(*U*) + β_{1}(*U*) + η_{1}(*U*)*Z*, where β_{0}(*U*) = γ_{0} − γ_{1}ϕ_{a}(*U*) + ψ_{a}(*U*), β_{1}(*U*) = γ_{1} and η_{1}(*U*) = δ_{1}. The PCAR estimators given in (8), namely _{0}, _{1} and _{1}, target

respectively. This holds regardless of the specific error structure, whether it be additive or multiplicative. Furthermore, under the additive distortion model, we have that

This follows since *E*{ψ_{a}(*U*)} = *E*{ϕ_{a}(*U*)} = 0 in the additive distortion model, under the identifiability condition of no average distortion, i.e. *E*(*Ỹ*) = *E*(*Y*) and *E*() = *E*(*X*). Thus, the PCAR estimators proposed in (8) are consistent for parameters of the underlying model also under additive distortion structure.

We present the asymptotic distribution of the estimators * _{n0}*,

(9)

denote the least squares estimators of the multiple regression of the unobserved data falling into *B _{nj}*, where the vectors are defined the same way as , with replacing , respectively. This quantity is not estimable, but will be used in the proof of the main results.

For the PCAR estimators given in (8) to be well defined, the least squares estimators given in (7) must exist for each bin *B _{nj}*, i.e. . Correspondingly, the estimators in (9) will exist under the condition that . The following theorems are given under event

For the following theorems, we define the following notations: λ_{ψ} = *E*{ψ^{2}(*U*)}, λ_{ϕ} = *E*{ϕ^{2}(*U*)}, λ_{ψϕr} = *E*{ψ(*U*)ψ_{r}(*U*)}, , **χ**^{T} = (1, *X*_{1},…, *X _{p}*,

Under the technical conditions *(C1)–(C7)* in Section 6, on event *E _{n}* with

where

Theorem 1 establishes the asymptotic normality of the proposed PCAR estimators. The following theorem provides consistent estimators of the asymptotic variances given in Theorem 1.

Under the technical conditions *(C1)–(C7)* in Section 6, on event *E _{n}* with

where

Normalizing by the above consistent variance estimators, it holds that

Therefore, the approximate (1−α)100% asymptotic confidence intervals for γ_{r} and *δ _{s}* have the endpoints

(10)

where *z*_{α/2} is the (1 − α/2)th quantile of the standard Gaussian distribution.

These proposed variance estimators are motivated by the identifiability conditions, the definition of the smooth varying coefficients functions given in (6), Lemma 3 and Lemma 4 (a.). Using the consistency of _{nrj} and _{nsj} for the values of the functions β_{r} and η_{s} at the midpoint of the *j*th bin and the definitions of , we target the quantities with the estimators , respectively. Furthermore, relying mainly on Lemma 3 and Lemma 4 (a.), we target and , respectively.

We illustrate the proposed partial covariate adjusted regression methodology with an application to the Pima Indians diabetes data set, available at http://www.ics.uci.edu/~mlearn. Obesity is an important contributing factor to diabetes and has been widely studied in the Pima Indians population (Smith et al., 1988; Knowler et al., 1991; Hansen et al., 1998). One-half of adult Pima Indians have diabetes and 95% of those with diabetes are overweight (National Institute of Diabetes and Digestive and Kidney Diseases, http://diabetes.niddk.nih.gov). The available data comes from a larger database, where the subgroup used consists of *n* = 524 females at least 21 years old and of Pima Indian heritage. (The population lives near Phoenix, Arizona, U.S.A.) An oral glucose tolerance test is one of the diagnostic tests for type II diabetes. The goal is to uncover the underlying, *BMI* adjusted, regression relation, *PGC* = γ_{0} + γ_{1}*DBP* + δ_{1}*Age* + δ_{2}*TSFT* + *e*, based on the observed plasma glucose concentration (; from a oral glucose tolerance test), diastolic blood pressure (), triceps skin fold thickness (*TSFT*), age and body mass index. We chose to adjust only the main relation of interest, namely the one between plasma glucose concentration (the response) and diastolic blood pressure for body mass index, and included age and triceps skin fold thickness as unadjusted predictors as they are commonly accounted factors in studies on diabetes.

Table 1 gives the regression coefficient estimates for (γ_{0}, γ_{1}, δ_{1}, δ_{2}) using the proposed PCAR method, CAR method, the ordinary least squares (OLS) estimates from regressing the observed on (, *Age, TSFT*) without adjusting for the confounder *BMI*, and adjustment via division, i.e. regressing on (, Age, *TSFT*). The approximate 95% asymptotic confidence intervals for the regression parameters obtained through all three methods are also displayed. The approximate confidence intervals for PCAR estimates were obtained as proposed in (10).

Parameter estimates for the regression model *PGC* = γ_{0} + γ_{1}*DBP* + δ_{1}*Age* + δ_{2}*TSFT* + *e*, obtained by least squares regression of *Ỹ* = (plasma glucose concentration) on = (diastolic blood pressure), *Z*_{1} = *Age* and *Z*_{2} = **...**

The implementation of the binning algorithm allows for merging of sparsely populated bins. Bin widths were chosen such that there are at least (*p* + *q* + 1) points, enough to fit the linear regression with (*p* + *q*) predictors in each bin. If there were bins with less than (*p* + *q* + 1) elements, such bins were randomly merged with neighboring bins. The merging algorithm is randomized to avoid the introduction of any additional bias. It starts by merging bins with no points. If there are more than one such bin, it randomly picks one and merges it with its neighbor of smallest number of points. After merging all the bins with no points, the bins with one point and eventually bins with *p* + *q* points are merged. For this example with *n* = 524 (after the removal of outliers), the average number of points per bin was 15, yielding a total of 34 bins after merging. Note that CAR estimates have been shown to be sufficiently robust regarding different choices of *m*, under the rate conditions given in Section 4 (Şentürk and Müller, 2006). We have found this property to hold for the proposed PCAR estimates as well, where the range of *m* values yielding robust estimates for different sample sizes is given explicitly in the next section.

Note that coefficients obtained by adjustment via division are quite different from the other three methods applied. In this adjustment the coefficient of *DBP* becomes quite pronounced compared to the other two predictors. This is most likely due to the pseudo dependence created between and via division by the common variable *BMI*. This is an example of the misleading conclusions that adjustment by division may suggest. In other words, if the original contamination is not exactly multiplication by the confounder (*BMI* in this example), then normalization by division may create further confounding, or “coupling” (as defined in Archie, 1981), creating a pseudo dependence that does not exist in the original data.

Even though OLS estimates for blood pressure and age are different from the PCAR and CAR estimates, all are found statistically significant at the usual 5% level. Thus, diastolic blood pressure and age are still important predictors of *PGC* even after adjusted for body mass index. However, using *OLS, TSFT* is a significant predictor of *PGC*, but it is not significant using PCAR and CAR at the 5% significance level. This result is not too surprising, since both *TSFT* and body mass index are indicators of obesity. They are positively correlated (Pearson correlation 0.67). Thus, adjusting for one, the other becomes an insignificant factor for predicting plasma glucose concentration. We note that even though estimation via CAR leads to the same conclusion as PCAR on the significance of the predictors for this analysis, the estimates from these two methods are different for *TSFT*. This is again to be expected, since CAR estimates are shown to be biased for the undistorted predictors.

To examine the numerical properties of the estimators, we implemented the following simulation studies. The underlying multiple regression model is

(11)

where the parameters of interest are (γ_{0}, γ_{1}, γ_{2}, δ)^{T} = (4,−1, 0.3, 3). The error variable is *e* ~ *N*(0, .5), and the confounder variable *U* is generated from a uniform distribution on [2, 6]. We considered the joint distribution of the predictors to be multivariate normal: (*X*_{1},*X*_{2},*Z*)^{T} ~ *N*_{3}(**µ,Σ**), with a general covariance structure

The mean vector is **µ** = (0.7, 1.2, |*U*| − 3.5)^{T}, so that the undistorted predictor *Z* is dependent on *U*. To simulate the distorted (observed) data, we consider the following distorting functions, ψ(*U*) = (*U* + 3)/7, ϕ_{1}(*U*) = (*U* + 1)^{2}/26.3333, and ϕ_{2}(*U*) = (*U* + 10)/14, satisfying the identifiability constraints that *E*{ψ(*U*)} = 1 and *E*{ϕ_{r}(*U*)} = 1. The distorted response and predictors are *Ỹ* = ψ(*U*)*Y*, _{1} = ϕ_{1}(*U*)*X*_{1}, and _{2} = ϕ_{2}(*U*)*X*_{2}. Under this simulation setting, we examine (1) the confidence interval coverage levels based on the asymptotic results and (2) the finite sample bias of the estimators, as well as comparing CAR and PCAR estimators in terms of variance and MSE.

We conducted 1000 Monte Carlo simulation runs for sample sizes *n* = 100, 150, 350, 800, and 1400 to study the approximate asymptotic confidence intervals given in (10). For the sample sizes *n* = 100, 150, 350, 800 and 1400, the total number of bins formed were *m* = 16, 27, 32, 50 and 70. Table 2 summarizes the coverage and interval lengths, averaged over the 1000 simulation runs, for the approximate 95% asymptotic confidence intervals for the parameter vector (γ_{0}, γ_{1}, γ_{2}, δ)^{T} = (4,−1,.3, 3). The numerical study indicates that the estimated non-coverage percentages are close to the target value of 0.05, as the sample size *n* increases. The estimated interval lengths are decreasing as *n* increases, as expected.

We also examined the bias, variance and mean squared error (MSE) of the proposed estimators in comparison to the CAR estimators. For example, the estimated (absolute bias, variance, MSE) values for PCAR estimators at the smallest sample size *n* = 100 are (0.0112, 0.2223, 0.2224), (0.0120, 0.1392, 0.1393), (0.0028, 0.0421, 0.0421) and (0.0167, 0.0651, 0.0654) for _{0}, _{1}, _{2} and , respectively. These values are averages over 1000 Monte Carlo runs. The results are similar for other sample sizes, where the variance seems to be the dominating factor contributing to the MSE. The estimated (bias, variance, MSE) values for the CAR estimator for δ (even though their asymptotic distributions are different, the three other point estimates for γ_{0}, γ_{1} and γ_{2} are the same for the two methods), *, are (1.294, 1.694, 3.368) at the same sample size of *n* = 100. The multiplicative bias factor of the CAR estimate for δ, shown to be *C _{s}* =

As stated above, for sample sizes *n* = 100, 150, 350, 800 and 1400, the total number of bins formed were *m* = 16, 27, 32, 50 and 70, respectively. We carried out additional simulation studies to examine the affect of the number of bins *m*. The results suggest that the estimators are robust, based on estimated MSE, when *m* is chosen in the intervals [13, 18], [18, 27], [25, 45], [35, 65] and [60, 90] corresponding to sample sizes *n* = 100, 150, 350, 800 and 1400. In a given application, the above intervals can give rough guidelines on how the choice of *m* may change with sample size, although a sensitivity analysis for the choice of *m* specific to the data would also be informative.

Finally, we note that even though the smallest sample size at which the proposed asymptotic confidence intervals attain reasonable coverage in our simulation study is *n* = 100 (the coverage is around 5% off the targeted level for *n* = 100), the proposed PCAR point estimators still yield reasonable (bias, variance and MSE) values: (0.0515, 0.5748, 0.5775), (0.0026, 0.3552, 0.3553), (0.0256, 0.0916, 0.0922) and (0.0205, 0.3328, 0.3332) for γ_{0} = 4, γ*1* = −1, γ*2* = 0.3, δ = 3, respectively, at *n* = 50. These values are estimated under the current simulation set-up with 3 predictor variables for *n* = 50 and *m* = 8. For the case of simple linear regression, roughly the same number of bins can be attained with *n* = 30, and hence *n* = 30 would be the smallest sample size where CAR and PCAR give reasonable point estimates. For sample sizes smaller than 30, systematic localization via binning may not be fully feasible so the PCAR estimates should be taken with caution. In this case a very rough localization by stratification (by *U*) into 2–3 groups for crude comparisons is possible.

We provide the major steps of the proofs of the main results (Theorem 1 and 2) here and defer the auxiliary results for these proofs to the Appendix, where they are listed as lemma 1 to lemma 4. We introduce the following technical conditions:

- (C1) The covariate
*U*is bounded below and above, −∞ <*a*≤*U*≤*b*< ∞ for real numbers*a*<*b*. The density*f*(*u*) of*U*satisfies inf_{a≤u≤b}*f*(*u*) >*c*_{1}> 0, sup_{a≤u≤b}*f*(*u*) <*c*_{2}< ∞ for real*c*_{1},*c*_{2}, and is uniformly Lipschitz continuous, i.e., there exists a real number*M*such that sup_{a≤u≤b}|*f*(*u*+*c*) −*f*(*u*)| ≤*M*|*c*| for any real number*c*. - (C2) The variables (
*e, U, X*) are mutually independent for_{r}*r*= 1,…,*p*. In addition, (*e, Z*) are assumed to be independent._{s} - (C3) For the predictors, sup
_{1≤i≤n,1≤r≤p,1≤s≤q}{|*X*|, |_{nri}*Z*|} ≤_{nsi}*B*for some bound*B*. In addition, the predictors*X*satisfy the condition that_{r}*E*(*X*) ≠ 0._{r} - (C4) Contamination functions ψ(·) and ϕ
_{r}(·), 1 ≤*r ≤ p*, are twice continuously different-ttiable, satisfying*E*ψ(*U*) = 1,*E*ϕ_{r}(*U*) = 1, and ϕ_{r}(·) > 0, 1 ≤*r ≤ p*. - (C5) The matrices
**Γ**_{nj},*j*= 1,…,*m*are nonsingular, i.e. ρ = |inf*j*det(**Γ**_{nj})| > 0, where is a*L*× (_{nj}*p+q*+1) undistorted data matrix in bin*j*, and denotes the*k*th observation.

The technical conditions above are similar to those introduced in Şentürk and Müller (2006), except for the new independence structure outlined in (C2), the boundedness of the undistorted predictors *Z _{s}* in (C3), and the bin dependent limiting matrices

In the proofs of the main results, the following notations will be utilized.

*A**B*: The Hadamard product of two matrices,*A*and*B*, of the same dimension. The matrix*A**B*is also of the same dimension with (*i, j*)th element equal to the product of the (*i, j*)th elements of matrices*A*and*B*.**1***a×b*: A matrix of size*a × b*with all entries equal to one.= (_{nj}_{n0j}_{n1j},…,_{npj},_{n1j},…,_{npj})^{T}.- We use to denote the matrix and
*L*_{nj(i)}to denote the number of points in the*j*th bin such that*U*_{ni}*B*, and is the (_{nj}*r, k*)th element of the matrix for 1 ≤ r ≤ p + q + 1, where is the*k*th element in the ordered sample . - .

From Lemma 4 (b.), we have that

(12)

where

We first consider the case of *r* = 0 and show that is asymptotically normal. Using Lemma 4, (13), and some algebra, can be expressed as

Since the above sum is over all bins indexed by *j*, and over all points within the bins indexed by *k*, it is equal to the sum over all data points indexed by *i*, summed up in a random order.

Thus, the above expression for can be further simplified to

(14)

Therefore, is asymptotically equivalent to because the second term is negligible when . Next, let *F*_{n0t} be the σ-field generated by . Then {*S*_{n0t}, *F*_{n0t}, 1 ≤ *t* ≤ *n*} is a mean zero martingale for *n* ≥ 1, since *E*(*S*_{n0t}) = 0, *E*(*S*_{n0,t+1}|*F*_{n0t}) = *S*_{n0t}, and *S*_{n0t} is adapted to *F*_{n0t}. Furthermore, note that the σ-fields are nested, that is, *F*_{n0t} *F*_{n0,t+1} for all *t* ≤ *n*. Hence, it follows from Lemma 1 that in distribution (McLeish, 1974, Theorem 2.3 and subsequent discussion). This establishes the asymptotic normality of .

We proceed next to establish the asymptotic normality of for *r* = 1,…, *p*. Let and note that _{nr} = _{nr}/_{nr}. We first show that

(15)

For (15) to hold, by the Cramér-Wold device, it is enough to show the asymptotic normality of

(16)

for real *a, b*. The asymptotic normality of will follow from (15) by applying the δ-method with _{nr} = _{nr}/_{nr}. Again applying Lemma 4 together with (13) and some simple algebra, we can express _{nr} and _{nr} as

Thus, using similar simplifications as was done for the case of *r* = 0 in (14), the linear combination (16), namely can be expressed as

The second term is asymptotically negligible and it is straightforward to verify that is a mean zero martingale for *n* ≥ 1. Analogous to the case of *r* = 0, described in more details earlier, it follows from Lemma 2 that . Finally, a direct application of the δ-method gives for 1 ≤ *r* ≤ *p*, where is explicitly given in Theorem 1.

The asymptotic normality of follows similarly to the case of , since they have similar forms in (13). (See also definition/notation 4.) The asymptotic variance which has a similar form as , is given explicitly in Theorem 1. This completes the proof of Theorem 1.

The following relation holds on event *A _{n}*

(17)

It follows from Lemma 4 (a.) and (b.). Utilizing (17) together with (13) gives

(18)

By the Law of Large Numbers, (18), and boundedness considerations

It has been shown in Şentürk and Müller (2006) that

and it can be shown similarly that Also, using Lemma 3, Lemma 4 (*a*.), (*b*.) and the Law of Large Numbers, we have

The estimators of the asymptotic variances given in Theorem 2, in terms of the above quantities, are: (1) , (2) , and (3) . Thus, the first part of Theorem 2 follows by noting that and . Asymptotic confidence intervals given in the second part of the Theorem follow immediately from Theorem 1 and Slutsky’s theorem using the consistent variance estimators.

In this work we extend covariate adjusted regression (CAR) models to partial covariate adjusted regression (PCAR) models that allow for the specification of the effects of un-adjusted predictors. Asymptotic normality of the proposed estimators are derived. The PCAR (and CAR) estimation approach was designed to estimate the underlying regression relationship directly, bypassing the estimation of the exact distortion forms. Although the current approach leads to estimators that are simple to implement with known asymptotic properties and good finite sample performance, it does not provide direct estimates of the distorting functions. If the primary interest is in the distorting functions then alternative approaches are needed. A potential alternative approach would involve considering refined estimators for the varying coefficient functions to be used in targeting the distorting functions. However, nontrivial work is required and this remains largely an open problem.

Another interesting and relevant issue, brought to light by a reviewer, is an alternative analysis of the diabetes data given in Section 5. The analysis proceeds by considering the observed variable *Z _{s}* =

Finally, we note the following regarding the implementation of the binning procedure, in the context of the data analysis. For the theory, we assumed that the support of *U* is the interval [*a, b*]. To be able to bin the data with respect to *U* = *BMI* and compute the estimators, we take *a* and *b* to be the min/max of the data. For any given population under study, a reasonable range can be inferred to define the limits *a* and *b*. For adults, we can reasonably set the limits to 14 and 65 BMI, for instance. In our data, the observed min and max are 18.2 and 49.7 and the intervals/bins are between these observed limits. However, if one is applying the binning using the limits 14 and 65, for instance, our estimator weighted by *L _{j}*/

We are grateful to the reviewers for many detailed suggestions which substantially improved the paper as well as to the Editor for careful review. This work is supported by Grant Number UL1 RR024146 from the National Center for Research Resources (NCRR), a component of the National Institutes of Health (NIH) and NIH Roadmap for Medical Research, NIEHS grant P01-ES011269-06, NIH grants UL1RR024922 and RL1AG032115, and National Institute of Child Health and Human Development grant HD036071.

In this section we provide the additional lemmas and their proofs utilized earlier for the main results. We begin by formally defining the events under which the two main theorems of Section 4 are given. Summarizing the existence conditions for the PCAR estimators, define the events

(19)

where , ρ is as defined in (C5), is the average of the *U*’s in *B _{nj}*, and (Ω,,

We next introduce some additional technical conditions that are needed for the proof of Lemma 4 given below:

- (C6) The functions
*h*_{1}(*u*) = ∫*xg*_{1}(*x, u*)*dx*and*h*_{2}(*u*) = ∫*xg*_{2}(*x, u*)*dx*are uniformly Lips-chitz, where*g*_{1}(·,·) and*g*_{2}(·,·) are the joint density functions of (**χ**,*U*) and (**χ***e,U*), respectively. - (C7) The error term satisfies
*E*|*e*^{τ}| < ∞ for τ > 4.

Under the technical conditions *(C1)–(C7)*, on event *A _{n}* (19), the martingale differences

(a.)

(b.)

.

Let *W*_{n0t} = *w*_{n0t}*υ*_{n0t}, where , *υ*_{n0t} = γ_{0}ψ(*U _{nt}*) + ψ(

Now, is bounded uniformly in *n* and *t*, since *e _{nt}* has finite fourth moment by (C7). Also note that . Lemma 1 (a.) follows, since uniformly in

Next, consider the term given in Lemma 1 (b.). It is equal to

Using Law of Large Numbers, it holds that . Since *T*_{4} and *T*_{5} have expected values zero and variances *O*(*n*^{−1}), they are both *O _{p}*(

where **Γ** and are as defined prior to Theorem 1. Thus, and Lemma 1 (b.) follows.

*Under the technical conditions (C1)–(C7), on event A _{n}(19), the martingale differences W_{nrt} satisfy the conditions*

(a.)

(b.)

Part (a.) of Lemma 2 follows in a similar fashion as part (a.) of Lemma 1. Therefore, we focus on the proof of part (b.). The term in Lemma 2 (b.) is equal to

Using the Law of Large Numbers, it holds that . Since *T*_{11}, *T*_{12}, *T*_{13} and *T*_{14} have expected values zero and variances *O*(*n*^{−1}), they are all *O _{p}*(

Thus

where , and Σ_{r22} = var(* _{r}*). Hence Lemma 2 (b.) follows.

*Under the technical conditions (C1)–(C6), it holds on event E _{n} that*,

where

The proof follows from Lemma 3 of Şentürk and Müller (2006) by substituting 1 in place of ϕ_{p+1},…, ϕ_{p+q}.

Under the technical conditions *(C1)–(C7)*, for a sequence *r _{n}* such that , on event

(a.)

(b.)

, *where* **Γ**_{nj} *is assumed to be nonsingular by (C5), and* .

The proof is similar to the proof of Lemma 4 given in Şentürk and Müller (2006). However a key difference is that the limiting term in part (a.), , contains expectations taken conditional on *U*. The conditioning on *U* does not disappear because of the dependence between *Z _{s}* and

*Proof that pr*(*E _{n}*) → 1. The formula given in (32) in Şentürk and Müller (2006) can be extended to sup

This shows that pr(inf*j* det(**Θ̃**_{nj}) > ζ) → 1 as *n* → ∞, which implies pr(*Ã _{n}*) → 1 as

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Archie JP. Mathematical coupling of data: a common source of error. Annals of Surgery. 1981;193:296–303. [PubMed]
- Berkson J. Are there two regressions? J. Am. Statist. Assoc. 1950;45:164–180.
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu C. Measurement Error in Nonlinear Models: A Modern Perspective. Second Edition. Boca Raton, FL: Chapman & Hall; 2006.
- Cleveland WS, Grosse E, Shyu WM. Local regression models. In: Chambers JM, Hastie TJ, editors. Statistical Models in S. Pacific Grove: Wadsworth & Brooks; 1991. pp. 309–376.
- Hanson RL, Ehm MG, Pettitt DJ, Prochazka M, Thompson DB, Timberlake D, Foroud T, Kobes S, Baier L, Burns DL, Almasy L, Blangero J, Garvey WT, Bennett PH, Knowler WC. An autosomal genomic scan for loci linked to type II diabetes mellitus and body-mass index in Pima Indians. Am. J.Hum. Genet. 1998;63:1130–1138. [PubMed]
- Hastie T, Tibshirani R. Varying coefficient models. J. R. Statist. Soc. B. 1993;55:757–796.
- Hwang JT. Multiplicative errors-in-variables models with applications to recent data released by the U.S. department of energy. J. Am. Statist. Assoc. 1986;81:680–688.
- Iturria S, Carroll RJ, Firth D. Polynomial regression and estimating functions in the presence of multiplicative measurement error. J. R. Statist. Soc. B. 1999;61:547–561.
- Kaysen GA, Dubin JA, Müller HG, Mitch WE, Rosales LM, Levin NW. the Hemo Study Group. Relationship among inflammation nutrition and physiologic mechanisms establishing albumin levels in hemodialysis patients. Kidney Int. 2003;61:2240–2249. [PubMed]
- Knowler WC, Pettitt DJ, Saad MF, Charles MA, Nelson RG, Howard BV, Bogardus C, Bennett PH. Obesity in the Pima Indians: its magnitude and relationship with diabetes. Am. J. Clin. Nutr. 1991;53:1543S–1551S. [PubMed]
- Lai TL, Robbins H, Wei CZ. Strong consistency of least-squares estimates in multiple regression 2. J. Mult. Anal. 1979;9:343–361.
- McLeish DL. Dependent central limit theorems and invariance principles. Ann. Statist. 1974;2:620–628.
- Pinter JD, Brown WE, Eliez S, Schmitt JE, Capone GT, Reiss AL. Amygdala and hippocampal volumes in children with Down syndrome: A high-resolution MRI study. Neurology. 2001;56:972–974. [PubMed]
- Şentürk D, Müller HG. Covariate adjusted regression. Biometrika. 2005;92:75–89.
- Şentürk D, Müller HG. Inference for covariate adjusted reg ression via varying coefficient models. Ann. Statist. 2006;34:654–679.
- Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus; Proceedings of the Symposium on Computer Applications and Medical Care; 1988. pp. 261–265.
- Wu CO, Yu KF. Nonparametric varying coefficient models for the analysis of longitudinal data. Int. Statist. Rev. 2002;70:373–393.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |