Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3640838

Formats

Article sections

Authors

Related links

Stat Med. Author manuscript; available in PMC 2013 May 1.

Published in final edited form as:

PMCID: PMC3640838

NIHMSID: NIHMS453468

Victor Kipnis,^{a,}^{*}^{†} Douglas Midthune,^{a} Laurence S. Freedman,^{b} and Raymond J. Carroll^{c}

The publisher's final edited version of this article is available at Stat Med

In a recent paper, Weller, Milton, Eisen, and Spiegelman discuss fitting logistic regression models when a scalar main explanatory variable is measured with error by several surrogates, i.e., a situation with more surrogates than variables measured with error. Using a regression calibration approximate model as if it were exact, they compare two methods of adjusting for measurement error. One is the standard regression calibration approach consisting of substituting an estimated conditional expectation of the true covariate given observed data in the logistic regression. The other is a novel two-stage approach when the logistic regression is fitted to multiple surrogates, and then a linear combination of estimated slopes is formed as the estimate of interest. Applying estimated asymptotic variances for both methods in a single data set with some sensitivity analysis, the authors assert superiority of their two-stage approach. We investigate this claim in some detail. A troubling aspect of the proposed two-stage method is that, unlike standard regression calibration and a natural form of maximum likelihood, the resulting estimates are not invariant to reparameterization of nuisance parameters in the model. We show, however, that, under the regression calibration approximation, the two-stage method is asymptotically equivalent to a maximum likelihood formulation, and is therefore in theory superior to standard regression calibration. However, our extensive finite-sample simulations in the practically important parameter space where the regression calibration model provides a good approximation failed to uncover such superiorityof the two-stage method. We also discuss extensions to different data structures.

Weller *et al*. [1] consider a problem where a binary variable *Y* is to be regressed on a scalar main explanatory variable *X* and a vector covariate **Z** via logistic regression, but that *X* is subject to measurement error. Instead of observing *X*, a vector covariate **W** is observed, so that there are more surrogates, **W**, than there are variables measured with error, the scalar *X*. In place of standard regression calibration as described by Carroll and Stefanski [2] and Carroll *et al*. [3], they describe a novel extension of the two-stage linear approximation approach by Rosner *et al*. [4] to estimating the slope for *X*. They compare their new approach with standard regression calibration by estimating the corresponding approximate asymptotic variances based on a single data set and some sensitivity analysis and conclude that the two-stage approach provides a more efficient estimator than standard regression calibration.

The purpose of this paper is to place these methods in a more general context than just logistic regression, and to replace the somewhat informal calculations of Weller *et al*. with a more formal theoretical treatment. There are four important aspects of this treatment.

- Weller
*et al*. work with an approximate model, namely the regression calibration approximation to the conditional distribution of*Y*given the observed covariates (**W**,**Z**) and then do all their theoretical calculations as if this approximate model were true. This is a fairly common strategy when dealing with measurement error modeling and regression calibration, see for example Thurston*et al*. [5]. In the parameter subspace where such an approximation is close to the correct model, this allows a formal theoretical analysis, which we provide. - In this problem, the primary interest is in a single parameter, the regression slope for the main explanatory variable. All other parameters are nuisance parameters. We can thus write the likelihood function of the data as
*L*(**ζ**,*β*_{1}), where**ζ**is the vector of nuisance parameters and*β*_{1}is the parameter of interest. In all such likelihood problems, the maximum likelihood estimator of*β*_{1}is unchanged when a 1–1 reparameterization of**ζ**is done: the same is true for standard regression calibration. Unfortunately, the suggested two-stage estimator is not invariant to such 1–1 reparameterizations of the nuisance parameters, and can be made to vary quite widely in a given finite set of data. We suggest a modified two-stage estimator which, in our extensive simulations, provides better finite sample results than the original method. - We show that if the regression calibration approximation were exact, the two-stage estimator would have the same asymptotic variance as the maximum likelihood estimator. Extensive numerical work shows that, asymptotically, the maximum likelihood estimator and modified two-stage estimator are typically only slightly less variable than the one based on standard regression calibration, at least in the practically important parameter space where the regression calibration model provides a good approximation.
- As noted above, the theoretical calculations in Weller
*et al*. and in the present work are based on an approximate model. It is well known that this approximation can be a very poor one if the main explanatory variable has a relatively large effect on the outcome and/or measurement error is relatively large. In such cases, as is well demonstrated by the results in Weller*et al*., the model based on the regression calibration assumption is badly misspecified leading to severely inconsistent estimators with very different asymptotic and finite-sample properties. Comparing asymptotic variances of estimators derived under the regression calibration assumption may be irrelevant in this situation. However, these are exactly the cases where Weller*et al*. emphasize the superiority of their two stage method over the standard one, disregarding involved biases and finite-sample properties. We therefore provide finite-sample simulation results which suggest that standard regression calibration is, in most cases, at least as good as the two-stage method in terms of bias and mean squared error.

The setup of the problem follows Weller *et al*. We have primary study data consisting of (*Y _{ip}*,

- Assumption 1
- The mean of
*X*given (**W**,**Z**) equals*γ*_{0}+**W**^{T}**γ**_{1}+**Z**^{T}**γ**_{2}and the variance of*X*given (**W**,**Z**) equals in both the primary and the validation data.The second part of the model is the relationship between*Y*and (*X*,**W**,**Z**) in the primary data. Here it is typical to make two assumptions. - Assumption 2
- The measurement error is non-differential, i.e.,
**W**is independent of*Y*given (*X*,**Z**). - Assumption 3
- In the primary data, the distribution of
*Y*given (*X*,**Z**) follows an additive model, with density or mass function*f*(*Y*,*β*_{0}+*X β*_{1}+**Z**^{T}**β**_{2},**ζ**), where**ζ**is a vector of nuisance parameters.

Our goal is to compare the two-stage method and the regression calibration method. There are three ways that this can be done.

- First, one can do simulation studies, something we take up in Section 3.2.
- The second approach is theoretical, involving asymptotic theory, but it involves an approximation as used by Weller
*et al*. [1] among many other authors. The approximation, called the*regression calibration*approximation, is stated below as Assumption 4. It allows for a clear asymptotic comparison among the methods, see Appendix A. - The third approach is more formal than the other two and does not depend on the regression calibration approximation. Based only on Assumptions 1–3, one can develop the asymptotic distribution of the two-stage and regression calibration methods. We show how to do this for logistic regression in Appendix B. There are however problems with this in terms of numerical comparisons.
- The two methods generally estimate different quantities, see Section 1.4 for cases that they do estimate the same thing.
- Both estimators converge to quantities that are solutions to nonlinear integral equations which depend on the joint distribution of (
*X*,**W**,**Z**). This makes numerical comparison of their biases and variances difficult because in each case, an integral equation must be solved. - Interestingly, we show in Appendix B that if we make the regression calibration approximation, the results of the developed correct asymptotic theory reduce to those in Appendix A.

The regression calibration approximation, which we state here as Assumption 4, has to do with the relationship between *Y* and (**W**, **Z**), i.e., the distribution of the *observed* data.

Assumption 4: The distribution of *Y* given (**W**, **Z**) is the same as that of *Y* given (*X*, **Z**), except *X* is replaced by its regression on (**W**, **Z**), *E* (*X* | **W**, **Z**), and the intercept is changed [3]. In symbols, it is assumed that the density/mass function of *Y* given (**W**, **Z**) is *f* (*Y*, *α*_{0} + *β*_{1}
*E* (*X* | **W**, **Z**) +**Z**^{T}**β _{2}**,

It is important to emphasize that, for most nonlinear regressions including logistic regression, Assumptions 1–4 are contradictory. In particular, Assumptions 1–3 are the ones usually made in the analysis, from which not only does Assumption 4 not follow, but, strictly speaking, it is incompatible with them. There are at least three notable exceptions.

- If
*β*_{1}= 0, then the regression calibration approximation holds exactly and Assumption 4 is true. In this case, both the two-stage method and standard regression calibration method estimate*β*_{1}= 0 asymptotically. It is a consequence of the work in Appendix A that the two estimates are asymptotically equivalent when*β*_{1}= 0. - The second case occurs when
*Y*given (*X*,**Z**) follows a homoscedastic linear model, in which case Assumption 4 is strictly true. As it turns out, in this case the two-stage method and the regression calibration method are asymptotically equivalent, as we showin Appendix A. - The third is probit regression, where the probability of disease is pr(
*Y*=1|*X*,**Z**) = Φ(*β*_{0}+*X β*_{1}+**Z**^{T}**β**_{2}). If*X*given (**W**,**Z**) is normally distributed, then the observed data follow the probit modelwhere and . In this probit case, maximum likelihood for the observed data of Y given (**W**,**Z**), the two-stage method and the standard regression calibration method all estimate , and our asymptotic results apply. Of course, if is large, then all methods are therefore badly biased.

Outside of these important cases, we are left with a dilemma about the relevance of the asymptotic theory based on Assumption 4 for comparing the two methods. There are three possible ways to argue the relevance of the asymptotic theory based on Assumption 4, all of which are equivalent for our calculations but are very different conceptually. The first is the conventional approach of staying with the usually made Assumptions 1–3 and using Assumption 4 as an approximation. In cases where
is relatively small, this approximation is quite good and the theoretical results are relevant. One should be careful though and not extend this theory to the full parametric space. In the absence of relevant theoretical results when Assumption 4 provides a poor approximation, the statistical properties of the estimated slope *β*_{1} should be investigated by extensive finite-sample simulations.

The second possibility is to make Assumptions 1, 2 and 4. In this case, the true model remains unknown, but one assumes that *β*_{1} has some approximate relevance to the parameter of interest, although we do not know exactly how to specify this relationship. The third possibility is to make Assumptions 1, 3 and 4. In this case, one does not assume that error is non-differential, but rather that it is such that Assumptions 1, 3 and 4 hold. Although all three possibilities are interesting, and our work applies to all of them, we will concentrate on the first one which is usually used in practical applications.

In summary, a generalization of the framework of Weller *et al*. based on Assumptions 1–3 and using Assumption 4 as an approximation is as follows.

(1)

(2)

(3)

The goal is to estimate *β*_{1}. Model (1)-(3) represents a likelihood formulation for the observed data. This is a new insight, and it suggests immediately that the data can be fit, efficiently, by maximum likelihood to get an asymptotically efficient estimate of the crucial parameter *β*_{1}.

An outline of this paper is as follows. In Section 2, we describe the two-stage method of Weller *et al*. generalized to our context (1)–(3). We then state our three main results:(a) in the linear model for *Y* given (*X*, **Z**), all three methods are asymptotically equivalent; (b) maximum likelihood for (1)–(3) and the two-stage method are asymptotically equivalent; and (c) standard regression calibration is not in general asymptotically equivalent to maximum likelihood for (1)–(3).

In Section 3.1, we take up the issue of how much more efficient the two-stage method is theoretically compared to standard regression calibration, by studying asymptotic variances over a broad range of interesting problems, and via simulations. In contrast to Weller *et al*., who use a single data set and compute estimated asymptotic variances, we find that in practice, the difference in asymptotic efficiency between the two methods typically appears to be slight, although there are some exceptions. As mentioned above, under usually made Assumptions 1–3, model (1)–(3) is merely an approximation; this suggests that a finite-sample simulation study that compares the standard regression calibration and two-stage methods be undertaken. The results of the simulation study are reported in Section 3.2. and Appendix C. We conclude with a discussion in Section 4. A sketch of the technical details when Assumption 4 approximately holds is given in Appendix A, while in Appendix B we derive the exact asymptotic distributions for logistic regression, and show that they reduce to the results in Appendix A when the regression calibration approximation is invoked.

Maximum likelihood for model (1)–(3) is straightforward. Standard regression calibration consists of first estimating (**γ**_{1}, **γ**_{2}) using the validation data only, substituting them into (3), assuming that **γ**_{1} is fixed to this estimated value and known, and then performing maximum likelihood to estimate the unknown slope *β*_{1}.

The two-stage method is somewhat more complex. Again, Weller *et al*. first estimate (**γ**_{1}, **γ**_{2}) using the validation data to obtain (_{1}, _{2}). Then estimation of *β*_{1} proceeds in two stages. At stage 1, the following model is fit to the data:

(4)

leading to estimates _{1} of the “naive” slopes. Because of model (3), one has *β*_{1}**e**_{one} = **α**_{1}//**γ**_{1}, where // denotes element-wise division and where **e**_{one} = (1, 1,…, 1)^{T}. At stage 2, Weller *et al*. estimate *β*_{1} by forming the weighted linear combination **τ^{T}**(

Let **Λ** be any full rank square matrix whose number of columns is the dimension of **W**. Consider using the transformed vector of surrogates
and
based on the same full-rank linear transformation **Λ** applied to both the validation and primary data. It is easy to see theoretically that both maximum likelihood and regression calibration give exactly the same estimate of *β*_{1} whether (**W*** _{iv}*,

In contrast, the two-stage estimator enjoys no such property, and indeed in some settings it can be made to take on almost any value by varying **Λ**. For example, at

can be found primary and validation data sets when **W** is two dimensional. We generated 10,000 full rank linear transformations **Λ** by having their entries be standard normal random variables. The two-stage estimator for the transformed **W**-variables over these 10,000 simulated full rank linear transformations had a minimum = −36.11, a maximum = 2.43, a mean = 0.79 and a 98*%* confidence interval of [0.75, 0.83].

We suggest linearly transforming **W** in the validation study to have mean zero and identity covariance matrix, applying the same transformation to the primary data, and then applying the two-stage method. In our extensive simulations, we have found that the modified two-stage method provides better results than the original two-stage estimator.

We show that the two-stage method achieves optimal asymptotic properties under the regression calibration assumption.

Here is our main result, the proof of which is given in Appendix A. The asymptotic variances for maximum likelihood, regression calibration and the two-stage method are given in Appendix A at equations (A.3), (A.4) and (A.5), respectively.

Under model (1)-(3), the maximum likelihood and standard regression calibration estimators have the same asymptotic variances if Y given (X, Z) follows a homoscedastic linear model or if *β*_{1} = 0, but not in general. Under model (1)-(3), the two-stage estimator of *β*_{1} is asymptotically equivalent to maximum likelihood if there are no nuisance parameters **ζ** or if *Y* given (**W**, **Z**) follows a generalized linear model in the primary data.

While, at least technically, regression calibration is not efficient for approximate model (1)-(3), it remains to be seen how inefficient it is in realistic contexts. The asymptotic variances for all considered estimators are given at (A.3), (A.4) and (A.5) in Appendix A.

To compare these variances, we consider the case of logistic regression. Without loss of generality one can set *E*(*X*) = 0 and
, and because maximum likelihood, regression calibration and the modified two-stage method for estimating *β*_{1} are all invariant to full-rank linear transformations of **W**, without loss of generality one can assume that cov(**W**) = **I**. Assuming that
, the only items that one needs to vary are (a) the distribution of **W**; (b) (
, **γ**) subject to
; (c) the odds ratio of interest; and (d) the rate of disease in the source population. The latter two determine the underlying logistic intercept *β*_{0} and the slope *β*_{1}.

First, **W** was set to be bivariate, with mean zero and covariance matrix the identity, but generated from (a) a bivariate normal distribution and then rescaled; (b) a bivariate chi-square distribution with one degree of freedom and then rescaled; and (c) a bivariate chi-square distribution with one degree of freedom with a random sign, resulting in a mixture distribution. Without loss of generality, we defined *X* = **W**^{T}
**γ**+ *ε* to have mean zero and variance one. We did this by first generating
. We then generated **γ**_{*} = Normal(0, **I**_{2}), and defined
. Then we set the true logistic intercept *β*_{0} and the true logistic slope *β*_{1} so that the odds ratio of moving from the 10* ^{th}* to the 90

The results are striking: in all cases, the asymptotic variance of the standard regression calibration estimator of *β*_{1} was less than 1.5% larger than the asymptotic variance of the maximum likelihood estimator, and the two-stage estimator.

We also considered lognormal cases. We let Σ* _{v}* be a correlation matrix with correlation

The **W** data were generated as follows. We describe here the theoretical setting, although simulation was actually used in all cases to compute the integrals involved in the asymptotic variances. First, **W**_{*} were bivariate independent lognormal random variables such that log(**W**_{*}) are bivariate independent normal variables with mean zero and variance *ζ* = Uniform(0,1). We defined **W**_{**} to be the standardized version of **W**_{*}, with both independent variables having mean zero and variance one, then made **W** to be a linear transformation of **W**_{**} that had covariance matrix **Σ*** _{v}* and exactly mean zero. In symbols, let

The lognormal distribution is notorious for generating a few extremely wild observations. For example, in Figures 1 and and22 we give box plots of 5 randomly generated data sets of size 10,000 from the lognormal and gamma distributions, each with the coefficient of variation equal to 1.0. These are plotted on the same scale to make the point that the lognormal distribution is much more prone to wild observations than is the gamma distribution. Thus, the distribution that we actually used for **W** was a truncated one, truncated at 4×IQR above the 75* ^{th}* or below the 25

Box plots of five randomly generated lognormal data sets of size 10,000 with coefficient of variation = 1.0.

Box plots of five randomly generated gamma data sets of size 10,000 with coefficient of variation = 1.0.

In most interesting situations for practical applications, with odds ratios *OR* ≤ 6, the worst case for regression calibration was when *π* = 0.075, *O R* = 6, and *n _{v}*/

We have also checked the lognormal case should the odds ratio for moving from the 10* ^{th}* to the 90

All results up to now have been asymptotic results based on the regression calibration approximation in equations (1)–(3). In this section, we use simulations to examine the finite sample properties of the two-stage and regression calibration methods when the regression calibration assumption given by equation (3) is violated. We especially concentrate on cases when equation (3) does not hold even approximately, and the relevance of asymptotic variance calculations under this assumption becomes questionable.

We let *Y* be a binary disease outcome, *X* = true exposure, *W*_{1} = first surrogate measurement of *X* and *W*_{2} = second surrogate measurement of *X*. We performed 5 sets of 10,000 simulations. For each simulation, we generated two independent data sets, a main study and a validation study. In the main study, each subject had variables *Y*, *W*_{1} and *W*_{2}. In the validation study, each subject had the variables *X*, *W*_{1} and *W*_{2}. The sample size for the main study was 10,000, while the sample size for the calibration study was 1,000. We generated the data according to the following models. First, (*W*_{1}, *W*_{2}) was bivariate normal with zero means, common variance equal to 1.0, and correlation *ρ*. We generated
and set *X* = *γ*_{0} + *γ*_{1}*W*_{1} + *γ*_{2}*W*_{2} + *ε*. Finally, we generated *Y* as a binary variable with pr(*Y* =1*| X*) =*H* (*β*_{0} +*β*_{1}
*X*), where *H* (·) is the logistic distribution function. In all simulations, we set *ρ* = 0.5, *γ*_{0} = 0, and
, thus ensuring that var(*X*) =1. Then we set the true logistic intercept *β*_{0} and the true logistic slope *β*_{1} so that the odds ratio for increasing *X* by two standard deviations of its distribution was equal to a predefined value OR, and the probability of disease in the source population was equal to a predefined value *π*. Table 1 shows the values of the parameters for each simulation. For each simulation, we estimated *β*_{1} using standard regression calibration and both the original and modified version of the two-stage method of Weller *et al*. Table 2 shows the results of the simulations.

Results of simulation study with normal W. Means, standard deviations (SD) and root mean squared errors (RMSE) of the estimators of β_{1}.

The results are instructive. In cases 1 – 4, the empirical variance of standard regression calibration is equal to or smaller than that of the unmodified two-stage method with essentially no difference between standard regression calibration and the modified two-stage method. In the most extreme case 5, there is a 3% and 9% difference in variance, with the original and modified two-stage methods, respectively, being less variable. However, in all cases, standard regression calibration is less biased than either two-stage method and, as a result, has smaller root mean squared error. When one deals with true odds ratios greater than 20, cases 4–5, the regression calibration approximation can be particularly poor, and finite-sample properties of the considered estimators become especially relevant.

We also ran simulations when surrogate measurements had a lognormal distribution. In addition to a normally distributed error *ε* in the linear regression of *X* on **W**, at the request of one of the reviewers, we also considered the case of a lognormal error, both homoscedastic and heteroscedastic that depended on **W**. The description of those simulations and their results are given in Appendix C. All results are reasonably in line with those in the previous section for the normal case.

We used the data described by Weller *et al.* to understand the properties of the methods when *W*_{1}, *W*_{2} are categorical. Following Section 6 in Weller *et al*,. we considered the case when **W** = (*W*_{1}, *W*_{2}) had in total 3 distinct values. We set **W** to be categorical with 3 levels according to probabilities (0.4,0.3,0.3), and let *W*_{1}, *W*_{2} be the dummy variable indicators according to the two highest levels, with the induced correlation of *ρ* = −3/7. We set γ_{0} = 0.127, and
. Table 3 shows the values of the parameters for each simulation. In the simulations, the primary sample size was 10,000, the validation sample size was 1,000, and 10,000 simulated data sets were generated at each configuration. The results given in Table 4 are quantitatively similar to those for continuously distributed surrogate values.

We have shown that the two-stage method of Weller *et al*. is not regular in the sense that the finite sample estimates of the odds ratio can vary widely if different 1–1 transformations of nuisance parameters are made. Despite this, the two-stage method is asymptotically equivalent to maximum likelihood for approximate model (1)–(3).

Our most striking conclusion is that, under the regression calibration approximation, the two-stage method is asymptotically less or no more variable than standard regression calibration in a wide variety of circumstances.

We performed numerical calculations that somewhat temper the main conclusion. We found in most cases that the two-stage and standard regression calibration methods were very nearly equivalent in asymptotic variance, exceptions being when the odds ratio was very and perhaps unrealistically large. It would be interesting to find practical cases with modest odds ratios and well behaved distributions of **W** when the variance gain from using the two-stage method was substantial, but we were unable to find such cases.

We have also conducted an extensive simulation study to compare finite-sample properties of the estimators, with a special attention to cases where the regression calibration assumption does not provide a good approximation and the relevance of the asymptotic variances of the estimators under this approximation becomes questionable. We have found that only in such extreme cases the standard regression calibration estimator had a slightly greater finite-sample variance than the two-stage estimator. Yet, in all cases, standard regression calibration was less biased and, as a result, had smaller root mean squared error.

Under the regression calibration approximation, the two-stage method is asymptotically efficient for the problems we have described. It would be interesting to see how the method could be applied to other problems and assess its asymptotic properties in these situations. One of the properties of standard regression calibration, or maximum likelihood in the regression calibration context, is that it generalizes easily to cases where there are more than one predictor measured with error, and the surrogates for each such predictor may or may not be the same. It is possible to define a two-stage procedure for such cases, although it is not straightforward and the resulting estimators remain not invariant to reparameterization of nuisance parameters Whether they have optimality properties that match the asymptotic properties of the regression calibration maximum likelihood is unknown.

To illustrate this, suppose that there are *K* = 2 variables measured with error and
and
. This is a case that one set of covariates, **W**_{2} appear in each regression, but the others, **W**_{1} and **W**_{3}, appear in only one of the regressions. Then write
, **β**=(*β*_{1}, *β*_{2})^{T}, so that

It is not clear what **α** is here, and how one can take advantage of the fact that there is knowledge that **W**_{3} is not a predictor of *X*_{1} given (**W**_{1}, **W**_{2}). While it is possible to devise estimators, and weight them, their properties are unclear; maximum likelihood and standard regression calibration have no such lack of clarity.

We thank Dr. Weller for providing us with the data used in their original analysis. Carroll’s research was supported by a grant from the National Cancer Institute (CA57030) and by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).

It is readily seen that the scores for estimating (*α*, *β*_{1}, *β*_{2}, *γ*_{0}, **γ**_{1}, **γ**_{2}, **ζ**) in model (3) are orthogonal to the score for estimating *σ _{ε}*, so we will without loss of generality assume
is known. Also, without loss of generality, we center

It suffices to prove the main result in the special case that there are no nuisance parameters **ζ**. In generalized linear models there may be a nuisance parameter **ζ**, but the loglikelihood is of the form

However, the score for *ζ* is orthogonal to the score for (*α*, *β*_{1}, **γ**_{1}, **β**_{2}). To see this well-known fact, we note that the score for (*α*, *β*_{1}, **γ**_{1}, **β**_{2}) is

where the superscript means first derivative. However, because this is a generalized linear model, *E*{**G**(*Y*, **W**, **Z**)*|*
**W**, **Z**} = 0. Differentiating the score for (*α*, *β*_{1}, **γ**_{1}, **β**_{2}) with respect to *ζ* yields −**G** (*Y*, **W**, **Z**)/*ζ*^{2}, so it too has mean zero. This means that the Fisher information matrix is block-diagonal, that the scores are orthogonal, and hence in generalized linear models the asymptotic theory is the same as if *δ* were known.

It suffices to work in the special case with no accurately measured covariates **Z** in the model: the extra generality is accomplished by somewhat more complicated definitions of what are terms *A* and **B** defined below. In this case, we replace **γ**_{1} by **γ** and *β*_{1} by *β*. The log-likelihood score then takes the form (*y*, *v*) = log{*f* (*y*, *v*)}/*v* with *v* = + *α*+ *β***W**^{T}**γ**. Hence, we do the detailed calculations for the model

We also use standard facts about likelihood theory, e.g., that the square of the score has expectation that is −1 times the expectation of the derivative of the score, so that, for example, we define in the primary study

These are easily confirmed by conditioning on **W**. Define *n _{v}*/

We use repeatedly the fact that if

then

where , and . It is also worth remembering that

(A.1)

The score for (*α*, **γ**, *β*) is given as

It follows with routine calculations, and remembering that cov(**W*** _{iv}*) =

Let . It follows from likelihood theory that

(A.2)

We want to compute . Define

Then

(A.3)

(A.4)

The estimating equation for (*α*, **γ**, *β*) is

where

Define **M** to be the covariance matrix of the left hand side of the last expansion so that

Then we have that

(A.4)

This method estimates **γ** in the validation data only, and thus
. Hence,
. The method then fits the likelihood
, and the loglikelihood score for this is just
. This has information matrix

Hence, the asymptotic covariance matrix of is the lower right corner submatrix of , and thus

The two-stage method considers a weighted linear combination of {diag()}^{−1}_{1}. Now, by algebra,

remembering that *β*(1,…,1)^{T} ={diag(**γ**)}^{−1}**α**_{1}. Hence,

If **e _{one}** = (1, 1, …, 1)

However,

and hence

(A.5)

Here we show that the maximum likelihood and two-stage estimators of *β* are asymptotically equivalent. Write
and **N** =**C** − **B***A*^{−1}**B*** ^{T}*. Studying (A.3) and (A.4), we have to show that

Then

where

,

*a*_{11} = *A*^{−1}{1+*β*^{2}**B**^{T}**a**_{22}**B***A*^{−1}}, *a*_{12} = −*βA*^{−1}*B*^{T}*a*_{22} and *a*_{21} = −*βa*_{22}*BA*^{−1}.

so that the MLE and the two-stage estimator are asymptotically equivalent. This completes the argument.

Here we show that the regression calibration estimator is not generally asymptotically equivalent to the other estimators. To do this, we simply need to exhibit a special case, and the simplest such case arises when Σ* _{v}* =

and hence

the lastfollowing from our more general result.

For standard regression calibration, we get a different answer. Let **D**_{*} and **M**_{*} be the lower non-diagonal versions of **D**and **M**, so that

Then

Direct calculations show that

This is in general different from the maximum likelihood and two-stage variance. It does coincide with them in the case of linear regression, where
, and in the case when *β* = 0.

Case | γ_{1} | γ_{2} | β_{0} | β_{1} | R^{2}
^{*} | Odds Ratio exp(2β_{1}) | Probability of Disease in Population | ||
---|---|---|---|---|---|---|---|---|---|

1 | 0.5 | 0.3 | −3 | 0.50 | 0.51 | 0.49 | 0.13 | 2.7 | 0.05 |

2 | 0.4 | 0.2 | −2 | 0.75 | 0.72 | 0.28 | 0.41 | 4.5 | 0.14 |

3 | 0.3 | 0.1 | −2 | 1.00 | 0.87 | 0.13 | 0.87 | 7.4 | 0.16 |

4 | 0.2 | 0.1 | −1 | 1.25 | 0.93 | 0.07 | 1.45 | 12.2 | 0.32 |

5 | 0.1 | 0.05 | 0 | 1.50 | 0.98 | 0.02 | 2.21 | 20.1 | 0.50 |

Case | β1 | Original Two-Stage Method | Modified Two-Stage Method | Regression Calibration Method | ||
---|---|---|---|---|---|---|

1 | 0.50 | 0.13 | Mean | 0.490 | 0.491 | 0.493 |

SD | 0.022 | 0.022 | 0.022 | |||

RMSE | 0.024 | 0.024 | 0.023 | |||

2 | 0.75 | 0.41 | Mean | 0.692 | 0.697 | 0.701 |

SD | 0.043 | 0.042 | 0.043 | |||

RMSE | 0.072 | 0.068 | 0.065 | |||

3 | 1.00 | 0.87 | Mean | 0.842 | 0.860 | 0.871 |

SD | 0.090 | 0.083 | 0.083 | |||

RMSE | 0.182 | 0.163 | 0.153 | |||

4 | 1.25 | 1.45 | Mean | 0.933 | 0.960 | 0.984 |

SD | 0.130 | 0.123 | 0.127 | |||

RMSE | 0.343 | 0.315 | 0.295 | |||

5 | 1.50 | 2.21 | Mean | 0.913 | 1.010 | 1.070 |

SD | 0.341 | 0.323 | 0.337 | |||

RMSE | 0.679 | 0.587 | 0.546 |

Case | β1 | Original Two-Stage Method | Modified Two-Stage Method | Regression Calibration Method | ||
---|---|---|---|---|---|---|

1 | 0.50 | 0.13 | Mean | 0.485 | 0.486 | 0.487 |

SD | 0.022 | 0.022 | 0.022 | |||

RMSE | 0.027 | 0.026 | 0.026 | |||

2 | 0.75 | 0.41 | Mean | 0.684 | 0.689 | 0.693 |

SD | 0.046 | 0.043 | 0.043 | |||

RMSE | 0.080 | 0.075 | 0.071 | |||

3 | 1.00 | 0.87 | Mean | 0.838 | 0.854 | 0.864 |

SD | 0.093 | 0.083 | 0.081 | |||

RMSE | 0.187 | 0.168 | 0.158 | |||

4 | 1.25 | 1.45 | Mean | 1.004 | 1.033 | 1.056 |

SD | 0.153 | 0.140 | 0.139 | |||

RMSE | 0.290 | 0.258 | 0.239 | |||

5 | 1.50 | 2.21 | Mean | 1.086 | 1.184 | 1.260 |

SD | 0.377 | 0.359 | 0.380 | |||

RMSE | 0.560 | 0.478 | 0.449 |

Case | β1 | Original Two-Stage Method | Modified Two-Stage Method | Regression Calibration Method | ||
---|---|---|---|---|---|---|

1 | 0.50 | 0.13 | Mean | 0.458 | 0.473 | 0.479 |

SD | 0.116 | 0.115 | 0.109 | |||

RMSE | 0.123 | 0.118 | 0.111 | |||

2 | 0.75 | 0.41 | Mean | 0.566 | 0.602 | 0.616 |

SD | 0.187 | 0.196 | 0.180 | |||

RMSE | 0.262 | 0.246 | 0.224 | |||

3 | 1.00 | 0.87 | Mean | 0.700 | 0.757 | 0.783 |

SD | 0.293 | 0.298 | 0.292 | |||

RMSE | 0.419 | 0.385 | 0.364 | |||

4 | 1.25 | 1.45 | Mean | 0.733 | 0.793 | 0.825 |

SD | 0.356 | 0.358 | 0.355 | |||

RMSE | 0.628 | 0.581 | 0.554 | |||

5 | 1.50 | 2.21 | Mean | 0.780 | 0.872 | 0.911 |

SD | 0.450 | 0.462 | 0.466 | |||

RMSE | 0.849 | 0.780 | 0.751 |

The purpose of this appendixis to provide a precise asymptotic analysis in the case of logistic regression, clearly stating where the regression calibration is used. In cases that the regression calibration is used, we will replace = by .

Let *H* (*x*) =exp(*x*)/{1+exp(*x*)} be the logistic distribution function. Our model is that in the validation and primary studies, which are of size *n _{p}* and

We define *c _{F}* =

The two-stage method makes the assumption that there is a unique (*α*_{0}, **α**_{1}) such that

(B.1)

In this case, (*α*_{0},**α**_{1}) are the parameters that logistic regression of *Y* on **W** in the primary study actually converges to in probability. It should be noted that, precisely,

(B.2)

In other word, (B.2) says that the observed data do not follow a logistic regression exactly. Logistic regression of *Y* on **W** in the primary study means solving the estimating equation

Define , and make the further definitions

where *H*
^{(1)}(*x*) =*H* (*x*){1 − *H* (*x*)}. It is a consequence of standard estimating equation theory that

This means that

The two-stage method considers a weighted linear combination of {diag(**γ**)}^{−1}_{1}. We have that

where

If **e**_{one} = (1, 1, …, 1)^{T}, then the optimal weighted combination in terms of variance of {diag()}^{−1}_{1}has weights
. The two stage estimator of *β* then is _{TS} = **ω**^{T}{diag(**γ**)}^{−1}_{1}, and we have that

(B.3)

Of course, an implication of (B.3) is that the two-stage estimator converges to **ω**^{T}{diag(**γ**)}^{−1}**α**_{1}).

Equation (B.3) makes it clear that the two-stage method actually converges to **ω**^{T}{diag(**γ**)}^{−1}**α**_{1}). However, **ω** = **ω**(**Q**_{1}, **Q**_{2}, **γ**, **α**_{1}) depends on (**Q**_{1}, **Q**_{2}, **γ**, **α**_{1}). Since (*α*_{0}, **α**_{1}) are the solutions to the integral equations (B.1), which in turn depend on the distributions of (*X*, **W**) in the primary study, the asymptotic variance (B.4) is not particularly useful for comparison with regression calibration.

However, there are important simplifications if one makes the same regression calibration approximations that are made by Weller, et al.[1], namely that for some *β*_{TS},

(B.4)

(B.5)

where in both cases we have invoked the term to reflect the regression calibration approximation. In this case, it is readily seen from (B.4) that **Q**_{1} **Q**_{2}, **ω**^{T}{diag(**γ**)}^{− 1}**α**_{1} *β*_{TS}, and then the regression calibration approximation to the asymptotic variance of the two-stage method reduces to the result in the Appendix A. In particular, define
. Then we have that **Q**_{3} (**C** − **B***A*^{−1}**B**^{T})^{−1}, which agrees with the result in Appendix A. Using (B.5), we also have

which also agrees with the results in Appendix A.

We will distinguish here between the regression calibration method and the regression calibration approximation.

The regression calibration method makes the assumption that there is a unique (*β*_{0}, *β*_{RC}) such that

(B.6)

Then (_{0}, _{1}) solve

(B.7)

This means that the estimating equations are

where if ,

Letting *v _{ip}* =

it follows that if **e**_{1} = (0, 0, …, 0, 1)^{T},

(B.8)

Like (B.3), equation (B.8) is not particularly useful for making comparisons among various methods because of the need to solve the integral equation (B.6) that in turn depends on the distribution of **W*** _{ip}*, etc. However, the equivalent regression calibration approximation to (B.4)–(B.5) is that

(B.9)

Under (B.4)–(B.5) and (B.9), *β*_{TS} *β*_{RC}. Equation (B.9) implies that
. In this case, **D**_{1} and **D**_{2} simplify to

exactly as described in Appendix A.

In this appendix, we present results of three sets of simulations where surrogates W_{1} and W_{2} had bivariate lognormal distributions. We first generated (*Z*_{1}, *Z*_{2}) as bivariate standard normal with correlation *ρ*_{*}. Let
, *μ*_{*} = exp(*σ*^{2}/2) and
, so that the coefficient of variation is *σ*_{*}/*μ*_{*} = 2. By simulation, we found *ρ*_{*} such that *corr*{exp(*σZ*_{1}), exp(*σZ*_{2})}= *ρ*. Then (*W*_{1}, *W*_{2}) defined by *W _{j}* ={exp(

As with other simulations described in the main text, we generated *X* as *X* = *γ*_{0} + *γ*_{1}*W*_{1} +*γ*_{2}*W*_{2} + *ε*, where *ε* was uncorrelated with (*W*_{1}, *W*_{2}) and *E*(*ε*) = 0,
, thus ensuring *E* (*X*) = 0, var(*X*) =1. As before, we generated *Y* as a binary variable with pr(*Y* =1*| X*) =*H* (*β*_{0} +*β*_{1}
*X*), where *H* (·) is the logistic distribution function. Then we set the true logistic intercept *β*_{0} and the true logistic slope *β*_{1} so that the odds ratio for increasing *X* by two standard deviations of its distribution was equal to a predefined value OR, and the probability of disease in the source population was equal to a predefined value *π*.

In the first set of simulations, we generated *ε* to be normal and independent of (*W*_{1}, *W*_{2}). In the second set, we set *ε* = *δ* − *E*(*δ*), where *δ* was independent of (*W*_{1}, *W*_{2}) and had a lognormal distribution with
and CV = 1. In the third set of simulations, we considered a case of heteroscedastic error *ε* = (*γ*_{1}*W*_{1} +*γ*_{2}*W*_{2})× (*δ* −1), *δ* was independent of (*W*_{1}, *W*_{2}), lognormal with mean *E*(*δ*) =1, variance
, and *CV*(*δ*) = var(*δ*).

In all simulations, we set *ρ* = 0.5, *γ*_{0} = 0, *n _{v}* =1,000 and

1. Weller EA, Milton DK, Eisen EA, Spiegelman D. Regression calibration for logistic regression with multiple surrogates for one exposure. Journal of Statistical Planning and Inference. 2007;137:449–461.

2. Carroll RJ, Stefanski LA. Approximate quasi-likelihood estimation in models with surrogate predictors. Journal of the American Statistical Association. 1990;85:652–663.

3. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2. Chapman and Hall CRC Press; Boca Raton, FL: 2006.

4. Rosner B, Willett WC, Spiegelman D. Correction of logistic regression relative risks and confidence intervals for systematic within-person measurement error. Statistics in Medicine. 1989;8:1051–1070. [PubMed]

5. Thurston SW, Spiegelman D, Ruppert D. Equivalence of regression calibration methods in main study/external validation study designs. Journal of Statistical Planning and Inference 2003. 2003;113:527–539.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |