Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2674251

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. Motivation
- 3. THREE PROPOSED ESTIMATORS
- 4. ASYMPTOTIC RELATIVE EFFICIENCY
- 5. SIMULATION
- 6. CONVERGENCE ISSUE
- 7. CONCLUDING REMARKS
- References

Authors

Related links

J Stat Plan Inference. Author manuscript; available in PMC 2010 July 1.

Published in final edited form as:

J Stat Plan Inference. 2009 July 1; 139(7): 2341–2350.

doi: 10.1016/j.jspi.2008.10.024PMCID: PMC2674251

NIHMSID: NIHMS85580

Department of Biostatistics, Mailman School of Public Health, Columbia University, 722 West 168^{th} Street, New York City, N.Y. 10032, U.S.A.

When data are missing, analyzing records that are completely observed may cause bias or inefficiency. Existing approaches in handling missing data include likelihood, imputation and inverse probability weighting. In this paper, we propose three estimators inspired by deleting some completely observed data in the regression setting. First, we generate artificial observation indicators that are independent of outcome given the observed data and draw inferences conditioning on the artificial observation indicators. Second, we propose a closely related weighting method. The proposed weighting method has more stable weights than those of the inverse probability weighting method (Zhao and Lipsitz, 1992). Third, we improve the efficiency of the proposed weighting estimator by subtracting the projection of the estimating function onto the nuisance tangent space. When data are missing completely at random, we show that the proposed estimators have asymptotic variances smaller than or equal to the variance of the estimator obtained from using completely observed records only. Asymptotic relative efficiency computation and simulation studies indicate that the proposed weighting estimators are more efficient than the inverse probability weighting estimators under wide range of practical situations especially when when the missingness proportion is large.

When data are missing, analyzing only completely observed records could cause bias or inefficiency. One way of handling missing data is to maximize the observed likelihood obtained by integrating the likelihood for the full data and observation indicators over the missing data (e.g. Little and Rubin, 1987; Laird, 1988). In the non-likelihood framework, approaches such as the imputation and the inverse probability weighting have been proposed. The imputation method replaces the contribution of the estimating function with the missing statistics by its conditional expectation given the observed data (Reilly and Pepe, 1995; Paik, 1996). The inverse probability weighting (Zhao and Lipsitz, 1992) uses only the completely observed records, but weighs each record by the inverse of the probability of observation. The two approaches reflect different viewpoints of the problems involving missing data: the imputation method fills in the missing data with the most plausible values, while the inverse probability weighting blows up the observed records to properly represent the whole data (Fleiss, Levin and Paik, 2003). Correspondingly these two approaches represent different ways of constructing unbiased estimating functions. Recently, Lipsitz et al. (1999) proposed a combination of the two where the authors used weights derived from imputation models. Here we propose a third approach, which deletes some of completely observed records artificially, thereby motivating a class of estimators. We propose to delete some records so that after deletion, the observation process is independent of the outcome and therefore the complete-record analysis is valid. The implementation of this approach is simpler than that of existing approaches and can be widely disseminated to users. Intuitively, we discard some observed information to undo the harm caused by the missing mechanism and restore the original structure existed in the full data. A similar idea was implemented in survival analysis via artificial censoring to fix dependent censoring (Lin, Robins and Wei, 1996).

Specifically, we propose three estimators. We call the first the deletion estimator, directly reflecting the main idea. That is, we create artificial observation indicators so that the artificially created indicator is conditionally independent of the outcome, and analyze only the records that are ‘artificially’ observed. The artificially created observation indicator is a decreasing function of the observation indicator and artificially observed records constitute a subset of completely observed records. The proposed deletion estimators are consistent when data are missing at random. Although counterintuitive, we show that the deletion estimator has asymptotic variance smaller than or equal to the estimator obtained from the complete-record analysis despite using a smaller number of records, when data are missing completely at random. The second proposed estimator involves a weighting method where the weight is the probability of deletion in the deletion method. The weights of the proposed method are more stable than those of the inverse probability weighting method, and the resulting estimates are more efficient when the missing proportion is high. We also show that the weighting estimate is the limit of the mean of repeatedly computed deletion estimates as the number of replications approaches infinity. Finally, the third estimator is a modified estimator from the second estimator using the argument of Robins et al. (1994) to improve the efficiency. The efficient version of the proposed weighting estimator are shown to perform better than the counterpart of the inverse probability weighting in situations where the proposed weighting estimator performs better than the IPW estimators.

Let *Y* denote the outcome and (*X,Z*) denote covariates. We consider situations where our interest lies in the regression setting and *E*(*Y*|*X,Z*), with a known parametric form, is the quantity of interest. Outcome *Y* and covariate *Z* are completely observed and covariate *X* could be missing. Let *R* be the observation indicator for *X*. Throughout the paper, we assume that the data are missing at random (Rubin, 1976), i.e., the observation probability does not depend on the missing variable *X* itself:

$$P(R|X,Y,Z)=P(R|Y,Z).$$

(1)

Our motivation of the proposed method starts from the observation that if *R* additionally satisfies the condition, *R**Y*|*Z*, we have *E*(*Y*|*X,Z,R* = 1) = *E*(*Y*|*X,Z*), then we can consistently estimate *E*(*Y*|*X,Z*) simply by conducting the Complete-Record (CR) analysis. In practice, however, we cannot control *R* and cannot force *R* to satisfy this additional condition. Therefore, a key idea of the proposed method is to generate artificial variable, say *R**, such that *R***Y*|*Z*, and to estimate *E*(*Y*|*X,Z*) using only records whose artificial variable *R** equals 1. This approach is simple and can be handled by most of standard software. Then the problem boils down how to generate such *R**.

For a concrete example, consider the case that (*Y,Z*) are all binary. For brevity, denote *P*(*R _{i}* = 1|

In Section 3, we propose three estimators motivated by the idea of deleting. First is the usual estimator but using the records with *R** = 1 only. We call this deletion estimator. The second is the weighting estimator using records with *R* = 1 only with weight *P*(*R _{i}** = 1|

To formalize the idea presented in Section 2, suppose that *P*(*R* = 1|*Y,Z*) is a known function indexed by unknown parameter *α*, denoted by *π*(*Y,Z; α*) > 0, and *π* is a differentiable function of *α*. Furthermore, under the condition (1), suppose there exists a consistent and asymptotically normally distributed estimator , which can be expressed as
${n}^{\frac{1}{2}}(\widehat{\alpha}-{\alpha}_{0})=\sum {A}_{i}({\alpha}_{0})+{o}_{p}(1)$, where *α*_{0} is the true value for *α, A _{i}*(

$${U}_{n}^{\ast}(\beta )=\sum {R}_{i}^{\ast}g({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}=0.$$

(2)

We show in the next section that the estimating function (2) has zero expectation and its solution * is consistent for the true value *β*_{0} and asymptotically normally distributed.

The construction of the deletion estimator is motivated to fix the bias but could potentially cause a loss of efficiency. However, it turns out that the estimator in fact gains efficiency over the CR estimator when data are missing completely at random. In Appendix D we show that the asymptotic variance of the deletion estimator is smaller than or equal to that of the CR estimator when data are missing completely at random. This may initially sound counterintuitive since *R** ≤ *R*, and the complete record analysis utilizes more records than the deletion method. However, the number of deleted records would be small when data are missing completely at random, and furthermore deletion is executed effectively using information from the observed data, which lead to a more efficient estimator.

A practical weakness of deleting method is that the randomness of *R** produces different estimates each time given the same observed data. To improve upon this feature, one can contemplate an estimating function, *E*{*U _{n}**(

$$E\{{U}_{n}^{\ast}(\beta )|R,{X}_{o},Y,Z\}={U}_{n}(\beta ,\widehat{\alpha})=\sum {R}_{i}{q}_{i}(\widehat{\alpha})g({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}=0$$

(3)

Equation (3) is the proposed weighting estimating equation, and its solution is the proposed weighting estimator. If the weight *q _{i}*() is replaced by 1/

We can show that the estimating function (3) has zero expectation and the resulting estimator is consistent and asymptotically normally distributed.

${n}^{\frac{1}{2}}(\widehat{\beta}-{\beta}_{0})$ *is asymptotically normally distributed with mean 0 and variance*
${\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})\sum ({\beta}_{0},{\alpha}_{0}){\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})$, *where*

$${\mathrm{\Gamma}}_{1}({\beta}_{0},{\alpha}_{0})=\underset{n\to \infty}{\mathrm{lim}}{n}^{-1}E[-\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\beta}^{T}};{\beta}_{0},{\alpha}_{0}],$$

*and*
$\sum ({\beta}_{0},{\alpha}_{0})=\mathit{Var}\{{n}^{-\frac{1}{2}}{U}_{n}({\beta}_{0},\widehat{\alpha});{\beta}_{0},{\alpha}_{0}\}$. *A consistent estimator for* Σ(*β*_{0}, *α*_{0}) *is given by*

$${n}^{-1}\sum {[{R}_{i}{q}_{i}(\widehat{\alpha})g({X}_{i},{Z}_{i};\widehat{\beta})\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\widehat{\beta})\}+{\mathrm{\Gamma}}_{2}(\widehat{\alpha},\widehat{\beta}){A}_{i}(\widehat{\alpha})]}^{\otimes 2},$$

*where A ^{2} = AA^{T}*,

$${\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0})=\underset{n\to \infty}{\mathrm{lim}}{n}^{-1}E[\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\alpha}^{T}};{\beta}_{0},{\alpha}_{0}].$$

See Appendix A for proof.

Based on this result, we can establish the asymptotic property of the deletion estimator by expressing ${U}_{n}^{\ast}(\beta ,\widehat{\alpha})={U}_{n}(\beta ,\widehat{\alpha})+\{{U}_{n}^{\ast}(\beta ,\widehat{\alpha})-{U}_{n}(\beta ,\widehat{\alpha})\}$.

${n}^{\frac{1}{2}}({\widehat{\beta}}^{\ast}-{\beta}_{0})$ *is asymptotically normally distributed with mean 0 and variance*
${\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})\{\sum ({\beta}_{0},{\alpha}_{0})+\mathrm{\Omega}({\beta}_{0},{\alpha}_{0})\}{\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})$ *where* Γ_{1}(*β*_{0}, *α*_{0}) *and* Σ(*β*_{0}, *α*_{0}) *are defined in Theorem 1,and*
$\mathrm{\Omega}({\beta}_{0},{\alpha}_{0})=\mathit{Var}[{n}^{-\frac{1}{2}}\{{U}_{n}^{\ast}(\beta ,\widehat{\alpha})-{U}_{n}(\beta ,\widehat{\alpha})\};{\alpha}_{0},{\beta}_{0}]$. Ω(*β*_{0}, *α*_{0}) *can be consistently estimated by*

$${n}^{-1}\sum {q}_{i}(\widehat{\alpha})\{1-{q}_{i}(\widehat{\alpha})\}g({X}_{i},{Z}_{i};\widehat{\beta}\ast ){\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\widehat{\beta}\ast )\}}^{2}g{({X}_{i},{Z}_{i};\widehat{\beta}\ast )}^{T}.$$

See Appendix B for proof.

Note that Ω(*β*_{0}, *α*_{0}) captures the extra variability of the deletion estimator caused by the randomness of *R** given observed data. This leads us to consider yet another related estimator. If we were to repeatedly compute the deletion estimates *K* times, we would obtain *K* different deletion estimates, say _{1}*,…,* _{K}** due to the randomness of deleting records. Let
${\overline{\beta}}_{K}={K}^{-1}{\sum}_{k=1}^{K}{\widehat{\beta}}_{k}^{\ast}$. We show in Appendix C that
${n}^{\frac{1}{2}}(\widehat{\beta}-{\overline{\beta}}_{\infty})={o}_{p}(1)$ as

We can express *U _{n}*(

$${U}_{n}(\beta ,\alpha )=\sum {R}_{i}{\pi}_{i}{({Y}_{i},{Z}_{i};\alpha )}^{-1}h({X}_{i},{Z}_{i};\beta ,\alpha )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}=0,$$

where *h*(*X _{i}, Z_{i}; β, α*) =

$$\begin{array}{ccc}{U}_{n}^{\mathrm{eff}}(\beta ,\alpha ,\varphi )& =& \sum {R}_{i}{\pi}_{i}{({Y}_{i},{Z}_{i};\alpha )}^{-1}h({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}\\ & & -\sum \pi {({Y}_{i},{Z}_{i};\alpha )}^{-1}\{{R}_{i}-\pi ({Y}_{i},{Z}_{i};\alpha )\}\varphi ({Y}_{i},{Z}_{i}).\end{array}$$

For the second term of
${U}_{n}^{\mathrm{eff}}$ to be a projection, one should find the function *ϕ* that minimizes the norm of
${U}_{n}^{\mathrm{eff}}$. Such a function *ϕ* can be found by decomposing
${U}_{n}^{\mathrm{eff}}(\beta ,\alpha ,\varphi )$ into the following two terms:

$$\begin{array}{c}{U}_{n}^{\mathrm{eff}}(\beta ,\alpha ,\varphi )=\sum h({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}\\ +\sum \{{R}_{i}-\pi ({Y}_{i},{Z}_{i};\alpha )\}\pi {({Y}_{i},{Z}_{i};\alpha )}^{-1}[h({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}-\varphi ({Y}_{i},{Z}_{i})]\\ ={C}_{1n}+{C}_{2n}=\sum ({C}_{1n}^{(i)}+{C}_{2n}^{(i)})\end{array}$$

Since *C*_{1n} and *C*_{2n} are uncorrelated,

$$\begin{array}{c}\mathit{Var}({C}_{1n}+{C}_{2n})=\mathit{Var}({C}_{1n})+\mathit{EVar}({C}_{2n}|X,Y,Z)+\mathit{VarE}({C}_{2n}|X,Y,Z)\\ =\mathit{Var}({C}_{1n})+E\frac{\{1-\pi ({Y}_{i},{Z}_{i};\alpha )\}}{\pi ({Y}_{i},{Z}_{i};\alpha )}E{[h({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}-{\varphi}_{i}]}^{2}.\end{array}$$

The above expression suggests that the minimum can be achieved when *ϕ*(*Y _{i}, Z_{i}*) equals
${\varphi}_{\mathrm{eff}}^{h}({Y}_{i},{Z}_{i})=E[h({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}|{Y}_{i},{Z}_{i}]$.

The solution to the equation
${U}_{n}^{\mathrm{eff}}(\beta ,\alpha ,{\varphi}_{\mathrm{eff}}^{h})=0$ say
$\widehat{\beta}({\widehat{\varphi}}_{\mathrm{eff}}^{h})$, is the most efficient estimator given the form of *h*(*X,Z; β*), and its variance can be estimated by,
${n}^{-1}{\widehat{\mathrm{\Gamma}}}_{3}^{-1}(\widehat{\beta}({\widehat{\varphi}}_{\mathrm{eff}}^{h}),\widehat{\alpha})\{\sum {({\widehat{C}}_{1n}^{(i)}+{\widehat{C}}_{2n}^{(i)})}^{\otimes 2}\}{\widehat{\mathrm{\Gamma}}}_{3}^{-1}(\widehat{\beta}({\widehat{\varphi}}_{\mathrm{eff}}^{h}),\widehat{\alpha})$, where
${\mathrm{\Gamma}}_{3}({\beta}_{0},{\alpha}_{0})={\mathrm{lim}}_{n\to \infty}{n}^{-1}E\{-\frac{\partial {U}_{n}^{\mathrm{eff}}(\beta ,\alpha ,{\varphi}_{\mathrm{eff}}^{h})}{\partial {\beta}^{T}}|{\beta}_{0},{\alpha}_{0}\}$ When we replace *h*(*X,Z; β*) with *g*(*X,Z; β*), the resulting estimator is an efficient version of the IPW estimator, say
${\widehat{\beta}}_{\mathrm{IPW}}({\widehat{\varphi}}_{\mathrm{eff}}^{g})$. However, neither
$\widehat{\beta}({\widehat{\varphi}}_{\mathrm{eff}}^{h})$ nor
${\widehat{\beta}}_{\mathrm{IPW}}({\widehat{\varphi}}_{\mathrm{eff}}^{g})$ is fully efficient. To obtain a fully efficient estimator, one has to replace *h*(*X,Z; β*) with an arbitrary function of (*X,Z*) say, *h**(*X,Z; β*) and solve for *h**(*X,Z*) satisfying integral equation (23) of Robins et al. (1994) and find its corresponding *ϕ*, say
${\varphi}_{\mathrm{eff}}^{h\ast}=E\{h\ast (X,Z;\beta )(Y-\mu )|Y,Z\}$. It is hard to analytically compare the efficiencies of the estimators derived from *g* and *h* functions. Note that *ϕ* is a function of an unknown quantity involving *E*[*h(X _{i}, Z_{i}; β)|Y_{i}, Z_{i}*] and

Given the fact that we can not establish any inequalities between the asymptotic variances of WTEF and IPWEF, an important practical question remains as to when to use deletion family estimators instead of inverse probability estimators. Obviously, when any one *π _{i}* is small, the weights for IPW or IPWEF could be unstable and result in non-convergence or inefficient estimators. This could happen when the overall probability is small, or when the observation probability depends on continuous variable whose values can be extreme, or when the effect of the covariate on the observation probability is large.

To systematically investigate the aforemetioned conjecture, the asymptotic relative efficiency of WTEF over IPWEF is computed in various situations and summarized in Figure 1. Two columns of graphs show the efficiencies of * _{X}* and

We conducted simulation studies with 500 replications for logistic regression models and classical linear models. Two sample sizes, 2000 and 500 were used and only the results for *n* = 500 is shown. For all models we generated observation indicators with *P*(*R*=1|*Y,Z*)=*logit*^{−1}(*α _{0}*+

The performance of eight estimates are given from full data (Full), complete records (CR), IPW using the correct missingness model (IPW1), IPW using the overfitted model with the interaction term between *Y* and *Z* (IPW2), efficient version of IPW (IPWEF), the proposed deletion method (Del), the proposed weighting method (WT), and the efficient version of the proposed weighting method (WTEF). In computing the projection term for IPWEF and WTEF when (*Y,Z*) are all binary, sample means are computed for each of four categories, and when at least one element of (*Y,Z*) is continuous, linear or logistic models are used depending on the type of *X* with linear predictor *Y* + *Z* + *Y* * *Z*.

For all estimates, the bias of the point estimates, simulation mean square errors, and the average number of records used in each method are shown in Table 1. Table 2 reports the coverage probabilities.

We focus on three comparisons: (i) CR estimate vs. the deletion estimate (Del); (ii) IPW (IPW1 and IPW2) vs. the proposed weighting (WT); and (iii) IPWEF vs. WTEF. First under MCAR, note that the bias of the eight estimates are negligible. In logistic models, the deletion estimates (Del) of *β*_{0} and *β _{Z}* are more efficient than those from the complete-record analysis despite using a smaller number of records when

In both logistic and linear models, the proposed weighting estimates (WT) are generally more efficient than the IPW estimates obtained by correctly specifying missingness models (IPW1) or the IPW estimates obtained by specifying overfitted missingness models (IPW2). The efficiency of IPW1 is substantially poorer than the WT estimates. The advantage of the proposed weighting estimate (WT) over IPW is apparent when *Z* is continuous. Under this condition, *π*(*Y _{i}, Z_{i};*) is small for extreme values of

The two efficient versions, IPWEF and WTEF, show smaller variances than their original versions as anticipated. However, note that for the case in which *X* and *Z* are dichotomous in logistic models, and where auxiliary models are saturated, there is no improvement over the original versions, IPW2 and WT. Between the IPWEF and WTEF, the performance is comparable in logistic regression models: IPWEF has a slight advantage in *β _{Z}* and WTEF has advantages in

Coverage probabilities of the eight methods are shown in Table 2. The coverage probabilities for consistent estimates are reasonable overall, which demonstrates that the asymptotic variances behave well in situations considered. Under MAR, the CR estimates are biased and the coverage probabilities are far from their nominal value. Although the deletion estimators are consistent, their coverage probabilities are less than its target 95% in linear models, because the number of records used in the analysis is small. Although not shown, our simulations show that the coverage probabilities become within the 95% confidence intervals of the nominal value when the overall sample size is 2000.

We have shown in Section 4 that asymptotic relative efficiency is generally better in proposed weighting estimator than the IPW estimators. Another reason to prefer the weighting estimator is that weighting estimates suffer much less from non-convergence problem than IPW estimates as the missing proportion is increased. For example, under the simulation situation shown in Table 1 (*α*_{0} = − 2 *α _{X}* = 0,

We have proposed three estimators based on the idea of deleting completely observed records. The deletion estimator serves as a conceptual device for the other two estimators, but may not be attractive for practical use due to randomness of the artificial observation indicator. The weighting and the improved weighting estimators are viable alternatives to the inverse probability weighting estimators. When some of the predicted observation probabilities are small, the proposed weighting estimators suffer much less from non-convergence problems and are more efficient than the inverse probability weighting estimators. While discarding some completely observed data to handle missing data may sound paradoxical, it proves to be effective when is done in an informative way.

Since *q*(*α*) is a differentiable function of *α*, we can write

$$\begin{array}{cc}{n}^{-\frac{1}{2}}{U}_{n}(\widehat{\beta},\widehat{\alpha})& ={n}^{-\frac{1}{2}}\{{U}_{n}({\beta}_{0},{\alpha}_{0})+\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\beta}^{T}}(\widehat{\beta}-{\beta}_{0})+\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\alpha}^{T}}(\widehat{\alpha}-{\alpha}_{0})\}+{o}_{p}(1)\\ & ={n}^{-\frac{1}{2}}\{{U}_{n}({\beta}_{0},{\alpha}_{0})-n{\mathrm{\Gamma}}_{1}({\beta}_{0},{\alpha}_{0})(\widehat{\beta}-{\beta}_{0})\}+{\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0})\sum {A}_{i}({\alpha}_{0})+{o}_{p}(1)\end{array}$$

where

$${\mathrm{\Gamma}}_{1}({\beta}_{0},{\alpha}_{0})=\underset{n\to \infty}{\mathrm{lim}}E\{{n}^{-1}-\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\beta}^{T}};{\beta}_{0},{\alpha}_{0}\},\phantom{\rule{0.2em}{0ex}}{\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0})=\underset{n\to \infty}{\mathrm{lim}}E\{{n}^{-1}\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\alpha}^{T}};{\beta}_{0},{\alpha}_{0}\},$$

$$\frac{\partial {U}_{n}(\beta ,\alpha )}{\partial {\alpha}^{T}}=\sum {R}_{i}\frac{\partial {q}_{i}(\alpha )}{\partial {\alpha}^{T}}g({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\},$$

$$\frac{\partial q(\alpha )}{\partial {\alpha}^{T}}=\frac{{\pi}_{M}({Z}_{i};\alpha )}{\pi ({Y}_{i},{Z}_{i};\alpha )}\{\frac{{\pi}_{M}^{\text{'}}({Z}_{i};\alpha )}{{\pi}_{M}({Z}_{i};\alpha )}-\frac{{\pi}^{\text{'}}({Y}_{i},{Z}_{i};\alpha )}{\pi ({Y}_{i},{Z}_{i};\alpha )}\},\mathrm{and}$$

*π′* denotes derivative of *π* with respect to *α*. Denoting the ith contribution to *U _{n}*(

$$\begin{array}{cc}{n}^{\frac{1}{2}}(\widehat{\beta}-{\beta}_{0})& ={\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})\{{n}^{-\frac{1}{2}}{U}_{n}({\beta}_{0},{\alpha}_{0})+{\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0})\sum {A}_{i}({\alpha}_{0})\}+{o}_{p}(1)\\ & ={\sum}_{i=1}^{n}{\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})\{{n}^{-\frac{1}{2}}{U}_{n}^{(i)}({\beta}_{0},{\alpha}_{0})+{\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0}){A}_{i}({\alpha}_{0})\}+{o}_{p}(1)\\ & =\sum {H}_{i}({\beta}_{0},{\alpha}_{0})+{o}_{p}(1)\end{array}$$

First, we can show that ${U}_{n}^{(i)}({\beta}_{0},{\alpha}_{0})$ has mean zero:

$$\begin{array}{cc}E\{{E}_{R|Y,{X}_{o},Z}{U}_{n}^{(i)}(\beta ,\alpha )\}& =E[\pi ({Z}_{i},{Y}_{i};\alpha ){q}_{i}(\alpha )g({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}]\\ & =E[{\pi}_{M}({Z}_{i};\alpha )g({X}_{i},{Z}_{i};\beta )\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\beta )\}]=0.\end{array}$$

In addition, by assumption made in Section 2, *A _{i}(α*

Using Taylor’s expansion, we have ${U}_{n}^{\ast}(\widehat{\beta}\ast )=0={U}_{n}^{\ast}({\beta}_{0})+\frac{\partial {U}_{n}^{\ast}(\beta )}{\partial {\beta}^{T}}(\widehat{\beta}\ast -{\beta}_{0})+{o}_{p}(1)$. First, it is easy to verify that

$$\underset{n\to \infty}{\mathrm{lim}}E\{{n}^{-1}-\frac{\partial {U}_{n}^{\ast}(\beta )}{\partial {\beta}^{T}};{\beta}_{0},{\alpha}_{0}\}={\mathrm{\Gamma}}_{1}({\beta}_{0},{\alpha}_{0}).$$

Then
${n}^{\frac{1}{2}}(\widehat{\beta}\ast -{\beta}_{0})={n}^{-\frac{1}{2}}{\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0}){U}_{n}^{\ast}({\beta}_{0})+{o}_{p}(1)$. Rewriting *U _{n}**(

$$\begin{array}{ll}{n}^{\frac{1}{2}}(\widehat{\beta}\ast -{\beta}_{0})& ={\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},{\alpha}_{0})[{n}^{-\frac{1}{2}}{U}_{n}({\beta}_{0},{\alpha}_{0})+{\mathrm{\Gamma}}_{2}({\beta}_{0},{\alpha}_{0})\sum {A}_{i}({\alpha}_{0})+{n}^{-\frac{1}{2}}\{{U}_{n}^{\ast}({\beta}_{0})-{U}_{n}({\beta}_{0},\widehat{\alpha})\}]+{o}_{p}(1)\\ & ={T}_{1n}+{T}_{2n}+{o}_{p}(1)=\sum _{i=1}^{n}({T}_{1n}^{(i)}+{T}_{2n}^{(i)})+{o}_{p}(1),\end{array}$$

where ${T}_{1n}^{(i)}={n}^{-\frac{1}{2}}{U}_{n}^{(i)}({\beta}_{0},{\alpha}_{0})+{\mathrm{\Gamma}}_{2}{A}_{i}({\alpha}_{0}),{T}_{1n}=\sum {T}_{1n}^{(i)}$ and ${T}_{2n}^{(i)}=\{{U}_{n}^{\ast (i)}(\widehat{\beta})-{U}_{n}^{(i)}({\beta}_{0},\widehat{\alpha})\}=\{{R}_{i}^{\ast}-{q}_{i}(\widehat{\alpha})\}g({X}_{i},{Z}_{i};{\beta}_{0})\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};{\beta}_{0})\}$, and ${T}_{2n}=\sum {T}_{2n}^{(i)}$. Note that ${E}_{R\ast |R,Y,{X}_{o},Z}[\{{R}_{i}^{\ast}-{q}_{i}(\widehat{\alpha})\}|R,Y,{X}_{o},Z;{\alpha}_{0}]=0$,and $E({T}_{2n}^{(i)}|R,Y,{X}_{o},Z;{\alpha}_{0})=0$. Furthermore, $E\{\mathit{Cov}({T}_{1n}^{(i)},{T}_{2n}^{(i)}|R,Y,{X}_{\mathrm{o}},Z)\}=0$ and $\mathit{Cov}\{E({T}_{1n}^{(i)}|R,Y,{X}_{\mathrm{o}},Z),E({T}_{2n}^{(j)}|R,Y,{X}_{0},Z)\}=0$. Therefore $\sum ({T}_{1n}^{(i)}+{T}_{2n}^{(i)})$ represents a sum of independent random vectors with mean 0 and finite variance. We also find that

$$\begin{array}{cc}\mathit{Var}({T}_{1n}+{T}_{2n})& =E\{\mathit{Var}({T}_{1n}+{T}_{2n}|R,Y,{X}_{o},Z)\}+\mathit{Var}\{E({T}_{1n}+{T}_{2n}|R,Y,{X}_{o},Z)\}\\ & =E\{\mathit{Var}({T}_{2n}|R,Y,{X}_{o},Z)\}+\mathit{Var}({T}_{1n}).\end{array}$$

Therefore
${n}^{\frac{1}{2}}(\widehat{\beta}\ast -{\beta}_{0})$ is asymptotically normally distributed with mean 0 and variance *E*{*Var*(*T*_{2n}|*R, Y, X _{o}, Z*)}+

$$\sum {q}_{i}(\widehat{\alpha})\{1-{q}_{i}(\widehat{\alpha})\}g({X}_{i},{Z}_{i};\widehat{\beta}\ast ){\{{Y}_{i}-\mu ({X}_{i},{Z}_{i};\widehat{\beta}\ast )\}}^{2}g{({Y}_{i},{Z}_{i};\widehat{\alpha})}^{T}$$

The proof is similar to that of theorem 2 of Reilly and Pepe (1997). Denote the artificial observation indicator for the *i ^{th}* unit and the

$${U}_{n,k}^{\ast}(\beta ,\widehat{\alpha})=\sum _{i=1}^{n}{R}_{i,k}^{\ast}g({X}_{i},{Z}_{i};\beta )({Y}_{i}-{\mu}_{i}).$$

From Taylor expansion, we have

$$\sqrt{n}\left({\widehat{\beta}}_{k}^{\ast}-{\beta}_{0}\right)={\mathrm{\Gamma}}_{1,k}{({\stackrel{\sim}{\beta}}_{k},\widehat{\alpha})}^{-1}\frac{{U}_{n,k}^{\ast}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}},$$

where * _{k}*(

$$\begin{array}{ll}\sqrt{n}({\overline{\beta}}_{K}-{\beta}_{0})& =\frac{1}{K}\sum _{k=1}^{K}{\mathrm{\Gamma}}_{1,k}{({\stackrel{\sim}{\beta}}_{k},\widehat{\alpha})}^{-1}\frac{{U}_{n,k}^{\ast}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}\\ & =\frac{1}{K}\sum _{k=1}^{K}\left\{{\mathrm{\Gamma}}_{1,k}{({\stackrel{\sim}{\beta}}_{k},\widehat{\alpha})}^{-1}-{\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},\widehat{\alpha})\right\}\frac{{U}_{n,k}^{\ast}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}+\frac{1}{K}\sum _{k=1}^{K}{\mathrm{\Gamma}}_{1}^{-1}{({\beta}_{0},\widehat{\alpha})}^{-1}\frac{{U}_{n,k}^{\ast}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}\\ & ={\mathrm{\Phi}}_{1}(K,n)+{\mathrm{\Phi}}_{2}(K,n).\end{array}$$

Given the observed data and *n*, let _{∞} = lim_{K→∞}_{K}. Then

$$\sqrt{n}({\overline{\beta}}_{\infty}-{\beta}_{0})=\underset{K\to \infty}{\mathrm{lim}}{\mathrm{\Phi}}_{1}(K,n)+\underset{K\to \infty}{\mathrm{lim}}{\mathrm{\Phi}}_{2}(K,n)={\mathrm{\Phi}}_{1}^{n}+{\mathrm{\Phi}}_{2}^{n}.$$

First, observe that Φ_{1}^{n} → 0 as *n* → ∞. Then, since
${K}^{-1}{\sum}_{k=1}^{K}{R}_{i,k}^{\ast}\stackrel{P}{\to}{q}_{i}(\widehat{\alpha})$, we have

$$\begin{array}{ll}{\mathrm{\Phi}}_{2}^{n}& =\underset{K\to \infty}{\mathrm{lim}}\frac{1}{K}{\mathrm{\Gamma}}_{1}^{-1}({\beta}_{0},\widehat{\alpha})\sum _{k=1}^{K}\frac{{U}_{n,k}^{\ast}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}=\frac{1}{\sqrt{n}}{\mathrm{\Gamma}}^{-1}({\beta}_{0},\widehat{\alpha})\sum _{i=1}^{n}\underset{K\to \infty}{\mathrm{lim}}\sum _{k=1}^{K}{R}_{i,k}^{\ast}g({X}_{i},{Z}_{i};{\beta}_{0})({Y}_{i}-{\mu}_{i})\\ & =\frac{1}{\sqrt{n}}{\mathrm{\Gamma}}^{-1}({\beta}_{0},\widehat{\alpha})\sum _{i=1}^{n}{q}_{i}(\widehat{\alpha})g({X}_{i},{Z}_{i};{\beta}_{0})({Y}_{i}-{\mu}_{i})={\mathrm{\Gamma}}^{-1}({\beta}_{0},\widehat{\alpha})\frac{{U}_{n}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}.\end{array}$$

Thus $\sqrt{n}({\overline{\beta}}_{\infty}-{\beta}_{0})={\mathrm{\Gamma}}^{-1}({\beta}_{0},\widehat{\alpha})\frac{{U}_{n}({\beta}_{0},\widehat{\alpha})}{\sqrt{n}}+{o}_{p}(1)$, and $\sqrt{n}(\widehat{\beta}-{\overline{\beta}}_{\infty})={o}_{p}(1)$.

To proceed, we need notation for the estimating function for *β* without missing data, *S*(*β*) = Σ*S*(*Y _{i}|X_{i}, Z_{i}; β*) and estimating function for the nuisance parameter

$$\begin{array}{lll}\sum & =& {\sum}_{t}-D\\ & =& \mathrm{E}\{\frac{{\pi}_{M}{(Z)}^{2}}{\pi (Y,Z)}S(Y|X,Z;\beta )S{(Y|X,Z;\beta )}^{T}\}\\ & & -\mathrm{E}\{\mathit{RqS}(Y|X,Z;\beta )U{(\alpha )}^{T}\}{\{\mathrm{E}U(\alpha )U{(\alpha )}^{T}\}}^{-1}\mathrm{E}\{\mathit{RqU}(\alpha ){S(Y|X,Z;\beta )}^{T}\},\end{array}$$

and Ω = E{*Rq*(1 − *q*)*S*(*Y|X,Z;β*)*S*(*Y|X,Z;β*)* ^{T}*(

Under MCAR, *π*(*Y,Z*) = *π*(*Z*) = *π _{M}*(

$${\mathrm{\Gamma}}_{1}=\mathrm{E}\{\pi (Z)S(Y|X,Z;\beta )S{(Y|X,Z;\beta )}^{T}\},$$

$${\sum}_{t}=\mathrm{E}\{\pi (Z)S(Y|X,Z;\beta )S{(Y|X,Z;\beta )}^{T}\}={\mathrm{\Gamma}}_{1},$$

*D* = E{*RS*(*Y|X,Z; β*)*U*(*α*)^{T}}{E*U*(*α*)*U*(*α*)^{T}}^{−1}E{*RU*(*α*)*S*(*Y|X,Z;β*)^{T}}, and Ω = 0. Thus under MCAR, the asymptotic variance of deletion estimator *, is
${\mathrm{\Gamma}}_{1}^{-1}({\mathrm{\Gamma}}_{1}-D){\mathrm{\Gamma}}_{1}^{-1}$. Denoting the CR estimator by * _{CR}*, the difference of the asymptotic variances is
$\mathrm{Var}({\widehat{\beta}}_{\mathit{CR}})-\mathrm{Var}(\widehat{\beta}\ast )={\mathrm{\Gamma}}_{1}^{-1}D{\mathrm{\Gamma}}_{1}^{-1}$, which is positive semidefinite. This proves that the CR estimator has no smaller asymptotic variance than that of the deletion estimator.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20.
- Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. John Wiley & Sons; New York: 2003.
- Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. [PubMed]
- Little R, Rubin D. Statistical Analysis with Missing Data. John Wiley & Sons; New York: 1987.
- Lin DY, Robins JM, Wei LJ. Comparing two failure time distributions in the presence of dependent censoring. Biometrika. 1996;83:381–393.
- Lipsitz SR, Ibrahim JG, Fitzmaurice GM. Likelihood methods for incomplete longitudinal binary responses with incomplete categorical covariates. Biometrics. 1999;55:214–223. [PubMed]
- Paik MC. Quasi-likelihood regression models with missing covariates. Biometrika. 1996;83:825–834.
- Reilly M, Pepe M. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82:299–314.
- Reilly M, Pepe M. The relationship between hot-deck multiple imputation and weighted likelihood. Statistics in Medicine. 1997;16:5–19. [PubMed]
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866.
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.
- Wang Y-G. Estimating equations with nonignorably missing response data. Biometrics. 1999;55:984–989. [PubMed]
- Zhao L, Lipsitz S. Designs and analysis of two-stage studies. Statistics in Medicine. 1992;11:769–782. [PubMed]
- Zhao LP, Lipsitz SR, Lew D. Regression analysis with missing covariate data using estimating equations. Biometrics. 1996;52:1165–1182. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |