PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Stat Plan Inference. Author manuscript; available in PMC 2010 July 1.
Published in final edited form as:
J Stat Plan Inference. 2009 July 1; 139(7): 2341–2350.
doi:  10.1016/j.jspi.2008.10.024
PMCID: PMC2674251
NIHMSID: NIHMS85580

HANDLING MISSING DATA BY DELETING COMPLETELY OBSERVED RECORDS

Abstract

When data are missing, analyzing records that are completely observed may cause bias or inefficiency. Existing approaches in handling missing data include likelihood, imputation and inverse probability weighting. In this paper, we propose three estimators inspired by deleting some completely observed data in the regression setting. First, we generate artificial observation indicators that are independent of outcome given the observed data and draw inferences conditioning on the artificial observation indicators. Second, we propose a closely related weighting method. The proposed weighting method has more stable weights than those of the inverse probability weighting method (Zhao and Lipsitz, 1992). Third, we improve the efficiency of the proposed weighting estimator by subtracting the projection of the estimating function onto the nuisance tangent space. When data are missing completely at random, we show that the proposed estimators have asymptotic variances smaller than or equal to the variance of the estimator obtained from using completely observed records only. Asymptotic relative efficiency computation and simulation studies indicate that the proposed weighting estimators are more efficient than the inverse probability weighting estimators under wide range of practical situations especially when when the missingness proportion is large.

1. INTRODUCTION

When data are missing, analyzing only completely observed records could cause bias or inefficiency. One way of handling missing data is to maximize the observed likelihood obtained by integrating the likelihood for the full data and observation indicators over the missing data (e.g. Little and Rubin, 1987; Laird, 1988). In the non-likelihood framework, approaches such as the imputation and the inverse probability weighting have been proposed. The imputation method replaces the contribution of the estimating function with the missing statistics by its conditional expectation given the observed data (Reilly and Pepe, 1995; Paik, 1996). The inverse probability weighting (Zhao and Lipsitz, 1992) uses only the completely observed records, but weighs each record by the inverse of the probability of observation. The two approaches reflect different viewpoints of the problems involving missing data: the imputation method fills in the missing data with the most plausible values, while the inverse probability weighting blows up the observed records to properly represent the whole data (Fleiss, Levin and Paik, 2003). Correspondingly these two approaches represent different ways of constructing unbiased estimating functions. Recently, Lipsitz et al. (1999) proposed a combination of the two where the authors used weights derived from imputation models. Here we propose a third approach, which deletes some of completely observed records artificially, thereby motivating a class of estimators. We propose to delete some records so that after deletion, the observation process is independent of the outcome and therefore the complete-record analysis is valid. The implementation of this approach is simpler than that of existing approaches and can be widely disseminated to users. Intuitively, we discard some observed information to undo the harm caused by the missing mechanism and restore the original structure existed in the full data. A similar idea was implemented in survival analysis via artificial censoring to fix dependent censoring (Lin, Robins and Wei, 1996).

Specifically, we propose three estimators. We call the first the deletion estimator, directly reflecting the main idea. That is, we create artificial observation indicators so that the artificially created indicator is conditionally independent of the outcome, and analyze only the records that are ‘artificially’ observed. The artificially created observation indicator is a decreasing function of the observation indicator and artificially observed records constitute a subset of completely observed records. The proposed deletion estimators are consistent when data are missing at random. Although counterintuitive, we show that the deletion estimator has asymptotic variance smaller than or equal to the estimator obtained from the complete-record analysis despite using a smaller number of records, when data are missing completely at random. The second proposed estimator involves a weighting method where the weight is the probability of deletion in the deletion method. The weights of the proposed method are more stable than those of the inverse probability weighting method, and the resulting estimates are more efficient when the missing proportion is high. We also show that the weighting estimate is the limit of the mean of repeatedly computed deletion estimates as the number of replications approaches infinity. Finally, the third estimator is a modified estimator from the second estimator using the argument of Robins et al. (1994) to improve the efficiency. The efficient version of the proposed weighting estimator are shown to perform better than the counterpart of the inverse probability weighting in situations where the proposed weighting estimator performs better than the IPW estimators.

2. Motivation

Let Y denote the outcome and (X,Z) denote covariates. We consider situations where our interest lies in the regression setting and E(Y|X,Z), with a known parametric form, is the quantity of interest. Outcome Y and covariate Z are completely observed and covariate X could be missing. Let R be the observation indicator for X. Throughout the paper, we assume that the data are missing at random (Rubin, 1976), i.e., the observation probability does not depend on the missing variable X itself:

P(R|X,Y,Z)=P(R|Y,Z).
(1)

Our motivation of the proposed method starts from the observation that if R additionally satisfies the condition, R[amalgamation]Y|Z, we have E(Y|X,Z,R = 1) = E(Y|X,Z), then we can consistently estimate E(Y|X,Z) simply by conducting the Complete-Record (CR) analysis. In practice, however, we cannot control R and cannot force R to satisfy this additional condition. Therefore, a key idea of the proposed method is to generate artificial variable, say R*, such that R*[amalgamation]Y|Z, and to estimate E(Y|X,Z) using only records whose artificial variable R* equals 1. This approach is simple and can be handled by most of standard software. Then the problem boils down how to generate such R*.

For a concrete example, consider the case that (Y,Z) are all binary. For brevity, denote P(Ri = 1|Yi, Zi) and P(Ri* = 1|Yi, Zi) by π(Yi, Zi) and π*(Yi, Zi), respectively. To satisfy R*[amalgamation]Y|Z, we need to generate R* so that π*(1,Zi) = π*(0,Zi), given the sample proportions of observation [pi](1,Zi) and [pi](0,Zi). Our strategy is to obtain π* by modifying [pi]. First, note that we can pretend some observed data to be missing, but cannot pretend missing data to be observed. This fact implies that R* should be bounded by R, i.e., Ri* ≤ Ri. That is, if Ri = 1, we can set Ri* = 0, but if Ri = 0, we should set Ri* = 0. To achieve π*(1,Zi) = π*(0,Zi) under the condition π*(Y,Z) ≤ [pi](Y,Z), we can ‘down-adjust’ as follows. If [pi](1,Z) is greater than [pi](0,Z), we pick the smaller proportion [pi](0,Z), set it equal to both π*(1,Z) and π*(0,Z). This is ‘down-adjusting’ since we equalize the two target probabilities π*(0,Z) and π*(1,Z) by setting equal to the smaller proportion between [pi](0,Z) and [pi](1,Z). In notation, π*(Y,Z) = minY [pi](Y,Z). In this example, for Y = 0, there is no difference between the artificial observation status vs. original observation status, i.e., Ri* =Ri with probability 1. For Y = 1, we enforce π*(1,Z) = π*(0,Z) = [pi](0,Z). This relationship can be implemented by generating binary R* with probability being 1 given R = 1 as [pi](0,Z)/[pi](1,Z), the denominator of which nullifies the observation process for Y = 1 while the numerator imposes the observation process for Y = 0. In summary, R* takes value 1 with probability {miny[set membership](0, 1)[pi](y,Z)}/[pi](Y,Z). For non-binary Y and Z, we can use the same ‘down-adjusting’ strategy by generating binary R* with probability of being 1 as {miny[set membership](y1, …,yn)[pi](y,Z)}/[pi](Y,Z).

In Section 3, we propose three estimators motivated by the idea of deleting. First is the usual estimator but using the records with R* = 1 only. We call this deletion estimator. The second is the weighting estimator using records with R = 1 only with weight P(Ri* = 1|Yi, Zi, Ri = 1). Third is an improved estimator from the second estimator by subtracting projection onto the nuisance tangent space.

3. THREE PROPOSED ESTIMATORS

3.1 Deletion estimator

To formalize the idea presented in Section 2, suppose that P(R = 1|Y,Z) is a known function indexed by unknown parameter α, denoted by π(Y,Z; α) > 0, and π is a differentiable function of α. Furthermore, under the condition (1), suppose there exists a consistent and asymptotically normally distributed estimator [alpha], which can be expressed as n12(α^α0)=Ai(α0)+op(1), where α0 is the true value for α, Ai(α0)’s are independently distributed with EAi(α0) = 0. Using the down-adjusting idea, we set P(R=1|Y,Z,R=1)=q(α)=πM(Z;α)π(Y,Z;α), where πM(Zi; α) = miny[set membership](y1, …,yn)π(y,Zi). Generating Ri* so that R* = 1 with probability qi([alpha]), the proposed estimator is the one obtained using the records with Ri* = 1 only, or the solution of the following estimating equation.

Un(β)=Rig(Xi,Zi;β){Yiμ(Xi,Zi;β)}=0.
(2)

We show in the next section that the estimating function (2) has zero expectation and its solution [beta]* is consistent for the true value β0 and asymptotically normally distributed.

The construction of the deletion estimator is motivated to fix the bias but could potentially cause a loss of efficiency. However, it turns out that the estimator in fact gains efficiency over the CR estimator when data are missing completely at random. In Appendix D we show that the asymptotic variance of the deletion estimator is smaller than or equal to that of the CR estimator when data are missing completely at random. This may initially sound counterintuitive since R* ≤ R, and the complete record analysis utilizes more records than the deletion method. However, the number of deleted records would be small when data are missing completely at random, and furthermore deletion is executed effectively using information from the observed data, which lead to a more efficient estimator.

3.2. Weighting estimator and its relationship with deletion estimator

A practical weakness of deleting method is that the randomness of R* produces different estimates each time given the same observed data. To improve upon this feature, one can contemplate an estimating function, E{Un*(β)|R, Xo, Y,Z}, where Xo is observed part of X. The conditional expectation replaces Ri* in the estimating function with its conditional mean qi([alpha]) given the observed data as follows:

E{Un(β)|R,Xo,Y,Z}=Un(β,α^)=Riqi(α^)g(Xi,Zi;β){Yiμ(Xi,Zi;β)}=0
(3)

Equation (3) is the proposed weighting estimating equation, and its solution is the proposed weighting estimator. If the weight qi([alpha]) is replaced by 1/π(Yi,Zi; [alpha]), the equation is the Inverse Probability Weighting (IPW) equation (Zhao and Lipsitz, 1992; Robins, Rotnitzky and Zhao, 1994). A clear advantage of the proposed weighting method over IPW is that the weight qi([alpha]) is bounded between 0 and 1, while the IPW weight is not. The IPW weight for a particular unit could be dominating since a single π(Yi, Zi; [alpha]) could be small with outlying covariate value Zi even when the overall observation probability is moderately large.

We can show that the estimating function (3) has zero expectation and the resulting estimator is consistent and asymptotically normally distributed.

Theorem 1

n12(β^β0) is asymptotically normally distributed with mean 0 and variance Γ11(β0,α0)(β0,α0)Γ11(β0,α0), where

Γ1(β0,α0)=limnn1E[Un(β,α)βT;β0,α0],

and (β0,α0)=Var{n12Un(β0,α^);β0,α0}. A consistent estimator for Σ(β0, α0) is given by

n1[Riqi(α^)g(Xi,Zi;β^){Yiμ(Xi,Zi;β^)}+Γ2(α^,β^)Ai(α^)]2,

where A[multiply sign in circle]2 = AAT,

Γ2(β0,α0)=limnn1E[Un(β,α)αT;β0,α0].

See Appendix A for proof.

Based on this result, we can establish the asymptotic property of the deletion estimator by expressing Un(β,α^)=Un(β,α^)+{Un(β,α^)Un(β,α^)}.

Theorem 2

n12(β^β0) is asymptotically normally distributed with mean 0 and variance Γ11(β0,α0){(β0,α0)+Ω(β0,α0)}Γ11(β0,α0) where Γ1(β0, α0) and Σ(β0, α0) are defined in Theorem 1,and Ω(β0,α0)=Var[n12{Un(β,α^)Un(β,α^)};α0,β0]. Ω(β0, α0) can be consistently estimated by

n1qi(α^){1qi(α^)}g(Xi,Zi;β^){Yiμ(Xi,Zi;β^)}2g(Xi,Zi;β^)T.

See Appendix B for proof.

Note that Ω(β0, α0) captures the extra variability of the deletion estimator caused by the randomness of R* given observed data. This leads us to consider yet another related estimator. If we were to repeatedly compute the deletion estimates K times, we would obtain K different deletion estimates, say [beta]1*,…,[beta]K* due to the randomness of deleting records. Let β¯K=K1k=1Kβ^k. We show in Appendix C that n12(β^β¯)=op(1) as n goes to infinity, where [beta with macron above] = limK→∞[beta with macron above]K.

3.3. Efficient weighting estimator

We can express Un(β,α) in a form resembling the class of equations considered by Robins et al. (1994).

Un(β,α)=Riπi(Yi,Zi;α)1h(Xi,Zi;β,α){Yiμ(Xi,Zi;β)}=0,

where h(Xi, Zi; β, α) = πM(Zi; α)g(Xi, Zi). Using the argument of Robins et al. (1994) we can improve the efficiency of the estimator by subtracting the projection of the estimating function onto the nuisance tangent space which is the closed linear span of all random vectors of fixed multiple of the nuisance score:

Uneff(β,α,ϕ)=Riπi(Yi,Zi;α)1h(Xi,Zi;β){Yiμ(Xi,Zi;β)}π(Yi,Zi;α)1{Riπ(Yi,Zi;α)}ϕ(Yi,Zi).

For the second term of Uneff to be a projection, one should find the function ϕ that minimizes the norm of Uneff. Such a function ϕ can be found by decomposing Uneff(β,α,ϕ) into the following two terms:

Uneff(β,α,ϕ)=h(Xi,Zi;β){Yiμ(Xi,Zi;β)}+{Riπ(Yi,Zi;α)}π(Yi,Zi;α)1[h(Xi,Zi;β){Yiμ(Xi,Zi;β)}ϕ(Yi,Zi)]=C1n+C2n=(C1n(i)+C2n(i))

Since C1n and C2n are uncorrelated,

Var(C1n+C2n)=Var(C1n)+EVar(C2n|X,Y,Z)+VarE(C2n|X,Y,Z)=Var(C1n)+E{1π(Yi,Zi;α)}π(Yi,Zi;α)E[h(Xi,Zi;β){Yiμ(Xi,Zi;β)}ϕi]2.

The above expression suggests that the minimum can be achieved when ϕ(Yi, Zi) equals ϕeffh(Yi,Zi)=E[h(Xi,Zi;β){Yiμ(Xi,Zi;β)}|Yi,Zi].

The solution to the equation Uneff(β,α,ϕeffh)=0 say β^(ϕ^effh), is the most efficient estimator given the form of h(X,Z; β), and its variance can be estimated by, n1Γ^31(β^(ϕ^effh),α^){(C^1n(i)+C^2n(i))2}Γ^31(β^(ϕ^effh),α^), where Γ3(β0,α0)=limnn1E{Uneff(β,α,ϕeffh)βT|β0,α0} When we replace h(X,Z; β) with g(X,Z; β), the resulting estimator is an efficient version of the IPW estimator, say β^IPW(ϕ^effg). However, neither β^(ϕ^effh) nor β^IPW(ϕ^effg) is fully efficient. To obtain a fully efficient estimator, one has to replace h(X,Z; β) with an arbitrary function of (X,Z) say, h*(X,Z; β) and solve for h*(X,Z) satisfying integral equation (23) of Robins et al. (1994) and find its corresponding ϕ, say ϕeffh=E{h(X,Z;β)(Yμ)|Y,Z}. It is hard to analytically compare the efficiencies of the estimators derived from g and h functions. Note that ϕ is a function of an unknown quantity involving E[h(Xi, Zi; β)|Yi, Zi] and E[h(Xi, Zi; β)μ(Xi, Zi; β)|Yi,Zi]. In GLM with canonical link function, ϕ is a function of E(Xi|Yi, Zi),E[μ(Xi,Zi)|Yi,Zi], and E[Xiμ(Xi,Zi)|Yi,Zi]. We suggest estimating unknown quantities in ϕ by parametric modelling as suggested by Zhao, Lipsitz and Lew (1996).

4. ASYMPTOTIC RELATIVE EFFICIENCY

Given the fact that we can not establish any inequalities between the asymptotic variances of WTEF and IPWEF, an important practical question remains as to when to use deletion family estimators instead of inverse probability estimators. Obviously, when any one πi is small, the weights for IPW or IPWEF could be unstable and result in non-convergence or inefficient estimators. This could happen when the overall probability is small, or when the observation probability depends on continuous variable whose values can be extreme, or when the effect of the covariate on the observation probability is large.

To systematically investigate the aforemetioned conjecture, the asymptotic relative efficiency of WTEF over IPWEF is computed in various situations and summarized in Figure 1. Two columns of graphs show the efficiencies of [beta]X and [beta]Z, respectively for binary X and Z. The pattern for the asymptotic relative efficiency of [beta]0 is similar to that of [beta]X. Within each graph, five curves are shown for different values of αY. Note that αY = 0 corresponds to missing completely at random and a large value of |αY| corresponds to a severe degree of departure from missing- completely-at-random case. Two rows display the efficiencies under different overall missing proportions controlled by α0 = −1 and α0 = 1, respectively. Note that large negative values of α0 and αZ correspond to a large overall proportion of missingness. The graphs show that in most cases the efficiency of WTEF over IPWEF increases for large negative values of αZ. This confirms our conjecture that the proposed estimate performs better than the IPWEF when [pi](Yi, Zi) is small. Also we see that the efficiency of WTEF is increased as α0 decreases. The graphs on the third column feature situations when [beta]Z of IPWEF is more efficient than that of WTEF. When both αZ and αY are negative, IPWEF has a smaller asymptotic variance than the proposed one. But even in the worst case, the efficiency is around 0.8.

Figure 1
Asymptotic Relative Efficiency of Weighting estimators vs. IPW estimators

5. SIMULATION

We conducted simulation studies with 500 replications for logistic regression models and classical linear models. Two sample sizes, 2000 and 500 were used and only the results for n = 500 is shown. For all models we generated observation indicators with P(R=1|Y,Z)=logit−1(α0+αYY+αZZ). For the case in which data are missing completely at random (MCAR), we set α = (−2,0,1) for the logistic model, and α = (−1.5,0,1) for the linear model. For the case of MAR, we set α = (−2,1,1) for the logistic model, and α = (−1.5,0.3,1) for the linear model. We generated two types of (X,Z), first as standard normal variables and second as binary indicators. In all cases X and Z are generated independently. Given X and Z,Y is generated as a binary indicator with probability μ(X,Z;β) = P(Y = 1|X,Z) = logit−1(X + Z) for logistic models, and as a normal variable with mean X + Z and variance 1 for linear models.

The performance of eight estimates are given from full data (Full), complete records (CR), IPW using the correct missingness model (IPW1), IPW using the overfitted model with the interaction term between Y and Z (IPW2), efficient version of IPW (IPWEF), the proposed deletion method (Del), the proposed weighting method (WT), and the efficient version of the proposed weighting method (WTEF). In computing the projection term for IPWEF and WTEF when (Y,Z) are all binary, sample means are computed for each of four categories, and when at least one element of (Y,Z) is continuous, linear or logistic models are used depending on the type of X with linear predictor Y + Z + Y * Z.

For all estimates, the bias of the point estimates, simulation mean square errors, and the average number of records used in each method are shown in Table 1. Table 2 reports the coverage probabilities.

Table 1
Simulation results based on 500 replications for eight estimators
Table 2
Coverage probabilities based on 500 replications for eight estimators

We focus on three comparisons: (i) CR estimate vs. the deletion estimate (Del); (ii) IPW (IPW1 and IPW2) vs. the proposed weighting (WT); and (iii) IPWEF vs. WTEF. First under MCAR, note that the bias of the eight estimates are negligible. In logistic models, the deletion estimates (Del) of β0 and βZ are more efficient than those from the complete-record analysis despite using a smaller number of records when Z is continuous or dichotomous. Note that this simulation result using a sample size of 500 agrees with asymptotic results stated in Section 3.1 that the variance of the deletion estimators are smaller than or equal to the CR estimators. On the other hand, the deletion estimate of βX is slightly less efficient than the complete record estimate of βX. In linear models, the efficiency gain of the deletion method under MCAR is much smaller than in logistic models because the minimum of π(Y,Z; α) is taken over the entire range of Y and thus a higher proportion of records is deleted. The deletion method provides valid estimates in both models under MAR, while the CR analysis is biased.

In both logistic and linear models, the proposed weighting estimates (WT) are generally more efficient than the IPW estimates obtained by correctly specifying missingness models (IPW1) or the IPW estimates obtained by specifying overfitted missingness models (IPW2). The efficiency of IPW1 is substantially poorer than the WT estimates. The advantage of the proposed weighting estimate (WT) over IPW is apparent when Z is continuous. Under this condition, π(Yi, Zi;[alpha]) is small for extreme values of Zi, and the inverse probability becomes large and unstable. The efficiency gain of the proposed weighting method (WT) is most notable in linear regression with continuous Z.

The two efficient versions, IPWEF and WTEF, show smaller variances than their original versions as anticipated. However, note that for the case in which X and Z are dichotomous in logistic models, and where auxiliary models are saturated, there is no improvement over the original versions, IPW2 and WT. Between the IPWEF and WTEF, the performance is comparable in logistic regression models: IPWEF has a slight advantage in βZ and WTEF has advantages in βX. In linear models with continuous Z, WTEF estimates have substantially smaller variances than the IPWEF estimates.

Coverage probabilities of the eight methods are shown in Table 2. The coverage probabilities for consistent estimates are reasonable overall, which demonstrates that the asymptotic variances behave well in situations considered. Under MAR, the CR estimates are biased and the coverage probabilities are far from their nominal value. Although the deletion estimators are consistent, their coverage probabilities are less than its target 95% in linear models, because the number of records used in the analysis is small. Although not shown, our simulations show that the coverage probabilities become within the 95% confidence intervals of the nominal value when the overall sample size is 2000.

6. CONVERGENCE ISSUE

We have shown in Section 4 that asymptotic relative efficiency is generally better in proposed weighting estimator than the IPW estimators. Another reason to prefer the weighting estimator is that weighting estimates suffer much less from non-convergence problem than IPW estimates as the missing proportion is increased. For example, under the simulation situation shown in Table 1 (α0 = − 2 αX = 0, αZ = 1), convergence problems did not occur. However, when αZ is changed to −1, IPW2 did not converge 42 times out of 500 simulation runs whereas the weighting estimates converged all 500 times. Furthermore, when αZ is increased to −1.5, IPW2 did not converge 68 times out of 500, while the weighting estimates did not converge 6 times. Similar trend is observed when α0 or αY is decreased.

7. CONCLUDING REMARKS

We have proposed three estimators based on the idea of deleting completely observed records. The deletion estimator serves as a conceptual device for the other two estimators, but may not be attractive for practical use due to randomness of the artificial observation indicator. The weighting and the improved weighting estimators are viable alternatives to the inverse probability weighting estimators. When some of the predicted observation probabilities are small, the proposed weighting estimators suffer much less from non-convergence problems and are more efficient than the inverse probability weighting estimators. While discarding some completely observed data to handle missing data may sound paradoxical, it proves to be effective when is done in an informative way.

APPENDIX A: Asymptotic distribution of [beta]

Since q(α) is a differentiable function of α, we can write

n12Un(β^,α^)=n12{Un(β0,α0)+Un(β,α)βT(β^β0)+Un(β,α)αT(α^α0)}+op(1)=n12{Un(β0,α0)nΓ1(β0,α0)(β^β0)}+Γ2(β0,α0)Ai(α0)+op(1)

where

Γ1(β0,α0)=limnE{n1Un(β,α)βT;β0,α0},Γ2(β0,α0)=limnE{n1Un(β,α)αT;β0,α0},
Un(β,α)αT=Riqi(α)αTg(Xi,Zi;β){Yiμ(Xi,Zi;β)},
q(α)αT=πM(Zi;α)π(Yi,Zi;α){πM'(Zi;α)πM(Zi;α)π'(Yi,Zi;α)π(Yi,Zi;α)},and

π′ denotes derivative of π with respect to α. Denoting the ith contribution to Un(β, α) by Un(i)(β,α),

n12(β^β0)=Γ11(β0,α0){n12Un(β0,α0)+Γ2(β0,α0)Ai(α0)}+op(1)=i=1nΓ11(β0,α0){n12Un(i)(β0,α0)+Γ2(β0,α0)Ai(α0)}+op(1)=Hi(β0,α0)+op(1)

First, we can show that Un(i)(β0,α0) has mean zero:

E{ER|Y,Xo,ZUn(i)(β,α)}=E[π(Zi,Yi;α)qi(α)g(Xi,Zi;β){Yiμ(Xi,Zi;β)}]=E[πM(Zi;α)g(Xi,Zi;β){Yiμ(Xi,Zi;β)}]=0.

In addition, by assumption made in Section 2, Ai0)’s have mean zero and are independent. Then, ΣHi is a sum of independently distributed random vectors with mean zero with finite variance, and [beta] is asymptotically normally distributed.

APPENDIX B: Asymptotic distribution of [beta]*

Using Taylor’s expansion, we have Un(β^)=0=Un(β0)+Un(β)βT(β^β0)+op(1). First, it is easy to verify that

limnE{n1Un(β)βT;β0,α0}=Γ1(β0,α0).

Then n12(β^β0)=n12Γ11(β0,α0)Un(β0)+op(1). Rewriting Un*(β) = Un(β,[alpha])+Un*(β) − Un(β,[alpha]), and using the fact that n12Un(β,α^)=n12Un(β,α0)+Γ2(β0,α0)Ai(α0)+op(1), we have

n12(β^β0)=Γ11(β0,α0)[n12Un(β0,α0)+Γ2(β0,α0)Ai(α0)+n12{Un(β0)Un(β0,α^)}]+op(1)=T1n+T2n+op(1)=i=1n(T1n(i)+T2n(i))+op(1),

where T1n(i)=n12Un(i)(β0,α0)+Γ2Ai(α0),T1n=T1n(i) and T2n(i)={Un(i)(β^)Un(i)(β0,α^)}={Riqi(α^)}g(Xi,Zi;β0){Yiμ(Xi,Zi;β0)}, and T2n=T2n(i). Note that ER|R,Y,Xo,Z[{Riqi(α^)}|R,Y,Xo,Z;α0]=0,and E(T2n(i)|R,Y,Xo,Z;α0)=0. Furthermore, E{Cov(T1n(i),T2n(i)|R,Y,Xo,Z)}=0 and Cov{E(T1n(i)|R,Y,Xo,Z),E(T2n(j)|R,Y,X0,Z)}=0. Therefore (T1n(i)+T2n(i)) represents a sum of independent random vectors with mean 0 and finite variance. We also find that

Var(T1n+T2n)=E{Var(T1n+T2n|R,Y,Xo,Z)}+Var{E(T1n+T2n|R,Y,Xo,Z)}=E{Var(T2n|R,Y,Xo,Z)}+Var(T1n).

Therefore n12(β^β0) is asymptotically normally distributed with mean 0 and variance E{Var(T2n|R, Y, Xo, Z)}+Var(T1n). A consistent estimator of Var(T1n) is given in Appendix A and E{Var(T2n|R, Y, Xo, Z)} can be consistently estimated by

qi(α^){1qi(α^)}g(Xi,Zi;β^){Yiμ(Xi,Zi;β^)}2g(Yi,Zi;α^)T

APPENDIX C: Relationship of deletion estimator and weighting estimator

The proof is similar to that of theorem 2 of Reilly and Pepe (1997). Denote the artificial observation indicator for the ith unit and the kth replication by Ri,k*, and the deletion estimator from the kth replication by [beta]k*, the estimating function can be written as

Un,k(β,α^)=i=1nRi,kg(Xi,Zi;β)(Yiμi).

From Taylor expansion, we have

n(β^kβ0)=Γ1,k(βk,α^)1Un,k(β0,α^)n,

where [beta]k[set membership]([beta]k*, β0. Then the mean of K deletion estimators can be expressed as

n(β¯Kβ0)=1Kk=1KΓ1,k(βk,α^)1Un,k(β0,α^)n=1Kk=1K{Γ1,k(βk,α^)1Γ11(β0,α^)}Un,k(β0,α^)n+1Kk=1KΓ11(β0,α^)1Un,k(β0,α^)n=Φ1(K,n)+Φ2(K,n).

Given the observed data and n, let [beta with macron above] = limK→∞[beta with macron above]K. Then

n(β¯β0)=limKΦ1(K,n)+limKΦ2(K,n)=Φ1n+Φ2n.

First, observe that Φ1n → 0 as n → ∞. Then, since K1k=1KRi,kPqi(α^), we have

Φ2n=limK1KΓ11(β0,α^)k=1KUn,k(β0,α^)n=1nΓ1(β0,α^)i=1nlimKk=1KRi,kg(Xi,Zi;β0)(Yiμi)=1nΓ1(β0,α^)i=1nqi(α^)g(Xi,Zi;β0)(Yiμi)=Γ1(β0,α^)Un(β0,α^)n.

Thus n(β¯β0)=Γ1(β0,α^)Un(β0,α^)n+op(1), and n(β^β¯)=op(1).

APPENDIX D: Asymptotic variance inequality between deletion estimator and CR estimator

To proceed, we need notation for the estimating function for β without missing data, S(β) = ΣS(Yi|Xi, Zi; β) and estimating function for the nuisance parameter α, U(α) = ΣUi(α). As shown in Section 3, the asymptotic variance of [beta]* is Γ11(+Ω)Γ11, where Γ1 = {M(Z)S(Y|X, Z; β)S(Y|X,Z;β)T},

=tD=E{πM(Z)2π(Y,Z)S(Y|X,Z;β)S(Y|X,Z;β)T}E{RqS(Y|X,Z;β)U(α)T}{EU(α)U(α)T}1E{RqU(α)S(Y|X,Z;β)T},

and Ω = E{Rq(1 − q)S(Y|X,Z;β)S(Y|X,Z;β)T(Y−μ)2}.

Under MCAR, π(Y,Z) = π(Z) = πM(Z) thus q = q(Z) = 1, hence

Γ1=E{π(Z)S(Y|X,Z;β)S(Y|X,Z;β)T},
t=E{π(Z)S(Y|X,Z;β)S(Y|X,Z;β)T}=Γ1,

D = E{RS(Y|X,Z; β)U(α)T}{EU(α)U(α)T}−1E{RU(α)S(Y|X,Z;β)T}, and Ω = 0. Thus under MCAR, the asymptotic variance of deletion estimator [beta]*, is Γ11(Γ1D)Γ11. Denoting the CR estimator by [beta]CR, the difference of the asymptotic variances is Var(β^CR)Var(β^)=Γ11DΓ11, which is positive semidefinite. This proves that the CR estimator has no smaller asymptotic variance than that of the deletion estimator.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20.
  • Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. John Wiley & Sons; New York: 2003.
  • Laird NM. Missing data in longitudinal studies. Statistics in Medicine. 1988;7:305–315. [PubMed]
  • Little R, Rubin D. Statistical Analysis with Missing Data. John Wiley & Sons; New York: 1987.
  • Lin DY, Robins JM, Wei LJ. Comparing two failure time distributions in the presence of dependent censoring. Biometrika. 1996;83:381–393.
  • Lipsitz SR, Ibrahim JG, Fitzmaurice GM. Likelihood methods for incomplete longitudinal binary responses with incomplete categorical covariates. Biometrics. 1999;55:214–223. [PubMed]
  • Paik MC. Quasi-likelihood regression models with missing covariates. Biometrika. 1996;83:825–834.
  • Reilly M, Pepe M. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82:299–314.
  • Reilly M, Pepe M. The relationship between hot-deck multiple imputation and weighted likelihood. Statistics in Medicine. 1997;16:5–19. [PubMed]
  • Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866.
  • Rubin DB. Inference and missing data. Biometrika. 1976;63:581–92.
  • Wang Y-G. Estimating equations with nonignorably missing response data. Biometrics. 1999;55:984–989. [PubMed]
  • Zhao L, Lipsitz S. Designs and analysis of two-stage studies. Statistics in Medicine. 1992;11:769–782. [PubMed]
  • Zhao LP, Lipsitz SR, Lew D. Regression analysis with missing covariate data using estimating equations. Biometrics. 1996;52:1165–1182. [PubMed]