In this section we discuss estimation of density function of

*T* for the group

*D* = 1 under missing group membership for some subjects. We assume that the covariates

**x**_{i} and the variable of interest

*T*_{i} are always observed. Let

*V*_{i} be the indicator of whether

*D*_{i} is observed;

*V*_{i} = 1 if it is observed, and

*V*_{i} = 0 if otherwise. Thus, (

**x**_{i},

*T*_{i},

*V*_{i}) is always observed, but

*D*_{i} is only observed for those subject with

*V*_{i} = 1. We will extend the kernel density estimate (

3) described in the last section to this situation.

The naive estimate would be simply applying the kernel smoothing estimate to the subgroup of those subjects who are known to be in the group defined by *V*_{i} = 1 and *D*_{i} = 1, i.e.,

This naive estimate is valid only under the very strong missing completely at random (MCAR) condition. In particular, it does not apply when the missingness of the group membership follows the missing at random (MAR) assumption.

Conditional on the observed outcomes *T* and covariates **x**, assume that the membership observation indicator *V* is independent with the exact membership *D*, i.e.

This MAR assumption is common in the literature of missing values. In the aforementioned diagnostic study examples, MAR is plausible if the decision of the administration of gold standard test depends on the observed test results and covariates. The inverse probability weighting (IPW) technique is commonly used to address missing values under MAR[

5,

8]. Below, we apply this approach to address missing group membership in our setting.

Let *π*_{i} = Pr(*V*_{i} = 1|*T*_{i}, **x**_{i}) be the probability that the group membership is observed for the *i*th subject. If *π*_{i} is known by design as in some two-stage studies, then those subjects with known group status in the group (*V*_{i} = 1 and *D*_{i} = 1) can be used for the density estimate with proper weighting. The idea of IPW is that each subject with known group membership is selected for verification (*V*_{i} = 1) with a probability *π*_{i} among similar subjects and thus should be weighted by the inverse of this probability in its contribution to the estimation. By applying this idea to the kernel density estimate in our setting, the density estimate with the inverse probability weight based on the known probabilities (IPWK) at a point *T* = *t* is given by

Since *V*_{i} = 0 for those with unknown *D*_{i}, the estimate above is computable only if *π*_{i} is known. Further, the estimate is only based on those with known group membership (*D*_{i} = 1 and *V*_{i} = 1).

Assume that the selection probability is continuous in *T*_{i} and **x**_{i}. Furthermore, we assume that *π*_{i} is bounded below away from 0, i.e.,

where

*c* is a constant. This condition is necessary for the estimates to have good behaviors, since otherwise we may have some very large weights, yielding unstable estimates. Under the conditions (

1),(

2), and (

4), and (

11), we have the following

Theorem 1

The asymptotic bias and variance of

are

where

*p* = Pr(

*D*_{i} = 1) is the proportion of the group with

*D*_{i} = 1 in the whole population and

.

It is clear that we obtain the same bias as the complete case, but with a larger variance. Under MCAR,

*π*_{i} is a constant and hence

. The asymptotic variance in this special case is

, which is consistent with the complete data case, since the actual sample size in this case is

*npπ*. In other words, the results in the theorem reduce to the complete case when there is no missing values in the group membership, i.e.,

*π*_{i} = 1.

In most studies other than those based on two-stage designs, the probabilities *π*_{i} are not known. For example, although physicians in our setting may make the decision for SCID assessment based on the subject’s demographic, history of mental health and screening test results, it is quite rare that they make their decisions by modeling *π*_{i} and generating a random *V*_{i} based on the model. In such cases, the missing mechanism satisfies the MAR assumption, but with known &*pi;*_{i}. Although in observational studies, MAR may not be a correct model, it will hold approximately true, if sufficient information is included when modeling the weight function *π*_{i}. In either situations, we need to model and estimate *π*_{i}.

Since the indicator of observed group membership is binary, we can model

*π*_{i} using logistic regression:

To simplify the notation, we have subsumed the constant term of the logistic regression as well as *T* into the vector of covariates **x**. Given the above model, we can readily estimate *β.*

In particular, the MLE of

*β* can be obtained by solving the following score equations:

Note that unlike the density estimate, all subjects (whether *D*_{i} is observed or not) are used for estimating *β* in the above equations.

Denote the estimated probabilities of being selected for verification by

. By substituting the estimates into (

10), we obtain the following IPW estimate with weight based on modeling of the missing mechanism (IPW):

Under the conditions (

1),(

2), and (

4), and (

11) and (

14), we have the following

Theorem 2

The asymptotic bias and variance of

are

where

.

Comparing Theorems 1 and 2, we see that

and

have the same asymptotic bias and variance. These are the direct consequences of the following lemma which gives the asymptotic distributions of the respective estimates

and

. A proof of the lemma is provided in the

appendix.

Lemma 3

Under the conditions (

1),(

2), and (

4), and (

11) and (

14). For fixed

*h,* we have (a)

where

, and (b)

where

**, c**_{1} =

*E* [(1 −

*π*_{i})

**x**_{i}|

*D*_{i} = 1] and

**c**_{2} =

*E* [(1 −

*π*_{i})

*K*_{h}(

*t* −

*T*_{i})

**x**_{i}|

*D*_{i} = 1].

It is straightforward to prove Theorem 1 based on the fact that

and (

17). Theorem 2 is based on (

18) and the fact that

. It should be pointed out that although the expression for the variance has extra terms, IPW with estimated missing probabilities in general has slightly better behavior than IPWK, even when the selection probabilities

*π*_{i} are known (see the simulation study in Section 5 for details). Note that a similar phenomenon in regression analysis is well known.