|Home | About | Journals | Submit | Contact Us | Français|
Density function is a fundamental concept in data analysis. Nonparametric methods including kernel smoothing estimate are available if the data is completely observed. However, in studies such as diagnostic studies following a two-stage design the membership of some of the subjects may be missing. Simply ignoring those subjects with unknown membership is valid only in the MCAR situation. In this paper, we consider kernel smoothing estimate of the density functions, using the inverse probability approaches to address the missing values. We illustrate the approaches with simulation studies and real study data in mental health.
As a fundamental concept in understanding univariate continuous outcomes, the probability density function is frequently used in data analysis. Among the nonparametric methods developed, the kernel smoothing estimate may be the most popular approach[7, 10, 13]. In the standard setting where a simple random sample (i.i.d.) from the target population is completely observed (complete data case), the problem of kernel smoothing has been thoroughly studied. Recently, some studies have generalized the kernel estimate to non-complete data situations. For example, estimates of density function were considered in the context of missing values in the variable of interest . In this paper we discuss estimates of density function within the diagnostic test setting where some subjects have missing disease status.
In modern clinical trials, it is quite common that the subjects are not sampled directly from the target population. In studies of specific diseases, the case and control approach in which subjects are directly sampled from cases and controls can not be applied if the disease status is unknown for each subject. A common approach in such situations is to first recruit subjects from a larger population, including both diseased and non-diseased subjects, and then ascertain the disease status at a later time. Such a two-stage design  is common, especially in diagnostic test studies. If the disease status or membership for the disease and non-disease groups for each subject is confirmed, the problem of density estimate reduces to the standard setting with analysis based on the subsample consisting of those subjects belonging to the target population. However, if not all subjects are known for their true disease status, then we are facing the missing data problem.
There are many reasons for missing disease status and some of them may be related with the missing values themselves. For example, if the membership is the onset of some disease, and the gold standard test for diagnosis is too expensive or involves intensive procedure such as surgeries, some of the subjects at lower risk of the disease may choose not to take the standard test (in some cases, it is actually unethical to do so). For example, structural clinical interview for DSM-IV diagnosis (SCID), which is in general viewed as the gold-standard for diagnosing depression, usually involves hours of interview of the doctor with the patient, making it expensive and inconvenient for the patients. Thus, it may not be practical to have each patient administered SCID.
This paper is motivated by problems from a real study. In a recent study of depression among postpartum women, accuracies of several screening tests for depression are assessed among postpartum mothers. A total of 419 postpartum women initially agreed to participate in the study, and their demographic information and screening test results were collected. The depression status was not known in advance for these subjects. But only 198 subjects completed the subsequent SCID. Assessing test accuracy with the subject’s disease status subject to missing is known as the verification bias problem in diagnostic test studies. Ignoring those subjects with missing disease status in general produces biased estimates. Although methods are available for correcting such bias, they focus on modeling the operating receiver characteristic (ROC) curves and/or the area under the ROC curve[1, 4, 9]. Since ROC curves are determined by the distributions of test results and the disease status, density estimates of the distributions of test outcomes for the diseased and non-diseased provide a direct alternative to the problem.
Since the membership of disease and non-disease groups is missing for some subjects, standard kernel smoothing methods do not apply. The naive approach that uses only those subjects with known group membership in the analysis is valid only in the special situation when the membership is missing completely at random (MCAR). As mentioned above, the missing group membership usually occurs in a systematic way, creating a basis for biased estimates when applying such a naive approach.
In the rest of the paper, we develop a kernel smoothing method for addressing the bias issue under missing group membership within the current context. For comparison purposes as well as a way to introduce notation, we first give a brief review of standard univariate kernel smoothing under complete data in Section 2. We then propose a new approach to address the limitations of traditional methods where the group membership is subject to missing in section 3. Bandwidth selection for the proposed smoothing approach is discussed in Section 4. Simulation studies are carried out in Section 5 to examine the performance of the approach, with a real study example given in Section 6. The paper concludes with a discussion in Section 7.
Let (xi, Ti, Di) be a simple random sample from a population of interest, where Di is a membership indicator of groups of interest such as diseased and non-diseased groups in our context, xi is a vector of covariates, and Ti is a univariate random variable such as a test outcome in our study. We are interested in estimating the density function of Ti among the different groups. Assume Di = 1 for the group of interest, and Di = 0 for the remaining subjects. In the case of complete data where Di is observed for each subject, standard univariate kernel smoothing methods can be applied to the subsample of Di = 1. Note that in this case, the covariates xi are not used in the estimate of density function of Ti.
Let f(t) denote the probability density function of Ti for the group Di = 1. Assume that the second derivative of f is square integrable,
Let K(·) denote a symmetric density function with a finite second moment, i.e.,
Also, let be the number of subjects with Di = 1 in the sample. The kernel density estimate of f(t) at a point T = t using the kernel function K(·) is defined as
where the bandwidth h > 0 is a constant and . The kernel function K(·) generally gives more weight to observations closer to t. The choice of the bandwidth is closely related to the behavior of the kernel estimate. Larger bandwidths correspond to estimates that are more biased but less unstable, while smaller bandwidths yield estimates that are less biased, but more unstable. Bandwidth should be carefully selected to balance bias and variance.
Since the bias reduces to 0 as the bandwidth h approaches infinity, the bandwidth needs to be small to reduce bias. However, it also needs to be large enough to include sufficient subjects; otherwise, the variance may be very large and the estimates themselves may not even exist. More precisely, we let the bandwidth h be a non-random sequence of positive numbers such that
where fh(t) = E [Kh(t − Ti)|Di = 1] and R(K) = ∫K(z)2dz. Taking both the bias and variance into consideration, the behavior of the estimate at each point is assessed by the mean squared error (MSE) of the estimate at the point. As a direct consequence of (5), the asymptotic MSE of the kernel density estimate in (3) is
It is often desirable to assess the estimate over the entire real line. By integrating the MSE over the entire real line, the mean integrated squared error (MISE) provides such a measure to assess the overall accuracy of the estimate. As a direct consequence of (6), the asymptotic MISE of (3) is given by
where R(f″) = ∫f″(t)2dt. Optimal bandwidths should give rise small MISEs. Thus, by minimizing (7), we obtain the asymptotic optimal bandwidth .
In this section we discuss estimation of density function of T for the group D = 1 under missing group membership for some subjects. We assume that the covariates xi and the variable of interest Ti are always observed. Let Vi be the indicator of whether Di is observed; Vi = 1 if it is observed, and Vi = 0 if otherwise. Thus, (xi, Ti, Vi) is always observed, but Di is only observed for those subject with Vi = 1. We will extend the kernel density estimate (3) described in the last section to this situation.
The naive estimate would be simply applying the kernel smoothing estimate to the subgroup of those subjects who are known to be in the group defined by Vi = 1 and Di = 1, i.e.,
This naive estimate is valid only under the very strong missing completely at random (MCAR) condition. In particular, it does not apply when the missingness of the group membership follows the missing at random (MAR) assumption.
Conditional on the observed outcomes T and covariates x, assume that the membership observation indicator V is independent with the exact membership D, i.e.
This MAR assumption is common in the literature of missing values. In the aforementioned diagnostic study examples, MAR is plausible if the decision of the administration of gold standard test depends on the observed test results and covariates. The inverse probability weighting (IPW) technique is commonly used to address missing values under MAR[5, 8]. Below, we apply this approach to address missing group membership in our setting.
Let πi = Pr(Vi = 1|Ti, xi) be the probability that the group membership is observed for the ith subject. If πi is known by design as in some two-stage studies, then those subjects with known group status in the group (Vi = 1 and Di = 1) can be used for the density estimate with proper weighting. The idea of IPW is that each subject with known group membership is selected for verification (Vi = 1) with a probability πi among similar subjects and thus should be weighted by the inverse of this probability in its contribution to the estimation. By applying this idea to the kernel density estimate in our setting, the density estimate with the inverse probability weight based on the known probabilities (IPWK) at a point T = t is given by
Since Vi = 0 for those with unknown Di, the estimate above is computable only if πi is known. Further, the estimate is only based on those with known group membership (Di = 1 and Vi = 1).
Assume that the selection probability is continuous in Ti and xi. Furthermore, we assume that πi is bounded below away from 0, i.e.,
where c is a constant. This condition is necessary for the estimates to have good behaviors, since otherwise we may have some very large weights, yielding unstable estimates. Under the conditions (1),(2), and (4), and (11), we have the following
The asymptotic bias and variance of are
where p = Pr(Di = 1) is the proportion of the group with Di = 1 in the whole population and .
It is clear that we obtain the same bias as the complete case, but with a larger variance. Under MCAR, πi is a constant and hence . The asymptotic variance in this special case is , which is consistent with the complete data case, since the actual sample size in this case is npπ. In other words, the results in the theorem reduce to the complete case when there is no missing values in the group membership, i.e., πi = 1.
In most studies other than those based on two-stage designs, the probabilities πi are not known. For example, although physicians in our setting may make the decision for SCID assessment based on the subject’s demographic, history of mental health and screening test results, it is quite rare that they make their decisions by modeling πi and generating a random Vi based on the model. In such cases, the missing mechanism satisfies the MAR assumption, but with known πi. Although in observational studies, MAR may not be a correct model, it will hold approximately true, if sufficient information is included when modeling the weight function πi. In either situations, we need to model and estimate πi.
Since the indicator of observed group membership is binary, we can model πi using logistic regression:
To simplify the notation, we have subsumed the constant term of the logistic regression as well as T into the vector of covariates x. Given the above model, we can readily estimate β.
In particular, the MLE of β can be obtained by solving the following score equations:
Note that unlike the density estimate, all subjects (whether Di is observed or not) are used for estimating β in the above equations.
Denote the estimated probabilities of being selected for verification by . By substituting the estimates into (10), we obtain the following IPW estimate with weight based on modeling of the missing mechanism (IPW):
The asymptotic bias and variance of are
Comparing Theorems 1 and 2, we see that and have the same asymptotic bias and variance. These are the direct consequences of the following lemma which gives the asymptotic distributions of the respective estimates and . A proof of the lemma is provided in the appendix.
where , and (b)
where , c1 = E [(1 − πi) xi|Di = 1] and c2 = E [(1 − πi) Kh(t − Ti)xi|Di = 1].
It is straightforward to prove Theorem 1 based on the fact that and (17). Theorem 2 is based on (18) and the fact that . It should be pointed out that although the expression for the variance has extra terms, IPW with estimated missing probabilities in general has slightly better behavior than IPWK, even when the selection probabilities πi are known (see the simulation study in Section 5 for details). Note that a similar phenomenon in regression analysis is well known.
It is clear from Theorems 1 and 2 that the behaviors of the estimates are closely related to the bandwidth h used. If h is too small, then the estimates are not stable. On the other hand, if h is too big, the bias can be large. We must trade o3 between bias and variance in selection of the bandwidths. As in the complete data cases, we can assess the qualities of the estimates by their mean squared errors. Based on Theorems 1 and 2, we immediately have the following
The MSE of and at point t both equal to
Since MSE = Bias2 +Variance, the proof is straightforward.
The MSE assesses the behavior of the estimate at a single point t, with a smaller MSE indicating better fit. Asymptotically optimal bandwidth at a point t can be obtained by minimizing the corresponding MSE. To assess the behavior of the estimate with a common bandwidth over the entire range, we need to assess the integrated MSE (MISE), which is the integration of MSE over the whole range. Following from Theorem 2, we have:
The MISE for and are given by
As in the case of MSE, the smaller the MISE the better the fit. Optimal bandwidths can be selected by minimizing the MISE (20). It is easy to see that the minimum can be achieved, asymptotically, when . Thus, the asymptotic optimal bandwidth is given by:
In the formula for the optimal bandwidth (21), R(K) and μ2(K) can be easily computed. For the Epanechnikov kernel function
we have μ2 (K) = 1 and . The term c0 is related to the missing rate, and it can be estimated roughly by the sample missing rate. The computation of R(f″) is a bit involved since it involves the estimate of f″. One approach is to use the normal distribution as a basis for bandwidth selection. If f follows a normal distribution with variance σ2, the corresponding R(f″) ≈ 0.212 σ4. Under this approach, we estimate first the variance of the variable to obtain an estimate of σ and then substitute 0.212 σ4 in the place of R(f″) in (21) to obtain an estimate of the optimal bandwidth. We may further estimate f″ using the bandwidth we obtained based on the normal distribution approach, and then substitute the estimate of R(f″) in (21) to obtain an estimate of the optimal bandwidth.
Simulation studies were performed in a couple of different scenarios to assess the behaviors of the estimates. We considered the situation where T is a binormal, i.e., T is normally distributed in both groups. However, to focus on a finite range, we used a normal distribution truncated on [−1, 1]. More precisely, T|D = 1 (T|D = 0) follows standard normal N(0, 1) (N(1, 1)) truncated on [−1, 1]. In addition, the missingness of group membership D depends only on T, i.e., no other covariates xi were involved.
We assumed a missing probability model based on logistic regression, i.e., . Following this model, β0 can used to control the missing rate, while β1 indicates the degree of deviation from MCAR. In the simulation study, the proportion of the first group in the whole population was fixed at .5, i.e., p = .5. The Epanechnikov kernel function was used for density estimates.
As pointed out in , sample sizes as large as 1000 are common in diagnostic test studies, and asymptotic theory can be applied. In the simulation study, to assess the behaviors of the estimates, especially how they would change with bandwidth, under small to moderate small sample sizes, we set the sample size at 200. Note also that under our setting, there are roughly 100 in each group, however, memberships of only part of them are confirmed.
Setting β0 = 0 and β1 = 0.5, we computed the integrated squared bias, integrated variance, as well as the MISE for bandwidths that varied from 0.01 to 1 based on a Monte Carlo size of 1000. Shown in Figure 1 are the MISEs as a function of bandwidth for the four estimates. When the bandwidth increased, the bias increased, while the variance decreased. Both IPW estimates, whether based on the known or estimated selection probabilities, behaved better as compared to the naive methods. The IPW approaches had a comparable amount of bias as compared to the estimate based on the complete data. But the naive method had a much larger bias. The variances of the IPW methods and those of the naive methods were comparable, although as expected the estimate based on the complete data had a smaller variance. Similar to the regression setting, density estimates obtained based on estimated probabilities behaved slightly better than those based on the true (known in simulation) probabilities, although the difference was small. Since μ2(K) = 1, R(K) = .6, the variance of the standard normal truncated by [−1,] is , where and Φ are the density and cumulative distribution functions of the standard normal. It follows that the bandwidth suggested by (20) would be . Based on the simulation results, the minimum of the MISE was achieved around h = 0.335 for the IPW estimates, confirming that the formula (20) is reliable for computing the optimal bandwidth.
From Figure 1, we see that the naive method produces larger biases as compared to the IPW estimates, as the former did not address MAR. To assess the effect of missing data mechanism, we considered different values of β1. By fixing β0 = 0, but varying β1 from 0 to 1, the missing mechanism changed from MCAR (β1 = 0) to MAR with larger β1 indicating more deviations from MCAR. Since the missing rate was roughly 50%, we obtained from (21) a rough estimate 0.3 as the asymptotic optimal bandwidth by substituting 2 for c0 in (21) and computing R(f″) under the known distribution with the sample size 200.
The following plot contains the mean of the smoothed curves with a Monte Carlo size of 1000. The mean curves for the IPW approaches are almost identical to that of the estimate based on the complete data. It is clear from the plot that as the missingness deviates from MCAR, the amount of bias from the naive methods increased. However, the IPW methods still provided good estimates.
We now illustrate our proposed methodology using the baseline data from a real longitudinal study on depression in elderly patients (age 65 and over) recruited from primary-care practices in Monroe County, New York. In addition to depression status determined by SCID, other information collected include demographic variables, the Hamilton Depression Rating Scale (HAM-D) for depression, a 24-item observer-rated scale designed to measure the severity of depressive symptoms , and the total score on the Cumulative Illness Rating Scale (CIRS), a reliable and valid measure of medical burden that quantifies the amount of pathology in each organ system . Among the 708 patients enrolled, 249 patients were classified as having depression and the remaining 459 patients were declared as depression-free. Although the total score of HAM-D is inherently discrete, it was often treated as a continuous outcome because of its large range. In this example, we will estimate its density function among depressed patients.
Data for both the SCID and the HAM-D were collected from all participating patients in this study; therefore, similar to  we used a subset that resembled data that would be obtained from a two-phase design. Hence, we can assess the behaviors of the estimates by comparing them to the estimates based on the complete data. In this subset, the HAM-D results were available for all patients, but the SCID diagnoses were available only for certain patients selected according to the following mechanism:
Thus, the verification mechanism preferentially selected the patients who were under the age of 75, with a relatively high cumulative illness burden. Using this mechanism, 394 of the 708 patients (55.6%) were selected for SCID verification of the depression diagnosis. A logistic regression model with age and CIRS as the predictors was used to model the missing data mechanism.
The true density function is not known in this real data example; however, since there is no missing values in HAM-D and SCID in the data set, we used the estimate based on the complete data set as a reference in assessing the quality of the estimate under missing data. To model the missing data mechanism, we assumed that the missingness was related to CIRS and age, and applied the following generalized linear model with a logistic link to estimate the relationship:
Thus, the model used for the missing mechanism is not exactly the one we used to generate the missing values (the predictors are not dichotomized in the regression).
Shown in the plot below are the estimates of the density using the complete data set of all subjects who were depressed (complete), naive estimate (naive), and IPW estimates (IPW for IPW with estimated missing probabilities and IPWK for IPW with known missing probabilities). The IPW estimates are in general closer to, as compared with the naive estimate, the estimate based on the complete data. Note that although the estimated weight function was not based on the exact model for generating the missing values, bias was greatly reduced under the proposed approach (comparing to the naive method). It is also interesting to note that the IPW estimate using the known missing data probabilities (IPWK) is not as good as the one based on the estimated weight (IPW). Such a difference is also observed in various other settings involving IPW estimates, including parametric regressions and U-statistic estimates.
In this paper we generalized the kernel smoothing density estimates for diagnostic test outcomes when the true disease status is subject to missing. Through the study, it is clear that the naive approach that ignores the missing values of disease status and uses only those with known disease membership is not valid. When the missing mechanism is well understood and characterized by the MAR mechanism, the proposed methodology works well in reducing the bias. Note that the information in Xi are only used in the modeling of the missing mechanism (πi). If the relationship between the disease status Di and (Ti, Xi) is well understood, then by including this additional model on the disease mechanism we may obtain more efficient estimates.
There are some problems left unanswered for our methods. Most notable is the study of boundary effects. It is well known that many smoothing methods do not behave well without adjustment near the boundaries. Our approach is no exception, as indicated by the relatively poor behaviors of the estimates near the boundaries. Further investigation is necessary to address such biases.
Another aspect that needs future study is the effect of MAR on the estimates. Although MAR is commonly assumed in the literature and satisfied by many studies in practice, missing not at random (MNAR) may arise in some studies. When enough information is collected (via covariates) and included in modeling the missingness of disease status, the MAR model may be approximately true. But, will the MAR assumption affect the behavior of the estimates, and if so, to what extent? More generally, how do we deal with MNAR? All these need future studies.
This research was supported in part by Award Number R21 DA027521-01 from the National Institute On Drug Abuse. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute On Drug Abuse or the National Institutes of Health.
The authors would also like to thank Jeffrey M. Lyness, M.D. for providing the data used in Section 6.
In this appendix we give a proof of Lemma 3 using estimating equation techniques. Note that under MAR, we have
When the parameters of the missing mechanism, β, is known and the true value is used, we can obtain the estimate (t) by solving the following estimating equations
If β is estimated from the data using the logistic model (14), the estimate (t) can be obtained by solving the system of estimating equations consisting of (25) and (14), and we can use the technique of stacking estimating equations. Since these estimating equations are unbiased, the estimates (t) and (t) are consistent estimates of fh(t). See appendix A.3 of  for details about estimating equations.
Let , and , then where the asymptotic variance σ2 equals to the (1, 1) term of the matrix A−1BA−T, where and
Let , and , then by simple matrix computation, σ2 = p (ef2 − 2gf + s). It is straightforward to verify that .
Let Ψ3 = (Vi − πi (β))xi. Using the technique of stacking estimating equation, it follows that where the asymptotic variance σ2 is the (1, 1) term of the matrix A−1BA−T, where and .
Let c1 = E [(1 − πi) xi|Di = 1] and c2 = E [(1 − πi) Kh(t − Ti)xi|Di = 1], then
Through tedious algebraic computation it can be proved that
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.