Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3221313

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Kernel Smoothing Estimate for Complete Data
- 3 Univariate Kernel Smoothing for Incomplete Data
- 4 Bandwidth Selection
- 5 Simulation Studies
- 6 Study of Depression in Elderly Primary-care Patients
- 7 Discussion
- References

Authors

Related links

J Stat Plan Inference. Author manuscript; available in PMC 2013 March 1.

Published in final edited form as:

PMCID: PMC3221313

NIHMSID: NIHMS329537

Department of Biostatistics and Computational Biology, University of Rochester Medical Center, 601 Elmwood Ave, Box 630, Rochester, NY 14642, U.S.A

The publisher's final edited version of this article is available at J Stat Plan Inference

Density function is a fundamental concept in data analysis. Nonparametric methods including kernel smoothing estimate are available if the data is completely observed. However, in studies such as diagnostic studies following a two-stage design the membership of some of the subjects may be missing. Simply ignoring those subjects with unknown membership is valid only in the MCAR situation. In this paper, we consider kernel smoothing estimate of the density functions, using the inverse probability approaches to address the missing values. We illustrate the approaches with simulation studies and real study data in mental health.

As a fundamental concept in understanding univariate continuous outcomes, the probability density function is frequently used in data analysis. Among the nonparametric methods developed, the kernel smoothing estimate may be the most popular approach[7, 10, 13]. In the standard setting where a simple random sample (i.i.d.) from the target population is completely observed (complete data case), the problem of kernel smoothing has been thoroughly studied. Recently, some studies have generalized the kernel estimate to non-complete data situations. For example, estimates of density function were considered in the context of missing values in the variable of interest [14]. In this paper we discuss estimates of density function within the diagnostic test setting where some subjects have missing disease status.

In modern clinical trials, it is quite common that the subjects are not sampled directly from the target population. In studies of specific diseases, the case and control approach in which subjects are directly sampled from cases and controls can not be applied if the disease status is unknown for each subject. A common approach in such situations is to first recruit subjects from a larger population, including both diseased and non-diseased subjects, and then ascertain the disease status at a later time. Such a two-stage design [15] is common, especially in diagnostic test studies. If the disease status or membership for the disease and non-disease groups for each subject is confirmed, the problem of density estimate reduces to the standard setting with analysis based on the subsample consisting of those subjects belonging to the target population. However, if not all subjects are known for their true disease status, then we are facing the missing data problem.

There are many reasons for missing disease status and some of them may be related with the missing values themselves. For example, if the membership is the onset of some disease, and the gold standard test for diagnosis is too expensive or involves intensive procedure such as surgeries, some of the subjects at lower risk of the disease may choose not to take the standard test (in some cases, it is actually unethical to do so). For example, structural clinical interview for DSM-IV diagnosis (SCID)[11], which is in general viewed as the gold-standard for diagnosing depression, usually involves hours of interview of the doctor with the patient, making it expensive and inconvenient for the patients. Thus, it may not be practical to have each patient administered SCID.

This paper is motivated by problems from a real study. In a recent study of depression among postpartum women, accuracies of several screening tests for depression are assessed among postpartum mothers[3]. A total of 419 postpartum women initially agreed to participate in the study, and their demographic information and screening test results were collected. The depression status was not known in advance for these subjects. But only 198 subjects completed the subsequent SCID. Assessing test accuracy with the subject’s disease status subject to missing is known as the verification bias problem in diagnostic test studies. Ignoring those subjects with missing disease status in general produces biased estimates. Although methods are available for correcting such bias, they focus on modeling the operating receiver characteristic (ROC) curves and/or the area under the ROC curve[1, 4, 9]. Since ROC curves are determined by the distributions of test results and the disease status, density estimates of the distributions of test outcomes for the diseased and non-diseased provide a direct alternative to the problem.

Since the membership of disease and non-disease groups is missing for some subjects, standard kernel smoothing methods do not apply. The naive approach that uses only those subjects with known group membership in the analysis is valid only in the special situation when the membership is missing completely at random (MCAR). As mentioned above, the missing group membership usually occurs in a systematic way, creating a basis for biased estimates when applying such a naive approach.

In the rest of the paper, we develop a kernel smoothing method for addressing the bias issue under missing group membership within the current context. For comparison purposes as well as a way to introduce notation, we first give a brief review of standard univariate kernel smoothing under complete data in Section 2. We then propose a new approach to address the limitations of traditional methods where the group membership is subject to missing in section 3. Bandwidth selection for the proposed smoothing approach is discussed in Section 4. Simulation studies are carried out in Section 5 to examine the performance of the approach, with a real study example given in Section 6. The paper concludes with a discussion in Section 7.

Let (**x*** _{i}, T_{i}, D_{i}*) be a simple random sample from a population of interest, where

Let *f*(*t*) denote the probability density function of *T _{i}* for the group

(1)

Let *K*(·) denote a symmetric density function with a finite second moment, i.e.,

(2)

Also, let
be the number of subjects with *D _{i}* = 1 in the sample. The kernel density estimate of

(3)

where the bandwidth *h >* 0 is a constant and
. The kernel function *K*(·) generally gives more weight to observations closer to *t*. The choice of the bandwidth is closely related to the behavior of the kernel estimate. Larger bandwidths correspond to estimates that are more biased but less unstable, while smaller bandwidths yield estimates that are less biased, but more unstable. Bandwidth should be carefully selected to balance bias and variance.

Since the bias reduces to 0 as the bandwidth *h* approaches infinity, the bandwidth needs to be small to reduce bias. However, it also needs to be large enough to include sufficient subjects; otherwise, the variance may be very large and the estimates themselves may not even exist. More precisely, we let the bandwidth *h* be a non-random sequence of positive numbers such that

(4)

Under conditions (1),(2), and (4), the asymptotic bias and variance for the kernel density estimates in (3)[7] are given by

(5)

where *f _{h}*(

(6)

It is often desirable to assess the estimate over the entire real line. By integrating the MSE over the entire real line, the mean integrated squared error (MISE) provides such a measure to assess the overall accuracy of the estimate. As a direct consequence of (6), the asymptotic MISE of (3) is given by

(7)

where *R*(*f*″) = ∫*f*″(*t*)^{2}*dt.* Optimal bandwidths should give rise small MISEs. Thus, by minimizing (7), we obtain the asymptotic optimal bandwidth
.

In this section we discuss estimation of density function of *T* for the group *D* = 1 under missing group membership for some subjects. We assume that the covariates **x*** _{i}* and the variable of interest

The naive estimate would be simply applying the kernel smoothing estimate to the subgroup of those subjects who are known to be in the group defined by *V _{i}* = 1 and

(8)

This naive estimate is valid only under the very strong missing completely at random (MCAR) condition. In particular, it does not apply when the missingness of the group membership follows the missing at random (MAR) assumption.

Conditional on the observed outcomes *T* and covariates **x**, assume that the membership observation indicator *V* is independent with the exact membership *D*, i.e.

(9)

This MAR assumption is common in the literature of missing values. In the aforementioned diagnostic study examples, MAR is plausible if the decision of the administration of gold standard test depends on the observed test results and covariates. The inverse probability weighting (IPW) technique is commonly used to address missing values under MAR[5, 8]. Below, we apply this approach to address missing group membership in our setting.

Let *π _{i}* = Pr(

(10)

Since *V _{i}* = 0 for those with unknown

Assume that the selection probability is continuous in *T _{i}* and

(11)

where *c* is a constant. This condition is necessary for the estimates to have good behaviors, since otherwise we may have some very large weights, yielding unstable estimates. Under the conditions (1),(2), and (4), and (11), we have the following

The asymptotic bias and variance of are

(12)

where *p* = Pr(*D _{i}* = 1) is the proportion of the group with

It is clear that we obtain the same bias as the complete case, but with a larger variance. Under MCAR, *π _{i}* is a constant and hence
. The asymptotic variance in this special case is
, which is consistent with the complete data case, since the actual sample size in this case is

In most studies other than those based on two-stage designs, the probabilities *π _{i}* are not known. For example, although physicians in our setting may make the decision for SCID assessment based on the subject’s demographic, history of mental health and screening test results, it is quite rare that they make their decisions by modeling

Since the indicator of observed group membership is binary, we can model *π _{i}* using logistic regression:

(13)

To simplify the notation, we have subsumed the constant term of the logistic regression as well as *T* into the vector of covariates **x**. Given the above model, we can readily estimate *β.*

In particular, the MLE of *β* can be obtained by solving the following score equations:

(14)

Note that unlike the density estimate, all subjects (whether *D _{i}* is observed or not) are used for estimating

Denote the estimated probabilities of being selected for verification by . By substituting the estimates into (10), we obtain the following IPW estimate with weight based on modeling of the missing mechanism (IPW):

(15)

Under the conditions (1),(2), and (4), and (11) and (14), we have the following

The asymptotic bias and variance of are

(16)

where .

Comparing Theorems 1 and 2, we see that and have the same asymptotic bias and variance. These are the direct consequences of the following lemma which gives the asymptotic distributions of the respective estimates and . A proof of the lemma is provided in the appendix.

Under the conditions (1),(2), and (4), and (11) and (14). For fixed *h,* we have (a)

(17)

where , and (b)

(18)

where
**, c**_{1} = *E* [(1 − *π _{i}*)

It is straightforward to prove Theorem 1 based on the fact that
and (17). Theorem 2 is based on (18) and the fact that
. It should be pointed out that although the expression for the variance has extra terms, IPW with estimated missing probabilities in general has slightly better behavior than IPWK, even when the selection probabilities *π _{i}* are known (see the simulation study in Section 5 for details). Note that a similar phenomenon in regression analysis is well known.

It is clear from Theorems 1 and 2 that the behaviors of the estimates are closely related to the bandwidth *h* used. If *h* is too small, then the estimates are not stable. On the other hand, if *h* is too big, the bias can be large. We must trade o3 between bias and variance in selection of the bandwidths. As in the complete data cases, we can assess the qualities of the estimates by their mean squared errors. Based on Theorems 1 and 2, we immediately have the following

The MSE of and at point *t* both equal to

(19)

Since MSE = Bias^{2} +Variance, the proof is straightforward.

The MSE assesses the behavior of the estimate at a single point *t*, with a smaller MSE indicating better fit. Asymptotically optimal bandwidth at a point *t* can be obtained by minimizing the corresponding MSE. To assess the behavior of the estimate with a common bandwidth over the entire range, we need to assess the integrated MSE (MISE), which is the integration of MSE over the whole range. Following from Theorem 2, we have:

The MISE for and are given by

(20)

where .

As in the case of MSE, the smaller the MISE the better the fit. Optimal bandwidths can be selected by minimizing the MISE (20). It is easy to see that the minimum can be achieved, asymptotically, when . Thus, the asymptotic optimal bandwidth is given by:

(21)

In the formula for the optimal bandwidth (21), *R*(*K*) and *μ*_{2}(*K*) can be easily computed. For the Epanechnikov kernel function

we have *μ*_{2} (*K*) = 1 and
. The term *c*_{0} is related to the missing rate, and it can be estimated roughly by the sample missing rate. The computation of *R*(*f*″) is a bit involved since it involves the estimate of *f*″. One approach is to use the normal distribution as a basis for bandwidth selection. If *f* follows a normal distribution with variance *σ*^{2}, the corresponding *R*(*f*″) ≈ 0.212 *σ*^{4}. Under this approach, we estimate first the variance of the variable to obtain an estimate of *σ* and then substitute 0.212 *σ*^{4} in the place of *R*(*f*″) in (21) to obtain an estimate of the optimal bandwidth. We may further estimate *f*″ using the bandwidth we obtained based on the normal distribution approach, and then substitute the estimate of *R*(*f*″) in (21) to obtain an estimate of the optimal bandwidth.

Simulation studies were performed in a couple of different scenarios to assess the behaviors of the estimates. We considered the situation where *T* is a binormal, i.e., *T* is normally distributed in both groups. However, to focus on a finite range, we used a normal distribution truncated on [−1*,* 1]. More precisely, *T*|*D* = 1 (*T*|*D* = 0) follows standard normal *N*(0*,* 1) (*N*(1*,* 1)) truncated on [−1*,* 1]. In addition, the missingness of group membership *D* depends only on *T*, i.e., no other covariates **x*** _{i}* were involved.

We assumed a missing probability model based on logistic regression, i.e.,
. Following this model, *β*_{0} can used to control the missing rate, while *β*_{1} indicates the degree of deviation from MCAR. In the simulation study, the proportion of the first group in the whole population was fixed at .5, i.e., *p* = .5. The Epanechnikov kernel function was used for density estimates.

As pointed out in [1], sample sizes as large as 1000 are common in diagnostic test studies, and asymptotic theory can be applied. In the simulation study, to assess the behaviors of the estimates, especially how they would change with bandwidth, under small to moderate small sample sizes, we set the sample size at 200. Note also that under our setting, there are roughly 100 in each group, however, memberships of only part of them are confirmed.

Setting *β*_{0} = 0 and *β*_{1} = 0.5*,* we computed the integrated squared bias, integrated variance, as well as the MISE for bandwidths that varied from 0.01 to 1 based on a Monte Carlo size of 1000. Shown in Figure 1 are the MISEs as a function of bandwidth for the four estimates. When the bandwidth increased, the bias increased, while the variance decreased. Both IPW estimates, whether based on the known or estimated selection probabilities, behaved better as compared to the naive methods. The IPW approaches had a comparable amount of bias as compared to the estimate based on the complete data. But the naive method had a much larger bias. The variances of the IPW methods and those of the naive methods were comparable, although as expected the estimate based on the complete data had a smaller variance. Similar to the regression setting, density estimates obtained based on estimated probabilities behaved slightly better than those based on the true (known in simulation) probabilities, although the difference was small. Since *μ*_{2}(*K*) = 1, *R*(*K*) = .6, the variance of the standard normal truncated by [−1,] is
, where and Φ are the density and cumulative distribution functions of the standard normal. It follows that the bandwidth suggested by (20) would be
. Based on the simulation results, the minimum of the MISE was achieved around *h* = 0.335 for the IPW estimates, confirming that the formula (20) is reliable for computing the optimal bandwidth.

From Figure 1, we see that the naive method produces larger biases as compared to the IPW estimates, as the former did not address MAR. To assess the effect of missing data mechanism, we considered different values of *β*_{1}. By fixing *β*_{0} = 0, but varying *β*_{1} from 0 to 1, the missing mechanism changed from MCAR (*β*_{1} = 0) to MAR with larger *β*_{1} indicating more deviations from MCAR. Since the missing rate was roughly 50%, we obtained from (21) a rough estimate 0.3 as the asymptotic optimal bandwidth by substituting 2 for *c*_{0} in (21) and computing *R*(*f*″) under the known distribution with the sample size 200.

The following plot contains the mean of the smoothed curves with a Monte Carlo size of 1000. The mean curves for the IPW approaches are almost identical to that of the estimate based on the complete data. It is clear from the plot that as the missingness deviates from MCAR, the amount of bias from the naive methods increased. However, the IPW methods still provided good estimates.

We now illustrate our proposed methodology using the baseline data from a real longitudinal study on depression in elderly patients (age 65 and over) recruited from primary-care practices in Monroe County, New York. In addition to depression status determined by SCID, other information collected include demographic variables, the Hamilton Depression Rating Scale (HAM-D) for depression, a 24-item observer-rated scale designed to measure the severity of depressive symptoms [16], and the total score on the Cumulative Illness Rating Scale (CIRS), a reliable and valid measure of medical burden that quantifies the amount of pathology in each organ system [6]. Among the 708 patients enrolled, 249 patients were classified as having depression and the remaining 459 patients were declared as depression-free. Although the total score of HAM-D is inherently discrete, it was often treated as a continuous outcome because of its large range. In this example, we will estimate its density function among depressed patients.

Data for both the SCID and the HAM-D were collected from all participating patients in this study; therefore, similar to [4] we used a subset that resembled data that would be obtained from a two-phase design. Hence, we can assess the behaviors of the estimates by comparing them to the estimates based on the complete data. In this subset, the HAM-D results were available for all patients, but the SCID diagnoses were available only for certain patients selected according to the following mechanism:

(22)

Thus, the verification mechanism preferentially selected the patients who were under the age of 75, with a relatively high cumulative illness burden. Using this mechanism, 394 of the 708 patients (55.6%) were selected for SCID verification of the depression diagnosis. A logistic regression model with age and CIRS as the predictors was used to model the missing data mechanism.

The true density function is not known in this real data example; however, since there is no missing values in HAM-D and SCID in the data set, we used the estimate based on the complete data set as a reference in assessing the quality of the estimate under missing data. To model the missing data mechanism, we assumed that the missingness was related to CIRS and age, and applied the following generalized linear model with a logistic link to estimate the relationship:

Thus, the model used for the missing mechanism is not exactly the one we used to generate the missing values (the predictors are not dichotomized in the regression).

Shown in the plot below are the estimates of the density using the complete data set of all subjects who were depressed (complete), naive estimate (naive), and IPW estimates (IPW for IPW with estimated missing probabilities and IPWK for IPW with known missing probabilities). The IPW estimates are in general closer to, as compared with the naive estimate, the estimate based on the complete data. Note that although the estimated weight function was not based on the exact model for generating the missing values, bias was greatly reduced under the proposed approach (comparing to the naive method). It is also interesting to note that the IPW estimate using the known missing data probabilities (IPWK) is not as good as the one based on the estimated weight (IPW). Such a difference is also observed in various other settings involving IPW estimates, including parametric regressions[12] and U-statistic estimates[4].

In this paper we generalized the kernel smoothing density estimates for diagnostic test outcomes when the true disease status is subject to missing. Through the study, it is clear that the naive approach that ignores the missing values of disease status and uses only those with known disease membership is not valid. When the missing mechanism is well understood and characterized by the MAR mechanism, the proposed methodology works well in reducing the bias. Note that the information in *X _{i}* are only used in the modeling of the missing mechanism (

There are some problems left unanswered for our methods. Most notable is the study of boundary effects. It is well known that many smoothing methods do not behave well without adjustment near the boundaries. Our approach is no exception, as indicated by the relatively poor behaviors of the estimates near the boundaries. Further investigation is necessary to address such biases.

Another aspect that needs future study is the effect of MAR on the estimates. Although MAR is commonly assumed in the literature and satisfied by many studies in practice, missing not at random (MNAR) may arise in some studies. When enough information is collected (via covariates) and included in modeling the missingness of disease status, the MAR model may be approximately true. But, will the MAR assumption affect the behavior of the estimates, and if so, to what extent? More generally, how do we deal with MNAR? All these need future studies.

This research was supported in part by Award Number R21 DA027521-01 from the National Institute On Drug Abuse. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute On Drug Abuse or the National Institutes of Health.

The authors would also like to thank Jeffrey M. Lyness, M.D. for providing the data used in Section 6.

In this appendix we give a proof of Lemma 3 using estimating equation techniques. Note that under MAR, we have

(23)

and

(24)

When the parameters of the missing mechanism, *β,* is known and the true value is used, we can obtain the estimate (*t*) by solving the following estimating equations

(25)

If *β* is estimated from the data using the logistic model (14), the estimate (*t*) can be obtained by solving the system of estimating equations consisting of (25) and (14), and we can use the technique of stacking estimating equations. Since these estimating equations are unbiased, the estimates (*t*) and (*t*) are consistent estimates of *f _{h}*(

Let
, and
, then
where the asymptotic variance *σ*^{2} equals to the (1, 1) term of the matrix *A*^{−1}*BA*^{−}* ^{T},* where
and

Let
, and
, then by simple matrix computation, *σ*^{2} = *p* (*ef*^{2} − 2*gf* + *s*). It is straightforward to verify that
.

Let Ψ_{3} = (*V _{i}* −

Let **c**_{1} = *E* [(1 − *π _{i}*)

Through tedious algebraic computation it can be proved that

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

1. Alonzo TA, Pepe MS, Lumley T. Estimating disease prevalence in two-phase studies. Biostatistics. 2003;4:313–326. [PubMed]

2. Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. London: Chapman and Hall; 1995.

3. Chaudron LH, Szilagyi PG, Tang W, Anson E, Talbot NL, Wadkins HIM, Tu X, Wisner KL. Accuracy of Depression Screening Tools for Identifying Postpartum Depression Among Urban Mothers. Pediatrics. 2010:e609–e617. [PMC free article] [PubMed]

4. He H, Lyness JM, McDermott MP. Direct estimation of the area under the receiver operating characteristic curve in the presence of verification bias. Stat Med. 2009;28(3):361–376. [PMC free article] [PubMed]

5. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association. 1952;47:663–685.

6. Linn BS, Linn MW, Gurel L. Cumulative illness rating scale. Journal of the American Geriatrics Society. 1968;16:622–626. [PubMed]

7. Parzen Emanuel. On estimation of a probability density function and mode. Ann Math Statist. 1962;33:1065–1076.

8. Robins JM, Rotnitzky A. Semiparametric efficiency in multivariate regression models with missing data. Journal of the American Statistical Association. 1995;90:122–129.

9. Rotnitzky A, Faraggi D, Schisterman E. Doubly robust estimation of the area under the receiver-operating characteristic curve in the presence of verification bias. Journal of the American Statistical Association. 2006;101:1276–1288.

10. Simonoff JS. Smoothing Methods in Statistics. Springer; New York: 1996.

11. Spitzer RL, Gibbon M, Williams JBW. Structured Clinical Interview for Axis I DSM-IV Disorders. Biometrics Research Department, New York State Psychiatric Institute; 1994.

12. Tsiatis Anastasios A. Springer Series in Statistics. Springer; New York: 2006. Semiparametric theory and missing data.

13. Wand MP, Jones MC. Monographs on Statistics and Applied Probability. Vol. 60. Chapman and Hall Ltd; London: 1995. Kernel smoothing.

14. Wang Qihua. Probability density estimation with data missing at random when covariables are present. J Statist Plann Inference. 2008;138(3):568–587.

15. White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50(1):1–25.

16. Williams JBW. A structured interview guide for the Hamilton Depression Rating Scale. Archives of General Psychiatry. 1988;45:742–747. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |