To get a better sense of how heterogeneous data can affect the CLO method, we considered the following simulation. We constructed training, validation, and test sets with the same number of diseased and nondiseased patients as in our data (see in the supplementary material
available at Biostatistics
online) and with 1000 cells per patient. The patient distributions were modeled using a mixture of normal distributions with 3 components with modes at μ
= (1,2,3). The mode at 1 models the normal cells, the mode at 2 models the cycling cells, and the mode at 3 models aneuploid cells. The variances were set as σ12
= (0.1,0.1,0.4) for class 1 and σ02
= (0.1,0.1,0.9) for class 0 and the weights were generated using the stick-breaking approach proposed by Sethuraman (1994)
, using independent beta random variables. We chose the parameters for the beta distributions to emulate our data, giving more weight to the component with mode around one, and having a larger variance for the class 0 third component than the class 1 third component. The weights of the second and third component of the mixture of normal distributions were drawn from a beta
(10,40) for the class 1 data and beta
(1,4) for the class 0 data. The LACLO and the CLO methods used the training data to build the model and were optimized by choosing the parameters that resulted in the largest partial AUC in the validation set. The optimal parameters for the LACLO method were K0
= 7, K1
= 10, ψ
= 0, and h
= 0.089. The optimal bandwidth for the CLO method was 0.01 times the Silverman's rule of thumb for the density of each class. The model was then trained on the combined training and validation sets using these parameters and applied to the test set to get an unbiased estimate of the performance of the methods. shows the ROC curves of the LACLO and CLO methods applied to the test data. The LACLO method outperformed the CLO method. The difference in the partial AUC between the LACLO and CLO method on the test set was statistically significant using 10,000 bootstraps (p
< 0.001, see Section B.3 of the supplementary material
available at Biostatistics
Summary of estimated sensitivities for detecting High Grade Squamous Intraepitheliel Lesion or worse versus Low Grade Squamous Intraepitheliel Lesion or better on Test Set using the threshold that gave 90% specificity in the validation set
ROC Plot of Latent-class CLO Method versus CLO Method in a simulation example.2
The CLO method was applied to the real data from our study using the DNA index as a predictor. Density estimates were computed for f
= 0,1 from the training data. The prior prevalence of disease was 14%, estimated from the training data. The uniform distribution mixing parameter ψ
was varied from 0 to 0.02 (10 evenly spaced points). The bandwidth was varied by multiplying the Silverman's bandwidth by a constant, ranging from 0.2 to 20 (11 values), for the normal and diseased patients separately (Silverman, 1986
). We then computed the posterior log-odds for each ψ
, bandwidth adjustment, and range choice.
shows the ROC curve of CLO using the range [0,5], ψ = 0, and bandwidth factor 10. The best classifier we were able to find with CLO had an AUC of 0.71, a partial AUC of 0.41, and 47% sensitivity for 90% specificity.
ROC curve of CLO and LACLO methods using DNA Index on validation data.3
In the first step of the optimization process, we chose between 1 and 10 groups in each disease state to calculate the LACLO score. We set the value of the bandwidth parameter, h, to be equal to 0.086 using Silverman's rule of thumb for the patient with the largest number of cells. The uniform distribution mixing parameter ψ was set to be equal to 0.01. The range rng was set to be either the interval [0,5] or the whole sample range of the training data. Using the whole range of DNA index showed an improvement over the range [0,5]. It is not surprising that the LACLO method worked well with the whole range of DNA index since LACLO can assign a cluster to the cases that had cells with a DNA index higher than 5.
The best partial AUC was 0.53 using 4 clusters in the nondiseased group and 5 clusters in the diseased group, the whole range, and the identity transformation.
In the second step of the optimization process, we fixed K0 = 4, K1 = 5, rng = the training data sample range, used the identity transformation, and optimized over the ψ and h parameters. The bandwidth h ranged from 0.05 to 0.12 and the amount to mix with a uniform distribution ψ ranged from 0 to 0.02. We maximized the partial AUC in the validation data using parameters h = 0.081 and ψ = 0 (sensitivity of 65% at 90% specificity).
At 90% specificity, this represented a 18% increase in sensitivity compared to the CLO method. The ROC curve is presented in . Note that the LACLO method with one cluster in each disease state is equivalent to CLO.
shows the density estimates for the 4 clusters used in the non-diseased class and the 5 clusters in the diseased class. The graphs are truncated between DNA index 1.25 and 5 since the density estimates were most different from each other in this range. There was much more variation among the density estimates in the diseased group than among those in the nondiseased group. The first cluster represents the patients who apparently had more cycling cells (between DNA indices 1 and 2), possibly due to the Hayflick limit where a high grade squamous intraepithelial lesion is likely to harbor a large aneuploid population around 1.7 as described by Rasnick (2000)
. The second disease cluster had the greatest proportion of cells with DNA index around 2 as well as many cells with a DNA index of around 4. The diseased and nondiseased representative density estimates were very different in appearance.
Estimated density (on limited range) of clusters in the non-diseased and diseased groups.4
Using the threshold of the posterior log-odds score used to get 90% specificity from the validation set, we applied the methods to the final test set to get our final unbiased estimate of its performance, shown in . To compare 2 ROC curves in the test set, we constructed bootstrap confidence intervals of the difference between the partial AUCs (Dodd and Pepe, 2003
). The LACLO method's sensitivity decreased from 66% to 52%, yet the specificity increased slightly. On the test set, the LACLO ROC curve was very similar to that of clinical cytology. The LACLO method had an AUC of 0.76 versus 0.71 from CLO (p
= 0.075). The partial AUC was statistically significantly higher for the LACLO method (0.47 vs. 0.41, p
= 0.01). The ROC curves of the CLO and LACLO methods are shown in .
ROC Plot of CLO and Latent-class CLO Methods on test data using variable DNA Index. The LACLO method used 4 clusters in the normal patients and 5 clusters in the diseased patients, ψ = 0, & h = 0.0815