Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2755302

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Models
- 3. Analysis of Gastric Cancer Data
- 4. Asymptotic Results
- 5. Finite Sample Results
- 6. Gastric Cancer Example Continued
- 7. Discussion
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2009 October 1.

Published in final edited form as:

J Am Stat Assoc. 2008 March 1; 103(481): 61–73.

doi: 10.1198/016214507000000329PMCID: PMC2755302

NIHMSID: NIHMS140945

Paul S. Albert, Biometric Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD 20892 (Email: vog.hin.liam@ptrebla);

See other articles in PMC that cite the published article.

We are often interested in estimating sensitivity and specificity of a group of raters or a set of new diagnostic tests in situations in which gold standard evaluation is expensive or invasive. Numerous authors have proposed latent modeling approaches for estimating diagnostic error without a gold standard. Albert and Dodd showed that, when modeling without a gold standard, estimates of diagnostic error can be biased when the dependence structure between tests is misspecified. In addition, they showed that choosing between different models for this dependence structure is difficult in most practical situations. While these results caution against using these latent class models, the difficulties of obtaining gold standard verification remain a practical reality. We extend two classes of models to provide a compromise that collects gold standard information on a subset of subjects but incorporates information from both the verified and nonverified subjects during estimation. We examine the robustness of diagnostic error estimation with this approach and show that choosing between competing models is easier in this context. In our analytic work and simulations, we consider situations in which verification is completely at random as well as settings in which the probability of verification depends on the actual test results. We apply our methodological work to a study designed to estimate the diagnostic error of digital radiography for gastric cancer.

Diagnostic and screening tests are important tools of modern clinical decision making. These tests help to diagnose illness to initiate treatment (e.g., a throat culture for streptococcal infection) or to identify individuals requiring more extensive follow-up (e.g., mammography screening for breast cancer). Estimation of sensitivity and specificity, measures of diagnostic accuracy, requires knowledge of the true disease state, which is assessed by a gold or reference standard. (Throughout, we use both “gold standard” and “reference standard” to mean the accepted standard for diagnosis.) Gold standard evaluation may be expensive, time consuming, or unethical to perform on all subjects and is commonly difficult to obtain in clinical studies. Latent class models offer a tempting alternative because assessment of the true status is not necessary. However, it has been shown that latent class models for estimating diagnostic error and prevalence may be problematic in many practical situations (Albert and Dodd 2004). Specifically, they showed that, with a small number of tests, estimates of diagnostic error were biased under a misspecified dependence structure, yet in many practical situations it was nearly impossible to distinguish between models based on the observed data. The lack of robustness of these models is problematic; however, the limitations of obtaining gold standard are a practical reality and reasonable alternatives are desirable.

Although it may be difficult to obtain the gold standard on all subjects, in many cases, it may be feasible to obtain gold standard information on a fraction of subjects (partial gold standard evaluation). In radiological studies, for example, gold standard evaluation usually requires multiple radiologists simultaneously examining images and clinical information. This may be an infeasible proposition for many studies to collect the gold standard on all subjects. However, it may be feasible to obtain gold standard information on a fraction of study subjects. Thus, methodological approaches that incorporate partial gold standard information may be an attractive alternative to latent class modeling.

Our application is a medical imaging study to compare conventional and digital radiography for diagnosing gastric cancer (Iinuma et al. 2000). In this study six radiologists evaluated 225 images on either conventional (*n* = 112) or digital (*n* = 113) radiography, to compare the sensitivity and specificity across techniques and radiologists. A gold standard evaluation was obtained from three independent radiologists simultaneously reviewing clinical information along with all imaging data to provide a reference truth evaluation of the image. Specifically, these radiologists reviewed clinical information such as patient characteristics, chief symptoms, purposes of the examination, endoscopic features, and histologic findings in biopsy specimens. This time-consuming consensus review was done on all 225 images, although this may not be feasible in larger studies or in other studies with more limited resources. Rater-specific as well as overall sensitivity and specificity were estimated by treating the consensus review by the three independent radiologists as gold standard truth. Our methodological development will focus on the data from this study.

Although our primary example is in radiology, the problem occurs more generally in medicine. For example, similar problems exist for the evaluation of biomarkers in which one wishes to compare the diagnostic accuracy of a series of tests, where a gold standard exists, but is very expensive. See, for example, Van Dyck et al. (2004), in which a set of tests for herpes simplex virus type 2 (HSV-2) was compared, but only a subset of samples was verified with the reference standard Western blot.

In this article we extend two classes of models, originally proposed for modeling diagnostic error on multiple tests without a gold standard (Albert and Dodd 2004), to the situation of estimating diagnostic error for a partially verified design. We examine the robustness of these models to the assumed dependence structure between tests. In particular, we examine bias and model selection using asymptotic results and simulation studies. We examine whether observing gold standard information on a small percentage of cases improves the lack of robustness to assumptions on the dependence between tests found when modeling without a gold standard. In Section 2 we describe our approach, which considers various models for the dependence between tests. In Section 3 we fit the various classes of models to the gastric cancer dataset and show that the results are quite different when we use the reference standard evaluation or when we model without a gold standard. In Section 4 we investigate the asymptotic bias from misspecifying the dependence structure under full as well as partial reference sample evaluation. Simulations examining the finite sample properties of partial reference sample verification are described in Section 5. We illustrate the effect of partial reference sample verification using the gastric cancer dataset in Section 6. A discussion follows in Section 7 in which we make general recommendations.

Let **Y*** _{i}* = (

$${L}_{i}[P({\text{Y}}_{i}|{d}_{i})P{({d}_{i})]}^{{v}_{i}}{\left[\sum _{l=0}^{1}P({\text{Y}}_{i}|{d}_{i}=l)P({d}_{i}=l)\right]}^{1-{v}_{i}},$$

(1)

where *P*(*d _{i}* = 1) is the disease prevalence, which will be denoted by

There are three types of verification processes. First, consider verification that is completely at random, which occurs if the verification process is a simple random sample chosen independently from the test results **Y*** _{i}*. The proportion of individuals verified is denoted by

We consider two different ways to specify *P*(**Y*** _{i}*|

$$P\left({Y}_{ij}=1|{d}_{i},{l}_{i{d}_{i}}\right)=\{\begin{array}{cc}1\hfill & \text{if}\phantom{\rule{0.2em}{0ex}}{d}_{i}=1\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{l}_{i1}=1\hfill \\ 0\hfill & \text{if}\phantom{\rule{0.2em}{0ex}}{d}_{i}=0\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{l}_{i0}=1\hfill \\ {\omega}_{j}(1)\hfill & \text{if}\phantom{\rule{0.2em}{0ex}}{d}_{i}=1\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{l}_{i1}=0\hfill \\ 1-{\omega}_{j}(0)\hfill & \text{if}\phantom{\rule{0.2em}{0ex}}{d}_{i}=0\phantom{\rule{0.2em}{0ex}}\text{and}\phantom{\rule{0.2em}{0ex}}{l}_{i0}=0,\hfill \end{array}$$

(2)

where *ω _{j}*(

Depending on the application, the FM or the GRE model may better describe the dependence structure between tests. Both models need to be compared with a simple alternative, which is nested within both of these conditional dependence models. The conditional independence (CI) model, which assumes the tests are independent given the true disease status, provides such an alternative. The GRE model reduces to the CI model when *σ*_{0} = *σ*_{1} = 0, whereas the FM model reduces to the CI model when *η*_{0} = *η*_{1} = 0.

For each of the models, estimation is based on maximizing
$L={i=1I}_{{L}_{i}}^{}$, where *L _{i}* is given by (1). Standard errors can be estimated with the bootstrap (Efron and Tibshirani 1993).

We estimate the prevalence, sensitivity, and specificity of digital radiography for gastric cancer using the likelihood in (1) and the GRE, FM, and CI models, under both complete and no verification. Table 1 shows the overall estimates of prevalence, sensitivity, and specificity for digital radiography with the consensus measurements as a gold standard and with no gold standard. Estimates were obtained by assuming a common sensitivity and specificity across the six raters and were derived under the CI model, as well as the GRE and FM models. Bootstrap standard errors are also presented under each model. Interestingly, under complete verification, overall estimates of prevalence, sensitivity, and specificity, as well as their bootstrap standard errors, were nearly identical across the three classes of models. In addition, these estimates were identical to estimates obtained by Iinuma et al. (2000) using generalized estimating equations (Liang and Zeger 1986), a procedure known to be insensitive to assumptions on the dependence structure between tests. These results suggest that estimates of prevalence, sensitivity, and specificity are insensitive to the dependence structure between tests under complete verification. When no gold standard information is incorporated, estimates of prevalence and diagnostic error differ across models for the dependence between tests. This is consistent with results by Albert and Dodd (2004) who showed that diagnostic error estimation may be sensitive to assumptions on the dependence between tests when no verification is performed.

Estimation of overall prevalence, sensitivity, and specificity for digital radiography using no gold standard (GS) and with the consensus rating as the gold standard

By the likelihood principle, we compare models based on a comparison of the likelihood values. Using the gold standard, the log-likelihoods were −314.63, −300.36, and −305.45, for the CI, GRE, and FM models, respectively (there are three, five, and five parameters for each model, respectively). We compared the GRE and FM models with the CI model using a likelihood ratio test because the CI model is nested within both of these conditional dependence models. Because the parameters that characterize the conditional dependence are on the boundary (*σ*_{0} = *σ*_{1} = 0 for the GRE model and *η*_{0} = *η*_{1} = 0 for the FM model) under the null hypothesis corresponding to a CI model, the standard likelihood ratio theory is inappropriate (Self and Liang 1987). We conducted a simulation study to obtain the reference distribution under the null hypothesis by simulating 10,000 datasets under the estimated CI model and evaluating the likelihood ratio test of *σ*_{0} = *σ*_{1} and *η*_{0} = *η*_{1} = 0 corresponding to the GRE model and FM models. Based on the observed log-likelihoods and the simulated reference distribution, we reject the independence model in favor of the GRE and FM models (*P* < .001 for both models). Further, parameter estimates characterizing the conditional dependence under both conditional dependence models are sizable. For the GRE model, _{0} = 1.1 and _{1} = .37, and for the FM model _{0} = .31 and _{1} = .38, respectively. A comparison of the two nonnested GRE and FM models can be made by directly comparing the two log-likelihoods because both models have the same number of parameters. Under complete gold standard evaluation, this comparison clearly favors the GRE model.

For the no gold standard case, the log-likelihoods for the CI, GRE, and FM models were −283.19, −280.16, and −280.30, respectively. Consistent with Albert and Dodd (2004), these results suggest that, although it is easy to distinguish between conditional dependence and a conditional independence model (likelihood ratio tests computed as described previously for complete verification showed evidence for conditional dependence; *P* values for the comparisons of the GRE and FM models relative to the CI model were .009 and .016, respectively), it may be difficult to choose between the two models for conditional dependence with no gold standard.

Table 2 shows rater-specific estimates of sensitivity and specificity, along with prevalence, for models that incorporate the gold standard information and those that do not. As with the overall estimates of sensitivity and specificity, individual rater estimates are nearly identical across models for the dependence between tests as well as to the rater-specific estimates presented in Iinuma et al. (2000). In contrast, estimates obtained using no gold standard information were highly model dependent and were very different from those estimates that used the gold standard information.

Estimation of prevalence and rater-specific sensitivity and specificity for digital radiography with no gold standard (GS) and with the consensus rating as the gold standard

Thus, modeling approaches with complete verification appear to be more robust against misspecification of the dependence structure between tests, whereas approaches with no verification appear to lack robustness. A natural question is how the statistical properties of the estimation improve with an increasing proportion of gold standard evaluation. This will be the primary focus of this article. We discuss asymptotic and simulation results before returning to this example and varying the amount of verification. We focus on comparing the GRE and FM models because it was shown in Albert and Dodd (2004) that it is difficult to distinguish between these rather different models with no gold standard evaluation.

We examined the asymptotic bias when the dependence structure is misspecified as a function of the proportion of samples receiving gold standard evaluation. For simplicity, we examined this bias for the case when interest focuses on estimating a common sensitivity and specificity across raters (denoted as *SENS* and *SPEC*, respectively). We examined both verification that is completely at random and verification biased sampling. The misspecified maximum likelihood estimator for the model parameters, denoted by ***, converges to the value **** θ***, where

$${\theta}^{}$$

(3)

and log *L*(**Y*** _{i}, θ*) is the individual contribution to the log-likelihood under the assumed model and the expectation is taken under the true model

$${E}_{T}\phantom{\rule{0.2em}{0ex}}(\text{log}\phantom{\rule{0.2em}{0ex}}{L}_{M})={E}_{T}\phantom{\rule{0.2em}{0ex}}[\text{log}\phantom{\rule{0.2em}{0ex}}L({\text{Y}}_{i},\theta )]{|}_{\theta ={\theta}^{}}$$

(4)

denotes the expectation (taken under the true model *T*) of an individual's contribution to the log-likelihood under the assumed model *M* when evaluated at ** θ***. Sensitivity and specificity are model-dependent functional forms of the model parameters,

First, we examined the case of completely at random verification (i.e., *r _{s}* =

Table 3 shows the results for various proportions of completely at random verification for five tests and a presumed constant sensitivity and specificity, with the true model being the FM model and the misspecified model being the GRE model. When we have no gold standard information (*r* = 0), there is serious bias under a misspecified dependence structure and the expected individual contribution to the log-likelihood under the correctly specified model is nearly identical (to more than six digits) to the expected log-likelihood under the correctly specified model, which is consistent with results reported in Albert and Dodd (2004). Thus, with no gold standard reference and with five tests, estimates of diagnostic error may be biased under a misspecified dependence structure, yet it may be very difficult to distinguish between models in most situations. As little as 2% gold standard verification (*r* = .02) reduces the bias considerably, and the expected log-likelihoods are no longer nearly identical, making it simpler to distinguish between models. With 20% verification, the bias is small. For complete verification (*r* = 1), marginal quantities such as sensitivity and specificity are nearly unbiased under a misspecified dependence structure. This is consistent with work by Tan, Qu, and Rao (1999) and Heagerty and Kurland (2001) who showed for clustered binary data that marginal quantities (which sensitivity, specificity, and prevalence are) are robust to misspecification of the dependence structure. The large differences in expected log-likelihoods suggest that it will be relatively simple to distinguish between models.

Large-sample robustness of the assumed Gaussian random effects (GRE) model to the true dependence structure between tests given by the finite mixture (FM) model

Table 4 shows asymptotic bias with five tests when the true model is the GRE model and the misspecified model is the FM model. As in Table 3, there is substantial asymptotic bias under the misspecified model when there is no gold standard evaluation. In addition, the expected log-likelihood for the misspecified model is nearly identical to the expected log-likelihood for the correctly specified model, again showing the difficulty in choosing between competing models with no gold standard information with few tests. Similar to the results in Table 3, estimates of prevalence, sensitivity, and specificity are asymptotically unbiased under the misspecified model when there is complete gold standard evaluation (*r* = 1). Unlike the results in Table 3, a larger percentage of verification (about 50%) is necessary to achieve approximate unbiasedness. In both cases, however, a small percentage of verification results in different expected log-likelihoods under the true and misspecified models, suggesting that it is simpler to choose between competing models with even a small percentage of gold standard verification.

Large-sample robustness of the assumed finite mixture (FM) model to the true dependence structure between tests given by the Gaussian random effects (GRE) model

Tables 3 and and44 provide an assessment of asymptotic bias under reciprocal model misspecification for both the FM and the GRE models when the sensitivity and specificity are .75 and .9, respectively. We also examined the relative asymptotic bias for a wide range of sensitivity and specificity (a grid ranging from values of .65 to .95 for both sensitivity and specificity) corresponding to the cases specified in these tables for *r* = .5. Figure 1 shows the results corresponding to the models and parameters described in Table 4 when *σ*_{1} = 3. Over the wide range of sensitivity and specificity, the maximum relative percent bias was 2.8% for sensitivity and 5.1% for specificity. Other scenarios provided similar results with all percent biases being less than 6% over the grid (data not shown).

Contour plot of relative asymptotic bias in sensitivity and specificity for 50% completely at random verification when the true model is a GRE model with *P*_{d} = .20, *σ*_{0} = 1.5, *σ*_{1} = 3, and *J* = 5. Relative asymptotic bias of sensitivity and **...**

We examined asymptotic bias of the FM and GRE models under alternative dependence structures. Specifically, we examined the asymptotic bias when the true conditional dependence structure *P*(**Y*** _{i}*|

For all three alternative models, we examined the bias in sensitivity and specificity of the GRE and FM models with 50% completely random verification over a wide range of sensitivity and specificity values (identical to the grid described for Fig. 1) for a prevalence of .20. Table 5 shows that the maximum relative asymptotic bias was less than 7% for both sensitivity and specificity for all three alternative models. Thus, estimates of diagnostic error appear to be quite robust with 50% completely random verification. When prevalence was very low or very high (e.g., below 5% or above 95%), there was more substantial bias under certain model misspecification with 50% completely random verification. For example, for a prevalence of .05 when the true model was the log-linear model, there was a maximum bias of 10% under a GRE model (as compared to a maximum bias of 4.3% for a prevalence of .20). However, unlike when there is no gold standard evaluation (*r* = 0), it is much easier to identify the better fitting model using likelihood and other criteria for model assessment. Further, for a rare disease, completely random verification would not generally be recommended due to efficiency considerations.

Range in relative asymptotic bias for the GRE and FM models when the true conditional dependence structure is (i) a Bahadur model with all correlations of order 3 and higher equal to 0, (ii) a log-linear model with a three-way interaction, and (iii) a **...**

Although random verification is of concern in our application, we also consider verification biased sampling because it is so common. We examine asymptotic properties under a misspecified dependence structure with verification biased sampling. Table 6 shows asymptotic bias and expected log-likelihoods for the situation in which a random sample of cases *among those who test positive on at least one of the five tests* is verified (e.g., extreme verification biased sampling) and where the true model is the FM model and the misspecified model is the GRE model. Interestingly, these results suggest that, in some cases, an increase in the proportion verified can result in an increase in bias under the misspecified model. For example, when *η*_{1} = .5 and *η*_{0} = .2, the estimator of sensitivity is only slightly asymptotically biased (*SENS** = .77) with no gold standard evaluation (*r _{s}* = 0 for

Large-sample robustness of the assumed Gaussian random effects model (GRE) when the true dependence structure between tests is a finite mixture (FM) model

In the next section we examine the finite sample results for both robustness and efficiency when we observe partial gold standard information.

We examine bias, variability, and model selection of the different models using simulation studies. Table 7 shows the effect of model misspecification on estimates of prevalence, sensitivity, and specificity when the true model is an FM model and we fit the misspecified GRE model. Results are shown for sample sizes of *I* = 100 and *I* = 1,000 and for various proportions of random verification *r*. Similar to simulations in Albert and Dodd (2004), we found that when *r* = 0 estimates of sensitivity, specificity, and prevalence are biased under a misspecified model, and it is difficult to distinguish between models based on likelihood comparisons. In addition, estimates under the misspecified GRE model are substantially more variable than estimates under the correctly specified FM model. However, with only a small percentage of samples verified, estimation of sensitivity, specificity, and prevalence has improved statistical properties. Table 7 shows that bias is substantially reduced when only 5% of cases are verified. With as little as 20% random verification, estimates of sensitivity, specificity, and prevalence are nearly unbiased under model misspecification. In addition, variance estimates are very similar under the misspecified model relative to the correctly specified model. Under complete verification (*r* = 1), there is essentially no effect to misspecifying the dependence structure. The table suggests that there are other advantages to measuring the gold standard test on at least a fraction of samples or individuals. First, there is a large payoff in efficiency. For sensitivity, under the correct FM model with *I* = 1,000, the efficiency gain relative to no gold standard information (*r* = 0) is 46%, 276%, and 640% for 5%, 20%, and 100% gold standard evaluation (these calculations were based on variance estimates computed to the fourth decimal place, whereas the standard errors in Table 7 are only presented to the second decimal place). This decrease in variance is even more sizable under the misspecified GRE model. Second, it becomes increasingly easier to distinguish between models for the dependence structure with increasing *r*. In Table 7 we show the percentage of times the correctly specified FM model is chosen to be superior than the misspecified GRE based on the criterion of a separation of likelihoods greater than 1. With five tests (*J* = 5) and a sample size of *I* = 1,000, the correctly specified FM was declared to be superior in 12% of the cases when there was no gold standard tests. The ability to choose the correct model increased dramatically with even a small fraction of gold standard evaluation. With only 5%, 20%, and 100% verification, the correct model was identified in 45%, 64%, and 79% of the cases.

Table 8 shows the effect of model misspecification on estimates of sensitivity and specificity when the true model is a GRE model and the misspecified model is the FM model. As with the asymptotic results in this situation, a random sample of more than 20% reference standard evaluation is needed to get approximately unbiased estimates under the misspecified model. However, unlike when *r* = 0, where it is difficult to choose the correct model (by the criterion that the log-likelihood for the GRE model was larger than the log-likelihood for the FM model by more than 1), we can choose between the GRE and FM models with high probability when *r* = .2.

We also examined the robustness of the GRE and FM models when the true dependence structure is governed by a Bahadur model. Specifically, we simulated data with the conditional dependence structure [*P*(**Y*** _{i}*|

Table 9 shows simulation results for the case of four raters under a correctly specified FM model and a misspecified GRE model. Estimates of sensitivity and specificity, which are seriously biased with no gold standard evaluation, are nearly unbiased under the misspecified model with 20% random verification. As in Table 7, this table illustrates the pay-off in efficiency with at least some partial gold standard evaluation under both the correct and the misspecified models. This table also shows the percentage of realizations where the FM model has a larger likelihood than the GRE model. Unlike with no gold standard evaluation, the FM model is almost always correctly identified with 20% verification. In addition, unlike with *r* = 0, models with 20% verification result in the correct ordering of sensitivity and specificity almost all of the time. We also performed simulations for the case of four raters when the true model is a GRE model and the misspecified model is the FM model. Under the misspecified FM model, bias is substantially reduced for *r* = .20 as compared to *r* = 0. Furthermore, estimates of sensitivity, specificity, and prevalence computed under the FM model were nearly unbiased for *r* = .5 (data not shown).

Next, we examine verification biased sampling. Our asymptotic results show that estimates of diagnostic error and prevalence can be biased when we oversample discrepant cases under a misspecified model, which was in contrast to results with random verification. We conducted simulations to examine this further. We examine bias in sensitivity, specificity, and prevalence estimates from a GRE model when the FM model is the correct model. We simulated under an FM model with *J* = 5, *I* = 100, *η*_{0} = .20, *η*_{1} = .50, *P _{d}* = .20,

Under the correctly specified model, oversampling discrepant cases may improve the precision of our estimates. Thus, an interesting question is whether the increase in efficiency from oversampling discrepant cases is worth the potential of serious bias under a misspecified model. We conducted a simulation where we simulated under a finite mixture model and fit the correctly specified FM model and the misspecified GRE model both under completely random verification and under a verification process where we oversampled discrepant cases. We simulated data with *J* = 5, *I* = 1,000, *P _{d}* = .2,

Next, we return to the gastric cancer dataset and use only partial gold standard evaluation. Our initial focus is on examining verification that is completely at random. We evaluated designs with different probabilities of verification (*r*). To capture the variability associated with different amounts of verified sampling, we resample data with replacement and incorporate the reference standard on a given image with probability *r*. Table 10 shows results for an assumed common and for an assumed rater-specific sensitivity and specificity for *r* ranging from .1 to .8. In each situation, we fit both the FM and the GRE models. A comparison of these results with those presented for complete verification and for no gold standard evaluation (Tables 1 and and2)2) is most revealing. The results suggest that the common as well as the rater-specific estimates for *r* = .50 are close to those presented for complete verification. In addition, the results for *r* = .2, although not very close to those presented for the complete verification case, are substantially closer than those estimated with the latent class models under *r* = 0 (Table 2).

Estimation of overall and rater-specific sensitivity and specificity as well as prevalence for digital radiography using partial verification designs

We also examined extreme bias verification. Specifically, we evaluated a design whereby we verified all images in which at least one of the six radiologists rated the image positive for gastric cancer (52% of images were declared positive by at least one radiologist). As with random verification, we constructed datasets by resampling images with replacement and incorporating reference standard information whenever a positive image for any radiologist was recorded. For a common sensitivity and specificity, estimates of sensitivity, specificity, and prevalence were .78 (SE = .05), .90 (.01), and .23 (.04) for the FM model and .72 (.11), .90 (.02), and .23 (.06) for the GRE model, respectively. There was greater discrepancy between the estimates across the two models under extreme verification bias than for a comparable proportion verified under a completely random verification mechanism (*r* = .50 in Table 10). Large differences between the FM and the GRE models for rater-specific estimates were also found (data not shown). These results, along with the analytic and simulation results, demonstrate less robustness under verification biased sampling.

It has been shown in previous work that estimates of diagnostic error and prevalence are biased under a misspecified model for the dependence between tests and that, with only a small number of tests, it is difficult to distinguish between models for the dependence structure using likelihood and other model diagnostics (Albert and Dodd 2004). Under complete verification, results on generalized linear mixed models would suggest that the estimation of marginal quantities (which prevalence, sensitivity, and specificity are) are insensitive to misspecification of the dependence between tests (Tan et al. 1999; Heagerty and Kurland 2001). Our results confirm this. Furthermore, we showed that it is much simpler to distinguish between models with complete verification. A natural question is whether gold standard verification on even a small percentage of cases improves the statistical properties of estimators of sensitivity, specificity, and prevalence. We examined both whether observing partial verification lessens the bias when the dependence structure is misspecified and whether one is able to more easily distinguish between different models for the dependence structure between tests. For the situation where verification is independent of the test results **Y*** _{i}*, gold standard evaluation on even a small percentage of cases greatly lowers the bias for estimating prevalence, sensitivity, and specificity under a misspecified model. In addition, identifying the correct model for the dependence structure using likelihood comparisons becomes much easier with even a small percentage of gold standard evaluation. Although there are advantages to performing the gold standard test on as many individuals as possible, this is not often possible due to limited resources. Our results suggest that between 20% and 50% gold standard evaluation results in large improvements in robustness, efficiency, and the ability to choose between competing models over no gold standard information. If the gold standard test is expensive, performing the gold standard test on more than 50% of patients may not be costeffective.

We also examined situations in which the probability of verification depends on observed test results (i.e., verification biased sampling). An important special case of verification biased sampling is extreme verification biased sampling where individuals who test negative on all tests do not receive gold standard evaluation. Such verification sampling occurs in situations where the gold standard is invasive (e.g., surgical biopsy) and it is considered unethical to subject a patient to the invasive test when there is little evidence for disease. Unlike for a single test where sensitivity, specificity, and prevalence are not identifiable under extreme verification bias sampling (Begg and Greenes 1983; Pepe 2003), these quantities are identifiable with multiple tests and an assumed model for the dependence between these tests. However, unlike the case where verification is completely at random, estimates of sensitivity, specificity, and prevalence may not be robust to misspecification of the dependence between tests with a large fraction of verification.

A gold standard can be defined in various ways depending on the scientific interest. The gold standard test could be a laboratory test, a consensus evaluation of an image, or an assessment of clinical disease. The nature of the gold standard will determine how diagnostic accuracy is interpreted. In the gastric cancer study, the gold standard was a consensus assessment (across three radiologists) of all available clinical information, including imaging data. All suspect gastric cancers were confirmed with biopsies, while patients who were negative had limited follow-up of two months to see if gastric cancer symptoms developed. A longer follow-up would have been ideal in assuring that these negative cases did not develop gastric cancer.

Other types of verification biased sampling schemes may be employed to improve efficiency. For example, our simulation results show that oversampling discrepant cases can result in improved efficiency over sampling completely at random. Our results further show that, although oversampling discrepant cases can improve efficiency, such a strategy loses the attractive feature of decreasing bias with an increasing proportion of verification found for a completely random verification mechanism. In addition, our results suggest that, for a comparable proportion of verification, choosing the correct model for the dependence between tests is more difficult for a verification process in which we oversample discrepant cases as compared with completely random verification.

Irwig et al. (1994) and Tosteson, Titus-Ernstoff, Baron, and Karagas (1994) considered optimal design strategies for the case of a single diagnostic test. Optimal design for multiple correlated tests is an area for future research. However, the choice of an optimal design will depend heavily on assumed models and parameter values for the dependence between tests. For this reason, we question the practicality of developing an optimal design in this situation.

A common criticism of latent class models for estimating sensitivity, specificity, and prevalence without a gold standard is that, without a gold standard, it is difficult to conceptualize sensitivity and specificity (Alonzo and Pepe 1999). Partial verification lessens the problem of conceptualizing the truth because a gold standard test needs to be defined and evaluated on at least a fraction of the cases.

The different models presented for analyzing partial verification data use a latent class structure for observations that do not have gold standard evaluation. In contrast with the full latent class modeling used when there is no gold standard evaluation, the semilatent class approach is more conceptually appealing, more robust under verification completely at random, and allows for model comparisons using likelihoods with only small number of tests.

We thank Dr. Gen Iinuma for providing us access to the gastric cancer dataset. We thank Dr. Seirchiro Yamamoto for helping us get access to the data as well as for helpful conversations. We thank the Center for Information Technology, National Institute of Health, for providing access to the high-performance computational capabilities of the Biowulf cluster computer system. We also thank the editor, associate editor, and two reviewers for their thoughtful and constructive comments, which have led to an improved article.

This is evaluated under the assumption of a common sensitivity and specificity across *J* tests, where the number of positive tests *S* is a sufficient statistic. Denote by *Z _{SD}* an indicator of whether the individual is verified, has

$$\begin{array}{l}{E}_{T}\phantom{\rule{0.2em}{0ex}}\left[\text{log}\phantom{\rule{0.2em}{0ex}}L\left({\text{Y}}_{i},{\theta}_{M}\right)\right]\hfill \\ \phantom{\rule{0.6em}{0ex}}=\sum _{d=0}^{1}\sum _{s=0}^{J}{E}_{T}\phantom{\rule{0.2em}{0ex}}\left[{Z}_{sd}\right]\text{log}\left[{P}_{M}\phantom{\rule{0.2em}{0ex}}(S=s|D=d)\phantom{\rule{0.2em}{0ex}}{P}_{M}\phantom{\rule{0.2em}{0ex}}(D=d)\right]\hfill \\ \phantom{\rule{1.5em}{0ex}}+\sum _{s=0}^{J}{E}_{T}\phantom{\rule{0.2em}{0ex}}\left[{X}_{s}\right]\text{log}[{P}_{M}\phantom{\rule{0.2em}{0ex}}(S=s|D=0)\phantom{\rule{0.2em}{0ex}}{P}_{M}\phantom{\rule{0.2em}{0ex}}(D=0)\hfill \\ \phantom{\rule{1.5em}{0ex}}+{P}_{M}\phantom{\rule{0.2em}{0ex}}(S=s|D=1)\phantom{\rule{0.2em}{0ex}}{P}_{M}\phantom{\rule{0.2em}{0ex}}(D=1)]+{C}_{v},\hfill \end{array}$$

(A.1)

where *P _{M}*(

$${E}_{T}\phantom{\rule{0.2em}{0ex}}[{Z}_{sd}]={r}_{s}P(S=s|D=d)P(D=d)$$

(A.2)

and

$$\begin{array}{l}{E}_{T}\phantom{\rule{0.2em}{0ex}}[{X}_{s}]=(1-{r}_{s})[P(S=s|D=1)P(D=1)+P(S=s|D=0)P(D=0)].\hfill \end{array}$$

(A.3)

Let *π _{ij}* be the probability of a positive response conditional on

The probability distribution can be expressed as
$f({\text{Y}}_{i}|{d}_{i})=\text{exp}({\sum}_{j=1}^{J}{\theta}_{j}{Y}_{ij}+{\sum}_{j<k}{\theta}_{jk}{Y}_{ij}{Y}_{ik}++{\theta}_{jkJ}$, where Δ is a normalization factor so that *f* (**y*** _{i}*|

This distribution assumes that the probability of a positive test (conditional on *d _{i}*) is common across the

**This article has been cited by:**

1. Haitao Chu , , Sining Chen , , Thomas A. Louis . 2009. Random Effects Models in a Meta-Analysis of the Accuracy of Two Diagnostic Tests Without a Gold StandardRandom Effects Models in a Meta-Analysis of the Accuracy of Two Diagnostic Tests Without a Gold Standard. *Journal of the American Statistical Association* 104:486, 512-523.

Paul S. Albert, Biometric Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD 20892 (Email: vog.hin.liam@ptrebla).

Lori E. Dodd, Biometric Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, Bethesda, MD 20892 (Email: vog.hin.liam@lddod)

- Albert PS, Dodd LE. A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error Without a Gold Standard. Biometrics. 2004;60:427–435. [PubMed]
- Albert PS, McShane LM, Shih JH, et al. Latent Class Modeling Approaches for Assessing Diagnostic Error Without a Gold Standard: With Applications to p53 Immunohistochemical Assays in Bladder Tumors. Biometrics. 2001;57:610–619. [PubMed]
- Alonzo TA, Pepe M. Using a Combination of Reference Tests to Assess the Accuracy of a Diagnostic Test. Statistics in Medicine. 1999;18:2987–3003. [PubMed]
- Bahadur RR. A Representation of the Joint Distribution of Responses of
*n*Dichotomous Items. In: Solomon H, editor. Studies in Item Analysis and Prediction. Stanford, CA: Stanford University Press; 1961. pp. 169–177. - Baker SG. Evaluating Multiple Diagnostic Tests With Partial Verification. Biometrics. 1995;51:330–337. [PubMed]
- Begg CB, Greenes RA. Assessment of Diagnostic Tests When Disease Verification Is Subject to Selection Bias. Biometrics. 1983;39:207–215. [PubMed]
- Cox DR. The Analysis of Multivariate Binary Data. Applied Statistics. 1972;21:113–120.
- Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
- Heagerty PJ, Kurland BF. Misspecified Maximum Likelihood Estimates and Generalized Linear Mixed Models. Biometrika. 2001;88:973–985.
- Hoenig JM, Hanumara RC, Heisey DM. Generalizing Double and Triple Sampling for Repeated Surveys and Partial Verification. Biometrical Journal. 2002;44:603–618.
- Iinuma G, Ushiro K, Ishikawa T, Nawano S, Sekiguchi R, Satake M. Diagnosis of Gastric Cancer Comparison of Conventional Radiography and Digital Radiography With a 4 Million Pixel Charge-Coupled Device. Radiology. 2000;214:497–502. [PubMed]
- Irwig L, Glasziou PP, Berry G, Chock C, Mock P, Simpson JM. Efficient Study Designs to Assess the Accuracy of Screening Tests. American Journal of Epidemiology. 1994;140:759–767. [PubMed]
- Kosinski AS, Barnhart HX. Accounting for Nonignorable Verification Bias in Assessment of Diagnostic Test. Biometrics. 2003;59:163–171. [PubMed]
- Liang KY, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika. 1986;73:12–22.
- Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford, U.K.: Oxford University Press; 2003.
- Qu Y, Tan M, Kutner MH. Random Effects Models in Latent Class Analysis for Evaluating Accuracy of Diagnostic Tests. Biometrics. 1996;52:797–810. [PubMed]
- Self SG, Liang KY. Asymptotic Properties of Maximum Likelihood Estimators and Likelihood Ratio Tests Under Nonstandard Conditions. Journal of the American Statistical Association. 1987;82:605–610.
- Tan M, Qu Y, Rao JS. Robustness of the Latent Variable Model for Correlated Binary Data. Biometrics. 1999;55:258–263. [PubMed]
- Tosteson TD, Titus-Ernstoff L, Baron JA, Karagas MR. A Two-Stage Validation Study for Determining Sensitivity and Specificity. Environmental Health Perspectives. 1994;102:11–14. [PMC free article] [PubMed]
- van der Merwe L, Maritz JS. Estimating the Conditional False-Positive Rate for Semi-Latent Data. Epidemiology. 2002;13:424–430. [PubMed]
- Van Dyck E, Buve A, Weiss HA, et al. Performance of Commercially Available Enzyme Immunoassays for Detection of Antibodies Against Herpes Simplex Virus Type 2 in African Populations. Journal of Clinical Microbiology. 2004;42:2961–2965. [PMC free article] [PubMed]
- Walter SD. Estimation of Test Sensitivity and Specificity When Disease Confirmation Is Limited to Positive Results. Epidemiology. 1999;10:67–72. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |