We examined the asymptotic bias when the dependence structure is misspecified as a function of the proportion of samples receiving gold standard evaluation. For simplicity, we examined this bias for the case when interest focuses on estimating a common sensitivity and specificity across raters (denoted as SENS
, respectively). We examined both verification that is completely at random and verification biased sampling. The misspecified maximum likelihood estimator for the model parameters, denoted by
*, converges to the value θ
and log L(Yi, θ) is the individual contribution to the log-likelihood under the assumed model and the expectation is taken under the true model T. The notation
denotes the expectation (taken under the true model T
) of an individual's contribution to the log-likelihood under the assumed model M
when evaluated at θ
*. Sensitivity and specificity are model-dependent functional forms of the model parameters, SENS
* = g1
*) and SPEC
* = g2
*), where g1
relate model parameters to sensitivity and specificity. Estimators of sensitivity and specificity converge to SENS
* and SPEC
* under misspecified models. Expressions for an individual's contribution to the expected log-likelihood under the correct and misspecified models are provided in Appendix A
. Asymptotic bias for sensitivity and specificity is defined as SENS
* ′ SENS
* ′ SPEC
First, we examined the case of completely at random verification (i.e., rs = r for all s = 1, 2, …, J). We initially examined the asymptotic bias of estimators of sensitivity and specificity when we falsely assumed a GRE model and when the true model was an FM model, as well as when we falsely assumed an FM model and the true model was a GRE model. This reciprocal misspecification with the FM and GRE models is an extreme type of misspecification because the two models are so different.
shows the results for various proportions of completely at random verification for five tests and a presumed constant sensitivity and specificity, with the true model being the FM model and the misspecified model being the GRE model. When we have no gold standard information (r
= 0), there is serious bias under a misspecified dependence structure and the expected individual contribution to the log-likelihood under the correctly specified model is nearly identical (to more than six digits) to the expected log-likelihood under the correctly specified model, which is consistent with results reported in Albert and Dodd (2004)
. Thus, with no gold standard reference and with five tests, estimates of diagnostic error may be biased under a misspecified dependence structure, yet it may be very difficult to distinguish between models in most situations. As little as 2% gold standard verification (r
= .02) reduces the bias considerably, and the expected log-likelihoods are no longer nearly identical, making it simpler to distinguish between models. With 20% verification, the bias is small. For complete verification (r
= 1), marginal quantities such as sensitivity and specificity are nearly unbiased under a misspecified dependence structure. This is consistent with work by Tan, Qu, and Rao (1999)
and Heagerty and Kurland (2001)
who showed for clustered binary data that marginal quantities (which sensitivity, specificity, and prevalence are) are robust to misspecification of the dependence structure. The large differences in expected log-likelihoods suggest that it will be relatively simple to distinguish between models.
Large-sample robustness of the assumed Gaussian random effects (GRE) model to the true dependence structure between tests given by the finite mixture (FM) model
shows asymptotic bias with five tests when the true model is the GRE model and the misspecified model is the FM model. As in , there is substantial asymptotic bias under the misspecified model when there is no gold standard evaluation. In addition, the expected log-likelihood for the misspecified model is nearly identical to the expected log-likelihood for the correctly specified model, again showing the difficulty in choosing between competing models with no gold standard information with few tests. Similar to the results in , estimates of prevalence, sensitivity, and specificity are asymptotically unbiased under the misspecified model when there is complete gold standard evaluation (r = 1). Unlike the results in , a larger percentage of verification (about 50%) is necessary to achieve approximate unbiasedness. In both cases, however, a small percentage of verification results in different expected log-likelihoods under the true and misspecified models, suggesting that it is simpler to choose between competing models with even a small percentage of gold standard verification.
Large-sample robustness of the assumed finite mixture (FM) model to the true dependence structure between tests given by the Gaussian random effects (GRE) model
and provide an assessment of asymptotic bias under reciprocal model misspecification for both the FM and the GRE models when the sensitivity and specificity are .75 and .9, respectively. We also examined the relative asymptotic bias for a wide range of sensitivity and specificity (a grid ranging from values of .65 to .95 for both sensitivity and specificity) corresponding to the cases specified in these tables for r = .5. shows the results corresponding to the models and parameters described in when σ1 = 3. Over the wide range of sensitivity and specificity, the maximum relative percent bias was 2.8% for sensitivity and 5.1% for specificity. Other scenarios provided similar results with all percent biases being less than 6% over the grid (data not shown).
Figure 1 Contour plot of relative asymptotic bias in sensitivity and specificity for 50% completely at random verification when the true model is a GRE model with Pd = .20, σ0 = 1.5, σ1 = 3, and J = 5. Relative asymptotic bias of sensitivity and (more ...)
We examined asymptotic bias of the FM and GRE models under alternative dependence structures. Specifically, we examined the asymptotic bias when the true conditional dependence structure P
) is a Bahadur model (Bahadur 1961
), a log-linear model (Cox 1972
), and a Beta-binomial model, where a description of each of these models is provided in Appendix B
. All three alternative models were formulated so they had the same number of parameters as the GRE and FM models. For the Bahadur model, we considered the special case of only pairwise conditional dependence between tests (i.e., all interactions of order 3 and higher are set to 0). For example, the conditional distribution of Yi
= 1] for all j
for any i
. As in and for reciprocal model misspecification, we evaluated the asymptotic bias of sensitivity and specificity for an increasing fraction r
of completely at random verification under a GRE and FM model when the true model was the Bahadur model. For five tests (J
= 5), SENS
= .75, and Pd
= .20, sensitivity and specificity were nearly asymptotically unbiased under both the GRE and the FM models with 20% completely random verification. For example, under a GRE model, SENS
* = .50, .63, .72, .74, and .75 for r
= 0, .02, .2, .5, and 1. Under an FM model, SENS
* = .61, .73, .78, .76, and .75 for the same values of r
For all three alternative models, we examined the bias in sensitivity and specificity of the GRE and FM models with 50% completely random verification over a wide range of sensitivity and specificity values (identical to the grid described for ) for a prevalence of .20. shows that the maximum relative asymptotic bias was less than 7% for both sensitivity and specificity for all three alternative models. Thus, estimates of diagnostic error appear to be quite robust with 50% completely random verification. When prevalence was very low or very high (e.g., below 5% or above 95%), there was more substantial bias under certain model misspecification with 50% completely random verification. For example, for a prevalence of .05 when the true model was the log-linear model, there was a maximum bias of 10% under a GRE model (as compared to a maximum bias of 4.3% for a prevalence of .20). However, unlike when there is no gold standard evaluation (r = 0), it is much easier to identify the better fitting model using likelihood and other criteria for model assessment. Further, for a rare disease, completely random verification would not generally be recommended due to efficiency considerations.
Table 5 Range in relative asymptotic bias for the GRE and FM models when the true conditional dependence structure is (i) a Bahadur model with all correlations of order 3 and higher equal to 0, (ii) a log-linear model with a three-way interaction, and (iii) a (more ...)
Although random verification is of concern in our application, we also consider verification biased sampling because it is so common. We examine asymptotic properties under a misspecified dependence structure with verification biased sampling. shows asymptotic bias and expected log-likelihoods for the situation in which a random sample of cases among those who test positive on at least one of the five tests is verified (e.g., extreme verification biased sampling) and where the true model is the FM model and the misspecified model is the GRE model. Interestingly, these results suggest that, in some cases, an increase in the proportion verified can result in an increase in bias under the misspecified model. For example, when η1 = .5 and η0 = .2, the estimator of sensitivity is only slightly asymptotically biased (SENS* = .77) with no gold standard evaluation (rs = 0 for s = 0, 1, 2, …, 5) and substantially biased (SENS* = .57) under complete verification of any case with positive tests (r0 = 0 and rs = 1 for s = 1, 2, …, 5). This result is consistent with our simulation results, which are presented in the next section. This problem occurs more generally under a wide range of verification biased sampling. For example, situations where one oversamples discrepant cases can result in bias under model misspecification. Bias can also increase with an increasing proportion of verification of discrepant cases. As an illustration, under completely random verification, when the true model is the FM model as described in with η1 = .5, the sensitivity converges to SENS* = .76 and is nearly unbiased when r = .2. When we oversample discrepant cases r0 = r5 = .20 and rs = .4 for s = 1, 2, 3, 4, estimates of sensitivity are more asymptotically biased (SENS* = .73). The asymptotic bias is increased further (SENS* = .71) when rs, s = 2, 3, 4, is changed from .4 to 1. We found similar results when the true model was the GRE model and the misspecified model was the FM model (data not shown).
Large-sample robustness of the assumed Gaussian random effects model (GRE) when the true dependence structure between tests is a finite mixture (FM) model
In the next section we examine the finite sample results for both robustness and efficiency when we observe partial gold standard information.