Take the case of study of screening for cancer, where the aim is to determine the relationship between results of the screening test and true disease status. Patients are screened using an imaging technology (the diagnostic test), and those with abnormal findings recommended for biopsy (the gold standard assessment). A hypothetical example from such a screening program is shown in Figure . A total of 500 patients are screened and 100 have abnormal findings. Since those with abnormal findings are strongly recommended to undergo biopsy, 75/100 decide to have a biopsy and 50/75 are confirmed to have disease present. Of the 400 patients with normal findings, 40 are nonetheless biopsied, and 5 are found to have disease.
Example of data subject to verification bias.
To estimate the sensitivity and specificity of imaging for detecting cancer, the naive approach would be to use only data from biopsied patients. This results in a sensitivity of 91% (50/55) and a specificity of 58% (35/60). However, it is obvious that patients with unfavorable characteristics (those likely to be both diagnostic and gold standard positive) are overrepresented and patients with favorable characteristics (those likely to be both diagnostic and gold standard negative) are underrepresented in the sample of biopsied patients. As a result, sensitivity is overestimated and specificity underestimated. This is a classic example of verification bias: imaging results are available for all 500 patients, but the gold standard available only for a subset, which is associated with the imaging result. Methods for verification bias correction, following Begg and Greenes [1
] are given in the Appendix (see Additional file 1
). Using these methods on our example data set, gives a corrected sensitivity of 57% (67/117) and specificity is 91% (350/383). Without verification bias correction, we would have concluded that imaging was highly sensitive but moderately specific when in fact the reverse is true.
Correction for verification bias becomes problematic if small cell counts are encountered. Table gives some different scenarios for the biopsy results of patients with a normal imaging result in our cancer screening example. In the first row, the scenario shown in figure , 40 participants with a normal imaging result were biopsied, of which 5 were found with disease – that is, there were 5 false negatives – giving a corrected sensitivity of 57%. In subsequent rows in table , we vary the number of false negatives and find that small changes in the data lead to large differences in our estimates: a change from 2 to 1 false negatives, for example, increases sensitivity from 77% to 87%. Clearly no robust statistical method should give such a different result given a change in status for a single patient in a 500 patient study.
Examples of data subject to verification bias and with a low number of false negatives
A mathematical explanation for this observation is as follows. Consider the formula for the corrected sensitivity given in the appendix (see Additional file 1
Here, v indicates patients with verified outcome (e.g. a biopsy result); n indicates all patients; the first and second subscripts refer to the test and gold standard results (e.g. imaging and biopsy) respectively; the subscript indicator 1 and 2 refer to test positive/disease and test negative/no disease. The problematic cell is the false negative cell, since participants are rarely verified if they have a strongly negative diagnostic test result; moreover, when these patients are verified, they are most likely to be disease free. It can be seen from (1) that as the false negative cell count (v21) approaches zero, the second term in the denominator of (1) also approaches zero, resulting in a corrected sensitivity that approaches 100%.
The sort of small differences in cell count which, as we have shown, can have a marked effect on estimates, are an inevitable consequence of sampling variability. In our principal example, 5 of the 40 patients with negative imaging results had positive biopsy. The 95% confidence interval for this proportion, 12.5%, is 4% to 27%: accordingly it would not at all be unusual if, were we to repeat this experiment, we were to see only 2 of 40 patients with false negative results. In other words, in the imaging example we could have reported a sensitivity ranging from approximately 40% to 75% due to small chance differences in the number of false negatives, and there would be a very wide confidence interval around these estimates.
To investigate further the effects of low false negative counts on sensitivity, and in turn the AUC, we performed the following experiment:
A) We created a simulated data set with 5000 subjects. Both diagnostic and gold standard test results were known for all 5000 subjects, constituting a fully verified data set. Since the gold standard result was known for all subjects, we were able to fix the true AUC to 0.750. Data were simulated according to the specified probability models:
a. The gold standard test result follows a Bernoulli distribution with the mean equal to the incidence of disease, which was set to 10%.
b. The diagnostic test result follows a log normal distribution where the log(test result) has a standard deviation of 1 and a mean of 0 and 1, respectively, for patients with negative and positive gold standard test results.
B) We introduced verification bias to the data in step A such that a certain proportion v
of participants were verified, where v
was varied as an experimental parameter. The probability p
of verification for each subject increased with the diagnostic test result using the formula log [p
)] = α + 0.5d
, where d
was the decile of the diagnostic test result and the constant α adjusted to fix the overall probability of verification to v
. This gives the probabilities shown in the top half of table . We then applied a correction for verification bias, as shown in the appendix. Note that verification status depended solely on the diagnostic test result, therefore fulfilling the missing at random assumption required for this method. Since the diagnostic test in our simulation has a continuous distribution, the sensitivity and specificity was derived for multiple thresholds by dichotomizing the subjects into abnormal (above the threshold) and normal (below the threshold). A receiver operating characteristics curve was then constructed from these estimates [2
] to calculate an AUC corrected for verification bias.
Probability of verification used in the simulations for each decile of the diagnostic test result
C) We repeated step B five times. Since we introduced verification bias in the same manner each time, we would expect no important differences in data structure between replications. Using the same argument, we would expect no important difference in verification bias corrected AUC unless standard methods were not appropriate for these data.
D) We compared the true AUC from step A to the verification bias corrected AUC corresponding to each replication from step C.
The AUC was calculated using the trapezoid rule, where sensitivity and specificity were estimated (a) over 10 categories based on the deciles of the diagnostic test result and (b) for each unique value of the diagnostic test result using semiparametric efficient estimators – subsequently referred to as the Alonzo-Pepe method -, the latter of which has been shown to have minimal bias when the verification mechanism is known[10
]. We specified that the simulated set have 5000 participants with 10% verification since these are common characteristics of large screening studies [4
We performed a simulation experiment where we repeated steps A and B 2000 times and report the mean of the true and verification bias corrected AUC, as well as the 2.5th – 97.5th percentiles and coverage. Coverage was the proportion of 95% confidence intervals, constructed using bootstrap methods with 2000 replications, containing the true value of 0.750. We performed this simulation experiment varying the proportion verified (v = 10, 30, and 60%). Our intent in varying the proportion verified was to vary the frequency of the cell counts while keeping the relationship between the diagnostic test and outcome the same. For example, with all else being equal, one would be less likely to encounter small cell counts with 60% verified compared to 10% verified. To test whether small cell counts or overall verification rates drove our findings, we repeated our simulations using probabilities of verification as shown in the bottom half of table : in this case, the probability of small numbers of false negatives is higher in the scenario with a higher overall verification rate. For the simulations, the AUC was calculated by estimating sensitivity and specificity over 10 categories based on the deciles of the diagnostic test result; we did not calculate the AUC using the Alonzo-Pepe method as it gave similar results. All statistical analyses were conducted using Stata 9.2 (StataCorp, College Station, TX).