Our derivations demonstrate that the observed ROC curve differs from the true ROC curve, with the amount of bias depending on the correlation between the screening tests for participants with disease, ρC, the rate of signs and symptoms, ψ, and the threshold for recall, θ. In some cases, the bias equally affects the observed ROC curve for both screening tests, and the scientific conclusion is the same as it would have been had the true disease state been observed. In other cases, the bias causes a change in the direction of the scientific conclusion. The scientific conclusion only changes direction when for one screening test, for participants with disease, a higher proportion of the scores lead to recall than for the other screening test. Thus, for that screening test, a larger percent of participants with true disease go on to have their disease status correctly ascertained, and observed in the study, than for the other screening test.
Figure and Figure demonstrate the possible effects of bias on the scientific conclusions. In Figure , the study investigator will draw the wrong scientific conclusion. In Figure , the study investigator will draw the correct scientific conclusion, despite the presence of bias. For Figure , the participants with the highest 8% of both screening Test 1 and screening Test 2 scores will be recalled for the sensitive and specific secondary test. For Figure , the participants with the highest 34% of the screening test scores for Test 1 will be recalled, but only the highest 8% for Test 2. In general, the scientific conclusion is correct when both screening tests lead to a secondary test at the same rate. The scientific conclusion may be wrong when the chance of proceeding to the secondary test depends on which screening test produced a high score.
Figure 2 True and observed ROC curves for a hypothetical example where bias changes the scientific conclusion. The parameters for this example were chosen to illustrate a case where paired screening trial bias may cause an incorrect scientific conclusion. The (more ...)
Figure 3 True and observed ROC curves for a hypothetical example where bias did not change the scientific conclusion. The parameters for this example were chosen to illustrate a case where paired screening trial bias did not change the direction of the difference (more ...)
As shown in Figure and Figure , the observed curves have inflection points, where the slope changes. There is no inflection point in the true ROC curves for either test, because the formulae that govern the sensitivity and specificity for the true curves are the same no matter what the ROC cutoff points are (see Tables and ). By contrast, as shown in Tables and , the formulae for the observed ROC curves change depending on whether the cutpoint is above or below θ. This causes a change in slope for the observed ROC curve. The inflection point is more obvious for Test 2 than for Test 1. The inflection point for Test 1 occurs at specificity of about 0.80, and is obscured in Figure . In general, as θ increases relative to the mean of the test score distribution, the point of inflection occurs at higher values of specificity.
In Figure , the true ROC curve for screening Test 2 is higher than the true ROC curve for screening Test 1. Thus, screening Test 2 has better true diagnostic accuracy than screening Test 1. However the observed ROC curve for screening Test 1 is higher than the observed ROC curve for Test 2.
In Figure , bias in the observed ROC curves leads to a bias in the observed AUC for each test. Recall that in reality, screening Test 2 has better diagnostic accuracy than screening Test 1. The true AUC of screening Test 1 is 0.64, and the true AUC of screening Test 2 is 0.70. However, the observed AUC tells a different story. The observed AUC for screening Test 1 is 0.82, and the observed AUC for screening Test 2 is 0.75.
Since Test 2 truly has better diagnostic accuracy than Test 1, the true difference in AUC between screening Test 2 and Test 1 is positive (Test 2 true AUC – Test 1 true AUC = 0.70 - 0.64 = 0.06). However, in Figure , the observed difference in AUC between Test 2 and Test 1 is negative (Test 2 observed AUC – Test 1 observed AUC = 0.75 – 0.82 = -0.07). If the study investigator were to observe these exact theoretical results, the study investigator would conclude that screening Test 1 has better diagnostic accuracy than Test 2, when in fact the opposite is true.
Study investigators never observe the true state of nature. They observe data, and make estimates, the precision of which depends on the sample size. They decide which screening test is better using hypothesis tests. To see which conclusion the hypothesis tests would suggest, both for the true
disease status, we conducted a simulation. For the parameters of Figure , for a Type 1 error rate of 0.05, if the true disease status were known, a non-parametric test [8
] would have 90% power with 33,000 participants. With the true disease status known, we would reject the null roughly 90% of the time. The remaining 10% of the time, we would conclude no difference in AUC between Test 1 and Test 2. If the true disease status were known, every time we rejected the null, we would conclude correctly that Test 2 is better than Test 1.
If we conduct the same simulation experiment from the point of view of the study investigator, for the experimental situation of Figure , we see only the observed state of disease. In that case, the study investigator will reject the null hypothesis only 71% of the time. The remaining 29% of the time, the study investigator will conclude that there is no difference in AUC between Test 1 and Test 2. The lower power is due to more variance in the observed data, compared to the true data. When the study investigator rejects the null, every time, she concludes incorrectly that Test 1 is better than Test 2.
The incorrect conclusion in Figure is the result of a cascade of errors. The observed sensitivity for Test 1 is inflated more than the observed sensitivity for Test 2. The increase in observed sensitivity makes the observed ROC curve higher for Test 1 than for Test 2. A higher observed ROC curve means a higher observed AUC for Test 1 than for Test 2.
To understand how and why paired screening trial bias occurs, consider a single specificity value on the true and observed ROC curves shown in Figure . Choose the value of specificity where there is the greatest increase in observed sensitivity relative to true sensitivity, for Test 1. This occurs when specificity is 0.82. For a hypothetical study of 10,000 participants, and specificity of 0.82, the observed and true 2 × 2 tables for Test 1 and Test 2 are shown in Figure .
Figure 4 For the hypothetical example of Figure 2, true and observed 2 × 2 tables. Numbers were rounded to the nearest whole number. All tables were calculated at specificity of about 0.82. This point was chosen because the maximum difference between the (more ...)
Each one of the four tables uses a slightly different ROC cutpoint. For the observed table, Test 1 is positive if it exceeds 2.511; for the true table, Test 1 is positive if it exceeds 2.515. For the observed table, Test 2 is positive if it exceeds 1.269; for the true table, Test 2 is positive if it exceeds 1.265. The tables have different ROC cutpoints because they were chosen to have the same specificity, not the same cutpoint.
Also, the number of cases of disease observed in the study, 45, is much smaller than the true number of cases of disease in the population, 100. The observed number of cases of disease is smaller than the true number because not every participant undergoes the invasive, yet sensitive and specific secondary test, and thus some cases of disease are missed. The observed number of cases of disease is the denominator of the observed sensitivity. Because the denominator is smaller for observed sensitivity than for true sensitivity, the observed sensitivity is strongly inflated for both tests. When specificity is 0.82, the observed sensitivity of Test 1 is 0.72, with true sensitivity of 0.33. For Test 2, the observed sensitivity is 0.52, with true sensitivity of 0.43.
Yet if the bias only affected the denominator, the inflation in sensitivity would be the same for both tests. After all, the same number of observed cases is used as the denominator for both tests. The differential inflation for Test 1 compared to that for Test 2 must be due to the numerator of the observed sensitivity.
For Test 2, the numerator of the observed sensitivity is the number of study participants who are positive on Test 2, and who are observed to have disease in the study. For Test 2, the numerator for observed sensitivity, 23, is smaller than the true numerator, 43. The difference occurs because disease can only be observed if the invasive, yet sensitive and specific secondary test is used. Even though the participants have a score that exceeds the ROC cutpoint for Test 2, they do not all undergo the invasive, yet sensitive and specific secondary test. Thus, they do not yield observed cases of disease. By contrast, for Test 1, because the ROC cutpoint is higher than the threshold which leads to the invasive, yet sensitive and specific secondary test, every participant positive on Test 1 undergoes the secondary test, and is shown to have disease. For each test, there is a different proportion of participants who exceed the cutpoint, who truly have disease, and who proceed to secondary testing. This is the source of the differential bias that causes the curves to reverse order in Figure .
Paired screening trial bias also increases as the proportion of participants with disease who have signs and symptoms (ψ) decreases. If all the cases of the disease were observed during the trial, there would be no difference between true and observed disease status, and no bias. Yet, in every screening trial, some cases of diseases are not identified by either screening test, and never present with signs and symptoms. As the proportion of participants presenting with signs and symptoms (ψ) decreases, fewer cases of disease are discovered during the trial in the interval after screening, and the difference between observed and true disease status grows.
Paired screening trial bias increases with the increase in correlation between the results of the screening tests for participants with disease, ρC. The bias in the observed ROC curves increases because as the two index tests become more highly correlated, the number of observed cases of disease becomes smaller relative to the number of true cases of disease. When the two index tests are highly correlated, they essentially produce the same information as to whether a participant has disease. When the index tests are independent, each test makes diagnoses on its own that the other test misses. Thus, when the tests are independent, and ρC is 0, the number of observed cases is highest, relative to the number of true cases. The percentage of participants receiving the infallible secondary test increases as ρC decreases. The bias lessens as the true disease status is ascertained for more participants.
In general, paired screening trial bias
tends to strongly increase the sensitivity, while slightly decreasing the estimate of specificity. The increase in observed sensitivity compared to true sensitivity is expected with verification bias [9