Our analysis has shown that differences in study design and patient selection are associated with variations in estimates of diagnostic accuracy. Accuracy was lower in studies that selected patients on the basis of whether they had been referred for the index test rather than on clinical symptoms, whereas it was significantly higher in studies with nonconsecutive inclusion of patients and in those with retrospective data collection. Comparable or even higher estimates of diagnostic accuracy occurred in studies that included severe cases and healthy controls and in those in which 2 or more reference standards were used to verify index test results, but the corresponding confidence intervals were wider in these studies.
We found that studies that used retrospective data collection or that routinely collected clinical data were associated with an overestimation of the DOR by 60%. In studies in which data collection is planned after all index tests have been performed, researchers may find it difficult to use unambiguous inclusion criteria and to identify patients who received the index test but whose test results were not subsequently verified.
48,49Studies that used nonconsecutive inclusion of patients were associated with an overestimation of the DOR by 50% compared with those that used a consecutive series of patients. Studies conducted early in the evaluation of a test may have preferentially excluded more complex cases, which may have led to higher estimates of diagnostic accuracy. Yet if clear-cut cases are excluded, because the reference standard is costly or invasive, diagnostic accuracy will be underestimated. These 2 mechanisms, with opposing effects, may explain why other studies have reported different results, either lower estimates of accuracy in studies with nonconsecutive inclusion
50 or, on average, no effect on accuracy estimates.
13We found that studies that selected patients on the basis of whether they had been referred for the index test or on the basis of previous test results tended to lower diagnostic accuracy compared with studies that set out to include all patients with prespecified symptoms. The interpretation of this finding is not straightforward. We speculate that, with this form of patient selection, patients strongly suspected of having the target condition may bypass further testing, whereas those with a low likelihood of having the condition may never be tested at all. These mechanisms tend to lower the proportion of true-positive and true-negative test results.
51An extreme form of selective patient inclusion occurred in the studies that included severe cases and healthy controls. These case–control studies had much higher estimates of diagnostic accuracy (RDOR 4.9), although the low number of such studies led to wide confidence intervals. Severe cases are easier to detect with the use of the index test, which would lead to higher estimates of sensitivity in studies with more severe cases.
52 The inclusion of healthy controls is likely to lower the occurrence of false-positive results, thereby increasing specificity.
52 Other studies have also reported overestimation of diagnostic accuracy in this type of case–control studies.
13,50Verification is a key issue in any diagnostic accuracy study. Studies that relied on 2 or more reference standards to verify the results of the index test reported odds ratios that were on average 60% higher than the odds ratios in studies that used a single reference standard. The origin of this difference probably resides in differences between reference standards in how they define the target conditions or in their quality.
53 If misclassifications by the second reference standard are correlated with index test errors, agreement will artificially increase, which would lead to higher estimates of diagnostic accuracy. Our result is in line with that of the study by Lijmer and colleagues,
13 who reported a 2-fold increase with a confidence interval overlapping ours.
As in the study by Lijmer and colleagues, we were unable to demonstrate a consistent effect of partial verification. This may be because the direction and magnitude of the effect of partial verification is difficult to predict. If a proportion of negative test results is not verified, this tends to increase sensitivity and lower specificity, which may leave the odds ratio unchanged.
54We were unable to demonstrate significant associations between estimates of DOR and a number of design features. The absence of an association in our model does not imply that the design features should be ignored in any given accuracy study, since the effect of design differences may vary between meta-analyses, or even within a single meta-analysis.
The results of our study need to be interpreted with the following limitations and strengths in mind. We were hampered by the low quality of reporting in the studies. Several design-related characteristics could not be adequately examined because of incomplete reporting (e.g., frequency of indeterminate test results and of dropouts, patient selection criteria, clinical spectrum, and the degree of blinding). We used the odds ratio as our main accuracy measure, which is a convenient summary statistic,
55,56 but it may be insensitive to phenomena that produce opposing changes in sensitivity and specificity. Further studies should explore the effects of these design features on other accuracy measures, such as sensitivity, specificity and likelihood ratios.
Our study can be seen as a validation and extension of the study of Lijmer and colleagues.
13 To ensure independent validation, we did not include any of their meta-analyses in our study. Furthermore, we replaced the fixed-effects approach used by them with a more appropriate random-effects approach, which allowed the design covariates to vary between meta-analyses. This explains the wider confidence intervals in our study, despite the fact that we included 269 studies more than Lijmer and colleagues did.
In general, the results of our study provide further empirical evidence of the importance of design features in studies of diagnostic accuracy. Studies of the same test can produce different estimates of diagnostic accuracy depending on choices in design. We feel that our results should be taken into account by researchers when designing new primary studies as well as by reviewers and readers who appraise these studies. Initiatives such as STARD (Standards for Reporting of Diagnostic Accuracy [
www.consort-statement.org/stardstatement.htm]) should be endorsed to improve the awareness of design features, the quality of reporting and, ultimately, the quality of study designs. Well-reported studies with appropriate designs will provide more reliable information to guide decisions on the use and interpretation of test results in the management of patients.