|Home | About | Journals | Submit | Contact Us | Français|
Studies with methodologic shortcomings can overestimate the accuracy of a medical test. We sought to determine and compare the direction and magnitude of the effects of a number of potential sources of bias and variation in studies on estimates of diagnostic accuracy.
We identified meta-analyses of the diagnostic accuracy of tests through an electronic search of the databases MEDLINE, EMBASE, DARE and MEDION (1999–2002). We included meta-analyses with at least 10 primary studies without preselection based on design features. Pairs of reviewers independently extracted study characteristics and original data from the primary studies. We used a multivariable meta-epidemiologic regression model to investigate the direction and strength of the association between 15 study features on estimates of diagnostic accuracy.
We selected 31 meta-analyses with 487 primary studies of test evaluations. Only 1 study had no design deficiencies. The quality of reporting was poor in most of the studies. We found significantly higher estimates of diagnostic accuracy in studies with nonconsecutive inclusion of patients (relative diagnostic odds ratio [RDOR] 1.5, 95% confidence interval [CI] 1.0–2.1) and retrospective data collection (RDOR 1.6, 95% CI 1.1–2.2). The estimates were highest in studies that had severe cases and healthy controls (RDOR 4.9, 95% CI 0.6–37.3). Studies that selected patients based on whether they had been referred for the index test, rather than on clinical symptoms, produced significantly lower estimates of diagnostic accuracy (RDOR 0.5, 95% CI 0.3–0.9). The variance between meta-analyses of the effect of design features was large to moderate for type of design (cohort v. case–control), the use of composite reference standards and the use of differential verification; the variance was close to zero for the other design features.
Shortcomings in study design can affect estimates of diagnostic accuracy, but the magnitude of the effect may vary from one situation to another. Design features and clinical characteristics of patient groups should be carefully considered by researchers when designing new studies and by readers when appraising the results of such studies. Unfortunately, incomplete reporting hampers the evaluation of potential sources of bias in diagnostic accuracy studies.
Although the number of test evaluations in the literature is increasing, much remains to be desired in terms of methodology. A series of surveys have shown that only a small number of studies of diagnostic accuracy fulfil essential methodologic standards.1,2,3
Shortcomings in the design of clinical trials are known to affect results. The biasing effects of inadequate randomization procedures and differential dropout have been discussed and demonstrated in several publications.4,5,6 A growing understanding of the potential sources of bias and variation has led to the development of guidelines to help researchers and readers in the reporting and appraisal of results from randomized trials.7,8 More recently, similar guidelines have been published to assess the quality of reporting and design of studies evaluating the diagnostic accuracy of tests. For many of the items in these guidelines, there is no or limited empirical evidence available on their potential for bias.9
In principle, such evidence can be collected by comparing studies that have design deficiencies with studies of the same test that have no such imperfections. Several large meta-analyses have used a meta-regression approach to account for differences in study design.10,11,12 Lijmer and colleagues examined a number of published meta-analyses and showed that studies that involved nonrepresentative patients or that used different reference standards tended to overestimate the diagnostic performance of a test.13 They looked at the influence of 6 methodologic criteria and 3 reporting features on the estimates of diagnostic accuracy in a limited number of clinical problems.
We conducted this study of a larger and broader set of meta-analyses of diagnostic accuracy to determine the relative importance of 15 design features on estimates of diagnostic accuracy.
An electronic search strategy was developed to identify all systematic reviews of studies evaluating the diagnostic accuracy of tests that were published between January 1999 and April 2002 in MEDLINE (OVID and PubMed), EMBASE (OVID), the Database of Abstracts of Reviews of Effect (DARE) of the Centre for Reviews and Dissemination (www.york.ac.uk/inst/crd/darehp.htm) and the MEDION database of the University of Maastricht (www.mediondatabase.nl/) (Appendix 1). The focus was on recent reviews, since we expected a larger number of studies in these and more variety in terms of studies with and without design deficiencies.
Systematic reviews were eligible if they included at least 10 primary studies of the accuracy of the same test, if study selection had not been based on one or more of the design features that we intended to evaluate, and if sensitivity and specificity were provided for at least 90% of the studies in the review (Fig. 1). Languages were restricted to English, German, French and Dutch. If 2 or more reviews addressed the same combination of index test and target condition, we included only the largest one to avoid duplicate inclusion of primary studies.
One of us (A.R.) completed the search and performed the initial selection of systematic reviews on the basis of abstracts and titles. Potentially eligible reviews were independently assessed by 2 researchers (A.R. and N.S., or A.R. and M.D.).
Standardized extraction forms and background documents were prepared for the evaluation of the eligibility of the systematic reviews and for the extraction of data and design features from the primary studies. All assessors attended a training session to become familiar with the use of these forms. No masking of authorship or journal name was applied during this or any of the following phases of the project. Inclusion criteria were tuned during the data extraction of the first few primary studies.
Paper copies of the reports of all of the primary studies were retrieved once a systematic review was included. We excluded primary studies if we were unable to reproduce the 2 × 2 tables.
A series of items was extracted from each report that addressed study design, patient group, verification procedure, test execution and interpretation, data collection, statistical analysis and quality of reporting. From this series, we assembled a list of 15 items as potential sources of bias or variation (Appendix 2). These items were selected on the basis of recent systematic reviews of the available literature.9,14,15 Table 1 displays 9 additional items that were selected to evaluate the quality of reporting.
One epidemiologist (A.R.) assessed all of the articles. A second independent assessment was performed by one member of a team of 5 clinicians and trained epidemiologists (N.S., M.D., J.R., J.vR., P.B.). Disagreements were discussed. If necessary, the ruling of a third assessor (J.R. or P.B.) was decisive.
We used a meta-epidemiologic regression approach to evaluate the effect of design deficiencies on estimates of diagnostic accuracy across the systematic reviews.16,17,18 Covariates indicating design features were used to examine whether, on average, studies that failed to meet certain methodologic criteria yielded different estimates of accuracy. The diagnostic odds ratio (DOR) was used as the summary measure of diagnostic accuracy.
Our model can be regarded as a random-effects regression extension of the summary receiver-operating-characteristic (ROC) model used in many systematic reviews of diagnostic accuracy.19
We modelled the DOR in a particular study of a test as a function of the summary DOR for that test, the threshold for positivity in that study, the effect of a series of design features, and residual error. We wanted to determine the average effect of the respective design features, expecting that the effect would differ between meta-analyses and that it can be more prominent for one test and less prominent for another. Using a regression approach, we adjusted the effect of one design feature for the potentially confounding effect of other design features. We allowed the DOR to be related to the positivity threshold in each meta-analysis, allowing for an ROC-like relation between sensitivity and specificity across studies in each meta-analysis.
More formally, our model, a single model including all studies from each meta-analysis, expresses the observed (log) DOR dij in study j in meta-analysis i using the following equation 1:
where Sij is the positivity threshold in each study defined as the sum of logit(sensitivity) and logit(1 – specificity); αi is the overall accuracy of the test studied in meta-analysis i; βi is the coefficient indicating whether the DOR varies with S in each meta-analysis; Χijm is the value of the design feature covariate m in study j included in meta-analysis i; γm is the average effect of feature m across all meta-analyses; and υim expresses the deviation from that average effect in meta-analysis i, calculated as follows (equation 2):
If the variance of an effect between meta-analyses (υim) is close or equal to zero, the average effect of a design feature is about the same in each meta-analysis. Larger values of vim indicate that the magnitude, or even the direction, of that design feature differs substantially from one meta-analysis to another. The error term eij is also normally distributed as follows (equation 3):
and it combines 2 sources of error: sampling error, which is specific for each study j, and a single residual error term, which is assumed to be constant across meta-analyses. The sampling error or imprecision e of the (log) DOR in each study j, is defined as follows (equation 4):
where aij, bij, cij, dij are the 4 cells of the 2 × 2 table of study j in meta-analysis i.
The coefficient γm of a particular design feature estimates the change in the log-transformed DOR between studies with and without that feature. It can be interpreted, after antilogarithm transformation, as a relative diagnostic odds ratio (RDOR). It shows the mean DOR of studies with a specific design deficiency relative to the mean DOR of studies without this deficiency. If the relative DOR is larger than 1, it implies that studies with that design deficiency yield larger estimates of the DOR than studies without it.
We used the PROC MIXED procedure of SAS to estimate the parameters of this model (SAS version 9.1, SAS Institute Inc, Cary, NC). This procedure allows for the specification of random effects and the specification of the known variances of the (log) DOR, which can be kept constant (inverse variance method). Further details on how to fit these models can be found in articles by van Houwelingen and colleagues.16,17
We used the following multivariable modelling strategy. We excluded covariates from the multivariable model when 50% or more of the studies failed to provide information on that design covariate. If that proportion was 10% or less, the corresponding studies were assigned to the potentially flawed category. Otherwise, the nonreported category was kept as such in the analysis. The results of the univariable analysis were used to decide whether categories of a design feature with only a few studies could be grouped together. Categories were combined only if the underlying mechanism of bias was judged to be similar and if the univariable effect estimates were comparable.
Our search identified 191 potentially eligible systematic reviews, from which we were able to include 31 meta-analyses20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47 of 487 primary studies (Fig. 1). Two meta-analyses of the same clinical problem but with different restrictions of patient selection were analyzed as one meta-analysis.20,34 Another meta-analysis had to be split into 4 separate meta-analyses because of differences in test techniques between the studies.46 Because of the exclusion of some primary studies (Fig. 1) and the splitting of a meta-analysis, 6 meta-analyses had fewer than 10 studies.20,32,46 The included meta-analyses addressed a wide range of diagnostic problems in different clinical settings (Appendix 3). Index tests varied, from signs and symptoms derived from history taking or physical examination to laboratory tests and imaging tests. This diversity in tests is also reflected in the pooled DORs, which ranged from 1.2 to 565 (median 30).
The characteristics of the included studies are listed in Table 2. Most of the 487 studies used a clinical cohort (445 [91%]), verified all index test results with a reference standard (453 [93%]) and interpreted the reference standard without integrating index test results (463 [95%]). Only 1 study fulfilled all 13 desired design features.
The quality of reporting per item varied, from reasonably good (age and sex distribution, definition of positive and negative index test results, and reference standard results) to poor (Table 1).
The results of the univariable analysis are presented in Appendix 4. Incomplete reporting precluded the investigation of 2 potential sources of bias. Information about noninterpretable test results and information about dropouts were reported in less than 50% of the studies and were therefore not analyzed any further. Of the remaining 13 design features, 6 were not reported in more than 10% of the studies (Table 2).
The relative effects of all of the characteristics in the multivariable model are shown in Table 2 and depicted in Fig. 2. The reference groups listed in Table 2 have, by definition, an RDOR of 1 and are therefore not presented in Fig. 2.
The largest overestimation of accuracy was found in studies that included severe cases and healthy controls (RDOR 4.9, 95% confidence interval 0.6–37). Only 5 studies in 2 meta-analyses used such a design, which explains the broad confidence interval. In addition, the heterogeneity in effect between meta-analyses was large (0.7), because there was severe overestimation in one of the meta-analyses (detection of gram-negative infection with Gelation Limulus amebocyte lysate) and a much smaller effect in the other meta-analysis (detection of lifetime alcohol abuse or dependence with the CAGE questionnaire). The design features associated with a significant overestimation of diagnostic accuracy were nonconsecutive inclusion of patients and retrospective data collection. Random inclusion of eligible patients and differential verification also resulted in higher estimates of diagnostic accuracy, but these effects were not significant. The selection of patients on the basis of whether they had been referred for the index test, rather than on clinical symptoms, was significantly associated with lower estimates of accuracy.
The RDORs presented in Table 2 and Fig. 2 are average effects across different meta-analyses, and effects varied between meta-analyses. The amount of variance between meta-analyses provides an indication of the heterogeneity of an effect (Table 2). Moderate to large differences were found for study design (cohort v. case–control design), the use of composite reference standards and differential verification. For the other design features, the variance between meta-analyses was close to zero.
Our analysis has shown that differences in study design and patient selection are associated with variations in estimates of diagnostic accuracy. Accuracy was lower in studies that selected patients on the basis of whether they had been referred for the index test rather than on clinical symptoms, whereas it was significantly higher in studies with nonconsecutive inclusion of patients and in those with retrospective data collection. Comparable or even higher estimates of diagnostic accuracy occurred in studies that included severe cases and healthy controls and in those in which 2 or more reference standards were used to verify index test results, but the corresponding confidence intervals were wider in these studies.
We found that studies that used retrospective data collection or that routinely collected clinical data were associated with an overestimation of the DOR by 60%. In studies in which data collection is planned after all index tests have been performed, researchers may find it difficult to use unambiguous inclusion criteria and to identify patients who received the index test but whose test results were not subsequently verified.48,49
Studies that used nonconsecutive inclusion of patients were associated with an overestimation of the DOR by 50% compared with those that used a consecutive series of patients. Studies conducted early in the evaluation of a test may have preferentially excluded more complex cases, which may have led to higher estimates of diagnostic accuracy. Yet if clear-cut cases are excluded, because the reference standard is costly or invasive, diagnostic accuracy will be underestimated. These 2 mechanisms, with opposing effects, may explain why other studies have reported different results, either lower estimates of accuracy in studies with nonconsecutive inclusion50 or, on average, no effect on accuracy estimates.13
We found that studies that selected patients on the basis of whether they had been referred for the index test or on the basis of previous test results tended to lower diagnostic accuracy compared with studies that set out to include all patients with prespecified symptoms. The interpretation of this finding is not straightforward. We speculate that, with this form of patient selection, patients strongly suspected of having the target condition may bypass further testing, whereas those with a low likelihood of having the condition may never be tested at all. These mechanisms tend to lower the proportion of true-positive and true-negative test results.51
An extreme form of selective patient inclusion occurred in the studies that included severe cases and healthy controls. These case–control studies had much higher estimates of diagnostic accuracy (RDOR 4.9), although the low number of such studies led to wide confidence intervals. Severe cases are easier to detect with the use of the index test, which would lead to higher estimates of sensitivity in studies with more severe cases.52 The inclusion of healthy controls is likely to lower the occurrence of false-positive results, thereby increasing specificity.52 Other studies have also reported overestimation of diagnostic accuracy in this type of case–control studies.13,50
Verification is a key issue in any diagnostic accuracy study. Studies that relied on 2 or more reference standards to verify the results of the index test reported odds ratios that were on average 60% higher than the odds ratios in studies that used a single reference standard. The origin of this difference probably resides in differences between reference standards in how they define the target conditions or in their quality.53 If misclassifications by the second reference standard are correlated with index test errors, agreement will artificially increase, which would lead to higher estimates of diagnostic accuracy. Our result is in line with that of the study by Lijmer and colleagues,13 who reported a 2-fold increase with a confidence interval overlapping ours.
As in the study by Lijmer and colleagues, we were unable to demonstrate a consistent effect of partial verification. This may be because the direction and magnitude of the effect of partial verification is difficult to predict. If a proportion of negative test results is not verified, this tends to increase sensitivity and lower specificity, which may leave the odds ratio unchanged.54
We were unable to demonstrate significant associations between estimates of DOR and a number of design features. The absence of an association in our model does not imply that the design features should be ignored in any given accuracy study, since the effect of design differences may vary between meta-analyses, or even within a single meta-analysis.
The results of our study need to be interpreted with the following limitations and strengths in mind. We were hampered by the low quality of reporting in the studies. Several design-related characteristics could not be adequately examined because of incomplete reporting (e.g., frequency of indeterminate test results and of dropouts, patient selection criteria, clinical spectrum, and the degree of blinding). We used the odds ratio as our main accuracy measure, which is a convenient summary statistic,55,56 but it may be insensitive to phenomena that produce opposing changes in sensitivity and specificity. Further studies should explore the effects of these design features on other accuracy measures, such as sensitivity, specificity and likelihood ratios.
Our study can be seen as a validation and extension of the study of Lijmer and colleagues.13 To ensure independent validation, we did not include any of their meta-analyses in our study. Furthermore, we replaced the fixed-effects approach used by them with a more appropriate random-effects approach, which allowed the design covariates to vary between meta-analyses. This explains the wider confidence intervals in our study, despite the fact that we included 269 studies more than Lijmer and colleagues did.
In general, the results of our study provide further empirical evidence of the importance of design features in studies of diagnostic accuracy. Studies of the same test can produce different estimates of diagnostic accuracy depending on choices in design. We feel that our results should be taken into account by researchers when designing new primary studies as well as by reviewers and readers who appraise these studies. Initiatives such as STARD (Standards for Reporting of Diagnostic Accuracy [www.consort-statement.org/stardstatement.htm]) should be endorsed to improve the awareness of design features, the quality of reporting and, ultimately, the quality of study designs. Well-reported studies with appropriate designs will provide more reliable information to guide decisions on the use and interpretation of test results in the management of patients.
We thank Jeroen G. Lijmer for his useful comments on earlier drafts of the study protocol and for securing project funding. We also thank Aeilko H. Zwinderman and Augustinus A. Hart for their statistical input.
• Clinicians need to know the diagnostic accuracy of the medical tests they use. Yet, determinations of test characteristics (sensitivity, specificity and likelihood ratios) derived from comparisons with a “gold standard” vary markedly between studies.
• In this study, the authors examined the sources of variation across 15 design features of 487 published studies of diagnostic accuracy. Only 1 study had no design deficiencies. Estimates of accuracy were highest in studies that selected nonconsecutive patients, that used severe cases and healthy controls and that analyzed retrospective data.
Implications for practice: The marked variation in estimates should make clinicians cautious when reading studies reporting on the diagnostic accuracy of tests. It is important that such studies be properly designed and reported.
This article has been peer reviewed.
Contributors: Johannes Reitsma and Patrick Bossuyt initiated and supervised the study. Anne Rutjes wrote the first draft of the study protocol, designed and established the database and wrote the first draft of the article. All of the authors collected the data. Anne Rutjes and Johannes Reitsma analyzed the data and, along with Patrick Bossuyt, provided the first interpretation of the implications of the study results. All of the authors contributed to the final manuscript and gave final approval of the version to be published. Patrick Bossuyt is the guarantor.
The study was funded by a research grant from the Netherlands organization for scientific research (NWO; registration no. 945-10-012). The funding source had no involvement in the development of the study design, the collection, analysis and interpretation of the data, the writing of the report or the decision to submit the paper for publication.
Competing interests: None declared.
Correspondence to: Dr. Anne W.S. Rutjes, Department of Clinical Pharmacology and Epidemiology, Consorzio Mario Negri Sud, Via Nazionale 8, 66030 Santa Maria Imbaro, Chieti, Italy; fax +39 087 2570206