|Home | About | Journals | Submit | Contact Us | Français|
Systematic reviews of diagnostic test accuracy studies are increasingly being published, but they can be methodologically challenging. In this paper we present some of the recent developments in the methodology for conducting systematic reviews of diagnostic test accuracy studies. Restrictive electronic search filters are discouraged, as is the use of summary quality scores. Methods for meta-analysis should take the paired nature of the estimates and their dependence on threshold into account, we therefore advice authors of these reviews to use the hierarchical summary ROC or the bivariate model for the analysis of the data for the analysis. Challenges that remain are the poor reporting of original diagnostic test accuracy research, and difficulties with the interpretation of the results of diagnostic test accuracy research.
Diagnostic tests are a critical component of health care, and clinicians, policy makers and patients routinely face a range of questions regarding diagnostic tests. They want to know if testing improves outcome, would like to know what test to use, to purchase, or to recommend in practice guidelines, and how to interpret test results. Well designed diagnostic test accuracy studies can help in making these decisions, provided that they transparently and fully report their participants, tests, methods and results, as facilitated, for example by the Standards for Reporting of Diagnostic Accuracy (STARD) statement (1). That 25 item checklist was published in this and many other journals, and is now adopted by more than 200 scientific journals.
As elsewhere in science, systematic reviews and meta-analysis of accuracy studies can be used to obtain more precise estimates when small studies addressing the same test and patients in the same setting are available. Reviews can also be useful to establish whether and how scientific findings vary by particular subgroups, and may summary estimates with a stronger generalizability than estimates from a single study. Systematic reviews may help identify the risk of bias that may be present in the original studies, and can be used to address questions that were not directly considered in the primary studies, such as comparisons between tests. The Cochrane Collaboration, is the largest international organization preparing, maintaining and promoting systematic reviews to help people make well-informed decisions about health care (2). They decided in 2003 to make preparations for including systematic reviews of diagnostic test accuracy in their Cochrane Database of Systematic Reviews (CDSR). To enable this, a working group was constituted to develop methodology, software and a handbook (see Appendix). The first diagnostic test accuracy reviews will be published in the CDSR in October 2008.
In this paper, we review recent methodological developments concerning problem formulation, location of literature, quality assessment, and meta-analysis of diagnostic accuracy studies using our experience from the work on the Cochrane Handbook. The information presented here is based on the recent literature and updates previously published guidelines by Irwig et al in this journal (3).
Diagnostic test accuracy refers to the ability of that test to distinguish between patients with disease (or more generally, a specified target condition) and those without. In such a test accuracy study, the results of the test under evaluation, or ‘index test’, are compared with those of the reference standard determined in the same patients. The reference standard is the best available method for identifying patients that have the target condition. Test accuracy is most often expressed as the test’s sensitivity (the proportion of those positive to the reference standard who are also positive to the index test) and specificity (the proportion of those negative to the reference standard who are also negative to the index test), but many alternative measures have been proposed and are in use (4,5).
Test accuracy is not a fixed property of a test. It can vary between patient subgroups, with their spectrum of disease, with the clinical setting, with the test interpreters, and may depend on the results of prior testing. For this reason, it is essential to include these elements in the study question. Review authors should at least consider whether the test of interest will be mainly used in general practice or in a secondary or even tertiary setting. If the index test is physical examination for example, a test more important for family practice than for specialized care, then the review authors must realize that their review may be of limited value if the included studies are all done in a tertiary setting.
In order to make a policy decision to promote use of a new index test, evidence is required that using the new test increases test accuracy over other testing options including current practice, or has equivalent accuracy but offers other advantages (6–8). As with the evaluation of interventions, systematic reviews need to include comparative analyses between alternative testing strategies, and not focus solely on evaluating the performance of a test in isolation.
In relation to the existing situation, three possible roles for a new test can be defined: replacement, triage, and add-on (6). If a new test is to replace an existing test, then comparing the accuracy of both tests on the same population and with the same reference standard provides the most direct evidence. In triage, the new test is used before the existing test or existing testing pathway, and only patients with a particular result on the triage test continue the testing pathway. When a test is needed to rule out disease in patients who then need no further testing, one will be looking for a test that gives a minimal proportion of false negatives and thus a relatively high sensitivity. Triage tests may be less accurate than existing ones, but they have other advantages, such as simplicity or low cost. A third possible role of a new test is add-on. The new test is then positioned after the existing testing pathway, to identify false positives or false negatives after the existing pathway. The review should provide data to assess the incremental change in accuracy made by adding the new test.
An example of a replacement question can be found in a systematic review of the diagnostic accuracy of urinary markers for primary bladder cancer (9). Clinicians may use cytology to triage patients before they undergo invasive cystoscopy, the reference standard for bladder cancer. As cytology combines a high specificity with a low sensitivity (10), the goal of the review was to identify a tumor marker with sufficient accuracy to either replace cytology or to be used in addition to cytology. For a marker to replace cytology, it has to achieve equally high specificity with improved sensitivity. New markers which are sensitive but not specific may have roles as adjuncts to conventional testing. The review included studies in which the test under evaluation (several different tumor markers and cytology) was evaluated against cystoscopy or histopathology. Included studies compared one or more of the markers, cytology only, or a combination of markers and cytology.
Although information on accuracy can help clinicians in making decisions about tests, review authors and readers should realize that good diagnostic accuracy is a desirable but not a sufficient condition for the effectiveness of a test (7). To show that using a new test does more good than harm to patients tested, randomized trials of test-and-treatment strategies and reviews of such trials may be necessary. In most cases, such randomized trials are rare and systematic reviews of test accuracy may provide the most useful evidence to guide decision making, and provide key evidence to incorporate into decision models.
Identifying test accuracy studies is more difficult than searching for randomized trials (11). There is not a clear, unequivocal key word or indexing term for an accuracy study in literature databases, comparable to the term “randomized controlled trial”. The Medical Subject Heading “sensitivity and specificity” may look suitable but is inconsistently applied in most electronic bibliographic databases. Furthermore, data on diagnostic test accuracy may be hidden in studies that did not have test accuracy estimation as their primary objective. This complicates the efficient identification of diagnostic test accuracy studies in electronic databases, such as Medline. Until indexing systems properly code studies of test accuracy, searching for them will remain challenging, and additional manual searches, such as screening reference lists, may be necessary.
In the development of a comprehensive search strategy, review authors can use search strings that refer to the test(s) under evaluation, the target condition and the patient description, or a subset of these. For tests with a clear name that are used for a single purpose, searching for publications in which those tests are mentioned may suffice. For other reviews it may be necessary to add the patient description, although this is also often poorly indexed. A search strategy in Medline should contain both Medical Subject Headings and free text words. A search strategy for articles about tests for bladder cancer, for example, should include as many synonyms for bladder cancer as possible in the search strategy, including neoplasm, carcinoma, transitional cell and, possibly, also haematuria.
Several methodological electronic search filters for diagnostic test accuracy studies have been developed, each attempting to restrict the search to articles that are most likely to be test accuracy studies (11–14). These filters rely on indexing terms for research methodology and text words used in reporting results but they often miss relevant studies and are unlikely to decrease the number of articles one needs to screen, so they are not recommended for systematic reviews (15,16). The incremental value of searching in languages other than English and in the so called grey literature has not yet been fully investigated.
In systematic reviews of intervention studies, publication bias is an important and well-studied form of bias, where the decision to report and publish studies is linked to their findings. For clinical trials, the magnitude and determinants of publication bias have been identified by tracing the publication history of cohorts of trials reviewed by ethics committees and research boards (17). A consistent observation has been that studies with statistically significant results are more likely to be published than studies with non-significant findings (17). Investigating publication bias for diagnostic tests is problematic, as many studies are undertaken without ethical review or study registration, so follow-up of cohorts of studies is not well possible (18). Funnel plot based tests used to detect publication bias in reviews of randomized controlled trials have proven to be seriously misleading for diagnostic studies, and alternatives have poor power (19). Also, as results of test accuracy studies do not routinely report P-values and dichotomize findings as significant or not significant, the determinants for publication of diagnostic studies are unlikely to be the same as the determinants for publication of intervention studies.
More variability among diagnostic accuracy study results is to be expected than with randomized trials. Some of this variability is due to chance, as many diagnostic studies have small sample sizes (20). The remaining heterogeneity may be due to differences in study populations, but differences in study methods are also likely to result in differences in accuracy estimates (21). Test accuracy studies with design deficiencies can produce biased results (22–24). Table 1 lists some of the more important forms of bias. Sources of bias for which there is unambiguous evidence that these can overestimate diagnostic accuracy are the inclusion of healthy controls and the incomplete or differential use of reference standards (22,24).
Quality assessment of individual studies in systematic reviews is therefore necessary to identify potential sources of bias and to limit the effects of these biases on the estimates and the conclusions of the review. We recommend the Quality Assessment of Diagnostic Accuracy Studies (Quadas) checklist to assess the quality of diagnostic test accuracy studies (25). In addition, specific sources of bias may exist for different types of diagnostic tests. For example, in studies assessing the accuracy of biochemical serum markers, data-driven selection of the cut-off value may bias diagnostic accuracy (26,27). Review authors should therefore think carefully whether specific items need to be added to the Quadas list.
The results of quality appraisal can be summarized to offer a general impression of the validity of the available evidence. Review authors should not use an overall quality score, as different shortcomings may generate different magnitudes of bias, even in opposing directions, making it very hard to attach sensible weights to each quality item (28). A way to summarize the quality assessment is shown in Figure 1, where stacked bars are used for each Quadas item. Another way of presenting the quality assessment results is by tabulating the results of the individual Quadas items for each single study. In the analysis phase, the results of the quality appraisal may guide explorations of the sources of heterogeneity (30,31). Possible methods to address quality differences are sensitivity analysis, subgroup analysis or meta-regression analysis, although the number of included studies may often be too low for meaningful investigations. Also, incomplete reporting hampers any evaluation of study quality (32). The effects of the STARD guidelines for complete and transparent reporting (1) are only gradually becoming visible in the literature (33).
Whereas the results of a randomized trial are often reported using a single measure of effect, such as a difference in means, a risk difference, or a risk ratio, most diagnostic test accuracy studies report two or more statistics: the sensitivity and the specificity, the positive and negative predictive value, the likelihood ratios for the respective test results, or the Receiver Operating Characteristic (ROC) curve and quantities based on it (34,35).
The first step in the meta-analysis of diagnostic test accuracy is to graph the results of the individual studies. The paired results for sensitivity and specificity in the included studies should be plotted as points in a ROC space (see Figure 2), which can highlight the covariation between sensitivity and specificity. In Figure 2, the X-axis of the ROC plot displays the specificity obtained in the studies in the review. The Y-axis shows the corresponding sensitivity. The rising diagonal indicates values of sensitivity and specificity that could be obtained by guessing and refers to a noninformative test: the chances of a positive test result are identical for the diseased and the non-diseased. It is expected that most studies will lie above this line. The best diagnostic tests will be positioned in the upper right corner of the ROC space, where both sensitivity and specificity are close to 1. As confidence limits are not displayed on these plots it is not possible to discern the cause of scatter across studies – it can be caused either due to small sample sizes or between study heterogeneity. Paired forest plots (see Figure 3) display sensitivity and specificity separately (but on the same row) for each study together with confidence intervals and tabular data. A disadvantage is that forest plots do not display the covariation between sensitivity and specificity.
The estimated sensitivity and specificity of a test often display a pattern of negative correlation when plotted in an ROC plot. A major contributor to this appearance is the trade-off between sensitivity and specificity when the threshold for defining test positivity varies. When high test results are labelled as positive, decreasing the threshold value that defines a test result as positive increases sensitivity and lowers specificity, and vice versa. When studies included in a review differ in positivity thresholds, a ROC-curve like pattern may be discerned in the ROC plot. There may be explicit variation in thresholds if different studies use different numerical thresholds to define a test result as positive (for example, variation in the blood glucose level above which a patient is said to have diabetes). In other situations, unquantifiable or implicit variation in threshold may occur when test results depend on interpretation or judgment (for example, between radiographers classifying images as normal or abnormal) or where test results are sensitive to machine calibration.
Because threshold effects cause sensitivity and specificity estimates to appear negatively correlated, and because threshold variation can be expected in many situations, robust approaches to meta-analysis take the underlying relationship between sensitivity and specificity into account. One way of doing so is by constructing a summary ROC curve. An average ‘operating point’ on this curve indicates where the centre of the study results lie. Separate pooling of sensitivity and specificity to identify this point has been discredited, because such an approach may identify a summary point which is not representative of the paired data, for example a point which does not lie on the summary ROC curve.
Meta-analyses of studies reporting pairs of sensitivity and specificity estimates often used the linear regression model for the construction of summary ROC curves proposed by Moses et al, which is based on regressing the log diagnostic odds ratio against a measure of the proportion reported as test positive (36). To examine differences between tests and to relate them to study or sample characteristics, the regression model can be extended by adding covariates (37). However, we now know that the formulation of the Moses model has its limitations. It fails to consider the precision of the study estimates, does not estimate between-study heterogeneity, and the explanatory variable in the regression is measured with error. These problems render estimates of confidence intervals and P-values unsuitable for formal inference (35,38).
Two newly developed approaches to fitting random effects in hierarchical models overcome these limitations: the hierarchical summary ROC model (35,39–41) and the bivariate random effects model (38,42). The hierarchical summary ROC model focuses on identifying the underlying ROC curve, estimating the average accuracy (as a diagnostic odds ratio) and average threshold (and unexplained variation in these parameters across studies), together with a shape parameter that describes the asymmetry in the curve. The bivariate random effects model focuses on estimating the average sensitivity and specificity, but also estimates the unexplained variation in these parameters and the correlation between them. These two basic models are mathematically equivalent in the absence of covariates (43). Both models give a valid estimation of the underlying summary ROC curve and the average operating point (38,43). Addition of covariates to the models, or application of separate models to different subgroups enables exploration of heterogeneity. Both models can be fitted with statistical software for fitting mixed models (35,38,40,42).
Estimates of summary likelihood ratios can best be derived from summary estimates of sensitivity and specificity obtained using the methods described above. Whilst some authors have advocated pooling likelihood ratios rather than pooling sensitivity and specificity or ROC curves (44–46), these methods do not account for the correlated bivariate nature of likelihood ratios, and may yield impossible summary estimates and confidence intervals, with positive and negative likelihood ratios either both above or both below 1 (47).
The ability to estimate underlying summary ROC curves and average operating points allows flexibility in testing hypotheses and estimating diagnostic accuracy. Analyses based on all included studies facilitate well powered comparisons between different tests or between subgroups of studies, which are not restricted to investigating accuracy at a particular threshold. Figure 2a shows a summary ROC curve for the diagnostic accuracy of a tumor antigen test for diagnosing bladder cancer. In contrast, when a test is being used at the same threshold in all included studies, review authors may estimate a summary estimate of sensitivity and specificity. The certainty associated with the estimate can be described by confidence regions marked on the summary ROC plot around the average point. Figure 2b shows an example of this approach.
Judgments about the validity of pooling data should be informed by considering the quality of the studies, the similarity of patients and tests being pooled, and whether the results may consequently be misleading. Where there is statistical heterogeneity in results random effects models will describe the variability and uncertainty in the estimates which may lead to difficulties in drawing firm conclusions about the accuracy of a particular test.
Systematic reviews of diagnostic test accuracy may evaluate more than one tes,t to determine which test or combination of tests can better serve the intended purpose. Indirect comparisons can be made by calculating separate summary estimates of the sensitivity and specificity for each test, including all studies that have evaluated that test, regardless of whether they evaluated the other tests. The substantial variability that can be expected between tests means that such comparisons are prone to confounding. Restricting inclusion to studies of similar design and patient characteristics may limit confounding. An theoretically preferable approach is to only use studies that have directly compared the tests in the same patients, or have randomized patients to one of the tests. Such direct comparisons do not suffer from confounding. Paired analyses can be displayed in an ROC plot, by linking the sensitivity-specificity pairs from each study with a dashed or dotted line, as in Figure 4. Unfortunately, fully paired studies are not always available.
The interpretation of the results offered in the systematic review should help readers to understand the implications for practice. This interpretation should consider whether evidence derived from the review suitably addresses the objectives of the review. It may involve considerations about whether the study sample was representative, whether the included studies indeed investigated the intended future role of the test under evaluation, and whether the results are unlikely to be biased. The potential effects of quality differences on the results, or the lack of high quality studies should be considered. The interpretation of the findings should furthermore consider the consequences of the false positive and false negative test results and whether the estimates of accuracy that were found are sufficiently high for the foreseen role that the test will have in practice. Some reviews may not result in useful summary estimates of sensitivity and specificity, for example because of large variability in the individual study estimates, or because the authors only investigated the comparative accuracy by comparing summary ROC curves. A decision model could be used to structure the interpretation of the findings. Such a model would incorporate important factors as the disease prevalence, likely outcomes, and the available diagnostic and therapeutic interventions that may follow the test (48). Additional information, such as costs or important trade-offs between harms and benefits can be included.
The development of the methodology for systematic reviews of diagnostic test accuracy studies has progressed importantly in recent years. We now know more about searching, about sources of bias in study design, and about quality appraisal, and about data analysis. In meta-analysis, new hierarchical random effects models have been developed with sound statistical properties that allow robust inferences. Methods for the estimation of summary ROC curves and of summary estimates of sensitivity and specificity are now available. All these advances will be described in detail in the Cochrane Handbook for Diagnostic Test Accuracy Reviews (49). Table 2 provides a summary of the key issues that both readers and reviews authors should think of.
Diagnostic test accuracy reviews face two major challenges. Firstly, they are limited by the quality and availability of primary test accuracy studies that address important relevant questions. More studies are needed which recruit suitable spectrums of participants, make direct comparisons between tests, use rigorous methodology, and clearly report their methods and findings. Secondly, more development is needed in the area of interpretation and presentation of the results of diagnostic test accuracy reviews. It has been shown that many clinicians struggle with the definitions of sensitivity, specificity and likelihood ratios (50,51). We have to explore how well the concept of diagnostic accuracy applies to other forms of testing, such as prognosis, prediction and monitoring, and to new test modalities, such as microarrays and genotyping. Policy makers and guideline developers may be interested in the comparative accuracy only, as well as in additional information, such as the costs and burden of testing, or in new test modalities. Developing systematic reviews that are relevant for policy makers and clinical practice poses a major challenge, and requires clear thinking about the scope and purpose of the review.
Publisher's Disclaimer: This is the prepublication, author-produced version of a manuscript accepted for publication in Annals of Internal Medicine. This version does not include post-acceptance editing and formatting. The American College of Physicians, the publisher of Annals of Internal Medicine, is not responsible for the content or presentation of the author-produced accepted version of the manuscript or any version that a third party derives from it. Readers who wish to access the definitive published version of this manuscript and any ancillary material related to this manuscript (e.g., correspondence, corrections, editorials, linked articles) should go to www.annals.org or to the print issue in which the article appears. Those who cite this manuscript should cite the published version, as it is the official version of record.
Contributors to the Cochrane Diagnostic Test Accuracy Working Group include (in alphabetical order):
Bert Aertgeerts, Doug Altman, Gerd Antes, Lucas Bachmann, Patrick Bossuyt, Heiner Buchner, Peter Bunting, Frank Buntinx, Jonathan Craig, Roberto D’Amico, Jon Deeks, Jenny Doust, Matthias Egger, Anne Eisinga, Graziella Fillipini, Yngve Flack-Ytter, Constantine Gatsonis, Afina Glas, Paul Glasziou, Fritz Grossenbacher, Roger Harbord, Jorgen Hilden, Lotty Hooft, Andrea Horvath, Chris Hyde, Les Irwig, Monica Kjeldstrøm, Petra Macaskill, Susan Mallett, Ruth Mitchell, Tess Moore, Rasmus Moustgaard, Wytze Oosterhuis, Madhukar Pai, Prashni Paliwal, Daniel Pewsner, Hans Reitsma, Jacob Riis, Ingrid Riphagen, Anne Rutjes, Rob Scholten, Nynke Smidt, Jonathan Sterne, Yemisi Takwoingi, Riekie de Vet, Vasivy Vlassov, Joseph Watine, Danielle van der Windt, Penny Whiting.