The series of papers in this supplement of the journal highlights common challenges in systematic reviews of medical tests and outlines their mitigation, as perceived by researchers partaking in the Agency for Healthcare Research and Quality (AHRQ) Effective Healthcare Program. Generic by their very nature, these challenges and their discussion apply to the larger set of systematic reviews of medical tests, and are not specific to AHRQ’s program.
This paper focuses on choosing strategies for meta-analysis of test “accuracy”, or more preferably, test performance. Meta-analysis is not required for a systematic review, but when appropriate, it should be undertaken with a dual goal: to provide summary estimates for key quantities, and to explore and explain any observed dissimilarity (heterogeneity) in the results of the examined studies.
“Summing-up” information on test performance metrics such as sensitivity, specificity, and predictive values is rarely the most informative part of a systematic review of a medical test.1–4
Key clinical questions driving the evidence synthesis (e.g., is this test alone or in combination with a test-and-treat strategy likely to improve decision-making and patient outcomes?) are only indirectly related to test performance per se. Formulating an effective evaluation approach requires careful consideration of the context in which the test will be used. These framing issues are addressed in other papers in this issue of the journal.5–7
Further, in this paper we assume that medical test performance has been measured against a “gold standard”, that is a reference standard that is considered adequate in defining the presence or absence of the condition of interest. Another paper in this supplement discusses ways to summarize medical tests when such a reference standard does not exist.8
Syntheses of medical test data often focus on test performance, and much of the attention to statistical issues relevant to synthesizing medical test evidence focuses on summarizing test performance data; thus their meta-analysis was chosen to be the focus of this paper. We will assume that the decision to perform meta-analyses of test performance data is justified and taken, and will explore two central challenges, namely how do we quantitatively summarize medical test performance when: 1) the sensitivity and specificity estimates of various studies do not vary widely, or 2) the sensitivity and specificity of various studies vary over a large range.
- Briefly, it may be helpful to use a “summary point” (a summary sensitivity and summary specificity pair) to obtain summary test performance when sensitivity and specificity estimates do not vary widely across studies. This could happen in meta-analyses where all studies have the same explicit test positivity threshold (a threshold for categorizing the results of testing as positive or negative) since if studies have different explicit thresholds, the clinical interpretation of a summary point is less obvious, and perhaps less helpful. However, an explicit common threshold is neither sufficient nor necessary for opting to synthesize data with a “summary point”; a summary point can be appropriate whenever sensitivity and specificity estimates do not vary widely across studies.
- When the sensitivity and specificity of various studies vary over a large range, rather than using a “summary point”, it may be more helpful to describe how the average sensitivity and average specificity relate by means of a “summary line”. This oft-encountered situation can be secondary to explicit or implicit variation in the threshold for a “positive” test result, heterogeneity in populations, reference standards, or the index tests, study design, chance, or bias.
Of note, in many applications it may be informative to present syntheses in both ways, as they convey complementary information.
Deciding whether a “summary point” or a “summary line” is more helpful as a synthesis is subjective, and no hard-and-fast rules exist. We briefly outline common approaches for meta-analyzing medical tests, and discuss principles for choosing between them. However, a detailed presentation of methods or their practical application is outside the scope of this work. In addition, it is expected that readers are versed in clinical research methodology, and familiar with methodological issues pertinent to the study of medical tests. We also assume familiarity with the common measures of medical test performance (reviewed in the Appendix, and in excellent introductory papers).9
For example, we do not review challenges posed by methodological or reporting shortcomings of test performance studies.10
The Standards for Reporting of Diagnostic accuracy (STARD) initiative published a 25-item checklist that aims to improve reporting of medical tests studies.10
We refer readers to other papers in this issue11
and to several methodological and empirical explorations of bias and heterogeneity in medical test studies.12–14
Nonindependence of sensitivity and specificity across studies and why it matters for meta-analysis
In a typical meta-analysis of test performance, we have estimates of sensitivity and specificity for each study, and seek to provide a meaningful summary across all studies. Within each study sensitivity and specificity are independent, because they are estimated from different patients (sensitivity from those with the condition of interest, and specificity from those without). According to the prevailing reasoning, across studies sensitivity and specificity are likely negatively correlated: as one estimate increases the other is expected to decrease. This is perhaps more obvious when studies have different explicit thresholds for “positive” tests (and thus the term “threshold effect” has been used to describe this negative correlation). For example, the D-dimer concentration threshold for diagnosing an acute coronary event can vary from approximately 200 to over 600 ng/mL.15
It is expected that higher thresholds would correspond to generally lower sensitivity but higher specificity, and the opposite for lower thresholds (though in this example it is not clearly evident; see Fig. ). A similar rationale can be invoked to explain between-study variability for tests with more implicit or suggestive thresholds, such as imaging or histological tests.
Figure 1. Typical data on the performance of a medical test (D-dimers for venous thromboembolism). Eleven studies on ELISA-based D-dimer assays for the diagnosis of venous thromboembolism.15 The top panel (a) depicts studies as markers, labeled by author names (more ...)
Negative correlation between sensitivity and specificity across studies may be expected for reasons unrelated to thresholds for positive tests. For example, in a meta-analysis evaluating the ability of serial creatine kinase-MB (CK-MB) measurements to diagnose acute cardiac ischemia in the emergency department,16, 17
the time interval from the onset of symptoms to serial CK-MB measurements (rather than the actual threshold for CK-MB) could explain the relationship between sensitivity and specificity across studies. The larger the time interval, the more CK-MB is released into the bloodstream, affecting the estimated sensitivity and specificity. Unfortunately, the term “threshold effect” is often used rather loosely to describe the relationship between sensitivity and specificity across studies, even when, strictly speaking, there is no direct evidence of variability in study thresholds for positive tests.
Because of the above, the current thinking is that in general, the study estimates of sensitivity and specificity do not vary independently, but jointly, and likely with a negative correlation. Summarizing the two correlated quantities is a multivariate problem, and multivariate methods should be used to address it, as they are more theoretically motivated.18, 19
At the same time there are situations when a multivariate approach is not practically different from separate univariate analyses. We will expand on some of these issues.