Whereas the results of a randomized trial are often reported using a single measure of effect, such as a difference in means, a risk difference, or a risk ratio, most diagnostic test accuracy studies report two or more statistics: the sensitivity and the specificity, the positive and negative predictive value, the likelihood ratios for the respective test results, or the Receiver Operating Characteristic (ROC) curve and quantities based on it (34
The first step in the meta-analysis of diagnostic test accuracy is to graph the results of the individual studies. The paired results for sensitivity and specificity in the included studies should be plotted as points in a ROC space (see ), which can highlight the covariation between sensitivity and specificity. In , the X-axis of the ROC plot displays the specificity obtained in the studies in the review. The Y-axis shows the corresponding sensitivity. The rising diagonal indicates values of sensitivity and specificity that could be obtained by guessing and refers to a noninformative test: the chances of a positive test result are identical for the diseased and the non-diseased. It is expected that most studies will lie above this line. The best diagnostic tests will be positioned in the upper right corner of the ROC space, where both sensitivity and specificity are close to 1. As confidence limits are not displayed on these plots it is not possible to discern the cause of scatter across studies – it can be caused either due to small sample sizes or between study heterogeneity. Paired forest plots (see ) display sensitivity and specificity separately (but on the same row) for each study together with confidence intervals and tabular data. A disadvantage is that forest plots do not display the covariation between sensitivity and specificity.
Figure 2 a and b: ROC showing pairs of sensitivity and specificity values for the included studies. The height of the rectangles is proportional to the number of patients with bladder cancer across studies, the width of the rectangles corresponds to the number (more ...)
Forest plots of sensitivity and specificity of a tumor marker for bladder cancer. Based on a re-analysis of the data from Glas et al.10.
The estimated sensitivity and specificity of a test often display a pattern of negative correlation when plotted in an ROC plot. A major contributor to this appearance is the trade-off between sensitivity and specificity when the threshold for defining test positivity varies. When high test results are labelled as positive, decreasing the threshold value that defines a test result as positive increases sensitivity and lowers specificity, and vice versa. When studies included in a review differ in positivity thresholds, a ROC-curve like pattern may be discerned in the ROC plot. There may be explicit variation in thresholds if different studies use different numerical thresholds to define a test result as positive (for example, variation in the blood glucose level above which a patient is said to have diabetes). In other situations, unquantifiable or implicit variation in threshold may occur when test results depend on interpretation or judgment (for example, between radiographers classifying images as normal or abnormal) or where test results are sensitive to machine calibration.
Because threshold effects cause sensitivity and specificity estimates to appear negatively correlated, and because threshold variation can be expected in many situations, robust approaches to meta-analysis take the underlying relationship between sensitivity and specificity into account. One way of doing so is by constructing a summary ROC curve. An average ‘operating point’ on this curve indicates where the centre of the study results lie. Separate pooling of sensitivity and specificity to identify this point has been discredited, because such an approach may identify a summary point which is not representative of the paired data, for example a point which does not lie on the summary ROC curve.
Meta-analyses of studies reporting pairs of sensitivity and specificity estimates often used the linear regression model for the construction of summary ROC curves proposed by Moses et al, which is based on regressing the log diagnostic odds ratio against a measure of the proportion reported as test positive (36
). To examine differences between tests and to relate them to study or sample characteristics, the regression model can be extended by adding covariates (37
). However, we now know that the formulation of the Moses model has its limitations. It fails to consider the precision of the study estimates, does not estimate between-study heterogeneity, and the explanatory variable in the regression is measured with error. These problems render estimates of confidence intervals and P
-values unsuitable for formal inference (35
Two newly developed approaches to fitting random effects in hierarchical models overcome these limitations: the hierarchical summary ROC model (35
) and the bivariate random effects model (38
). The hierarchical summary ROC model focuses on identifying the underlying ROC curve, estimating the average accuracy (as a diagnostic odds ratio) and average threshold (and unexplained variation in these parameters across studies), together with a shape parameter that describes the asymmetry in the curve. The bivariate random effects model focuses on estimating the average sensitivity and specificity, but also estimates the unexplained variation in these parameters and the correlation between them. These two basic models are mathematically equivalent in the absence of covariates (43
). Both models give a valid estimation of the underlying summary ROC curve and the average operating point (38
). Addition of covariates to the models, or application of separate models to different subgroups enables exploration of heterogeneity. Both models can be fitted with statistical software for fitting mixed models (35
Estimates of summary likelihood ratios can best be derived from summary estimates of sensitivity and specificity obtained using the methods described above. Whilst some authors have advocated pooling likelihood ratios rather than pooling sensitivity and specificity or ROC curves (44
), these methods do not account for the correlated bivariate nature of likelihood ratios, and may yield impossible summary estimates and confidence intervals, with positive and negative likelihood ratios either both above or both below 1 (47
Curves or summary estimates?
The ability to estimate underlying summary ROC curves and average operating points allows flexibility in testing hypotheses and estimating diagnostic accuracy. Analyses based on all included studies facilitate well powered comparisons between different tests or between subgroups of studies, which are not restricted to investigating accuracy at a particular threshold. shows a summary ROC curve for the diagnostic accuracy of a tumor antigen test for diagnosing bladder cancer. In contrast, when a test is being used at the same threshold in all included studies, review authors may estimate a summary estimate of sensitivity and specificity. The certainty associated with the estimate can be described by confidence regions marked on the summary ROC plot around the average point. shows an example of this approach.
Judgments about the validity of pooling data should be informed by considering the quality of the studies, the similarity of patients and tests being pooled, and whether the results may consequently be misleading. Where there is statistical heterogeneity in results random effects models will describe the variability and uncertainty in the estimates which may lead to difficulties in drawing firm conclusions about the accuracy of a particular test.
Systematic reviews of diagnostic test accuracy may evaluate more than one tes,t to determine which test or combination of tests can better serve the intended purpose. Indirect comparisons can be made by calculating separate summary estimates of the sensitivity and specificity for each test, including all studies that have evaluated that test, regardless of whether they evaluated the other tests. The substantial variability that can be expected between tests means that such comparisons are prone to confounding. Restricting inclusion to studies of similar design and patient characteristics may limit confounding. An theoretically preferable approach is to only use studies that have directly compared the tests in the same patients, or have randomized patients to one of the tests. Such direct comparisons do not suffer from confounding. Paired analyses can be displayed in an ROC plot, by linking the sensitivity-specificity pairs from each study with a dashed or dotted line, as in . Unfortunately, fully paired studies are not always available.
Figure 4 Direct comparison of two index tests for bladder cancer: cytology (squares) and bladder tumor antigen (diamonds). Figure 1.4a shows the summary ROC curve that can be drawn through these values. Figure 1.4b shows the summary point estimate of sensitivity (more ...)