We now describe a new system for rating the strength of evidence. Like the GRADE system, our system emphasizes the need for making judgments explicit, and uses the equivalent of a formal point system for rating the strength of a body of evidence. Our system, however, is unique in several fundamental respects. Below, we describe the three prominent areas of uniqueness: 1) the distinction between quantitative and qualitative conclusions; 2) extensive use of *a priori *criteria for judgments; and 3) the direct impact of meta-analysis and sensitivity analysis on evidence ratings. We then provide details on the system itself, including graphical depictions of how judgments can be combined (Figure , Figure , Figure , Figure , and Figure ).

Our system is designed to rate the stability and strength of evidence for each outcome an analyst chooses to evaluate (see Note 2), and is not intended to produce overall recommendations for (or against) a technology. By contrast, both USPSTF and GRADE rate the strength of the

*overall evidence for the technology *(which is often based on several outcomes), and the strength of a

*recommendation about the technology*. The first involves assessing the overall net benefit by balancing the benefits and harms. The second entails more judgments (listed in the USPSTF methods article) about the importance and impact of cost, ethics, law, patient expectations, and societal expectations [

1]. Also, unlike GRADE and USPSTF, our evidence rating system focuses on only the

*internal validity *of the evidence. Questions about generalizability can be addressed outside the scope of the rating system. We are currently examining how well our evidence rating system can be applied within the overall GRADE framework.

Quantitative and Qualitative Conclusions

Our system draws a distinction between two types of conclusions: quantitative and qualitative (see the bottom half of Table ). The quantitative conclusion addresses the question, "How well does it work?", and we refer to the corresponding rating as a "stability" rating. By contrast, the qualitative conclusion addresses the more general question, "Does it work?", and we refer to the rating of the evidence pertaining to this conclusion as a "strength" rating. Thus, a quantitative conclusion characterizes the size of the effect, whereas a qualitative conclusion characterizes the direction of the effect.

This key distinction allows one to draw a strong qualitative conclusion in the face of quantitatively heterogeneous data. Such a situation arises when the results of all studies included in an evidence base demonstrate efficacy, but the magnitude of measured treatment effect differs considerably across studies.

This situation is illustrated in a recent systematic review on drug-eluting stents for coronary artery disease [

13]. This review included 14 randomized trials that compared drug-eluting stents to bare metal stents. Each trial reported the percentage of patients in each arm who underwent target lesion revascularization (TLR) after stent implantation (Figure ). A homogeneity test of these data identified substantial heterogeneity among trial results (Q = 59, p < 0.0001, I

^{2 }= 78%), and subsequent meta-regression analyses did not explain this heterogeneity. Consequently, we refrained from presenting an estimate of the size of treatment effect. However, all trials found that TLR rates were lower after implantation of a drug-eluting stent than after a bare metal stent, and the random-effects meta-analytic confidence interval demonstrated this clear direction of effect (see the bottom of Figure ). Thus, although one can have little confidence in the accuracy of a single quantitative estimate of the effect size, one can have high confidence that drug-eluting stents are effective in reducing TLR rates. Thus, the quantitative/qualitative distinction also underpins two notions of

*consistency*. The first is quantitative consistency: do the studies report similar effect sizes? The second is qualitative consistency: do the studies report the same direction of effect?

Another important purpose of differentiating quantitative from qualitative conclusions is to acknowledge the different needs of those who utilize systematic reviews. Some users are primarily interested in obtaining an estimate of the amount of benefit (or harm) associated with a technology. Other users are simply interested in whether the technology provides any benefit at all. If a systematic review provides both kinds of conclusions, then both needs are met.

No other system for rating medical evidence distinguishes explicitly between quantitative and qualitative conclusions. We believe the distinction is critical to ensure that systematic reviews provide a full picture of the evidence. The GRADE group defines their ratings as the likelihood that further evidence will change one's confidence in the size of the effect, which appears to be a purely quantitative definition. The USPSTF system does not state whether its ratings refer to quantitative conclusions, qualitative conclusions, or both.

In our system, stability and strength ratings are not independent. Logically, evidence that permits a "highly stable" estimate of treatment effect (e.g., an odds ratio of 3.25 in favor of treatment) must also permit a "strong" conclusion about the direction of effect (e.g., that the odds ratio favors treatment). Thus, one built-in feature of our system is that the stability rating sets a lower bound on the strength rating. This means that "moderate" stability can be accompanied by a strength rating no lower than "moderate", and "low" stability can be accompanied by a strength rating no lower than "weak."

Crucial to understanding the results of a systematic review is understanding whether the results are clinically important; statistically significant results do not necessarily represent a clinically important effect. This has been mentioned in many systematic reviews [

14-

18], and our quantitative/qualitative distinction provides an analytic approach. To address clinical importance in our system, one first defines precisely the magnitude of effect that is considered clinically important (e.g., a difference of 0.5% in H

_{b}A

_{1c }in treatments for diabetes). Then, clinical importance can be addressed as a qualitative question: "Is the difference clinically important?" This question is addressed analytically via a comparison of effect sizes to an effect size predefined as clinically important.

Extensive *A Priori *Criteria for Judgments

Most systematic reviews use *a priori *inclusion criteria to reduce the potential for bias in judgments about which studies to include. Some also make an *a priori *judgment about which instrument will be used to assess study quality. However, many other judgments are still susceptible to bias. To reduce this potential, our system specifies the use of *a priori *judgments wherever possible. For example, the system requires that one specify *a priori *quantitative definitions of "consistent" and "robust" effects. Also, the analyst must pre-specify the minimum percentage of included studies that reported the outcome of interest in order to permit a meta-analytic estimate of effect size. If only a small percentage of included studies reported the outcome, selective outcome reporting may have occurred, thereby biasing the meta-analytic summary statistic. For study quality, the analyst must identify not only the instrument to be used, but also the scoring system (if used) and the thresholds that define study quality categories (high, moderate, or low quality). Even the threshold for statistical significance (which does not have to be the conventional 0.05 in all contexts because some clinical contexts may warrant greater or lesser concern about Type 1 errors) must be specified beforehand. To address the question of clinical importance, the minimum level considered to be clinically important must also be determined *a priori*. These definitions, and others, are discussed further in the section entitled "How the System Works".

Consequences of Meta-Analysis and Sensitivity Analysis

Many systematic reviews report the results of meta-analyses, and some also describe sensitivity analyses. Often, however, the results of these statistical analyses are not explicitly tied to ratings of the evidence. In this section, we describe how our system links analytic results to both stability and strength ratings.

The purpose of meta-analysis is not just to obtain a summary estimate of treatment effect, but also to test the data for consistency (heterogeneity testing). This latter purpose is typically accomplished using the Q-statistic, and more recently, I

^{2 }[

19,

20] If important heterogeneity is detected, our system requires that, whenever appropriate (e.g., when the evidence base is large enough), the analyst explore potential sources of this heterogeneity using meta-regression. If heterogeneity cannot be explained by meta-regression, then our system precludes one from presenting a single summary estimate of treatment effect (i.e., a stability rating of "Unstable") (see Note 3). Some investigators advocate the use of a random effects summary statistic in this situation. However, unexplained heterogeneity could be due to differences in patient populations, and/or the way a treatment is administered. Our view is that computing a

*single *summary estimate is not warranted when the evidence demonstrates the existence of

*multiple *estimates.

Although our system precludes the use of random-effects models in determining a single summary estimate of treatment effect, the use of these models does have an important role. This role involves a summary of the evidence to support a *qualitative *conclusion. Even if there is substantial unexplained heterogeneity, the evidence may still indicate a consistent *direction *of effect. The confidence interval (CI) around the random effects summary statistic, which incorporates both within-study and between-study variance, may lie fully above 0 or below 0 (see Note 4). This CI, therefore, is suitable for determining whether the data suggest a clear direction of effect.

Other systems, such as GRADE and USPSTF, are largely silent on the role of meta-analysis in systematic reviews. Our system uses meta-analysis and meta-regression (when clinically appropriate) to increase statistical power and employ precise study weights; furthermore, the system is unique in incorporating the results of these analyses into evidence ratings.

We now turn to the role of sensitivity analysis in our system. In this context, consider that the goal of rating evidence is to assess the likelihood that future evidence will indicate something different than what current evidence indicates. If a large amount of consistent evidence has already accumulated, then future evidence is unlikely to alter the overall strength or stability. Conversely, conclusions based on only a small amount of accumulated evidence may easily change when a single new study is published.

Considered from this perspective, we argue that sensitivity analysis (see Note 5) can substitute for certain judgments about quantity. The idea is that if the conclusion from a meta-analysis depends critically on only one or a few studies in that analysis (or if there is reason to suspect that not all relevant studies are available), then the conclusion may not be robust. Such dependence suggests that a future study may alter conclusions based on currently available studies. Consequently, our system downgrades the stability or strength ratings accordingly. Although there is a widespread sense that sensitivity analysis should be incorporated into an analysis, the system is unique in offering explicit rules for how to gauge the impact of the results of sensitivity analyses on one's confidence in the available evidence.

Sensitivity analysis can obviate the need for certain subjective judgments about the magnitude of effect. Some rating systems (e.g., GRADE) employ such judgments, and if the observed effect is very large, the evidence receives a higher strength rating. Presumably, this is because a very large effect is less likely to be overturned by future evidence and is therefore more robust. However, if there are sufficient studies to perform direct robustness tests via sensitivity analyses, then we advocate doing so, in lieu of making judgments about effect sizes. A meta-analytic sensitivity analysis incorporates effect sizes and confidence intervals from all studies, so that the test is empirically-based.

As with consistency, the quantitative/qualitative distinction helps clarify two notions of robustness. Quantitative robustness concerns the degree to which the summary effect size from meta-analysis tends to change based on relatively small alterations in the data. To assess quantitative robustness, one can perform successive meta-analyses and observe the relative changes in the summary estimate. If the changes in the estimate exceed a predetermined tolerance level, then the original summary estimate is not quantitatively robust. Qualitative robustness refers to whether the evidence base yields the same qualitative general conclusion upon alterations of the data. To assess it, one can again perform successive meta-analysis, but in this case the issue is whether the confidence intervals around summary statistics consistently indicate the same direction of effect.

For example, one qualitative robustness test we have employed utilized cumulative meta-analysis [

21]. In a report on treatments for bulimia [

22], we included seven randomized trials that compared the efficacy of pharmacotherapy to placebo and reported mean purging frequency. A random-effects meta-analysis found that medication yielded significantly greater effects than placebo (i.e., lower purging frequency). We tested the qualitative robustness of this finding in the following manner (see Figure ). The 95% confidence interval of the study with the largest weight (as determined by the inverse of the variance) in the meta-analysis was plotted first (the topmost horizontal segment in the figure). Then we added the study with the next largest weight, and plotted the corresponding random-effects 95% confidence interval for the two-study meta-analysis (the second segment from the top in the figure). Then we continued adding studies, one at a time, until all meta-analytic confidence intervals were plotted.

*A priori*, we had defined a qualitative robust evidence base as one where each of the last three cumulative meta-analyses yielded the same qualitative conclusion. Therefore, we deemed this evidence base to be qualitatively robust.

How the System Works

The system is shown graphically in five figures:

• Entry into system (Figure )

• Overview of the high quality arm (Figure )

• Homogeneous data (Figure )

• Heterogeneous data (Figure )

• Small evidence base (Figure )

An important feature of this system is that every question illustrated in the figures requires a set of *a priori *criteria. Nearly all of these *a priori *criteria are operational definitions that are quantitative. The use of *a priori *criteria helps to reduce bias and subjectivity, as discussed above, and the use of quantitative definitions increases transparency. This system assumes that the assessor has already applied appropriate inclusion/exclusion criteria and has excluded from the analysis any study with fatal flaws.

The initial entry into the system occurs with an assessment of the quality of the evidence for a specific outcome (Figure ), which we consider to be the most important aspect of the evidence. Quality sets an upper bound on the stability and strength ratings (e.g., moderate strength is only possible for data that is, at minimum, moderate quality). Although quality evaluation can be performed with a checklist or scale, any reasonable method for separating the evidence base into different categories of quality will suffice. After an evaluation of individual study quality, studies are judged to be high, moderate, low, or very low quality. Studies of very low quality are always excluded from the evidence base, and the analyst may also choose to exclude low or even moderate quality studies as well. The analyst must choose a method for aggregating the quality of the individual studies to obtain an overall quality rating for the evidence base and then enter the high, moderate, or low quality arm of the system. Within these arms, the system further assesses the quantity, consistency, robustness, and (in some instances) magnitude of effect to determine the stability and strength of the evidence.

Figure through Figure detail the high quality arm of the system. The top half of each figure includes all of the questions and decisions that impact stability ratings (and quantitative conclusions), while the bottom half includes all of the questions and decisions that impact strength ratings (and qualitative conclusions). The moderate and low quality arms of the system are not shown because all aspects of this system are already displayed in the high quality arm.

At the top of these pathways, one first considers whether the evidence base is sufficient to provide a single quantitative estimate of the effect size. We generally require at least three studies, but other investigators may wish to set this criterion higher (e.g., five studies). Additionally, the system requires that a certain percentage of the studies (e.g., 80% or more) must have calculable effect sizes (that can be determined without imputation). If these criteria *are not met*, then one proceeds to Figure (small evidence base). If these criteria *are met*, then one tests the quantitative consistency of the data using a heterogeneity measure such as Q or I^{2}. Under homogeneity, one proceeds to Figure , whereas under heterogeneity, one proceeds to Figure .

Before detailing the steps in Figures and , we must first define the concept of "informativeness", a concept crucial to interpreting the results of individual studies and meta-analyses. Figure illustrates four different effect sizes (A through D) that are considered informative based on criteria discussed in Armitage and Berry [

23]. These effects are informative because the confidence intervals around the summary effect estimates support one of four qualitative conclusions: A) the treatment is beneficial

*and *the effect is clinically important (i.e., the lower 95% confidence interval around the meta-analytic summary statistic is greater than the effect size deemed clinically important); B) the treatment is beneficial but the effect may or may not be clinically important (i.e., the lower 95% confidence interval around the meta-analytic summary statistic is greater than zero but less than a clinically important effect); C) the treatment is beneficial but the effect is not clinically important (the 95% confidence interval is between zero and the effect deemed clinically important); or D) the treatment is not beneficial (the 95% confidence interval overlaps zero and does not overlap the line of clinical importance) (see Note 6). By contrast, example E in Figure would be considered inconclusive (non-informative) because the 95% confidence interval overlaps both zero and the line of clinical importance. Note that this use of "informativeness" accounts for the statistical power of the evidence base, another unique feature of our system (for a related discussion see Armitage and Berry) [

23]. Moreover, by incorporating clinical importance into the system, we provide clinical meaning for the end-users of systematic reviews and other evidence-based documents.

In the homogeneous pathway (Figure ), one performs a meta-analysis to combine the study results. If the meta-analytic summary statistic is not informative, then no conclusions are reached. If the summary statistic is informative, one tests the robustness of the findings through sensitivity analysis (e.g., removal of one study at a time). If the meta-analytic summary statistic passes the robustness tests, the estimate is quantitatively robust. This produces a high stability rating for the quantitative estimate, which leads directly to a strong qualitative conclusion. The logic behind this implication is that if one is confident in the specific estimate of the effect, one is automatically confident in the general direction of that effect.

Continuing within Figure , if the findings are not quantitatively robust, one re-examines the sensitivity analyses to determine qualitative robustness (e.g., do any of the last three analyses in a cumulative meta-analysis of a given data set lead to a different qualitative conclusions?). Additional sensitivity analyses that can be used include removing each study separately or changing the effect size statistic (e.g., using Cohen's h instead of an odds ratio). We also consider tests for publication bias to be a form of sensitivity analysis, although publication bias testing requires a minimum number of available studies. Whether the findings are qualitatively robust determines whether one reaches a strong or moderate qualitative conclusion. Also, one can only reach a strong conclusion from a high quality evidence base. In the moderate and low quality arms, the qualitative conclusion can never be stronger than moderate or weak, respectively.

Figure illustrates the branch followed when an evidence base has enough studies with calculable effect sizes to potentially reach a quantitative effect estimate, but the heterogeneity test indicates significant differences among the studies. If this heterogeneity can be explained using meta-regression, one can still reach a quantitative conclusion. The quantitative conclusion is the conclusion reached about the regression coefficients, including the intercept. For example, if gender is the variable that explains heterogeneity, one might have a conclusion such as "treatment X improved symptoms twice as effectively in women as in men". If meta-regression is not possible or does not explain heterogeneity, no quantitative conclusion is possible. However, one can still perform a random-effects meta-analysis which, if informative, may allow a qualitative conclusion.

Figure illustrates what occurs when the evidence base is too small or otherwise insufficient to allow a quantitative conclusion. Some studies may not report effects sizes and standard errors (nor sufficient information for the analyst to calculate both measures). The analyst must acknowledge and adjust for the existence of such studies. This adjustment may require the estimation or imputation of effect sizes in certain studies [

24]. The full evidence base is then assessed in a random-effects meta-analysis to determine if a qualitative conclusion can be reached.

If there are only two studies and both have calculable effect sizes, one performs a random-effects meta-analysis which, if informative, allows a qualitative conclusion. Of note, meta-analysis of a two-study evidence base is not required in the moderate quality arm, where a conclusion would require both studies to have a statistically significant effect. In the low quality arm, a minimum of three studies is required to reach any conclusion. A qualitative conclusion is also possible for two studies with imprecise effect sizes (that cannot be combined) if both studies are informative and show qualitatively consistent results. If two studies are qualitatively inconsistent or not informative when combined, the findings are inconclusive. If there is only one study, a large effect size is required to allow a weak qualitative conclusion (note that one cannot reach a conclusion if the single study is of moderate or low quality).

Examples

In this section, we provide two example applications of our evidence rating system. In addition to illustrating various aspects of the system, we show how the system can be used in conjunction with simple declarative conclusions that are tied to the stability and strength ratings. The first example involves drug-eluting stents (DESs) for the treatment of coronary artery disease, and the second example involves positron emission tomography (PET) in the staging of lymphoma.

Example #1: Comparison between Drug-Eluting Stents and Bare-Metal Stents for the Treatment of Coronary Artery Disease

In a 2006 report, we examined the evidence comparing the safety and efficacy of drug-eluting stents (DESs) and bare-metal stents for the treatment of angina [

13].

Evidence base We included 14 randomized trials that compared drug-eluting stents to bare metal stents and reported the percentage of patients in each group who underwent target lesion revascularization (TLR) after stent implantation. The trials enrolled a total of 7,006 patients. We addressed the quantitative issue of the size of the difference in overall TLR rates and also the qualitative issue of whether there was any difference in overall TLR rates between the two types of stents.

Study quality assessment To assess the quality of the studies, we applied a quality rating scale, and determined (using *a priori *definitions of high, medium, and low quality) that the evidence base was of high quality.

Sufficient data for quantitative estimate In order to attempt a quantitative estimate of the effect, we required *a priori *that there must be at least three studies, and at least half of the included studies reported sufficient information for us to calculate effect sizes and confidence intervals. All 14 studies reported such information, so we attempted to make a quantitative estimate.

Heterogeneity testing *A priori*, we defined quantitative consistency based on thresholds for Q and I

^{2}. These thresholds were a p value for the Q statistic less than 0.10 (which would mean quantitative inconsistency) and I

^{2 }< 50% (which would also mean quantitative inconsistency). We used a p value of 0.10 because of the known low power of Q. We used a threshold of 50% for I

^{2 }because this value represents moderate heterogeneity [

19,

20] For this evidence base, we performed a meta-analysis using Cohen's h and found substantial heterogeneity (I

^{2 }= 78%, Q = 59, p value for Q was less than 0.000001).

Meta-regressions to explain heterogeneity Given the heterogeneity of effect sizes, we performed meta-regressions in an attempt to explain why study results differed. Not all potential covariates could be examined because not all studies reported some covariates. The three covariates we could examine were drug type (paclitaxel or sirolimus), mean target vessel diameter, and mean target lesion length. None of these three factors were sufficient to explain the observed heterogeneity. Therefore, we did not draw a quantitative conclusion for the outcome of TLR rates. However, we proceeded to a qualitative analysis using a random-effects model to determine whether the evidence permitted a qualitative conclusion.

Stability rating We rated the stability of the evidence as Unstable, due to the unexplained heterogeneity among effect sizes.

Informativeness We performed a random-effects meta-analysis using Cohen's h. The summary statistic was statistically significant (lower TLR rates among patients who received DESs) and clinically important (because TLR is an important patient-oriented outcome, we defined clinical importance *a priori *as any statistically significant effect). This meant that the evidence was informative.

Qualitative robustness testing *A priori*, we defined a quantitatively robust evidence base as one in which the confidence intervals of the last three cumulative, random-effects meta-analyses remained fully on the same side of zero after; 1) removal of the study with the smallest weight (i.e., the lowest precision), and, 2) after the additional removal of the study with the second smallest weight in the meta-analysis. Because the evidence base met both of these criteria, we deemed the evidence base to be qualitatively robust.

Strength rating The strength rating of the evidence for a qualitative difference in TLR rates was Strong, because the meta-analysis was of high quality studies, and was informative and qualitatively robust.

Wording of conclusions The conclusions for this outcome were phrased in the following manner:

The use of DESs (Cypher and TAXUS stents) leads to lower overall TLR rates than use of bare-metal stents in patients with angina at 6 to 12 months following stent implantation. (Strength of evidence: Strong)

• Due to unexplainable differences among the findings of different trials, one cannot accurately determine how much lower these rates are at 6 to 12 months following implantation of a DES.

Example #2: Positron Emission Tomography for the Staging of Lymphoma

In a 2006 report prepared by ECRI's Health Technology Assessment Information (HTAIS) under contract to TRICARE Management Activity (see Note 7), we assessed the use of positron emission tomography (PET) in the staging of lymphoma. The reference standard for determining whether lymphoma has reached Stage IV is a bone marrow biopsy typically taken from the iliac crest. PET may potentially help patients avoid the invasiveness of bone marrow biopsy, depending on how accurately it detects bone marrow infiltration.

Evidence base We included five diagnostic cohort studies that performed PET as well as bone marrow biopsy and also reported sufficient information for the calculation of sensitivity and specificity. The studies reported data on a total of 243 patients. Because our research question was to determine a quantitative estimate of diagnostic accuracy, we did not attempt to draw a qualitative conclusion. Therefore, the text below refers only to the stability rating, not to a strength rating.

Study quality assessment To assess the quality of the studies, we applied a quality rating scale, and determined (using *a priori *definitions of high, moderate and low quality) that the evidence base was of moderate quality. To rate the strength and stability of this evidence, we used the moderate quality branch of the system, which is not included in this paper, but is structurally very similar to the high quality arm. The key difference is an across-the-board decrease in both stability and strength ratings (e.g., "Strong" in the high quality branch corresponds to "Moderate" in the moderate quality branch).

Sufficient data for quantitative estimate *A priori*, we decided that an evidence base could be considered sufficient to permit a quantitative estimate if there were at least three studies and also if at least 75% of the included studies had reported effect sizes or had provided sufficient information for the calculation of effect sizes. In this case, there were five studies, and both sensitivity and specificity were calculable for all five studies. Therefore, we proceeded with the quantitative analysis.

Heterogeneity testing *A priori*, we defined quantitatively consistent results as an I

^{2 }of less than 50% (see above). We computed the diagnostic odds ratio for each of the five studies, and the heterogeneity test revealed no heterogeneity (I

^{2 }= 0%). Therefore, based on the definition of quantitative consistency, we deemed these findings to be quantitatively consistent. There was a threshold effect in the data, however, as evidenced by a plot of the data in ROC space and a strong negative correlation between sensitivity and specificity. We used the method of Littenberg and Moses [

25] to construct a symmetric summary ROC curve, and computed a summary diagnostic odds ratio of 12.7 (95% confidence interval 5.5 to 29.7). At the mean threshold in the included studies, this estimate corresponded to a sensitivity of 64% and a specificity of 88%.

Informativeness When evaluating this diagnostic, we attempted to reach only quantitative conclusions, not qualitative conclusions. Therefore, we did not consider informativeness, and we automatically proceeded to quantitative robustness testing for the summary diagnostic odds ratio (DOR).

Quantitative robustness testing *A priori*, we defined a quantitatively robust evidence base as one that met both of the following two conditions: 1) the confidence interval around the summary DOR did not contain a DOR 50% higher or lower than the summary point estimate, and 2) a cumulative meta-analysis in which studies were entered by precision (highest precision study first) found that all of the last three analyses produced summary effects that were within 5% of the overall summary diagnostic odds ratio. In this case, the evidence base met neither of these conditions; therefore, we deemed the estimate to be not robust.

Stability rating Using the system, we assigned a stability rating of Low to the estimated diagnostic odds ratio of 12.7 (95% CI 5.5 to 29.7). This rating was based on the fact that the studies were of moderate quality, and the estimate was not quantitatively robust.

Wording of conclusions The conclusion was phrased in the following manner:

For the detection of bone marrow infiltration, at mean threshold PET has a sensitivity of 64% (95% CI: 43% to 80%), and a specificity of 88% (95% CI: 76% to 95%). Stability of estimate: Low.