|Home | About | Journals | Submit | Contact Us | Français|
The content of this article has been reviewed by independent peer reviewers to ensure that it is balanced, objective, and free from commercial bias. No financial relationships relevant to the content of this article have been disclosed by the authors or independent peer reviewers.
We conducted a systematic review and meta-analysis to better define the prognostic ability of fluorine-18-fluorodeoxyglucose positron emission tomography (18F-FDG PET) following salvage chemotherapy for relapsed or refractory Hodgkin's lymphoma (HL) and aggressive non-Hodgkin's lymphoma.
We searched PubMed (from inception to January 31, 2010), bibliographies, and review articles without language restriction. Two assessors independently assessed study characteristics, quality, and results. We performed a meta-analysis to determine prognostic accuracy.
Twelve studies including 630 patients were eligible. The most commonly evaluated histologies were diffuse large B-cell lymphoma (n = 313) and HL (n = 187), which were typically treated with various salvage and high-dose chemotherapy regimens. Studies typically employed nonstandardized protocols and diagnostic criteria. The prognostic accuracy was heterogeneous across the included studies. 18F-FDG PET had a summary sensitivity of 0.69 (95% confidence interval [CI], 0.56–0.81) and specificity of 0.81 (95% CI, 0.73–0.87). The summary estimates were stable in sensitivity analyses. In four studies that performed direct comparisons between PET and conventional restaging modalities, PET had a superior accuracy for predicting treatment outcomes. Subgroup and metaregression analyses did not identify any particular factor to explain the observed heterogeneity.
18F-FDG PET performed after salvage therapy appears to be an appropriate test to predict treatment failure in patients with refractory or relapsed lymphoma who receive high-dose chemotherapy. Some evidence suggests PET is superior to conventional restaging for this purpose. Given the methodological limitations in the primary studies, prospective studies with standardized methodologies are needed to confirm and refine these promising results.
Advances in chemotherapy regimens have established Hodgkin's lymphoma (HL) and aggressive non-Hodgkin's lymphoma (NHL) as potentially curable malignancies [1, 2]. Despite progress in the treatment of these diseases, substantial proportions of patients remain refractory to—or relapse after—standard first-line chemotherapy. In these patients, salvage chemotherapy followed by high-dose consolidation chemotherapy with autologous hematopoietic stem cell transplantation constitutes the treatment of choice. However, relapse after high-dose therapy is not uncommon and the procedure is associated with substantial short-term morbidity and mortality, emphasizing the importance of identifying markers to determine disease prognosis and guide risk-oriented management [3, 4]. Although prognostic models based on clinical and laboratory characteristics have been proposed, their limited predictive accuracy needs to establish through the identification of more reliable prognostic markers in clinical practice [5, 6].
Fluorine-18-fluorodeoxyglucose positron emission tomography (18F-FDG PET) is an established functional imaging modality, routinely used for the staging and post-therapy response assessment of patients with HL and diffuse large B-cell lymphoma (DLBCL) [7–10]. 18F-FDG PET performed after a few cycles of first-line chemotherapy (interim response assessment) is a promising candidate predictive marker for treatment outcomes and could be used for implementing response-tailored treatment strategies . In the relapsed or refractory disease setting, several studies have evaluated the prognostic value of 18F-FDG PET when performed between salvage and consolidation high-dose therapy with stem cell transplantation; the sample sizes of these studies were, however, small, resulting in imprecise estimates of the test's prognostic value, particularly with respect to specific lymphoma subtypes . Furthermore, studies have used heterogeneous designs and protocols for PET assessment, making interpretation of the published data difficult.
To better define the prognostic value of 18F-FDG PET following salvage chemotherapy before high-dose chemotherapy for patients with relapsed or refractory HL and aggressive NHL, we conducted a systematic review of studies assessing its accuracy in predicting treatment outcomes. Using meta-analysis, we summarize the results of studies across different treatment settings and lymphoma subtypes, and estimated the effects of a positive 18F-FDG PET scan on progression-free survival (PFS).
We searched PubMed from inception through January 31, 2010 with no language restrictions. Exact search strategies can be found in online Appendix 1. To complement the search, we examined the reference lists of eligible studies and relevant review articles.
Two reviewers (T.T., I.J.D.) independently screened abstracts and further examined full-text articles of all potentially eligible citations. Studies that assessed 18F-FDG PET for patients with lymphoma during or after induction chemotherapy and before high-dose chemotherapy followed by autologous stem cell transplantation were considered eligible. We included both prospective and retrospective studies, and we considered clinical follow-up with or without pathologic confirmation to be the reference standard. We included studies that evaluated ≥10 patients and included at least five patients who experienced disease progression or relapse after high-dose chemotherapy. When a study included patients who were evaluated with a gallium scan together with those with PET, we included it only if subgroup data on PET were separately extractable. Likewise, when a study included patients who underwent allogeneic transplantation, we included it only if data on those who underwent autologous transplantation were separately extractable. We excluded studies that did not provide adequate information to allow the calculation of sensitivity and specificity, or hazard ratios with their variance for predicting treatment failures. We excluded editorials, comments, letters, and review articles.
One reviewer (T.T.) extracted descriptive data from each eligible study, which were confirmed by another reviewer (I.J.D.). We extracted the following information from eligible studies: first author, year of publication, journal, patient demographic and clinical characteristics such as the International Prognostic Score for advanced-stage HL or the International Prognostic Index for NHL, therapeutic interventions, technical specifications of 18F-FDG PET, and interpretation of PET results. A third reviewer (T.N.) also verified the data on the technical specification and interpretation of 18F-FDG PET. Two reviewers (T.T., I.J.D.) independently extracted data regarding treatment outcomes. If a study reported PET results at multiple time points during induction chemotherapy, we recorded the results of the scan performed closest to high-dose chemotherapy. When studies performed a direct comparison between PET and “conventional” restaging modalities (i.e., computed tomography [CT] or magnetic resonance imaging, and bone marrow biopsy), we also recorded data to assess the diagnostic accuracy of conventional restaging. We defined a direct comparison as the performance of conventional restaging at the same time point in at least 90% of patients who had a PET scan.
To assess the quality and reporting of studies, we evaluated 15 items that were considered relevant to the review topic, based on the Quality Assessment of Diagnostic Accuracy Studies instrument and the Reporting Recommendation for Tumor Marker Prognostic Studies guidelines [13, 14]. Online Appendix Table 1 describes how we rated each methodological item. Two reviewers (T.T., I.J.D.) independently assessed the quality items, and discrepancies were resolved by consensus.
For each study, we constructed a 2 × 2 contingency table consisting of true-positive (TP), false-positive (FP), false-negative (FN), and true-negative results, whereby all patients were categorized according to whether they were PET positive or negative, and whether they experienced treatment failure after high-dose chemotherapy. For the main analysis, we used the entire clinical follow-up as the reference standard for treatment failure diagnosis, and counted censorings as no treatment failure regardless of the duration of follow-up. Also, we considered minimal residual uptake (MRU), when defined, as a positive finding based on the reported diagnostic criteria. Regarding conventional restaging, we counted only complete remission as negative, and considered any residual mass or lesion to be positive, regardless of its size.
We calculated sensitivity and specificity for each study, and then estimated summary sensitivity and specificity with their corresponding 95% confidence interval (CIs) using bivariate random effects meta-analysis [15–17]. Summary positive and negative likelihood ratios (LRs) were calculated from the summary sensitivity and specificity estimates . We assessed between-study heterogeneity visually, by plotting sensitivity and specificity in the receiver operating characteristic (ROC) space. We also drew summary ROC curves and confidence regions for summary sensitivity and specificity [15–17]. As a global measure for the summary ROC curves, we estimated the Q* statistic, the point on the ROC curve where sensitivity and specificity are equal.
To explore heterogeneity, we performed subgroup analysis based on lymphoma histology (HL, NHL, and DLBCL), treatment setting of induction chemotherapy (first-line therapy versus salvage therapy), and whether MRU was separately categorized. To further explore whether study-level characteristics could explain between-study heterogeneity we performed univariate metaregression analyses within a hierarchical summary ROC model . We assessed the following, a priori selected, covariates: year of publication, study design (prospective versus retrospective), study size, proportion of second-line patients, proportion of patients with HL, and relapse rate.
We performed sensitivity analyses to explore the effects of MRU and early censorings during the first year after high-dose chemotherapy on the summary estimates. In a sensitivity analysis, we categorized MRU as negative. Regarding early censorings, we first excluded them regardless of PET results. In a “best-case” scenario, we counted early censored patients with positive PET scans as TP. Conversely, in a “worst-case” scenario we counted early censored patients with negative PET scans as FN. For these “best-case” and “worst-case” scenarios, when a study did not report early censorings for PET+ and PET− patients, we imputed the number of censored cases based on the highest censoring rate observed in each scenario.
To better account for time-to-event data, we also estimated the hazard ratio (HR) comparing PFS between the PET+ and PET− groups. The unadjusted HR was preferred over the adjusted HR. If the HR and its variance were not directly extractable from a study, we calculated them from reported statistics using a prespecified algorithm . We estimated a summary HR by random-effects meta-analysis . We quantified between-study heterogeneity with the I2 statistic .
Analyses were conducted using STATA, version 10.1/SE (Stata Corp, College Station, TX), and SAS, version 9.2 (SAS Institute Inc., Cary, NC). All tests were two-sided and statistical significance was defined as a p-value < .05.
Our literature search identified 2,327 citations, of which 28 were considered potentially eligible and were retrieved for further assessment. After excluding 16 publications, we identified 12 studies eligible for this review (Fig. 1) [22–33]. A complete list of excluded studies and reasons for exclusion are available in online Appendix 1.
The eligible studies evaluated pre–high-dose PET for a total of 630 patients, most of whom (n = 541; 87%) had refractory or relapsed lymphoma (Table 1). Seven studies exclusively included patients undergoing second-line treatment for relapsed or refractory lymphoma [26, 28–33], and five reported on mixed first-line and second-line patient populations [22–25, 27]. Although most studies assessed PET after induction chemotherapy just before high-dose therapy, the number of chemotherapy cycles before PET ranged from two to nine cycles. Nine studies (75%) had a retrospective design. Typically, studies followed up patients for 1–2 years.
Overall, the two most commonly evaluated histologies were DLBCL (n = 313; 50%) and HL (n = 187; 30%) (Table 2). Therapeutically more challenging lymphomas, such as indolent lymphomas (n = 66; 10%) and other aggressive NHLs (n = 62; 10%), were also included. Two studies exclusively evaluated patients with HL [26, 30], and one study focused only on DLBCL patients . In general, patients had a wide range of risk for treatment failure by commonly used prognostic scores such as the age-adjusted international prognostic index at second-line therapy  or the prognostic score for relapsed Hodgkin's lymphoma . Studies employed diverse chemotherapy regimens both for the induction and high-dose components of treatment.
Concerning imaging techniques and technologies, included studies generally followed the Society of Nuclear Medicine guidelines (Appendix Table 2) [34, 35]. Three studies exclusively used a PET/CT scanner [23, 24, 27], and another study used a hybrid PET/CT in a subgroup of patients . All studies but one  employed attenuation correction to reconstruct imaging.
Studies adopted various definitions of qualitative positive and negative diagnostic criteria (Appendix Table 3). One study  adopted diagnostic criteria proposed for post-therapy response assessment , and none used criteria proposed for interim PET assessment . Five studies also defined positive or negative lesions using the standard uptake value [22, 24, 27, 29, 33]. In general, multiple nuclear medicine physicians interpreted PET results in each study. No study reported the level of between-observer agreement.
No study reported all 15 quality items that we assessed (online Appendix Table 4). Reporting was especially poor on blinding of assessors to clinical outcomes (typically treating physicians), to the results of pre–high-dose therapy PET, and to whether treatment strategies were altered based on the PET results. Although six studies [22, 24, 27, 28, 31, 32] employed blinding of PET interpreters to clinical information, only two studies [22, 31] avoided the alteration of treatment based on PET results. Detailed results of quality assessment can be found in online Appendix Table 4.
Visual assessment revealed substantial between-study heterogeneity (Figs. 2 and and3).3). PET sensitivity was in the range of 0.32–1.0 and specificity was in the range of 0.48–1.0. Summary estimates were 0.69 (95% CI, 0.56–0.81) for sensitivity and 0.81 (95% CI, 0.73–0.87) for specificity, for a positive LR of 3.6 and a negative LR of 0.38. The Q* statistic for the summary ROC curve was 0.87.
Of seven studies that performed conventional restaging at the same time point as pre–high-dose therapy PET scans [22, 25, 26, 28, 30–33], four studies reported direct comparisons of conventional restaging with PET [22, 25, 28, 32]. One study did not report whether supplemental tests such as bone marrow biopsy were performed in addition to CT. Interpreters of PET results were blinded to conventional restaging results in three studies [22, 28, 32], whereas no study explicitly reported the blinding of assessors of conventional restaging to PET results. The summary ROC curve for PET (Q* = 0.93) stayed consistently above the curve for conventional restaging (Q* = 0.59) over the range where data points for these four studies were plotted (Fig. 4).
Prognostic accuracy was relatively stable across subgroups (online Appendix Table 5 and Fig. 5). Summary sensitivity and specificity did not significantly change when the analyses were restricted to specific histologies (i.e., HL or DLBCL), studies of salvage therapy, or studies that did not adopt MRU as a separate response criterion. We further explored between-study heterogeneity by metaregression analysis for the predefined study-level covariates. None of the predictors significantly influenced the prognostic accuracy (all p-values > .1).
Of three studies that had MRU with various definitions [22, 25, 31], only two reported such PET results in 20% and 46% of patients [25, 31]. In a sensitivity analysis, treating MRU as a “negative” result decreased the summary sensitivity to 0.63 (95% CI, 0.49–0.74) and increased the summary specificity to 0.85 (95% CI, 0.80–0.88) (Fig. 5). Eight studies reported early (<1 year) censorings, the proportion of which was in the range of 5%–17% of all cases [22–25, 27, 29, 31, 33]. The summary sensitivity and specificity were not materially different in the “best-case” and “worst-case” scenarios, or when censored cases were excluded (Fig. 5).
Eleven studies allowed the calculation of a HR for PFS [22–29, 31–33]. A positive 18F-FDG PET scan was significantly associated with a shorter PFS interval (random effects HR, 4.3; 95% CI, 3.1–6.0; p < .0001) (online Appendix Fig. 1). There was low evidence of between-study heterogeneity (I2 = 14%).
18F-FDG PET performed for patients with lymphoma before high-dose chemotherapy with stem cell transplantation has good accuracy for predicting progression or relapse in the first 2 years following the completion of therapy. The summary specificity estimated in the meta-analysis was nearly 80%, which was stable in sensitivity analyses, including a “worst-case” scenario biased against PET. Overall, patients with a positive pre–high-dose therapy PET scan appear to have a four- to fivefold higher risk for treatment failure than patients with a negative scan.
At present, the literature on pre–high-dose chemotherapy PET is clinically and methodologically heterogeneous. The significant statistical heterogeneity observed among the eligible studies could at least in part be attributable to the underlying clinical heterogeneity, because studies included patients with diverse histological subtypes in different treatment settings (i.e., first-line and second-line) and used many different regimens of salvage and high-dose chemotherapy. Other potential sources of heterogeneity may include temporal changes in treatment strategies (e.g., the introduction of rituximab-based chemoimmunotherapy), supportive care, imaging technologies (e.g., transition from stand-alone PET to PET/CT), and the diagnostic criteria for PET. We nevertheless attempted to synthesize these clinically heterogeneous studies to identify factors that may explain this heterogeneity. Although in subgroup and metaregression analyses these factors did not appear to explain the observed heterogeneity, the interpretation of these results should be treated with caution because of the low statistical power. Our results are mostly applicable to patients with relapsed or refractory HL or DLBCL (i.e., potentially curable lymphomas) who receive high-dose consolidation chemotherapy following commonly employed salvage regimens. We caution that our results may not be applicable to histological subtypes such as indolent lymphomas because the reported follow-up periods are too short to evaluate PFS for these histologies. In addition, PET assessment before high-dose therapy in the salvage setting is clinically less relevant for indolent lymphoma subtypes.
Interestingly, we found some evidence that 18F-FDG PET performed before consolidative high-dose chemotherapy may be a good replacement for CT-based conventional restaging for identifying patients most likely to fail in this invasive and costly treatment. In the head-to-head comparison between these two tests, 18F-FDG PET outperformed conventional restaging.
Our primary analysis, using a predictive accuracy framework, provides valuable information on error rates (i.e., FP and FN rates) and takes into account the variability in diagnostic thresholds. This approach has the benefit of generating clinically applicable information and facilitated the exploration of heterogeneity by subgroup and metaregression analyses. This approach is different from that used in a previously published review , which assessed only the HR. Furthermore, to account for censoring, we performed sensitivity analyses, and also used a time-to-event analysis to estimate a summary HR for PFS.
Several limitations need to be taken into account when interpreting our results. In view of the heterogeneity of chemotherapy regimens in the studies, we were not able to explore their effect on the prognostic accuracy of PET. This is particularly important for assessing the effect of rituximab-containing regimens, because rituximab is now considered part of standard therapy for most patients with B-cell lymphomas, and some studies suggest that the use of rituximab may increase the FP rate of PET [38, 39]. Also, we were not able to assess the incremental value of PET compared with conventional prognostic scores. Although several studies performed multivariate analyses to take such factors into account [23–25, 29–33], relevant data were typically not available from the publications. A patient-level meta-analysis with standardized outcomes and full information on covariates and censoring would be beneficial to achieve this goal .
Given the limitations of the existing literature, future studies are needed to confirm the promising estimates of the prognostic value of 18F-FDG PET. Prospective studies with larger sample sizes focusing on clinically and histologically more homogeneous populations (e.g., only relapsed DLBCL patients after rituximab-containing first-line therapies) using standardized PET protocols and interpretation criteria are needed. Better standardization of diagnostic criteria with the involvement of well-trained assessors is particularly important given that inter-reader variability appears to be substantial, even among experts using the same criteria . To facilitate this goal, international collaborations of investigators, such as the International Workshop on Interim-PET Scan in Lymphoma, could provide guidance regarding the technical implementation of PET, the adoption of uniform response assessment criteria, and the availability of individual patient data to facilitate future evidence synthesis .
18F-FDG PET appears to be an appropriate test for the prediction of treatment outcomes in patients with refractory or relapsed lymphoma who receive high-dose chemotherapy after induction salvage chemotherapy. Given the methodological limitations of primary studies, prospective studies with standardized methodologies are needed to confirm and refine these promising results.
We thank Dr. Christopher Schmid for guidance in performing statistical analyses.
This study was supported in part by grant UL1RR025752 from the National Center for Research Resources to Tufts-Clinical Translational Science Institute (T.T. and I.J.D.), Banyu Life Science Foundation International (H19) to T.T., a research scholarship from the “Maria P. Lemos” Foundation to I.J.D., and a grant from the Ministry of Education, Culture, Sports, Science and Technology of Japan (No. 21791183) to T.N.
Conception/Design: Teruhiko Terasawa, Issa J. Dahabreh, Takashi Nihashi
Collection and/or assembly of data: Teruhiko Terasawa, Issa J. Dahabreh, Takashi Nihashi
Data analysis and interpretation: Teruhiko Terasawa, Issa J. Dahabreh, Takashi Nihashi
Manuscript writing: Teruhiko Terasawa, Issa J. Dahabreh, Takashi Nihashi
Final approval of manuscript: Teruhiko Terasawa, Issa J. Dahabreh, Takashi Nihashi