|Home | About | Journals | Submit | Contact Us | Français|
To determine whether changes in objective response measures proposed by the National Institutes of Health (NIH) correlate with clinical benefit, such as symptom burden, quality of life (QOL) and survival outcomes, we analyzed data from a multi-center prospective cohort of 283 patients with for chronic graft-versus-host disease (GVHD) requiring systemic treatment. The median follow-up time of survivors was 25.1 months (range 5.4 to 47.7) after enrollment. Symptom measures included the Lee symptom scale and 10-point patient-reported symptoms. QOL measures included the SF-36, FACT-BMT and Human Activities Profile. Overall and organ-specific responses were calculated by comparing manifestations at the 6-month visit to the enrollment visit using a provisional algorithm. Complete or partial responses were considered “Response” and stable or progressive disease considered “No Response.” Overall response rate at 6 months was 32%. Organ-specific response rates were 45% for skin, 23% for eyes, 32% for mouth and 51% for gastrointestinal tract. Response at 6 months as calculated according to the provisional response algorithm correlated with changes in symptom burden in patients with newly diagnosed chronic GVHD, but not with changes in QOL or survival outcomes. Modification of the algorithm or validation of other more meaningful clinical endpoints is warranted for future clinical trials of treatment for chronic GVHD.
Chronic graft-versus-host disease (GVHD) is one of the most devastating long-term complications after allogeneic hematopoietic cell transplantation . Clinical treatment decisions are generally based on physician-assessed response, and the degree of physician-assessed response has served as the primary endpoint in clinical trials . Definitions of physician-assessed response used in previous studies, however, have often been vague and subjective. The lack of standardized quantitative response criteria has been one of the major obstacles in the field .
In 2005, the National Institutes of Health (NIH) Consensus Conference recommended measurement tools to capture clinician and patient-assessed chronic GVHD manifestations . The NIH Conference also recommended a set of provisional algorithms to calculate overall and organ-specific response using these objective response measures . The algorithms were based on experts' opinions, with the recognition that they needed to be validated and refined according to data emerging from prospective studies. The Chronic GVHD Consortium was established to collect prospective observational data for this purpose .
In the current study, we analyzed the correlations of the calculated response based on the provisional algorithm with symptom burden, quality of life (QOL) and survival outcomes. We chose these endpoints as “gold standards” for our analysis because they represent clinical benefit, or “living better or living longer” as defined by the U.S. Food and Drug Administration [7–9]. If the calculated response is to be used as an endpoint in clinical trials, it should correlate with these well-recognized measures of clinical benefit.
We studied 283 patients with chronic GVHD who received systemic treatment and were enrolled in a multi-center, prospective, longitudinal, observational cohort between August 2007 and December 2010 (registered as NCT00637689) and had 6-month visits after enrollment . Diagnosis of chronic GVHD was made according to the NIH consensus criteria . Patients at least 2 years of age who received systemic treatment for chronic GVHD were eligible. Newly diagnosed, incident cases were defined as enrollment within 3 months after chronic GVHD diagnosis, while prevalent cases were enrolled 3 or more months after chronic GVHD diagnosis but within 3 years after transplantation. Patients were evaluated at the transplant center every 6 months. Incident cases had an additional assessment 3 months after enrollment. Treatment of chronic GVHD was not uniform or mandated for this study, although compliance with the NIH chronic GVHD consensus guidelines was encouraged . The study protocol was approved by the Institutional Review Board of each participating center, and all participants or their guardians gave written informed consent.
Organ involvement at the enrollment visit was defined as a NIH organ score of 1 or greater. Lung involvement was based on pulmonary function tests if available or on lung symptom scores otherwise. Responses were calculated according to the provisional response algorithm [4, 5, 12] as complete response (CR), partial response (PR), stable disease (SD) or progressive disease (PD), separately in the skin, eyes, mouth, gastrointestinal tract, liver and lung as well as overall (Table 1). The algorithm compares organ manifestations reported at the 6-month follow-up visit to the enrollment visit. Note that the NIH eye score was used to calculate response in the eye instead of the Schirmer’s test . Complete response in an organ indicates resolution of all reversible manifestations related to chronic GVHD, PR indicates at least 50% improvement, PD indicates an absolute increase of at least 25% or a new organ involvement, and SD indicates none of the above.
Overall response was calculated as followed; overall CR was defined as attainment of CR in all involved organs without evidence of PD in any organ, overall PR was defined as the presence of PR in at least one involved organ without evidence of PD in any organ, overall PD was defined as the presence of PD in any organ or any new organ involvement, and overall SD was defined as none of the above. To be consistent with most published phase II studies [13–16], CR or PR was considered “Response” and SD or PD as “No response” at the 6-month time point.
Symptom measures included the Lee symptom scale and 10-point patient-reported global rating of symptoms [17, 18]. Quality of life and functional measures included the SF-36, FACT-BMT and Human Activities Profile (HAP) [19–21]. All of these instruments have been recommended by the NIH Consensus Conference as patient-reported measures for chronic GVHD. The Lee symptom scale is a 30-item patient self-administered questionnaire specific to symptoms of chronic GVHD in the skin, eyes, mouth, breathing, nutrition, muscles and joints, energy, and mental and emotional aspects . Nutrition items were used as gastrointestinal symptoms in this study. Patients also reported their skin itching, mouth dryness, mouth pain, mouth sensitivity, eye problems and overall chronic GVHD severity on a 10-point scale for peak severity during the past week, as recommended by the NIH Consensus Conference . The SF-36 version 2 is a 36-item self-report questionnaire which assesses patient-reported health and functioning . Two summary scales from the SF-36 include the physical component score (PCS) and the mental component score (MCS). The FACT-BMT version 4.0 is a 37-item self-report questionnaire, which includes a 10-item Bone Marrow Transplant Subscale . The HAP is a 94-item self-reported assessment of energy expenditure or physical fitness that was originally developed in a population of patients with pulmonary disease, and its performance was validated in patients with chronic GVHD .
Patient characteristics were presented as median and range for continuous variables, and as frequency and percentage for categorical variables. Multivariable linear regression models were used to estimate the correlation of the calculated overall and organ-specific response at the 6-month visit with the change in symptom or QOL measures from baseline, among patients who completed questionnaire and had involvement at baseline. Patient characteristics used for adjustment in all models included case type (incident vs. prevalent), donor-patient gender combination, stem cell source, donor-patient cytomegalovirus status, and the NIH global severity at enrollment. Statistical interactions between explanatory variables and adjusted patients characteristics were tested.
A second analysis was designed to determine whether changes in patient-reported outcomes were clinically meaningful, as defined by the NIH Consensus Conference or prior publications. Briefly, a 2-point change on a 10-point scale was considered clinically meaningful for global rating of symptoms [4, 22], while a half-standard deviation change was used for the others [4, 23, 24]. Agreement between the NIH response and clinically meaningful improvement in each measure was examined by the kappa statistic. Empirical interpretation was used for kappa coefficients (0, no agreement; 0 to 0.2, slight agreement; 0.2 to 0.4, fair agreement; 0.4 to 0.6, moderate agreement; 0.6 to 0.8, substantial agreement; and 0.8 to 1.0, almost perfect agreement).
Overall survival (OS) was calculated from the date of the 6-month visit to the date of death or last follow-up using the Kaplan-Meier method. Nonrelapse mortality (NRM) was defined as any death without relapse, and the cumulative incidence of NRM was estimated with relapse considered a competing risk . Landmark analyses fitting Cox regression models were used to compare overall mortality and NRM from the 6-month visit according to the NIH overall response at that time. Hazard ratios were estimated with adjustment for known risk factors, including months from transplantation to enrollment in the cohort, platelet count at onset of chronic GVHD, Karnofsky performance status at onset of chronic GVHD, patient age at transplantation, donor and HLA matching, donor-patient gender combination, prior grades II–IV acute GVHD, and the NIH global severity at enrollment. Statistical analyses were performed with SAS/STAT software, version 9.2 (SAS Institute, Inc., Cary, NC) and R version 2.9.2 (R Foundation for Statistical Computing, Vienna, Austria).
The median patient age at enrollment was 51 years (range, 2 to 79 years). The median time from transplantation to enrollment was 12.2 months (range, 2.9 to 38.5 months). The median time from onset of chronic GVHD to enrollment was 2.5 months (range, 0 to 31.5 months). Of the 283 patients, 150 (53%) were incident cases, 252 (89%) had mobilized blood cell transplantation, 133 (47%) had HLA-matched related donors, 107 (38%) had HLA-matched unrelated donors, 153 (54%) had myeloablative conditioning, and 151 (53%) had prior grades II–IV acute GVHD. The organs most frequently involved at enrollment were the mouth (61%) and skin (60%). Among 138 patients with lung involvement, 113 (82%) were based on pulmonary function tests, and 25 (18%) were based on lung symptom score. Eighty (28%) patients had severe NIH global severity at enrollment. Other demographic characteristics of patients are shown in Table 2.
The calculated overall response was not available for 3 (1%) patients. As shown in Figure 1, 12 (4%) patients had overall CR, 77 (28%) overall PR, 15 (5%) overall SD and 176 (63%) overall PD, for an overall response rate of 32% at 6 months. Organ-specific response rates at 6 months were 45% for skin, 23% for eyes, 32% for mouth, 51% for gastrointestinal tract, and 54% for liver. Compliance rates of the symptom and QOL measures ranged from 83% to 89% at enrollment and from 77% to 83% at 6 months, and patients who did not complete the measures were not included in the correlation analyses. Baseline values for symptom and QOL measures were similar between incident and prevalent cases (P-values ranged from .10 to .64). Correlations of calculated overall response with changes in symptom and QOL measures are shown in Table 3. We observed a different pattern between incident and prevalent cases, and an interaction effect between the NIH response and case type was added to each model. Among incident cases, overall responders had improved overall symptom measures compared with non-responders. For example, the change in the Lee symptom overall score over 6 months after enrollment was estimated to be 7.8 points lower (better) in responders than in non-responders. Among prevalent cases, however, overall response was not associated with changes in symptom measures. No association was observed between overall response and QOL measures in either incident or prevalent cases. Type of systemic treatment at enrollment was not associated with subsequent changes in symptom or QOL measures (P-values ranged from .28 to .99).
We next analyzed symptom and QOL measures at 6 months according to type of treatment at 6 months. Prednisone treatment at 6 months was associated with higher overall symptom burden by both the Lee symptom overall score (P = 0.016) and 10-point overall symptoms (P = .0043), and worse QOL by the SF-36 MCS (P = .022) and FACT-BMT (P = .0019). Treatment with daily prednisone as compared to alternative day or less frequent administration at 6 months was associated with higher overall symptom burden by both the Lee symptom overall score (P = .039) and 10-point overall symptoms (P = .022), worse QOL by the SF-36 PCS (P = .0017) and FACT-BMT (P = .0091), and worse HAP-MAS (P = .005) and HAP-AAS (P = .0013). Treatment with calcineurin inhibitor at 6 months was not associated with symptom or QOL measures.
Correlations of calculated organ response with changes in symptom measures for individual organs are shown in Table 4. Among incident cases, organ response was associated with improved symptom measures for the skin, eyes, mouth and gastrointestinal tract by the Lee symptom scale, and with improved symptom measures for the skin and eyes by the 10-point symptom scale, but not with changes in the 10-point mouth measures. Among prevalent cases, organ response was associated with improved symptom measures in the 10-point mouth pain and sensitivity, but not with changes in the 10-point skin itching, eye problem or mouth dryness, or the Lee symptom scales.
We also examined agreement between the calculated response and clinically meaningful improvement in symptoms or QOL measures among incident cases (Table 5). The calculated response rates response ranged from 20% to 54%, and the rates of clinically meaningful improvement ranged from 20% to 41%. Although these rates appeared similar, kappa statistics showed no better than fair agreement between the calculated response and clinically meaningful improvement for all measures (kappas ranged from −0.04 to 0.39). Agreement for symptom measures (kappas ranged from 0.09 to 0.39) appeared to be better than agreement for QOL measures (kappas ranged from −0.04 to 0.14).
The median follow-up time of survivors was 25.1 months (range 5.4 to 47.7) after enrollment. Figure 2 displays OS and NRM from the 6-month visit according to the calculated overall response at that time. After adjusting for known chronic GVHD risk factors, overall responders did not have a statistical difference in risk of overall mortality (adjusted HR 0.6; 95% CI, 0.2 to 1.4; P = .20) or NRM (adjusted HR 0.8; 95% CI, 0.3 to 2.2; P = .69) compared to non-responders. Results were similar when analyses were separated by incident vs. prevalent cases, and there was no significant statistical interaction between overall response and case type for overall mortality and NRM. Type of systemic treatment at enrollment was not associated with subsequent overall mortality (P = .29) or NRM (P = .38).
The calculated overall and organ-specific responses at 6 months correlated with the corresponding overall and organ-specific changes in patient-reported symptom burden among incident cases. Calculated overall response at 6 months, however, did not correlate with changes in QOL measures or with survival outcomes.
Symptom burden is very important to patients with chronic GVHD and of great interest in clinical trials as a secondary endpoint . The results of multivariable analyses suggest that the calculated response is correlated with a change in patient symptom burden among incident cases but not among prevalent cases. Thus, patients enrolled in clinical trials more than 3 months after diagnosis of chronic GVHD may not have a detectable change in patient-reported outcomes even if chronic GVHD improves according to the current provisional algorithm for calculating response. We hypothesize that this discrepancy may be because symptoms are more amenable to treatment early after diagnosis, or that symptom changes are less perceptible in patients with long-established chronic GVHD.
Patient-reported QOL captures patients’ perceptions of their functional ability and well-being. Improvements in these measures are considered a clinical benefit, fulfilling one component for regulatory approval [7, 8]. Our analysis did not find that patients with responses at 6 months were correlated with changes in QOL. Another study from our Consortium reported a poor correlation between change in NIH global severity scores and change in QOL measures for chronic GVHD . The lack of correlation between calculated responses and QOL could be explained by the fact that they are fundamentally measuring different aspects of GVHD or have different sensitivity to change . Alternatively, resolution of chronic GVHD might not readily produce a detectable change in QOL measures, since QOL is affected by many factors other than chronic GVHD, including toxicities of previous treatments for the underlying disease, immunosuppressive treatment, and fixed deficits caused by chronic GVHD. Although steroid dose and frequency of administration were associated with symptom and QOL measures at 6 months, our data cannot distinguish cause from effect. For example, prednisone treatment and dosing may reflect high symptom burden, or prednisone treatment could be causing more symptoms.
Whether complete resolution of chronic GVHD activity results in significant improvement in QOL measures remains controversial . One retrospective study showed that once patients had chronic GVHD, they continued to have a worse QOL than those who never had chronic GVHD more than 10 years after transplantation, even though half of the patients had resolved chronic GVHD by that time . Another large prospective observational study using patient-reported activity of GVHD showed that patients with resolved chronic GVHD had a better QOL than those with active chronic GVHD . These studies focused on patients who were at least two years after transplantation, whereas the median time from transplant to enrollment in our cohort was 12 months, perhaps too soon to observe an improvement in QOL measures. Another caveat is that all these studies used different QOL measures, and sensitivity of the scale to response may differ among the measures. The source of information about chronic GVHD may be also important, since previous studies have shown that changes in the “patient-reported” severity of chronic GVHD were associated with changes in QOL, while changes in “physician-assessed” severity were not [26, 30].
The choice of assessment time points may affect the ability to detect a correlation between chronic GVHD response and outcomes. The 6-month point was chosen in this study, since it was the most commonly used time point in clinical studies [13–16], and it takes more than 3 months for some manifestations of chronic GVHD such as skin sclerosis to improve . Chronic GVHD is a chronic illness requiring systemic immunosuppressive treatment for a median duration of 2 to 3 years to achieve tolerance . A previous study showed that short-term response was not able to predict long-term treatment success, as defined as withdrawal of all immunosuppressive treatment without secondary systemic treatment after initial systemic treatment of chronic GVHD . Response assessment at a later time point might provide better associations with QOL or survival since more patients may withdraw immunosuppressive treatment by then, and QOL may be better without the side effects of medication. More patients and longer followup may reveal survival differences, although the current cohort was large and the median survival was 25.1 months.
The current algorithm for calculating response is provisional and has several limitations. First, distinctions between reversible disease activity and irreversible damage are not clear, especially for some fibrotic manifestations such as contractures, bronchiolitis obliterans and sicca syndrome, where responses are difficult to achieve . Second, objective response measures for joints and genital tract were not addressed by the NIH Consensus Conference. Third, the optimal cut-off points and measures were based on consensus and should be refined now that data are available [12, 34–37]. Lastly, the proportion of overall CR was very small at 6 months, precluding analyses focused solely on patients with CR.
In the current study, we are not evaluating responses to a specific drug, but rather correlating different possible measures of response. This is an important distinction because duration of response and fluctuating disease activity do not affect our results. We were looking for internal consistency in the various measures we collected at two different time points, irrespective of why the patient got better or worse. If patients improved because of specific treatments, development of tolerance, discontinuation of toxic medications, or even adaptation to chronic GVHD, we would still hope to measure improvements with our instruments.
In summary, response at 6 months after enrollment correlated with changes in symptom burden among incident cases, but not among prevalent cases. Response at 6 months was not associated with changes in QOL or survival outcomes in either group. Modifications of the current response algorithm are needed to improve correlations with clinical benefits, such as changes in QOL and survival outcomes. Alternatively, validation of other more meaningful clinical endpoints is warranted for future clinical trials of treatment of chronic GVHD.
The authors thank Sally Arai, MD, for contributing patients.
This work was supported by grants CA118953 and CA163438 from the National Institutes of Health. Y.I. is a recipient of the Japan Society for the Promotion of Science Postdoctoral Fellowships for Research Abroad.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Financial disclosure: The authors declare no competing financial interests.