|Home | About | Journals | Submit | Contact Us | Français|
Health-related quality of life (HRQL) is an important outcome in drug trials. Little is known about how the Short Form-36 (SF-36) and Saint George's Respiratory Questionnaire (SGRQ) perform in idiopathic pulmonary fibrosis (IPF).
To examine the validity of the SF-36 and SGRQ and to determine scores from each that would constitute a minimum important difference (MID).
We analyzed data from a recently completed trial that enrolled subjects with well-defined IPF who completed the SF-36, SGRQ, and Baseline/Transition Dyspnea Index at baseline and six months. We compared mean changes in HRQL scores between groups of subjects whose disease severity changed over six months according to clinical anchors (FVC, DLCO, and dyspnea). We estimated the MID for each domain by using both anchor- and distribution-based approaches.
Results supported the validity of the SF-36 and SGRQ for use in longitudinal studies. Mean changes in domain scores differed significantly between subjects whose clinical status improved and those whose clinical status declined according to the anchors. MID estimates for the SF-36 ranged from 2-4 points and from 5-8 points for the SGRQ.
In IPF, the SF-36 and SGRQ possess reasonable validity for differentiating subjects whose disease severity changes over time. More studies are needed to continue the validation process, to refine estimates of the MIDs for the SF-36 or SGRQ, and to determine if a disease-specific instrument will perform better than either of these.
Idiopathic pulmonary fibrosis (IPF) is a progressive interstitial lung disease (ILD) without effective therapy. Patients with IPF have impaired health-related quality of life (HRQL) in nearly every domain,1 and dyspnea is one strong driver of that impairment.2
By quantify patients' perceptions,3 HRQL instruments capture information that physiologic or radiologic measures do not. Thus, investigators view HRQL as an important outcome to use when attempting to determine the effectiveness of a particular intervention. In patients with IPF, the Short Form- (SF-) 36 and Saint George's Respiratory Questionnaire (SGRQ) yield scores reflecting impaired HRQL, and at single time points, their scores correlate with clinical measures of IPF severity.4 In IPF, what is unknown about either of them is whether they are responsive to underlying change in status and whether they can discriminate between patients whose status over time improves, remains unchanged, or declines. Also lacking for IPF is a basic understanding of how to interpret changes in HRQL scores. Finally, the minimum score change considered clinically important (i.e., the minimum important difference or MID) is known for the SF-36 and SGRQ for certain conditions—but not for IPF.
The overarching goal of this study was to advance understanding and improve interpretation of SF-36 and SGRQ scores in IPF. The main hypotheses were that scores would decline in subjects whose disease progressed; that both the SF-36 and SGRQ could discriminate patients who improve, remain stable, or decline or over time; and that we could use both anchor- and distribution-based methods to establish MID estimates for these instruments in patients with IPF.
We used data from a recently completed trial (the Bosentan Use in ILD-1 or BUILD-1)5 for this retrospective analysis. Details of the BUILD-1 study have been described previously.5 Briefly, subjects had very well-defined IPF according to accepted consensus guidelines.6,7 The SF-36 version 1®, SGRQ, and Baseline/Transition Dyspnea Index (BDI/TDI) were administered at baseline, six months, and twelve months. We used baseline and six month data for our study because this provided us the greatest number of datapoints with which to perform our analyses.
The SF-36 is a generic questionnaire with 36 items that measure functional health and well-being.8 It comprises eight domains and two psychometrically-established summary components, each derived from four domain scores. Domain and summary component scores range from 0-100; higher scores correspond to better health status or well-being. For each domain and summary component, as endorsed by SF-36 developers, we used scoring algorithms to generate linear T-score transformations (http://gim.med.ucla.edu/FacultyPages/Hays/util.htm; last accessed August 1, 2008). Such transformations place scores on scales with mean scores equal to 50 (and standard deviations of 10). The SGRQ is a self-administered, obstructive lung disease-specific questionnaire with 50 items comprising three domains, each scored from 0-100, with higher scores corresponding to worse HRQL.9 The BDI has three domains.10 The TDI is a follow-up questionnaire that asks respondents to rate (from ‘major deterioration’ = −3 to ‘major improvement’ = +3) how dyspnea has changed over time for each BDI domain; thus, scores for the TDI range from −9 (largest deterioration) to +9 (largest improvement).
We used baseline values to calculate mean scores, standard deviations, standard errors of measurement (SEM), and internal consistency reliability (Cronbach's alpha11) coefficients for each instrument. Next, we applied the methods of Kosinski and colleagues12 and their use of known-groups validity13 to examine relationships between either SF-36 or SGRQ scores and FVC, DLCO, and dyspnea, which we will heretofore refer to as anchors. Excluded from our analyses were subjects whose FVC, DLCO, TDI, or entire HRQL questionnaires were missing at either baseline or six months.
We began the analyses by calculating anchor change scores. For FVC, we categorized subjects as “unchanged” if the difference in the raw FVC value at month six was within 7% (inclusive) of the baseline value, as “changed minimally” if the difference at month six was between 7 and 12% (exclusive) of baseline, and as ”changed more than minimally” if the difference was ≥ 12%. We used the widely accepted cut-off value of 15% to represent a significant difference from baseline in DLCO; we elected not to parse DLCO into more categories because of the greater statistical “noise” in DLCO as compared with FVC, and it is far less clear to us what the range for a minimum change in DLCO should be. Thus, we did not used DLCO as an anchor in the MID analyses (see below). We used TDI scores as an anchor because dyspnea has been shown to be a strong influence on HRQL in patients with IPF,2 and attempts have been made to define the MID for the TDI (at least in populations other than IPF14).
Next, we calculated mean SF-36 and SGRQ scores for subjects within each anchor change category. We used ANOVA—one for each HRQL domain—to compare contrasts in mean changes in SF-36 or SGRQ domain scores across anchor change categories. These models generated F-statistics; a larger F-statistic connotes a domain that yields a larger separation between mean HRQL scores across anchor change categories and/or a smaller within group variance.
We used Pearson product-moment correlation coefficients to examine relationships between anchors and HRQL scores. To derive MID estimates for domains from each instrument, we used the effect size (ES) and the 1-SEM criterion15,16 as distribution-based approaches. Although there is no consensus about how or even whether17 the ES should be used in the estimation of MIDs, some investigators consider 0.5 to correspond to the MID,18,19 and that is what we used here. In the first anchor-based approach, we used linear regression to examine the relationship between change in HRQL (dependent variable) and change in the anchor—FVC or TDI score (independent variables).20 We derived a point estimate for the MID by plugging into these equations values representing a minimal change (e.g., 10% for raw FVC—roughly the midpoint of our minimum change range of 7-12%—and one point for the TDI) in the independent variable. In the second anchor-based approach, we calculated the weighted average of mean change scores for each HRQL domain for subjects who changed (either improved or declined) minimally according to the FVC and TDI anchors. All analyses were performed with SAS version 9.1.3 (SAS Institute Inc., Cary, NC), and p-values < .05 were considered statistically significant.
Except for the Symptoms domain for DLCO, mean change scores from each SGRQ domain differed significantly between categories of change in each of the three anchors. Findings were similar for certain SF-36 domains (Table located in online supplement).
For the SF-36, the Physical Functining and Social Functioning domains along with the Physical Component Summary score (PCS) were most useful (i.e., valid) to discriminate between all categories of change in the FVC anchor (Figure 1). The Role Emotional domain (RE) discriminated best between the subset of subjects whose FVC either improved or declined minimally: the difference in RE change scores between subjects in whom FVC improved by 7-12% and those in whom FVC declined by 7-12% was 1.1 standard deviation units (e.g., the difference between an increase of 10.6 points for subjects with FVC improvement 7-12% and a decline of 5.3 points for subjects with FVC decline 7-12% divided by the baseline standard deviation for RE: 10.6-(−5.3)/14.2). For the SGRQ, the Impact domain discriminated best between all categories of change in the FVC anchor as well as between subjects whose FVC either improved or declined minimally: the difference in Impact change scores between subjects in whom FVC improved by 7-12% and those in whom FVC declined by 7-12% was 0.7 standard deviation units (SDU).
For the SF-36, the PCS and RE domains discriminated best between all categories of change in the DLCO anchor (Figure 2A). The difference in RE change scores between subjects in whom DLCO improved by > 15% and those in whom DLCO declined by > 15% was 0.9 SDU. For the SGRQ, the Impact domain discriminated best between categories of change in the DLCO anchor (Figure 2B). The difference in Impact change scores between subjects in whom DLCO improved by > 15% and those in whom DLCO declined by > 15% was 0.8 SDU.
For the SF-36, the Vitality (VT) and PCS domains discriminated best between all categories of change in the TDI anchor. Because of low numbers of subjects with TDI scores of 1 or −1, for this analysis, we elected to compare differences in HRQL change scores between subjects with TDI scores of 2 and those with TDI scores of −2. The VT domain remained most useful to discriminate between subjects whose TDI either improved or declined by 2 points: the difference in VT change scores between subjects in whom TDI improved by 2 and those in whom TDI declined by 2 points was 1.1 SDU. The SGRQ Impact domain discriminated best between all categories of change in the TDI anchor (Figure 3). For the SGRQ, the Symptoms domain discriminated best between subjects whose TDI either improved or declined by 2 points: the difference in Symptoms change scores between subjects in whom TDI improved by 2 and those in whom TDI declined by 2 points was 0.9 SDU.
Correlations between the two anchors used in these analyses and HRQL scores are presented in Table 3. For the SF-36, distribution-based MID estimates were greater than anchor-based estimates (Table 4). For a given domain, the 1-SEM and 0.5ES estimates were fairly similar. On balance, minimally important changes in FVC corresponded to slightly higher MID estimates than did minimally important changes in the TDI anchor. Means of MID estimates for the SF-36 ranged from 2 for the GH domain to 4 for a number of domains. As for the SF-36, for the SGRQ, distribution-based MID estimates were greater than anchor-based estimates. Grand means of MID estimates for SGRQ domains ranged from 5 for the Activity domain to 8 for the Symptoms domain.
We performed the first systematic examination of the longitudinal performance of the SF-36 and SGRQ in patients with IPF. We found subjects whose clinical status changed most had the greatest changes (in the appropriate direction) in SF-36 and SGRQ scores; subjects whose clinical status did not change had essentially no change in HRQL scores; and subjects whose clinical status changed minimally had minimal changes in HRQL scores. We also derived the first MID estimates for the SF-36 and SGRQ in IPF.
There are no data on the longitudinal performance characteristics of the SGRQ in IPF. In the only longitudinal study to examine the SF-36 in IPF,21 Tomioka and colleagues showed that certain domains discriminated between subjects whose clinical status had changed according to pulmonary physiology or peripheral oxygenation. They did not estimate MIDs for SF-36 domains.
Validation is a process involving testing multiple hypotheses about an instrument to determine whether it “behaves” as expected of one designed to measure HRQL,22 and whether its scores can be used confidently (e.g., to determine whether a therapeutic intervention is beneficial). Our results support the validity of the SF-36 and SGRQ for longitudinal use in IPF and allow us to apply meaning to changes in SF-36 and SGRQ scores. For example, a group of IPF patients whose SF-36 PCS domain—which assesses physical health—score drops by three points is likely to have an FVC decline of at least 12% and worsening dyspnea (three-point decline in TDI).
Discriminating between subjects who improve or decline—an attribute some label as discriminant validity—is key to the usefulness of any HRQL instrument. That all domain scores did not change to the same degree (or at all) for certain anchors is not unexpected and does not detract from the usefulness of an instrument. As demonstrated by higher F-statistics, the SGRQ Impacts domain best discriminated between change categories in each of the three anchors. Among SF-36 scales, the PCS best discriminated between change categories in two of the three anchors. This is not surprising, given the greater impairment in physical domains in IPF and that the PCS integrates the four SF-36 physical health domains.
The recently modified definition of MID is that it is the smallest difference in a score that informed patients or proxies perceive as important, either beneficial or harmful, and which would lead the patient or clinician to consider a change in management.20 There is no one correct way to estimate the MID; it should be done using multiple methods.17 There are no published MID estimates for the SF-36 for IPF or even, to our knowledge, for COPD. Examining the results of a study by Kosinski and colleagues,12 in which MIDs for the SF-36 were derived in subjects with rheumatoid arthritis, gives some perspective to our SF-36 MID estimates: after converting estimates from their study to norm-based, we found our estimates to be very similar. Their MID estimate for the PF domain was 3 points versus 3 from this study—for RP 5 vs. 4, BP 5 vs. 3, GH 1 vs. 2, VT 4 vs.3, SF 4 vs. 4, RE 5 vs. 4, MH 5 vs. 3, PCS 3 vs. 3, MCS 4 vs. 3. These similarities are not surprising: one expects that a generic instrument (like the SF-36) would behave similarly, no matter the population.
For the SGRQ, our MID estimates were greater than its widely accepted MID of four points—an estimate derived in patients with obstructive diseases by using expert opinion and anchor-based approaches.23 The divergence likely reflects differences in IPF vs. COPD and the differing behavior of the SGRQ in each. Recall, the SGRQ is obstructive diesease-specific, and certain items tap constructs (e.g, wheezing) not pertinent to IPF patients. Pulished distribution-based MID estimates for the SGRQ vary widely, ranging from 1.3 to 8.4 units.23 Our distribution-based estimates ranged from 6-13.
We chose FVC, DLCO, and dyspnea as anchors because, in patients with IPF, each is key to tracking clinical status, and they are commonly used trial outcomes. We considered a 7-12% change in raw FVC as minimally important, because this range covers both 7% (recently shown to carry prognostic significance in IPF24,25) and 10% (a common endpoint in clinical trials); this gave us a reasonable range around the globally accepted 10% value. In populations other than IPF, a one-unit change in TDI is the MID,14 so we used it here.
The primary limitation of this study is the relatively small number of subjects whose pulmonary physiology changed over time, which left us with imprecise MID estimates. Unfortunately, patient-report global change scores—where a subject rates his overall HRQL at present in relation to baseline, often on a 7-choice Likert scale—were not collected in the BUILD-1 trial; if they had been collected, such scores could have been used as an anchor. Some investigators argue that global change scores make the best anchors.17 The inclusion criterion that subjects' baseline 6MWD had to between 150-499 meters means the results of our analyses may not be translatable to all IPF patients (e.g., those in the end stages of the disease who are unable to walk 150 meters in six minutes). The strength of our study are that it yielded the first-ever estimates of MIDs for the SF-36 and SGRQ in IPF—results that could be useful for guiding future research. In future IPF studies, investigators should perform confirmatory assessments of validity, responsiveness, and MIDs for the SF-36 and SGRQ (or any other instrument). Until a disease-specific instrument is developed and tested, investigators can confidently administer either or both the SF-36 and SGRQ in their studies—and pay close attention to domains that have been shown to be useful.
In sum, we examined the SF-36 and SGRQ in a longitudinal IPF study and found them to perform reasonably well. Each possessed validity for discriminating subjects whose disease status changed by differing degrees over time. We derived the first estimates of the MIDs for these two instruments in IPF. More studies are needed to refine these estimates and further advance our understanding of the behavior of these instruments in IPF.
The authors wish to acknowledge the efforts of all the personnel at Actelion Pharmaceuticals (trial sponsor) and of the investigators involved in the BUILD-1 trial: Ishaar Ben-Dov, Charles Chan, Jean-Francois Cordier, James Dauber, Joao De Andrade, Adaani Frost, Thomas Geiser, Marilyn Glassberg, Jeffrey Golden, Gary Hunninghake, Sanjay Kalra, Lisa Lancaster, Robert Levy, Fernando Martinez, Keith Meyer, Joachim Mueller-Quernheim, Paul Noble, Christophe Pison, Charles Poirier, Milton Rossman, Paola Rottoli, Gerd Staehler, Domonique Valeyre, Athol Wells, Gordon Yung and David Zisman. We also wish to thank Dr. Diane Fairclough for her comments on a prior version of this manuscript and Dr. Ron Hays for his availability to answer questions pertaining to the MID.
Study conceptualization and design: Swigris, Wamboldt
Data collection: Swigris, Brown, Behr, du Bois, King, Raghu and the BUILD-1 investigators
Statistical analyses: Swigris, Wamboldt
Manuscript preparation and final approval: Swigris, Brown, Behr, du Bois, King, Raghu
The work in this manuscript is the original work of the stated authors. None of the authors has any real or potential conflicts with information in this manuscript.