|Home | About | Journals | Submit | Contact Us | Français|
We combined anchor- and distribution-based methods to establish minimally important differences (MIDs) for six PROMIS-Cancer scales in advanced-stage cancer patients.
Participants completed six PROMIS-Cancer scales and 23 anchor measures at an initial (n=101) and follow-up (n=88) assessment 6 to 12 weeks later. Three a priori criteria were used to identify useable cross-sectional and longitudinal anchor-based MID estimates. The mean standard error of measurement was also computed for each scale. The focus of the analysis was on IRT-based MIDs estimated on a T-score scale. Raw score MIDs were estimated for comparison purposes.
Many cross-sectional (64%) and longitudinal (73%) T-score anchor-based MID estimates were excluded because they did not meet a priori criteria. The following are recommended T-score MID ranges: 17-item Fatigue (2.5–4.5), 7-item Fatigue (3.0–5.0), 10-item Pain Interference (4.0–6.0), 10-item Physical Functioning (4.0–6.0), 9-item Emotional Distress-Anxiety (3.0–4.5), and 10-item Emotional Distress-Depression (3.0–4.5). Effect sizes corresponding to these MIDs averaged between 0.40 and 0.63.
This study is the first to address MIDs for PROMIS measures. Studies are currently being conducted to confirm these MIDs in other patient populations and to determine whether these MIDs vary by patients’ level of functioning.
Patient-reported outcomes (PROs) range from specific concepts such as a single symptom (i.e., perceived pain) to more general or multidimensional concepts such as health-related quality of life (HRQL). Clinical researchers and clinicians interested in incorporating PRO assessments into their work have long desired brief yet precise PRO measures. In recent years, the Patient-Reported Outcomes Measurement Information System (PROMIS) Network, a National Institutes of Health Roadmap Initiative, has advanced PRO measurement by developing item banks for measuring major self-reported health domains affected by chronic illness [1–3]. An item bank is a collection of calibrated items from which short form measures and computer-adaptive tests can be derived. Scores on short forms derived from the same item bank are calibrated on the same measurement metric and can therefore be compared.
The National Cancer Institute (NCI) provided supplemental PROMIS funding to ensure that the PROs developed by the network were valid for cancer patients and survivors. Input was solicited from domain experts and patients to improve the cancer relevance of PROMIS measures of fatigue, pain, physical function, and emotional distress . The resulting measures will be referred to herein as PROMIS-Cancer scales.
In addition to brevity and precision, PROs used in research and clinical practice must also be interpretable. One tool for enhancing the interpretability of PROs is the minimally important difference (MID), which we define as a difference in score that is large enough to have implications for a patient’s treatment or care . However, for certain treatment settings and objectives, such as in patients with advanced-stage disease where palliation is the intent of treatment, the MID is patient-centered and may have no specific reference to the clinical aspect of the patient’s change . Our primary objective was to develop preliminary IRT-based MIDs for six PROMIS-Cancer scales that were created as short form versions of item banks. A secondary objective was to develop a reference table linking IRT-based MIDs to raw score MIDs.
Patients were recruited at two cancer centers in the Chicago metropolitan area: a private, suburban and a public, urban hospital. Eligible patients were at least 18 years old, had a diagnosis of advanced-stage cancer (Stages III or IV), and were able to read and understand English. In an attempt to capture the spectrum of care for advanced-stage cancer, aside from hospice care, eligible patients could be receiving any cancer treatment or follow-up care. The study was approved by the institutional review boards for the participating sites and informed consent was obtained for all participants, who received $20 compensation for each completed assessment.
Patients completed two assessments: one at baseline and one 6 to 12 weeks later. Assessments were completed in clinic using touch screen computers. At baseline, patients completed PROMIS-Cancer scales and anchor measures (described below). Sociodemographic (e.g., age, gender, ethnicity) and clinical (e.g., treatment, stage of disease) variables were captured through self-report or from medical records.
We estimated MIDs for the following six PROMIS-Cancer scales: 17-item fatigue (Fatigue-17); 7-item PROMIS fatigue (Fatigue-7) ; 10-item Pain Interference (PainInt-10), 10-item Physical Function (PhysFunc-10), 9-item Anxiety (Anxiety-9), and 10-item Depression (Depression-10). The six PROMIS-Cancer scales are presented in supplementary on-line appendices. Items had 5-point ordinal rating scales, and all PROMIS-Cancer scales were scored such that a higher score represents higher levels of the concept (e.g., higher Fatigue-7 score indicates more fatigue, higher PhysFunc-10 score indicates better physical functioning).
Using anchor-based and distribution-based methods that we [6–12] and others [13–15] have published, we identified MIDs for static, fixed-length forms. IRT-based scores were determined by mapping item responses on a given PROMIS-Cancer scale to the item calibrations in the corresponding item bank. The item calibrations were based on data for over 2000 patients with different cancer types and stages of disease that were collected as part of the PROMIS-Cancer supplement from the NCI . The resulting item parameters based on data from cancer populations were then linked to the corresponding PROMIS general population calibrations using a common item equating procedure . As a result of the linking procedure, scores from the PROMIS-Cancer banks and the corresponding PROMIS general banks are placed on a common scale and, hence, are comparable. IRT-based scores (i.e., theta) can be transformed to T-scores, with a mean of 50 and a standard deviation of 10 in the reference population. Raw scores were calculated as the prorated sum of the item responses, and were computed if more than 50% of the items on the scale were answered. MIDs were estimated using similar methodology for both T-scores and raw scores so that conversions from one to the other can be made by end-users. The emphasis of this paper is on the T-score MIDs because IRT-based T-scores are a better reflection of the underlying concept being measured.
Distribution-based measures rely on the statistical distributions of PRO data, including effect size measures [17, 18] and the standard error of measurement (SEM) . To be confident in the MID, we must confirm that the magnitude of the MID is larger than the measurement error  or the minimally detectable difference  of the scale. The SEM is a measure of precision of the scale and can be interpreted as the smallest difference or change score likely to reflect a true difference or change rather than measurement error. It also reflects the minimally detectable difference in a scale [20, 21]. Therefore, the lower bound on the MID range based on anchor-based estimates was compared to the SEM. If the SEM was greater than the lower bound of the MID range, then the lower bound was increased to be larger than the SEM. In IRT analysis, each person has a standard error associated with his/her score. Thus, for the T-score MIDs, the SEM in this study was measured as the mean standard error across the sample, whereas for the raw score MIDs, the SEM was computed per convention (i.e., SEM = SD(1−rxx)1/2 where SD is the standard deviation of the scale score and rxx is the reliability of the scale).
Anchor-based approaches can be cross-sectional or longitudinal. Cross-sectional approaches involve comparing PRO scores for patients in clinically-distinct groups, such as categories of performance status rating. Longitudinal approaches involve comparing changes in PRO scores to patient-reported assessments of change over time (either prospectively or retrospectively determined)  or to clinically-relevant measures such as response to treatment [7, 9]. We collected data on 23 clinically-relevant, self-reported anchors for the cross-sectional and longitudinal anchor-based analyses. Table 1 illustrates the anchors that were used to estimate MIDs for the different scale scores. We paired anchors with PROMIS-Cancer scales based on (1) precedence, that is, whether the anchor had been used in previous research to establish MIDs for a similar domain, and (2) our confidence in their clinical relevance  for estimating the MID for a given scale.
We defined three a priori criteria for useable anchor-based estimates. The first was a Spearman correlation between an anchor (e.g., Brief Pain Inventory) and PROMIS-Cancer scale of at least 0.3 [10, 23]. If an anchor was collapsed into categories, we computed the Spearman correlation between the PRO score and the collapsed categories. The second criterion required a sample size of at least 10 in the clinically-distinct group (e.g., ECOG performance status 0, 1, 2, or 3) or change score group (e.g., minimally better, minimally worse) that was used to calculate a cross-sectional or longitudinal MID, respectively. We believed, based on some of our previous MID experience, that a difference score or change score based on fewer observations would be too unstable. The final criterion was that the anchor-based estimate had a corresponding effect size within a plausible range of 0.2–0.8; that is, estimates with effect sizes <0.2 are unlikely to be “important” and estimates with effect sizes >0.8 are unlikely to be “minimal.” If an anchor-based estimate did not meet all of these criteria, it was not considered in the final MID determination.
In the cross-sectional analysis, anchors were used to categorize patients into multiple clinically-distinct groups. Many different anchors can be used for this purpose, provided individuals can be classified into distinct categories that are both clinically relevant but also minimally different. Score differences between adjacent, clinically-distinct categories represent estimates of the MID. Effect sizes for these estimates were computed by dividing the adjacent category score difference by the overall SD for the sample .
For anchors with a 5-point ordinal rating scale, each category was considered a clinically-distinct group. For anchors with an 11-point response scale, we referred to Butt et al.  who determined cut-off scores to identify clinically-distinct groups for 11-point single-item measures of fatigue, pain, anxiety, and depression. Using these criteria, three severity groups were formed: 0–3 = none/mild; 4–6 = moderate; 7–10 = severe. Mean scale scores were computed for each of the three categories and differences in mean scores across adjacent categories (e.g., none/mild vs. moderate; moderate vs. severe) were considered estimates of the MID.
For multi-item anchors such as the Hospital Anxiety and Depression Scale (HADS), Brief Pain Inventory (BPI), and Functional Assessment of Chronic Illness Therapy (FACIT)-Fatigue subscale, established cutpoints for scores were used to categorize patients into clinically-distinct groups [25–27]. Differences in mean PROMIS-Cancer scale scores across adjacent, distinct categories of the FACIT-Fatigue, HADS or BPI were considered estimates of the MID. To our knowledge, cutpoints for distinguishing multiple (i.e., more than 2) clinically distinct groups have not been published for the SF-36 10-item physical function subscale (SF-36 PF); thus, it was not used as an anchor in the cross-sectional analysis. Cross-sectional analyses were conducted separately using Assessment 1 and Assessment 2 data.
Prospective anchors were measured longitudinally, i.e., at Assessment 1 (T1) and Assessment 2 (T2). For single-item anchors with a 5-point response scale, change scores can range from −4 to +4. We considered a 1-point change, either positive (improvement) or negative (decline) to be clinically meaningful . For single-item anchors with 11-point response scales, change scores can range from −10 to +10. While there are no recognized guidelines for interpreting meaningful change on an 11-point scale, Farrar et al.  have identified a 2-point improvement as clinically meaningful on an 11-point pain scale. Thus, for the two 11-point pain items (global pain, BPI worst pain), we classified patients improving by 2–3 points as “a little better” and those declining by 2–3 points as “a little worse.” Mean changes in the PROMIS-Cancer scale scores corresponding to these anchor changes were considered estimates of the MID. Although these anchor-change categories are based on findings for a pain scale, we extended them to any anchor using an 11-point scale.
Established MIDs for multi-item anchors were used to identify patients who have experienced a meaningful change in a scale score. An MID of 1.5 has been estimated for the HADS subscales  and the MID for the FACIT-Fatigue subscale is 3–4 points [6, 30]. Proposed MIDs for the SF-36 PF subscale, which has a 0–100 score range, include 7–8 points based on one SEM [15, 31] and 10 points based on a consensus panel of experts using a Delphi method . Based on these findings, we considered a range of 8–10 points as the MID for the SF-36 PF subscale. Patients were classified as “a little better” or “a little worse” if their HADS, FACIT-Fatigue or SF-36 PF scores increased or decreased by at least the lower end of the published MID ranges (i.e., 1.5 points for HADS, 3 points for FACIT-Fatigue and 8 points for SF-36 PF), but no more than 2 times the MID (i.e., 3 points for HADS, 6 points for FACIT-Fatigue and 16 points for SF-36 PF). Estimates of the MID were the mean PROMIS-Cancer scale change scores in the “a little better” and “a little worse” anchor change categories. We are not aware of an established MID for the BPI Pain Interference multi-item scale; thus, we estimated the MID as ½ standard deviation  (1.4 points in this dataset), which is a liberal approximation (i.e., erring on the high end) of the MID when empirical data are lacking.
The global rating of change (GRC) was first suggested as a clinical anchor by Jaeschke and colleagues  and has been implemented in several of our previous MID studies [7, 12, 22]. The GRC items in this study were worded specifically for each PROMIS-Cancer scale; that is, the meaning of each domain measured by the scale (physical function, pain, etc.) was briefly described at Assessment 2 and then patients rated the degree of change they have experienced on each of these domains since Assessment 1. Responses were scored on a 7-point scale ranging from −3 = “very much worse” to +3 = “very much better.” Mean change scores on the PROMIS-Cancer scales corresponding to GRC item responses of +1 or +2 (“a little better,” “moderately better”) and −1 or −2 (“a little worse,” “moderately worse”) were considered estimates of the MID. Due to the large number of anchor-based MID estimates calculated, we summarized the results using nonparametric statistics, namely medians and interquartile ranges.
A total of 101 patients completed the first assessment and 88 completed the second. Participants were predominantly female, non-Hispanic White, and receiving chemotherapy (Table 2). Participants had worse fatigue, physical function and anxiety than a general cancer population as indicated by mean Assessment 1 T-scores greater than 50 (or less than 50 for physical function) as shown in Table 3. Levels of pain and depression were comparable to the general cancer population. The most frequent responses to anchor items tended to be for the middle categories except for pain, anxiety and depression for which respondents experienced mild levels. The Assessment 1 mean FACIT-Fatigue score of 34.0 was between reported values for anemic (23.9) and non-anemic (40.0) cancer patients .
Although the T-score metric has a mean of 50 and standard deviation of 10 in the reference sample, the observed standard deviations in this sample were less than 10 for all short forms and ranged from 6.7 for Fatigue-17 to 9.4 for PainInt-10 (Table 3). Almost all mean standard errors were less than 1/3 standard deviation (Table 4), reflecting good precision of the T-scores.
Spearman correlations between anchors and short form T-scores were greater than 0.3 for all anchors at both assessments except between ECOG performance status and Anxiety-9 at Assessment 2. An average of 45 MID estimates (range 37 – 52) was calculated for each of the six PROMIS-Cancer scales over both assessments. An average of 36% of the estimates that were calculated for each short form (range 27%–41%) met our a priori criteria for determining the MID. The most common reason for excluding an MID estimate was because the sample size for one or both of the adjacent categories being compared was less than 10. Very few estimates were discarded because the effect size for the adjacent category score difference was less than 0.2, but quite a few were discarded because the effect size was greater than 0.8. The medians for the usable cross-sectional MID estimates ranged from 4.0 for Depression-10 to 5.7 for PhysFunc-10. The effect sizes corresponding to these medians ranged from 0.48 for Depression-10 to 0.61 for Fatigue-17. The minimum, maximum, median and interquartile ranges of useable cross-sectional estimates for each short form are presented in Table 5.
An average of 17 MID estimates (range 14–20) was calculated for each PROMIS-Cancer scale in the longitudinal analysis. Across the six scales, an average of only 27% (range 6% – 40%) calculated estimates met a priori criteria for determining the MIDs. The main reason estimates were discarded was a Spearman correlation for an anchor change score and short form change score of less than 0.3. Very few estimates were discarded based on a sample size less than 10 in the anchor change score category or because the effect size for the short form change score was less than 0.2. No estimates were discarded due to an effect size for the change score greater than 0.8. The medians for the useable longitudinal MID estimates were considerably lower than those from the cross-sectional analysis and ranged from 2.4 for Fatigue-7 to 3.5 for PainInt-10. Effect sizes corresponding to the median MIDs ranged from 0.29 for Fatigue-7 to 0.42 for Anxiety-9. The minimum, maximum, median and interquartile ranges of useable longitudinal estimates for each short form are presented in Table 6.
All usable anchor-based estimates of the MID are plotted in Fig. 1. As with any empirically derived value, there is uncertainty and variability associated with MIDs. To reflect this, we recommend MID ranges rather than single point estimates [8–10, 12]. In this study, the interquartile ranges shown in Fig. 1 rounded to the nearest half-integer were used to inform the recommended MIDs ranges for each PROMIS-Cancer scale. For example, the interquartile range for the Depression-10 scale T-scores was 2.6 – 4.3 points (Fig. 1), which we rounded to 2.5 – 4.5.
The final step was to compare the lower bounds of the MID ranges to the SEMs in Table 4. The lower bounds for the Anxiety-9 and Depression-10 scales rounded to the nearest half-integer were both 2.5 points, which is smaller than their SEMs of 2.6 and 2.8 points, respectively. Therefore, the lower bounds on the MID ranges for these two scales were increased to the next half-integer of 3.0 points to ensure that the MID exceeds the measurement error. The lower bounds of the MID ranges for all other scales were greater than the SEMs. Recommended T-score MIDs are summarized in Table 7.
In our previous work, the MID ranges were presented as whole integers to facilitate interpretation of an individual patient’s score, which can only change by a whole number on a raw score scale. However, on an IRT-based T-score scale, an individual patient’s score can change by less than a whole integer. Thus, the recommended T-score MID ranges in Table 7 are not constrained to being bounded by whole integers. Useable raw score MIDs estimates are reported in Fig. 2 and recommended raw score MID ranges rounded to the nearest whole integer are summarized in Table 7.
We combined multiple anchor-based estimates from a sample of 101 advanced cancer patients to determine the MIDs for six PROMIS-Cancer scales. We defined three criteria for identifying appropriate anchor-based MID estimates. Finally, we compared the MIDs to the SEM to ensure that the MID ranges exceeded a standard unit of measurement error for each scale. IRT-based T-score MIDs and raw score MIDs were estimated. Through this exercise we were able to begin to estimate PROMIS MIDs, at least as applied to this patient population.
We observed T-score standard deviations at Assessment 1 less than 10 for all short forms, which was smaller than expected. It is possible that patients in this study, who were restricted to Stage III or IV disease, were more homogeneous with respect to their PROs than were the Stage I through IV cancer patients who participated in the calibration study.
The cross-sectional T-score MID estimates were larger than the longitudinal estimates and a large number of estimates were discarded from the cross-sectional analysis because the effect sizes were too large (i.e., >0.8). This may be due to the definitions of clinically-distinct groups in the cross-sectional analysis. That is, it is possible that the adjacent groups represented more than a minimally important difference in the domain measured by the scale. This was apparent even in commonly used cross-sectional anchors such as patient-reported ECOG performance status and general health. For example, for four of the six PROMIS-Cancer scales at Assessment 1 and for three of the six scales at Assessment 2, cross-sectional anchor-based estimates based on the comparisons of the fair vs. good categories of general health were discarded because the effect sizes were greater than 0.8. This is not unusual. We often observed very large PRO score differences (e.g., effect sizes >1.0) when comparing adjacent categories of physical functioning or general health anchors with 4–5 severity levels [6, 9, 10]. Thus, it is possible that these categories, while representing clinically-distinct groups, do not represent a minimal clinical distinction. There was no discernable pattern regarding whether certain anchors performed better than others overall in the cross-sectional analyses.
The longitudinal analyses did not yield many usable estimates for the MID, primarily due to weak Spearman correlations between anchor change scores and short form change scores. The MID estimates that were calculated tended to be quite small. A great deal of change in PROs often occurs in the earlier phases of treatment when side effects are first experienced by patients . Over time, side effects are managed and/or patients may adapt to them, both of which can contribute to the stabilization of PRO scores over time. In the present study, patients were eligible for enrolment at any time during treatment and follow-up care to better reflect the spectrum of care for advanced-stage cancer. Therefore, it is possible that outcomes such as pain, fatigue, physical function and emotional distress for many patients in our sample had already stabilized at the time of the first assessment. As a result, little change from the first to second assessment would be experienced by those patients, which may further explain why the longitudinal MID estimates were smaller than the cross-sectional MID estimates. The GRC items, which are some of the more prevalent and longstanding anchors used in MID analyses [14, 21, 34], were not useful in this analysis. Only the fatigue GRC item had sufficiently large correlations with the two fatigue scales to yield useable estimates. The other four GRC items produced no useable MID estimates for their respective PROMIS-Cancer scales. Problems with GRC items as anchors have been noted by us  and others .
Turner et al.  recently recommended that investigators use anchor-based methods to determine MIDs for health-related PROs. However, we observed numerous problems with anchors (e.g., poor correlation with PROs, inability to distinguish minimal clinically distinct groups) that can threaten one’s ability to be comfortable with proposed MIDs until sufficient data have been amassed. Particularly problematic were the legacy transition rating anchors (i.e., GRC). In light of the potential problems that we noted with anchors, placing primacy on anchor-based methods seems unsupported until better methodology exists for identifying appropriate anchors for an MID analysis. At present, no one method is without limitations. Thus, we recommend triangulation across multiple methods [6–12] in which anchor- and distribution-based estimates are considered together.
Despite being scored on the same scale, recommended T-score MIDs varied by PROMIS-Cancer scale. This may represent variability in relevance of anchors across domains, differences in precision of measurement of these particular selected scales in this particular patient population, or true variability in magnitude of change required to surpass a threshold for meaningfulness in a given measured concept.
A strength of this study was the large number of variables available (between six and ten per PROMIS-Cancer scale) to use in the anchor-based analyses. Therefore, we were able to discard estimates based on a priori criteria and still have plenty of useable estimates on which to base the MIDs. Our study is not without limitations. The sample size of 101 and 88 at the first and second assessments, respectively, was small relative to other published MID studies. As a result of the small sample, a substantial number of cross-sectional MID estimates were discarded because they failed to meet the a priori criterion of at least 10 subjects in each adjacent, clinically distinct anchor category. Replication of this analysis with a larger sample is ongoing. We used a correlation of 0.3 to determine whether an anchor was appropriate for estimating the MID for a given scale. It is possible that this criterion was not stringent enough; others have recommended a correlation of at least 0.5 between GRC anchors and PRO change scores . The sample of patients with advanced-stage cancer evaluated here experienced little longitudinal change in PRO scores and was potentially homogeneous; thus, results may not be generalizable to cancer patients as a whole. The range of time between the first and second assessments (6–12 weeks) was quite variable and may have contributed to the small PROMIS-Cancer scale change scores and to the weak correlations observed between anchor and scale change scores. In addition to triangulation across methods, MIDs should also be triangulated across multiple samples [10, 23]. Thus, the MIDs estimated here should be confirmed in other samples of cancer patients. Finally, and possibly most importantly, the MID may vary as a function of the level of impairment experienced by patients . In other words, the MID may vary by location on the severity continuum. To address this concern, one would need a sample large enough to support separate MID analyses within subsamples defined by T-scores. For example, three separate MIDs could be determined for patients with T-scores ≤45, >45 – <55 and 55+. The sample in this study was too small to accomplish this. However, this study represents the first of many that will address MIDs in PROMIS-Cancer and general PROMIS scales. Studies are currently being conducted that will allow us to answer the question of whether MIDs for PROMIS scales differ by level of functioning.
PROMIS is an IRT-based measurement system that measures self-reported physical, mental and social health using large item banks that drive brief, precise assessment referenced to United States general population norms [1–3]. These normative T-scores are standardized with the mean set at 50 and standard deviation set at 10 units. This is the first paper to estimate the minimally important difference of those T-scores, using cross-sectional and longitudinal data. We estimated MID ranges for five PROMIS domains: Fatigue, pain, depression, anxiety, and physical functioning. Interquartile ranges of those estimated MIDs were in the range of 3–5 points for fatigue, anxiety and depression, and 4–6 points for pain and physical function. This corresponds with approximately one-third to two-third standard deviation units for each scale, and offers a beginning point for interpretation of difference and change with PROMIS measures.
This work was supported by the National Institutes of Health grants U01 AR052177 and R01 CA60068. The authors would like to thank Seung Choi, Ph.D. for calculating the IRT-based T-scores; Sarah Rosenbloom, Ph.D. for directing and Jacquelyn George for coordinating the parent grant of which this study was a part; Maria Corona, Yvette Garcia, Natalie Gela and Ramya Iyer for recruiting patients and collecting data; Michael Bass for programming the assessment software; and Katy Wortman for assisting with data management. We also thank all the study participants at Kellogg Cancer Care Center of NorthShore University HealthSystem and John H. Stroger, Jr. Hospital of Cook County.
Portions of this manuscript were presented at the International Society of Quality of Life Research (ISOQOL) annual meeting in London, UK, October 28-30, 2010
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.