|Home | About | Journals | Submit | Contact Us | Français|
There are clinical and research settings in which concerns about respondent burden make the use of longer self-report measures impractical. Though computer adaptive testing provides an efficient strategy for measuring patient reported outcomes, the requirement of a computer interface makes it impractical for some settings. This study evaluated how well brief short forms, constructed from a longer measure of patient reported fatigue, reproduced scores on the full measure. When the items of an item bank are calibrated using an item response theory model, it is assumed that the items are fungible units. Theoretically, there should be no advantage to balancing the content coverage of the items. We compared short forms developed using a random item selection process to short forms developed with consideration of the items relation to subdomains of fatigue (ie, physical and cognitive fatigue). Scores on short forms developed using content balancing more successfully predicted full item bank scores than did scores on short forms developed by random selection of items.
Fatigue is a primary complaint of people with numerous conditions and diseases including multiple sclerosis (MS),1 stroke,2 cancer,3 post-polio,4 arthritis,5 and Parkinson’s disease.6 Fatigue and other patient-centered outcomes often are measured using retrospective self-reports. Because clinical time is at a premium and response rates have been found to be significantly lower with longer versus shorter surveys,7 patient outcomes often are measured using relatively brief scales. Despite their practicality, shorter measures typically yield less reliable scores than do longer measures. An alternative to static scales is computer adaptive testing (CAT), but CAT administrations require a computer interface and may be impractical in some settings. One alternative offered by classical test theory (CTT) methods is to construct parallel tests,8 but tests that are truly parallel are difficult if not impossible to construct. Item response theory (IRT) methods offer a more promising approach. Once a parent item bank has been calibrated to an IRT model, items can be extracted to comprise one or more short forms and scores can be calibrated to a common mathematical metric. Whereas CTT scoring methods assume that participants respond to the same items, with an IRT model persons’ trait-levels can be estimated on a common mathematical metric even when persons respond to different subsets of items. Thus, an IRT calibrated item bank provides the opportunity to develop multiple short forms whose scores are directly comparable.
The purpose of this study was to evaluate how well short forms constructed from a longer measure of patient reported fatigue could reproduce scores on the full measure. Two methods for developing short forms were compared – selecting items randomly and balancing the content of items based on targeted subdomains. In addition, the impact of number of items in the short forms was explored.
In a previous study, a sample of persons with MS (n = 466) responded to the 21-item Modified Fatigue Impact Scale (MFIS).9 For the current study, the data were reduced to include only MFIS responses of those who completed all 21 items (n = 374; 80%). Participants were recruited through the Multiple Sclerosis Association (MSA) of King County (Washington). MSA members were mailed a survey. Approximately 400 returned completed surveys, about a 55% response rate. Information on nonrespondents is unavailable because the surveys were mailed by the association. Study investigators did not have access to the mailing list.
The MFIS9 was developed to assess the impact of fatigue on a variety of daily activities. The item content is included in Table 1. Respondents rate their fatigue over the previous four weeks on a 0–4 scale where 0 = Never, 1 = Rarely, 2 = Sometimes, 3 = Often, and 4 = Almost always. The MFIS can be scored as a general measure of fatigue by summing across all items. Alternatively, subscale scores can be generated to estimate levels of physical (9 items), cognitive (10 items), and psychosocial fatigue (2 items).
An assumption of IRT models and CTT is unidimensionality; that is, it is assumed that a single latent construct drives the variance in scores. It is well-recognized that the assumption of unidimensionality is never strictly met in the context of health outcomes measurement. A scale of a very narrowly-defined construct could be expected to exhibit good fit to a unidimensional model based on conventional fit criteria,10 but most health constructs have greater conceptual breadth and require a broader range of indicators.11 Health outcomes are conceptually complex and never perfectly meet strictly defined unidimensionality assumptions.12–15
A number of approaches have been suggested for evaluating model assumptions, and often the findings of several methods are compared.16–20 Reise and Haviland14 have recommended comparing first-order unidimensional models with bi-factor models.19,20 With a bifactor model, in addition to a general factor, there are “group” factors that account for score variation caused by subdomains. For the current study, we considered the results of an exploratory factor analysis (EFA), first-order unidimensional CFA, and a bifactor analysis. The factor analyses were conducted using Mplus software.21 For the EFA, we used unweighted least squares estimation. For the one-factor and bifactor CFAs, we used weighted least squares with mean and variance adjustment. Because of the categorical nature of the response data, a polychoric correlation matrix was analyzed. Fit was evaluated based on the Comparative Fit Index (CFI),10 the Tucker–Lewis Index,22 and the Root Mean Square Error of Approximation (RMSEA).23,24 To assess local independence, we examined the magnitude of residual correlations.25,26 The residuals represent the variance not accounted for by the model and, if local independence holds, they should not be substantially correlated.
The items of the MFIS were divided into 2-, 3-, 4-, and 5-item short forms using two item selection strategies: content balancing and random selection. This resulted in an 8-cell design (4 sizes of short forms × 2 item selection strategies). The items of the MFIS were used to create 10 different short forms within each study cell. Thus, a total of 80 short forms were generated (8 cells × 10 replications).
Within each selection strategy, the 2-item short forms were comprised of unique items; that is, no item appeared in more than one short form. There were not enough items to build wholly unique short forms for the 3- to 5-item conditions, but each short form was comprised of a unique grouping of items. Within each 10 short form study condition, an effort was made to balance the number of short forms for which any given item was selected.
As already noted, the MFIS can be scored as a single scale or as three subscales to measure the impact of cognitive, physical, and psychosocial fatigue. We conducted our own content review of the items and elected, for content balancing purposes, to reclassify one of the physical fatigue items. It was our judgment that item 21, “needed to rest more often or for longer periods”, could indicate cognitive as well as physical fatigue. We defined an “other” category that included the two psychosocial items and item 21. Adding this item made content balancing somewhat easier because there were more items from which to choose in populating this subdomain.
For the second item selection condition, items were randomly assigned to short forms. When this selection resulted in a duplicate short form within a study cell, a new item subset was generated at random so that, within each study cell, there were no short forms with the exact same items.
Responses to the 21 items of the MFIS were calibrated using Master’s Partial Credit Model (PCM)27 and Parscale Software.28 The PCM is appropriate for calibrating responses to items that offer three or more response options (eg, never/sometimes/always). Scores are obtained based on a derived probability function that models how persons with different levels of the outcome being measured (fatigue in the current study) are likely to respond to items. Fit to the PCM was evaluated using the computer macro, IRTFIT.29 We report both S-X2 and S-G2 fit statistics (P < 0.01).30,31
A total of 81 scores were generated for each individual in the validation sample. One set of scores was estimated based on responses to all 21 MFIS items (full scale scores). The other 80 were based on responses to each of the 80 short forms.
Persons’ full scale scores served as a standard by which the short forms were evaluated. Pearson product-moment correlation coefficients were calculated between short form and full scale scores, and within each study cell, the range and average of correlation coefficients were calculated (using Fisher Z transformation).32 In addition, the root mean squared errors (RMSE) were calculated. The “errors” were defined for the purposes of this study as short form score minus full scale score.
To evaluate the errors associated with short form scores relative to the error associated with full scale scores, we derived confidence intervals for persons full scale scores. These were defined as calibrated full scale scores ± two standard errors. The standard errors were the individual standard errors obtained for each person in the PCM calibration of all 21MFIS items. We calculated the percentage of short form scores in each study condition that were within this range.
A total of 374 persons responded to all items of the MFIS. Data from these respondents were used for the current study. The study population was largely female (79.4%) and overwhelmingly Caucasian (93%), with an average age of 49 years (range of 21–78). Participants reported their course of disease based on a self-report item that displays five figures in which severity of symptoms is plotted against time.33 Each figure represents a different pattern of symptom severity over time, and respondents are asked to indicate the one that “best describes the course of your MS over time”. Of those who responded to this item, 55% selected a plot consistent with “relapsing remitting”, 27% with “secondary progressive”, and 18% with “primary progressive”.
The fit statistics for a first-order unidimensional CFA model yielded mixed results. The CFI10 and TLI22 were 0.891 and 0.942 respectively, suggesting moderate model fit.10 The RMSEA, however, was very high (0.331), indicating very poor fit. Half of the residual pairs had correlations >0.10 (n = 106); more than half of these were >0.20 (n = 56).
An EFA was conducted. The ratio of the first and second eigenvalue was 3.95, with the first factor accounting for 60.0% of the variance. The correlation between the first and second factor was 0.57. Item factor loadings obtained in an EFA are provided in Table 1. Loadings from both a one- and two-factor solution are included. The loadings supported a two-factor solution in which cognitive items loaded on the first factor, and all other items loaded on the second factor. The loadings of items categorized as psychosocial in the original categorization and as “other” in our reclassification loaded with the physical items on the second factor.
A bifactor model was fitted in which all items loaded on a general factor, cognitive items loaded on one orthogonal group factor and all other items on a second orthogonal group factor. The fit of this model was substantially better than that of the first-order, one-factor model (CFI = 0.961, TLI = 0.993, RMSEA = 0.105), and only three residual correlations had values greater than absolute value of 0.10. The general factor accounted for 67% of the common variance. The physical and “other” items accounted for 29% of the common variance, and the cognitive factor accounted for only 7%. Because the cognitive specific factor accounted for such a small proportion of variance, we concluded that the data were sufficiently unidimensional for calibration with the partial credit model.27 Though the items we designated as “other” did not define a separate group factor, for content balancing purposes only, we retained the category.
Three of the 21 items (15%) failed to fit the PCM at alpha = 0.01. The items that failed at this criterion were items 16, 18, and 19 (item content reported in Table 1).
Pearson product-moment correlations were computed between short form and full scale scores and these were compared across study conditions (short form size and item selection strategy). The results for the ten replications per study cell were summarized by calculating the range of correlation values and the average correlation. Average correlations were calculated by transforming r-values into corresponding z-scores, finding the mean of those scores, and then transforming this value back to an r-value.32 Figure 1 is a box plot displaying the 10th, 25th, 50th, 75th, and 90th percentiles for each study condition. As the plot indicates results were substantially better for the short forms created based on content-balancing. Even for short forms comprised of only two items, correlations ranged from 0.83 to 0.90 (mean = 0.87). With the 5-item short forms correlations ranging from 0.94 to 0.96 (mean = 0.95). Though the correlations for the short forms based on random selection of items fared relatively well with means of 0.81, 0.89, 0.91, and 0.93 for short forms with 2-, 3-, 4-, and 5-items, respectively, the results were inferior to those obtained with the content-balanced short forms and far more variable.
As expected, short forms with more items performed better than those with fewer items. For the short form sizes evaluated in this study, there was little “leveling-off ” of the advantage gained by having more items. The 5-item short forms performed better than the 4-item short forms; 4-item short forms performed better than the 3-item short forms, and so on.
We made the assumption that trait-level estimates based on all 21 items of the MFIS would be superior to estimates based on fewer items. For this study, therefore, “error” was defined as short form score minus full-scale score. Figure 2 compares the RMSE values calculated based on this definition of error. RMSEs are in the metric of the scale, and their magnitude can be interpreted relative to the range of theta estimates (7.7 logits in the current study). The pattern of RMSE results mirrored the correlation results. Increasing the number of items reduced the observed error as did developing short forms based on content-balancing. The 5-item content-balanced short forms performed particularly well in approximating full-scale scores. The RMSE for these short forms was 0.43, which is 5.6% of total score range.
Though we used the full-scale score as our gold standard, this estimate also has an error associated with it. The IRT calibration outputs a standard error of estimate (SEM) for every person. These vary by trait level. We computed the 95% confidence interval (± 2 SEMs) around each respondent’s full-scale trait level estimate and then calculated the proportion of short form scores from each condition that fell within this range. Figure 3 shows the results. Like the previous comparisons, these analyses show the superiority of the content-balanced short forms and the increase in precision gained by adding more items. For example, for the content-balanced short forms the proportion that fell within the ± 2 SEM confidence interval ranged from 0.68 for the 2-item short forms to 0.87 for the 5-item short forms. Of the scores based on short forms comprised of randomly selected items, the proportions falling within the ± 2 SEM confidence interval were 0.58 and 0.80, respectively, for the 2- and 5-item short forms.
We found a clear advantage for using a content-balancing strategy over random selection of items in developing short forms. We did not investigate the impact of difficulty-balancing because of the limits of our item pool. In the current study, short forms developed to be content-balanced proved to be balanced with respect to item difficulty as well. Content- and difficulty-balancing should be compared with a larger item pool to evaluate whether one approach is superior to the other.
Despite the limitations of our study, the results warrant several conclusions. The PCM proved an effective model for developing multiple subscales calibrated to a common mathematical metric, and even very brief sub-scales produced reasonable approximations of full scale scores, particularly when the subscales were developed to represent the subdomains of the measured construct. Increases in number of items per subscale yielded the expected increases in precision. Future research should further investigate the impact of item content and item parameters in the development of short forms from a parent item bank.
The authors declare no conflicts of interest.