|Home | About | Journals | Submit | Contact Us | Français|
A post-hoc simulation of a computer adaptive administration of the items of a modified version of the Roland Morris Disability Questionnaire.
To evaluate the effectiveness of adaptive administration of back pain-related disability items compared to a fixed 11-item short form.
Short form versions of the Roland Morris Disability Questionnaire have been developed. An alternative to paper-and -pencil short forms is to administer items adaptively so that items are presented based on a person’s responses to previous items. Theoretically, this allows precise estimation of back pain disability with administration of only a few items.
Data were gathered from two previously conducted studies of persons with back pain. An item response theory model was used to calibrate scores based on all items, items of a paper-and-pencil short form, and several computer adaptive tests (CATs).
Correlations between each CAT condition and scores based on a 23-item version of the Roland Morris Disability Questionnaire ranged from 0.93 to 0.98. Compared to an 11-item short form, an 11-item CAT produced scores that were significantly more highly correlated with scores based on the 23-item scale. CATs with even fewer items also produced scores that were highly correlated with scores based on all items. For example, scores from a five-item CAT had a correlation of 0.93 with full scale scores. Seven- and nine-item CATs correlated at 0.95 and 0.97, respectively. A CAT with a standard-error-based stopping rule produced scores that correlated at 0.95 with full scale scores.
A CAT-based back pain-related disability measure may be a valuable tool for use in clinical and research contexts. Use of CAT for other common measures in back pain research, such as other functional scales or measures of psychological distress, may offer similar advantages.
Self-reports of back pain-related disability often are used as endpoints in clinical research studies that evaluate interventions and treatments. Such measures also may supply information useful in clinical decision making. Among the most widely used measures of back pain disability is the Roland-Morris Disability Questionnaire (RM).1 The RM is comprised of 24 items to which respondents answer “yes” or “no” to indicate limitations they experience due to back pain. The original RM was developed by identifying items of the Sickness Impact Profile (SIP)2 relevant to low back pain. Patrick and colleagues3 developed a 23-item, modified version of the RM (RM-MOD) by selecting items sensitive to change over time. The two versions share 19 items in common, but five of the original items were replaced by four new items in the modified version. Evidence regarding the psychometric properties of the RM-MOD has been published.3
The RM-MOD’s 23 items constitute a relatively brief scale and if it were the only instrument administered to patients it would not impose a substantial response burden. However in practice back pain-related disability is often one of a large number of patient-reported outcomes of interest, and typically several outcome instruments are administered.4 Using traditional psychometric methods, comprehensive assessment would require completion of long questionnaires. This is suboptimal since response rates have been found to be significantly lower with longer versus shorter surveys.5
At least three short forms of the RM have been developed. The items included in the RM, the RM-MOD, and the three short forms are displayed in Table 1. Stratford and colleagues6 developed an 18-item version (RM-18) based on classical psychometric analyses including calculation and evaluation of response frequencies, inter-item and item-total correlations, and coefficient alpha. They concluded that the RM-18 was as effective as the full 24-item RM. Atlas and colleagues constructed a 12-item questionnaire (RM-12).7 The RM-12 was found to have somewhat lower coefficient alpha values (in the derivation sample, 0.82 for the RM-12 compared to 0.90 for the RM) but had reproducibility equal to that of the RM. The RM-12 also demonstrated high construct validity. Stroud and colleagues8 developed an 11-item version of the RM (RM-11) using an item response theory approach. The coefficient alpha of the RM-11 was similar to that of the 24-item (0.88 versus 0.90, respectively). Correlations between RM-11 scores and those obtained with the RM-18 and 24-item measures were 0.95 and 0.93, respectively. The results of the three studies suggest that the RM may be shortened substantially without greatly sacrificing the strength of its psychometric properties.
Another approach to decreasing the number of items to which subjects respond is computer adaptive testing (CAT) in which, after a person responds to a starting item, items are selected and presented based on preliminary estimates of persons’ trait level. This preliminary estimate is based on persons’ responses to previous items.9 After responses to each successive item, the preliminary estimate is updated and a new item is selected based on this estimate. Thus, the assessment is adapted to respondents’ levels of the outcome being measured. Items that contribute substantively to estimating persons’ trait levels are presented. As an example, if a patient responds to a question and indicates inability to walk a block, there is little point in asking about ability to walk a mile. Though CAT has been employed in educational and psychological testing for 30 years, only recently has it been applied to the measurement of health outcomes.10–17 The purpose of the current study was to evaluate whether a CAT administration of back pain disability items could result in more efficient measurement of back pain disability.
Data were gathered from two previously conducted studies.18, 19 IRB permission was obtained from the University of Washington Human Subjects Committee for this reanalysis of the data. Detailed methods have been published.18, 19 One of the two studies was a multi-center prospective cohort study of 495 patients with presumed discogenic back pain (“the Discogenic study”18). Participants had one- or two-level disc degeneration confirmed by imaging and neurological evaluations. The second study was a clinical trial of 380 participants with low back pain randomly assigned to rapid magnetic resonance imaging or standard radiographs (the Seattle Lumbar Imaging Project, “SLIP”).19 In both studies, RM scores served as the primary outcome, and in both, data were collected at more than one time point. For the current study, we used only data collected at baseline.
The mathematical model that allows CAT is Item Response Theory (IRT). IRT is a probability model and estimation of scores is achieved without the requirement that every person respond to the same items.17, 20 IRT models assume that a single latent construct (e.g., “back-pain related disability”) drives persons’ item responses (unidimensionality assumption). Health outcomes are conceptually complex and never perfectly meet the unidimensionality assumption.21–24 The pertinent question is whether the presence of secondary dimensions cause the results of a unidimensional IRT calibration to be invalid.25
To evaluate the degree to which the unidimensionality assumption was met, we conducted a first-order, confirmatory factor analysis using Mplus software.26 Because of the categorical nature of the response data, the polychoric correlation matrix was analyzed. In addition we plotted eigenvalues (scree plot) and evaluated the correlation between factors in a two-factor solution in an exploratory factor analysis.
We modeled RM-MOD item responses to the two-parameter logistic model (2-PL) using Parscale version 4.1.27 The 2-PL model estimates both an item difficulty and an item discrimination parameter. Expected a posteriori score estimation was used so that scores could be estimated for persons who endorsed all (or none) of the items. Fit to the 2-PL model was calibrated using the computer macro, IRTFIT.28 We report S-X2 and S-G2 fit statistics (p<0.01).29, 30
With traditional measures, assessment stops when a respondent completes all items. With CAT measures, participants respond only to a subset of items, and so a “stopping rule” must be specified. Stopping rules can be based either on number of items (fixed-length CAT) or a standard error of measurement (SEM) can be specified (variable CAT). The SEM is an estimate of a measurement’s precision. It estimates the standard deviation of the differences between persons’ “true scores” and their observed scores (the ones obtained on the measure). With variable CATs, the administration continues until the pre-specified SEM is reached or all items have been administered.
We simulated 5-, 7-, 9-, and 11-item fixed-length CATs. We also simulated a CAT based on a SEM stopping rule of 0.5. All respondents in the current study answered all 23 items of the RM-MOD (RM-MODIRT). We used a computer algorithm to simulate CAT administration of the items. The computer program was written using SAS/STAT software, Version 9.1 for Windows XP-Pro.31 In the simulation Item 1, “I stay at home most of the time because of my back” was “presented” to each respondent as the initial item, and an initial estimate was made of persons’ levels of back pain disability based on their responses. The item chosen to be presented next was the remaining item that provided the most “information” given the initial estimate of the person’s trait level (note: information in IRT is an extension of the concept of reliability.32) Based on response to this item, the estimate of a person’s trait level was updated. This process continued until the stopping rule was reached or all 23 items had been presented.
For purpose of comparison, in addition to the CAT simulations we calibrated IRT scores for the RM-11 (RM-11IRT). These scores were compared to IRT scores based on the full RM-MOD (RM-MODIRT) and to scores obtained in the CAT conditions.
The success of the CAT administrations and of the IRT-calibration of the RM-11 was evaluated by comparing them to the full-scale score (RM-MODIRT). Pearson product moment correlations were calculated between sets of scores. In addition, we calculated residuals. We defined residuals as CAT score (or RM-11 score) minus RM-MODIRT. These residuals indicated how far off the mark CAT and short form scores were from the 23-item RM-MODIRT scores, our gold standard for the current study.
Though a scale may be more precise at some levels of back pain-related disability than others (e.g, moderate disability versus severe disability), with classical methods a single summary reliability estimate is calculated for the entire scale.20 Thus differences in a scale’s measurement precision at different levels of trait are masked. An advantage of IRT models is that reliability is calculated for every level of trait. To have a reference point with which to compare the magnitude of residuals in the current study, we compared each person’s residual with the SEM obtained for that person in the 2-PL calibration of the 23-item scale. We then compared, across condition, the percentage of scores whose residuals were less than one SEM. The choice of one SEM was somewhat arbitrary.
In the combined samples of the Discogenic study18 and the SLIP,19 740 participants (85%) were white, 71 (9%) African-American or Black, 18 (2%) Asian, and 24 (3%) were of Hispanic origin. The mean age was 47 years (range = 18 to 93 years, SD = 13 years). Of the participants in the combined samples, 45.0% reported working full-time; 10.3% worked part-time; 20.5% were retired; and 4.2% were unemployed. Eighteen percent reported receiving workers’ compensation because of their back. Additional demographic and clinical details have been published elsewhere.18, 19, 33
The results of a confirmatory factor analysis of a one-factor model (to evaluate unidimensionality) were assessed by examining several fit indices including the comparative fit index (CFI), root mean square error of approximation (RMSEA), and standardized root mean residuals (SRMR). Strict standards for these measures have been suggested (CFI>0.95, RMSEA <0.06, SRMR <0.08).25 In practice such high standards are seldom met in the context of health outcome measurement.21, 34 Such was the case in the current study. The CFI was 0.90; RMSEA was 0.08; and the SRMR was 0.09.
When fit statistics fail to meet strict standards for good fit, Reeve and colleagues25 recommend conducting an exploratory factor analysis, plotting eigenvalues against the rank of the eigenvalue (scree plot), and evaluating several statistics including percent of score variability on the first factor (≥20% is desirable), ratio of first and second eigenvalues (>4 supports unidimensionality assumption), and correlations among factors. Following this advice, we conducted a follow-up exploratory factor analysis. By these more relaxed standards, there was good support for the essential unidimensionality of the item responses. The scree test supported a single dominant dimension; the first factor accounted for 51.3% of the variance; the ratio of the first and second eigenvalues was 7.1; and the correlation between the first two factors was 0.69.
Values of S-X2 and S-G2 statistics indicated good fit to the 2-PL model. None of the 23 items were found to be misfitting base on a criterion of p<0.01. The fit of the items to the model corroborated our judgment that the responses were essentially unidimensional and therefore appropriate for calibration using an IRT model.
As expected, scores from CAT administrations with more items were more highly correlated with RM-MODIRT scores than were scores from CAT administrations with fewer items. Correlations between CATs with 5, 7, 9, and 11 items were 0.93, 0.95, 0.97, and 0.98, respectively. The correlation between RM-11IRT scores and RM-MODIRT scores was 0.93. Coincidentally, this is exactly the value of the Pearson product-moment correlation reported by Stroud and colleagues for the correlation between summed RM-11 scores and summed scores of the 24-item version of the RM.8 In comparison, the correlation between the 11-item fixed-length CAT scores and RM-MODIRT scores was 0.98. This correlation was significantly higher than the correlation of 0.93 obtained between scores from the 11-item fixed form and the RM-MODIRT (t=21.796, p<0.001). 35
SEM-based CAT scores had a correlation of 0.95 with RM-MODIRT scores. The number of items administered in the SEM CAT condition ranged from 2–23 (See Figure 1). Though a mean of 8.3 items was administered, there was substantial variation in the number of items, and the majority of persons received seven or less items. For 121 of the 874 respondents (14%), all 23 available items were administered without the estimate reaching the stopping rule of SEM ≤ 0.50. This is indicative of the sparseness of items in the bank that target substantial portions of the sample. Indeed the RM-MOD exhibited substantial ceiling effects in the study sample; 96 persons (11%) scored in the upper 5% of the score range, endorsing 22 or 23 out of the 23 items. In contrast, for persons well-targeted by the scale, considerable efficiency was achieved as evidenced by the fact that in 53% of cases, five or fewer items were administered before the stopping rule of SEM ≤ 0.50 was reached.
Figure 2 plots, for each study condition, the percentage of scores whose absolute residual (CAT/RM-11 score – RM-MODIRT score) was less than one SEM (obtained for each score in the IRT calibration). Higher percentages indicate CATs that more successfully approximated the score obtained on the full scale. Percentages ranged from 68% for the 5-item CAT to 92% for the 11-item CAT. Notable is the substantially larger percentage of 11-item CAT scores (92%) compared to the RM-11 (73% within one SEM). In fact, better results were obtained with a 7-item CAT (76% of scores within one SEM) than with the RM-11, even with the 36% “savings” in response burden. A 7-item scale represents a 70% savings in response burden over the 23-item RM-MOD.
The SEM-based CAT had a stopping rule of SEM = 0.5, roughly equivalent to reliability of 0.69. This CAT condition performed better than the 5-item CAT, but not as well as the 7-item CAT.
The results of the current study indicate that substantial savings in response burden could be obtained using an adaptive approach to scaling back pain-related disability. CAT administrations with as few as 5 items predicted RM-MOD scores with reasonable accuracy, and an 11-item CAT performed substantially better than an 11-item short form. These results are evidence that the hypothesized scaling efficiency of computer adaptive testing can be realized in a clinical outcomes context. Of particular note is that these results were achieved with an item bank that was not developed specifically for computer adaptive administration. To be maximally effective, CAT item banks should be large and of sufficient breadth so that floor and ceiling effects are avoided.9 In the current study, 11% of participants endorsed either all (6.5%) or all but one (4.5%) of the RM-MOD items, suggesting an insufficient number of items that target high levels of back pain-related disability—a ceiling effect. An item bank developed specifically for CAT not only would include more than 23 items but also more items designed to discriminate among persons with high levels of back pain-related disability. A CAT supported by such an item bank would be expected to result in even greater measurement efficiency across a broader range of disability than was observed in the current study.
The complexity of developing a CAT is acknowledged. It requires substantial effort in item development as well as advanced psychometric skills and specialized software. However, once a CAT is developed, the algorithm that selects items and estimates scores operates “behind the scene.” For responders the experience of reporting outcomes using CAT is no different from responding to a more traditional computer administered questionnaire, except that the number of questions and length of time required are reduced.
Even in contexts in which response burden is not a major issue, CAT retains advantages over assessment using fixed forms. As was suggested by our comparison of the 11-item CAT and the RM-11, with the same number of items, greater measurement precision can be achieved. A legitimate goal in its own right since greater measurement precision also results in smaller sample size requirements for clinical research since increases in precision result in increases in statistical power.36
A limitation of this study is that the CAT administrations were simulated. Participants did not respond to items of an actual CAT. We made the assumption that persons would give the same answers to the items of the RM-MOD whether they were presented by CAT or in a fixed, paper-and-pencil format. Though this assumption seems reasonable, its validity should be tested in future research. We also did not evaluate any of the non-technical, but critical issues regarding use of CAT in patient populations including whether patients are comfortable using a computer reporting their outcomes.
Our findings suggest that a CAT-based back pain-related disability measure could be a valuable tool for use in clinical and research contexts, particularly when response burden is a concern and/or multiple assessments are planned.
Dr. Deyo’s effort was supported by a NIAMS Multidisciplinary Clinical Research Center, grant No. 5 P60-AR48093
Drs. Cook, Crane, Johnson, and Amtmann were funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 5U01AR052171-03 to University of Washington. Information on the “Dynamic Assessment of Patient-Reported Chronic Disease Outcomes” can be found at http://nihroadmap.nih.gov/clinicalresearch/index.asp