|Home | About | Journals | Submit | Contact Us | Français|
When functional scales are to be used as treatment outcome measures, it is essential to know how responsive they are to clinical change. This information is essential not only for clinical decision-making, but also for the determination of sample size in clinical trials. The present study examined the responsiveness of a German version of the Oswestry Disability Index version 2.1 (ODI) after surgical treatment for low back pain. Before spine surgery 63 patients completed a questionnaire booklet containing the ODI, along with a 0–10 pain visual analogue scale (VAS), the Roland Morris disability questionnaire, and Likert scales for disability, medication intake and pain frequency. Six months after surgery, 57 (90%) patients completed the same questionnaire booklet and also answered Likert-scale questions on the global result of surgery, and on improvements in pain and disability. Both the effect size for the ODI change score 6 months after surgery (0.87) and the area under the receiver operating characteristics (ROC) curve for the relative improvement in ODI score in relation to global outcome 6 months after surgery (0.90) indicated that the ODI showed good responsiveness. The ROC method revealed that a minimum reduction of the baseline (pre-surgery) ODI score by 18% (equal to a mean 8-point reduction in this patient group) represented the cut-off for indicating a “good” individual outcome 6 months after surgery (sensitivity 91.4% and specificity 82.4%). The German version of the ODI is a sensitive instrument for detecting clinical change after spinal surgery. Individual improvements after surgery of at least an 18% reduction on baseline values are associated with a good outcome. This figure can be used as a reliable guide for the determination of sample size in future clinical trials of spinal surgery.
In recent years patient-oriented, self-administered questionnaires have been used with increasing frequency in the assessment of outcome after treatment for low back pain . For assessing “back-specific function‘’, most state-of-the-art reviews [6, 11] recommend either the Oswestry Disability Index (ODI [15, 16] or the Roland Morris Questionnaire (RM ). A number of studies have been carried out to examine the psychometric characteristics of these instruments, especially when validating various non-English language versions, but most of these investigations have only been concerned with the reliability (internal consistency and test–retest reliability) and validity of the given questionnaires (e.g. [7, 19]). Good reliability and validity are prerequisites of any instrument, especially when it is to be used to discriminate between subjects or predict prognosis [3, 24, 29]. However, the requirements for successful cross-sectional discrimination are not necessarily the same as those for successful longitudinal evaluation , and when functional scales are to be used as treatment outcome measures, it is essential to know how well they can detect small but important clinical changes, i.e. how “responsive” they are . This information is essential not only for clinical decision-making, but also for the determination of sample size in clinical trials, to ensure that they are adequately powered to detect a difference between treatments if one is present.
Previous studies have used “effect sizes” to examine the responsiveness of the Oswestry Disability Index to surgical treatment . However, the effect size, i.e. the mean change-score for a group of patients divided by the standard deviation of all the change-scores, predominantly depicts the overall group response; a more complete picture of the responsiveness of an outcome measure on an individual basis is obtained with the use of receiver operating characteristics (ROC). The ROC approach assesses how successfully a given change-score can discriminate between patients who improved and those who did not improve as a result of any given treatment . In this way, both sensitivity and specificity to change for a range of possible cut-off change-scores can be calculated.
The present study examined the responsiveness of a German version of the Oswestry disability index, as compared with that of the Roland Morris Disability Score [14, 28] and the visual analogue scale for pain intensity, in a group of Swiss patients undergoing spine surgery.
The ODI version 2.1 (the English version of which is reprinted in full in ) is a self-administered questionnaire, which comprises ten items to assess the extent of the patient’s back pain and difficulty in carrying out nine different activities of daily life: personal care, lifting, walking, sitting, standing, sleeping, sex life, social life, and travelling. The questionnaire is completed in reference to the patient’s functional status “today”. Each item is scored from 0 to 5, with higher values representing greater disability. The total score is multiplied by 2, and normally expressed as a percentage (in the present study this percentage will simply be referred to as “the ODI score” and discussed in terms of points (0–100), to avoid confusion when discussing percentage changes in the score (as a mathematical expression) following surgery).
The cross-cultural adaptation, reliability and validity of the German version of the ODI version 2.1 are described in detail in Mannion et al. .
Sixty-eight patients with low back pain (LBP) agreed to take part in the study. All had been referred to the hospital’s Spine Unit for surgery in connection with spinal stenosis, herniated disc, failed back, spondylolisthesis, or degenerative disease with chronic LBP. The patients completed a baseline questionnaire (see below), sent to them by post approximately 2–3 weeks before their operation. Sixty-three underwent the planned surgery (mainly decompression, fusion, metal removal, or a combination of these), and 57 of these (90%) completed a second questionnaire 6 months after the operation. There were 31 women and 26 men, with a mean age of 53.2 (14.6) years.
The patients completed a questionnaire booklet containing the German version of the ODI , 0–10 visual analogue scales for back/leg pain intensity in the last week (VASpain) and for general health (VAShealth), and a German version of the Roland Morris (RM) disability questionnaire (validated by Exner and Keel ). The RM enquires as to whether back pain hinders the performance of 24 activities of daily living (today), each with possible responses of “yes” and “no”; the RM score ranges from 0 to 24 points. At follow-up, the questionnaire booklet also contained the following items: two Likert scale questions enquiring how the patient’s (1) back/leg pain and (2) disability in everyday activities had changed compared with the time before the operation (in each case, 6 categories from “now free of complaints/problems” to “now worse”); a question about how much the operation had helped (5 categories from “helped a lot” to “made things worse”); and a question enquiring as to whether, with his/her current knowledge of the result, the patient would make the same decision to undergo surgery if he/she found himself in the same situation as before the operation (“yes”/“no”).
The study was approved by the local ethics committee.
Paired t-tests were used to examine the significance of the change in group mean scores for each instrument, from pre-surgery to 6 months post-surgery. The effect size for each instrument was calculated by taking the mean of the individual change scores and dividing this by the corresponding standard deviation of these change scores . The effect size was also calculated for each instrument in relation to the five categories of the global outcome question, “did the operation help?” Examination of the correlation between the instrument change-scores and the (ordinal) global outcome scale gave a further indication of responsiveness . The sensitivity and specificity of each instrument, relative to patient global outcome, was examined using the receiver operating characteristic (ROC) method . It has been suggested that instrument responsiveness can be considered analogous to evaluating a diagnostic test, in which the instrument is the diagnostic test and the global outcome represents the gold standard . The ROC curve synthesises information on sensitivity and specificity for detecting improvement according to some dichotomised, external criterion. It consists of a plot of “true-positive rate” (sensitivity) versus “false positive rate” (1-specificity) for each of several possible cut-off points in change score . Thus, sensitivity and specificity are calculated for a change score of 1 point, 2 points, and so on. The five global outcome categories for the question “how much did the operation help?” were collapsed to provide a dichotomous outcome variable: “good outcome” (included “helped a lot” and “helped”) and “poor outcome” (included “only helped a little”, “didn’t help”, “made things worse”). (As most of the patients were undergoing elective surgery, we felt that the overall result “only helped a little” should be categorised as a poor outcome.) The area under the ROC curve (ROCarea) was interpreted as the probability of correctly discriminating between patients with a “good” and a “poor” outcome, based on the change in instrument scores (examined for ODI, RM and VASpain). The ROCarea can range from 0.5 (no accuracy in discriminating) to 1.0 (perfect accuracy in discriminating). The ROC curve was used to indicate the cut-off change-score for distinguishing between “good” and “poor” outcomes , using the approach of minimising “errors” (equivalent to maximising the sum of the specificity and sensitivity) .
Statistical significance was accepted at the P<0.05 level.
The mean scores for ODI, RM, VASpain and VAShealth before and 6 months after surgery are shown in Table 1. Each of the disability scores (ODI and RM) showed a significant reduction of 30–35% 6 months after surgery (P<0.001), and the changes in scores correlated highly significantly with each other (Fig. 1).The VASpain showed a reduction of 43% 6 months after surgery (P<0.001) and VAShealth improved by about 22% (P=0.02).
Considering the whole group data, the effect sizes were similar for the two disability questionnaires (ODI, 0.84; RM, 0.90) and were both somewhat lower than that of VASpain(1.07). As expected, the effect size for VAShealth (0.29) was considerably smaller than that for any of the condition-specific measures. (The VAShealth measures were not considered in any further analyses.)
Six months after surgery, 40% of patients reported that the operation “helped a lot”, 26% that it “helped”, 14% that it “only helped a little”, 18% that it “didn’t help” and 2% that it “made things worse”. There was a highly significant correlation between these “global outcome” ratings and the Likert-scale ratings of perceived improvement in disability (Spearman’s ρ=0.83, P<0.001) and perceived improvement in pain (Spearman’s ρ=0.82, P<0.001). This indicated that the “global outcome” categories, themselves, had good construct validity in relation to changes in perceived pain and disability.
The change-scores for ODI and RM each showed a significant correlation with the global outcome categories when the latter were expressed as ordinal data (scale of 1–5): ODI Spearman’s ρ=0.69, P<0.001; RM Spearman’s ρ=0.67, P<0.001.
Table 2 shows the mean change-scores and effect sizes (pre-surgery to 6 months after surgery) for the ODI, RM and VASpain for each of the five global outcome categories. For each instrument, the mean change-scores differed between the outcome categories, though not always statistically significantly (NB the group sizes for the poor outcome categories were generally quite small). A significant difference in the score-change between patients who reported that the operation “helped a lot” and all less favourable outcomes was observed for each instrument.
The difference in the mean score-change between the categories “didn’t help” and “helped” was 10 points for the ODI, 3.6 points for the RM and 2.1 points for the 0–10 VASpain.
When the five-category global outcome ratings were dichotomised (see Statistical analysis above), the majority of patients (66%) reported a “good” outcome (“poor” outcome, 34%). As expected, the effect size statistics in the “good’ group were significantly greater than those in the “poor” group for each outcome measure (Table 2). Thus, all three instruments showed good sensitivity to change. For the two disability questionnaires, ODI and RM, the effect size statistics for the “good” outcome group were similar (both around 1.3), and both were somewhat lower than that of VASpain(1.6). The difference in the mean ODI change score between the “good” and “poor” categories was approximately 20 points.
Both disability questionnaires showed good specificity, i.e. the effect size for the patients in the “poor” global outcome group was minimal (Table 2). In contrast, VASpain showed a moderate effect size of 0.50 for the “poor” outcome group; even for the sub-category “operation didn’t help”, the effect size for VASpainwas 0.53. This indicates that some patients who had not improved according to their global outcome category had still shown a moderate improvement in relation to pain intensity, suggesting that the VASpainis less specific to change than the two disability questionnaires.
Using the dichotomised global outcome as the “external criterion”, the ROC curves for the change-scores for ODI, RM and VASpain were each far to the left above the diagonal, indicating that each had some discriminative ability. The ROCareas for ODI, RM and VASpainwere 0.85 (SEM 0.06), 0.84 (SEM 0.05) and 0.88 (SEM 0.05), respectively.
When the change-scores were expressed as a percentage of their baseline value, the areas under the ROC curves were even higher [0.90 (SEM 0.04), 0.86 (SEM 0.05), 0.92 (SEM 0.04) for ODI, RM and VASpain, respectively] (Fig. 2).
Very similar results were obtained when, instead of using the “global outcome rating”, dichotomous categories formed by collapsing the 6-category Likert scales of the degree of improvement in pain and in disability were used (data not shown).
Assuming equivalent importance for false-positive and false-negative errors, the absolute change-scores with the best cut-off points for predicting global outcome (“good”/“poor”) were calculated. Different cut-offs sometimes gave the same optimised product of sensitivity and specificity, and in these instances the range is given. The cut-offs were approximately 11 points for ODI (83.8% sensitivity, 84.2% specificity), 1.5 points for RM (83.8% sensitivity, 73.7%specificity) and 1.5–2.8 points for VASpain(76.3–81.6% sensitivity, 73.7–89.5% specificity). The corresponding cut-offs for the percentage score reduction from baseline were 18% for ODI (91.7% sensitivity, 84.2% specificity), 8% for RM (88.9% sensitivity, 73.7% specificity) and 32% for VASpain (86.5% sensitivity, 94.7% specificity). These change-scores can be considered to represent the minimal clinically significant change, at the level of the individual patient.
In the present study, the responsiveness of the ODI, determined using the various recommended statistical methods , was confirmed in a group of LBP patients undergoing spinal surgery. The difference in the mean ODI change score between the global outcome categories “good” and “poor” was approximately 20 points. This is higher than the score of 10 points previously reported by Hagg et al.  for the difference in ODI change-score between patients who showed “improvement” and those who showed “no relevant change” after surgery. However, in the present study, the global category “good” included not only those patients for whom the operation “helped”, but also those who reported that the operation “helped a lot” (i.e. more than just “improved”). When the difference between the narrower categories “operation helped” and “operation didn’t help” was examined (analogous to the analysis carried out by Hagg et al. ), then a similar mean ODI change-score (10 points) to that of Hagg et al.  was obtained. In the present study, no minimal clinically relevant difference for “worsening of the condition” could be calculated, as too few patients declared that the operation “made things worse”.
Demonstrating that post-treatment scores are significantly different from pre-treatment scores and that the change-scores are greater in an “improved” group than in a “no change” group addresses the sensitivity to change of the scale, but not its specificity [4, 12]. The concept specificity to change is also important, since changes without clinical relevance may occur in function scale scores . For example, in the present study, the change-score for the VASpain was very high in the “good” outcome group (4.2 points; effect size 1.6) but was also moderately high in the “poor” outcome group (0.7 points; effect size 0.50), indicating that a number of patients who were not improved according to the global outcome criterion still decreased appreciably in their pain score.
In order to better quantify the responsiveness of the ODI, the ROC method was used. The area under the ROC was 0.85 for ODI and 0.84 for RM. These values are generally somewhat higher than those previously reported in the literature for acute or chronic LBP patients undergoing conservative treatment (ODI: 0.76 , 0.78 , 0.94 , 0.78 ; RM: 0.79 , 0.93 , 0.77 ). Slight differences between studies may be the result of the questionnaire version used (e.g. Beurskens et al.  used an older version of the ODI), or the differing LBP populations and treatment strategies investigated (e.g. acute versus chronic LBP; conservative versus surgical patients).
The patients showed quite wide-ranging disability scores at baseline, and we therefore considered it of interest to examine whether the responsiveness of the instruments improved when, instead of absolute change scores, relative change scores (i.e. before surgery score-6 month score/before surgery score) were used as the “discriminating variable” in the ROC analysis. For each of the three instruments (ODI, RM and VASpain), the areas under the ROC curve were even higher (0.90, 0.86, and 0.92, respectively) when the relative scores were used. Thus, we tentatively suggest that it may be more appropriate to discuss the cut-off scores for indicating “improvement” (as determined from ROC curves) in terms of the percentage change-score from baseline. Although percentages of change scores are not recommended for use in the statistical analysis of outcome in clinical trials , they may be of some practical use for the calculation of sample size for such trials, especially when populations with differing baseline scores are being investigated: using the percentage of change value, one can calculate the corresponding absolute score-change required to be considered as “improvement” in relation to the expected baseline scores for the given population. This absolute value can then be used in the subsequent power calculations. For the ODI, a “good” global outcome was predicted (with 92% sensitivity and 84% specificity) by a change in ODI score greater than or equal to an 18% reduction from baseline values. In relation to the mean pre-surgery ODI value in the present study (45 points), this is equivalent to an approximate 8-point reduction. The ROC analysis done using the absolute change-score revealed a cut-off for a good outcome of 11 points. Interestingly, both of these cut-off values are somewhat higher than the previously reported values for conservatively treated acute or chronic LBP patients of 4–6 points  and 6 points . The precise value may depend on the patient group and treatment under investigation: in less disabled patients, changes of up to 6 points may represent a similar percentage reduction from baseline to that reported for the patients in the present study.
An individual change-score of 8–11 points lies relatively close to minimal detectable change (MDC95%) for the ODI (9 points; Mannion et al. ). This is the value required to detect (with 95% confidence) real individual change over and above measurement error . Nonetheless, in clinical practice, the 95% confidence level may be too strict for governing the presence of real individual change: with a standard error of measurement (SEM) of 3.4 points for the ODI , a score-change of approximately 7 points (2×SEM) could still be considered “real change” with a 92% confidence level, or of 5 points (1.5×SEM) with an 86% confidence level .
Clinically relevant group mean changes in an outcome instrument appear to be somewhat more difficult to define, and are not (directly) determined by the same factors as those governing clinically relevant individual change [18, 30]. As regards the ODI, the clinically relevant group mean change is likely to be considerably lower than 10 points; indeed, previous studies have suggested that differences in group mean scores as low as 4 points can carry clinical significance . Perhaps power calculations for clinical trials in low back pain research should be based on the proportion of individuals who are expected to achieve a clinically relevant change-score rather than on the expected (and difficult to ascertain) clinically relevant group mean change; this might lead to more relevant findings, although the necessary sample sizes for the trials would undoubtedly increase [5, 10].
It is important to point out that both strategies used to assess an instrument’s responsiveness (effect sizes and the ROC method) depend on some external criterion for rating “improvement”; further, to perform the ROC analysis, this criterion must be dichotomous. However, there exists no “gold standard” for assessing outcome and, in reality, there are often more than two grades of improvement that can be considered to carry clinical relevance. In the present study, the five-category Likert scale for “how much the operation helped” was collapsed into a dichotomous variable for “good” and “poor” outcome to provide the external criterion for use in the ROC analyses. Although we do not suggest that this measure constitutes a definitive gold standard for assessing outcome, it can at least be expected to reflect the most important changes to the individual patient elicited by the operation. The construct validity of this global outcome scale appeared to be satisfactory: it showed highly significant associations with each of the two Likert scale ratings for improvement in disability and in pain, and when the effect size/ROC analyses were carried out using improvement in disability or pain as the external criteria, the results were largely consistent with those obtained using the global outcome scale. In the absence of a true gold standard, the best one can do is ensure construct validity of the criterion that is ultimately chosen for use . Further, as highlighted by Beurskens et al. , most people would be reluctant to label patients as improved or worse contrary to their personal rating of the global effect of treatment. An alternative may have been to use the answer to the question “if you found yourself in the same situation as before the operation, would you make the same decision to undergo surgery, with your current knowledge of the result?” (yes/no). However, although it has been used as a main outcome measure in other retrospective studies (e.g. ), our experience with this question has indicated that it is a confusing construct for some patients to understand. Some report no change in (or even a worsening of) symptoms or disability, but still tend to say “yes”, as if perhaps interpreting the question to be an enquiry regarding their propensity to think that “everything’s worth a try” as opposed to a direct evaluation of their perceived outcome after the intervention received. Further, people sometimes simply don’t like to consider that they “made a wrong decision” and therefore answer “yes” regardless of the outcome, to avoid being confronted with feelings of regret or self-blame. As such, and in keeping with the methodology used by previous authors , we consider that collapsing the five-category Likert scale into a dichotomous variable provides the more accurate representation of global outcome.
Our studies on the responsiveness of the German version of the ODI version 2.1 are the first to address both the sensitivity and specificity of ODI change-scores in categorising outcome after spinal surgery and to provide cut-off scores for interpreting meaningful clinical change. A good global outcome was predicted (with 92% sensitivity and 84% specificity) by a change in ODI score greater than or equal to an 18% reduction of the individual’s baseline value.
The authors would like to thank Gordana Balaban, Simon Smit and Katrin Knecht for the administration of the questionnaires. The study was funded by the Schulthess Klinik Research Funds.
Part 1 of this article can be found at http://dx.doi.org/10.1007/s00586-004-0815-0