|Home | About | Journals | Submit | Contact Us | Français|
A proof of concept case study.
To introduce and evaluate a method for identifying what constitutes a minimal clinically important difference (MCID) in the SF-36 Physical Function scale at the patient level.
MCID has become increasingly important to researchers interested in evaluating patient care. Over the last 30 years, an array of approaches for assessing MCID has evolved with little consensus on which approach applies in any given situation.
Three approaches for estimating standard errors of measurement (se) and a 30% change approach for establishing MCID were evaluated for the PF scale with SPORT patients in the IDH cohort. MCIDs for each se approach were then developed based on 1) these standard errors and 2) clinically relevant factors including: a) baseline PF score, and b) acceptable risk for type I error.
IDH patients (N=996) identified from the SPORT database met inclusion criteria. The se for the CTT-based test level approach was 9.66. CTT-score-level and IRT-pattern-level standard errors varied depending on the score, and ranged from (2.73–7.17) and (5.96–16.2), respectively. As predicted, CTT-score-level se values were much smaller than IRT-pattern-level se values at the extreme scores and IRT-pattern-level se values were slightly smaller than CTT score-level se values in the middle of the distribution. Across follow-up intervals, the CTT-score-based approach consistently demonstrated greater sensitivity for identifying patients who were Improved or Worsened. Comparisons of CTT-based-score-level se and 30% improvement rule MCID estimates were as hypothesized: MCID values for 30%-gains demonstrated substantially lower sensitivity to change for baseline PF scores in the 0–50 range but were similar to CTT-score-level-based MCIDs when baseline scores were above 50.
The CTT-based-score-level approach for establishing MCID based on the clinical relevance of the baseline PF score and the tolerance for erroneously accepting an observed change as reliable provided the more sensitive and theoretical compelling approach for estimating MCID at the patient level, which in turn will provide fundamentally important to the clinician regarding treatment efficacy at the patient level.
When a body of research has reached a certain level of maturity, exhibited for the concept of Minimal Clinically Importance Difference (MCID) by a history of almost 30 years of study,1,2,3,4,5,6,7 but implementation and integration into the fields of study have generally not been very successful, there comes a time when researchers in the field need to question the viability of their endeavors. Perhaps the time has come to reframe the problem.
MCID has two basic components: 1) evaluation of the magnitude of change, for which a wide variety of options have been suggested; and 2) clinical implications or importance of that change, or lack there-of, which is dependent on treatment and timing issues related to both patient and/or clinician expectations. These clinical factors have not been systematically studied in conjunction with the evaluation of the magnitude of change. Further, the will to integrate these two concepts does not seem strong. The logical approach would be to disattenuate these two issues by calling the magnitude of change one thing and the clinical implications of change another.
Recently, Dhawn, et al 8 introduced the notion of minimally detectable measurement difference (MDMD) to describe what others have called standard errors of measurement-based approaches for assessing MCID.9–11 The difference between MDMD and MCID is that MDMD simply seeks to determine if the change is reliable, meaning larger than can be explained due to measurement error. Education and psychology has a rich literature in this area generally under the key words “reliable change”.12–14 In contrast, MCID attempts to determine if the change is “Important” or “Clinically Relevant.” Thus, MDMD or reliable change is a necessary but not a sufficient condition for establishing MCID.
The issue then becomes, what should constitute the criteria for determining if the change that is reliably measurable (MDMD) is also clinically important (MCID)? The proposed method of establishing MCID combines the notion of MDMD with the clinical relevance of a given change considering: 1) a patient factor (i.e., patient baseline level on the outcome of interest); and 2) a clinician judgment factor, namely the clinician’s risk tolerance for judging an observed change as clinically important when, in fact, it is not.
This study uses the SF-36 Physical Function score from SPORT patients diagnosed with an intervertebral disc herniation. The methods used to select this sample have been described in detail by Birkmeyer.15 Patients with complete scores at baseline, 6-week, 3-month, 6-month and 1-year follow-ups were evaluated.
The standard error of measurement is the index that estimates the consequence of the lack of perfect reliability within the metric of the survey score. Thus, it is important for the clinician to understand that: 1) all instruments come with a standard error particular to that population responding to the instrument; and 2) that error must be taken into account when interpreting the change in any given patient’s score.
The CTT-based standard error of measurement for an instrument is defined by Equation 1:
where se-GL (GL = Group level) is the standard error of measurement for the survey instrument as a whole, or at the group level, based on classical test theory, rXX is an estimate of the stability of the trait being assessed by the instrument, and SX is the standard deviation of the scores obtained by administering the tool to a sample from the cohort of interest.
Although Equation 1 is the most commonly used definition of se, Standard 2.10 in the 1985 Standards for Educational and Psychological Testing calls for reporting standard errors at the score level.16 CTT-based score-level standard errors (se-SL) were computed using a modification of the Thorndike17 method proposed by Feldt and Qualls.18 The details of these procedures as applied to these data are provided in Appendix A.
In contrast, Item Response Theory (IRT) derives its standard error of measurement (se-IRT) from the pattern of responses across the items.19 Each unique pattern results in a unique standard error of measurement. Bjorner20 provides a basic “clinician-friendly” primer for IRT within a medical context; a more extensive, but not overly mathematic, view is provided by Embreton.19
CTT and IRT methods were each used to determine PF scores and the standard error of measurement associated with that score for the 10-item physical function subscale of the SF-36. Under CTT methods, the ability estimate was determined using the standard scoring algorithm from the SF-36 version 1.0. The standard deviation was estimated across baseline and all follow-ups and test-retest reliability was set at .89. The group level standard error was then computed using Equation 1.
Under IRT-methods the 2-parameter generalized partial credit model (G-PCM)21 was used to estimate each patient’s latent or underlying PF ability level (by convention symbolized by the Greek letter theta - θ) using Parscale 4.1.21 Thetas, which are arbitrarily scaled to have mean 0 and standard deviation 1, were linearly transformed to have the same mean and standard deviation as the CTT-based scores to facilitate comparison of IRT- and CTT-based ability estimates and standard errors of measurement. Person-level standard errors of measurement were estimated using the Bayesian EAP (expected a posteriori estimation) method.22 Readers who wish more details regarding IRT methods are referred to an article by Bjorner et al,23 which provides a detailed summary of the polytomous IRT approach written for a clinical audience.
Charters and Feldt32 provide a compelling argument that establishing confidence intervals around the observed score X remains the most appropriate approach for evaluating the potential distinction between the observed score (X) and the underlying true score (T), though others have been suggested. 24–31
In this study, the X of interest is the difference between the patient’s reported PF score atfollow-up and at baseline (DeltaPF = PFFollow-up − PFBaseline). Evaluating the magnitude of DeltaPF involves two steps:
Evaluating clinical relevance requires simultaneously considering the nature of the scale and the importance of the observed clinical change for a particular patient. First, one needs to understand what change means on the instrument. In the case of the PF scale, the responses to the 10 3-choice items making up the scale are summed and then transformed to range from 0 to 100 by 5s. This means that the granularity of the rescaled PF scale is 5 points. In other words, a single change of one level on a single item will change the PF score by 5 points.
Second, one needs to understand the meaning of change relative to the precision of the instrument, which is reflected by the standard error of measurement. Since the granularity of the PF is 5, if the standard error of measurement is 10, then the granularity to se ratio (i.e., 5/10 or .5) provides a basic sensitivity scale. For this sensitive scale, values less than 1 are generally considered weak evidence for scale sensitivity, meaning that one unit of change on the scale (5 points on the PF scale) represents a small amount relative to the precision of the scale.
For the sake of simplicity, let’s limit our discussion to a patient’s baseline PF score. Average PF scores for surgical candidates for IDH at baseline typically range from 30–40.33 In this IDH SPORT sample, the average baseline PF score was 35.8 with a standard deviation of 24.1. Such a high standard deviation indicates a great deal of variability in the sample. Thus, the mean score for the group may not be reflective the PF levels of many patients in the group.
Given the large variability in the sample, at the patient level a relatively large number of patients are likely to have low scores (0 – 25) and many will have high scores (85–100). At the patient level, the standard error of measurement can be thought of for a single patient’s score the same way that the standard deviation of the mean can be though of for a group’s average score. If the standard error of measurement is large, our confidence that the observed score for that patient is precise is low, if the error is small, our confidence that the observed score is precise is high.
For extreme scores (both low and high) it should be clinically interesting to detect any change at follow-up. For low and high scores, any worsening is likely to reflect important loss of function and any improvement likely to reflect important function gain. In the middle of the distribution, small changes are more likely to reflect multiple items changes with some items showing gain and others worsening. Thus, the importance of a small net gain or loss when PF scores are more in the middle of the distribution is more difficult to evaluation based on change alone. It is for this reason that score-level standard errors of measurement should be preferred relative to test level standard errors and IRT-base standard errors of measurement since theory dictates that group level standard errors of measurement are constant across all scores and IRT-based standard errors of measurement tend to be larger for extreme scores compared to scores in the middle of the distribution. In contrast, score-level standard errors of measurement in well constructed scales tend to display smaller standard errors of measurement for extreme scores compared to scores in the middle of the distribution.
From this logic, clinical importance for detecting change for extreme scores on the PF scale seems clear. A clinically important change is a function of the: 1) size of the change (DeltaPF), 2) standard error of measurement associated with the change (se-D) and 3) clinical importance of detecting change, which, here is defined by the baseline PF score. Again, considering the baseline PF score is important because, it is difficult know where you’re going if you don’t know where you started.
Figure 1 summarizes the different decision rules associated with making a clinical evaluation of whether or not DeltaPF is considered clinically important when baseline PF scores are in the 0–25, 30–80, or 85–100 range. When DeltaPF is positive and above the cut point the patient is judged to have demonstrated reliable and clinically important improvement. When DeltaPF is negative and below the cut point, the patient is judged to have demonstrated reliable and clinically important worsening. Otherwise, no reliable change has been demonstrated and the patient is judged to be the same.
It is hypothesized that:
The 30% change rule defines MCID as 30% change in the possible gain (for improvement) and 30% change in the possible loss (for worsening). Thus, for a PF baseline of 30, which has a possible gain of 70 points, the 30% rule would define MCID for improvement as 21 (i.e. .30 × 70) and MCID for worsening as 9 (.30 × 30). From this definition, improvement in low scores, where there is a lot of possible improvement will be associated with large MCID values (e.g., 30 points if PF at baseline is 0) improvement for large scores, where possible improvement is small, will be associated with small MCID values. Similarly, MCIDs will be small for low extreme scores and large for high extreme scores.
Based on this 30% gain rule approach for defining MCID, it is hypothesized that:
A total of 1411 patients diagnosed with intervertebral disc herniations (IDH) were identified from SPORT. Of these, 996 patients had complete data at baseline and across all follow-ups. The average age was 53.6 with standard deviation 16.0, and ranged from 18 to 92 years of age. A little over half of the patients (50.2%) were female.
Pearson correlations between CTT- and IRT-based PF scores were .976, 981, .980, .980, .980 at baseline, 6-weeks, 3-Months, 6-Months, and 1-Year follow-up, respectively. Table 1 summarizes average CTT- and IRT-based PF scores as well as follow-up from baseline change scores at each follow-up interval. The pattern of results was quite similar for the CTT- and IRT-based scoring. Overall, change from baseline was statistically significant at each follow-up interval, but the large standard deviations for these change scores indicate that at the patient level change was quite variable, as clearly observed by noting the magnitudes of the minimum and maximum scores at each follow-up interval.
Figure 2 summarizes the se-GL, se-SL and se-IRT based standard errors of measurement. As required by Equation 1, the se-GL value is constant at 9.66 across all score levels, and as would be expected, the se-SL values are smaller at the extremes and somewhat larger in the center of the distribution and the se-IRT values are larger at the extremes and smaller in the center of the distribution.
Estimating the standard error of measurement for a difference score requires some notion of the magnitude of the correlations between scores across time. The correlations summarized in Table 2 were used in conjunction with the baseline and follow-up standard errors of measurements (se-GL, se-SL, and se-IRT) to estimate se-D for each baseline to follow-up difference based on Equation 2.
Table 3 summarizes MCID cut points for determining Improved or Worsened change relative to baseline PF score and time to follow-up for each of the three se approaches. Table 4 summarizes the percentages of patients classified as Improved, Same, or Worse from baseline to 1-year follow-up based on the se-GL se-SL and se-IRT methods with the same rules for establishing risk tolerance across the methods. The heavy lines in Table 4 serve as a reminder of the cut points where risk tolerance was adjusted to reflect the clinician’s notion of clinical importance. As depicted in Table 4, when using the se-SL approach, at 1-year follow-up 74.4% of the 996 patients demonstrated clinically relevant improvement, 18.5% no change and 7.1% clinically relevant worsening. Clinically relevant improvement ranged from 81.2–89.7% when baseline PF was 0–35, from 54.8–70.5% when baseline PF was 40–80, and from 40–53.3% when baseline PF was 85–100. Results for the 6-week, 3-month and 6-month follow-ups were consistent with these 1-year follow-up results (data not presented).
Table 5 summarizes the consistency of classification, symmetry, and relative discrimination of the score-level and response-level standard error of measurement approaches for establishing MCID at all follow-up intervals. The weighted Kappa estimates were all quite high, and percent agreements were all above 84%. Not surprisingly, given the similarly of the weighted Kappa’s, the test for equal weighted Kappa’s across the four follow-up times was not significant, , p < .21. On the other hand, Bowker’s symmetry index (Figure 3) was also quite high; indicating that although the two methods agreed much of the time, when there was disagreement, the pattern of disagreement was not similar across the two methods. The percent asymmetry index, which is designed to estimate the relative discrimination of the two approaches, indicates that when disagreements between the two methods were observed, the likelihood was that patients classified as Same by the IRT approach were more likely to be classified as either Improved or Worsened when using the se-SL approach.
Table 6 summarizes the differences between the CTT-based score-level standard error approach for estimating MCID for improvement and for worsening vs. the 30% change rule.
This study compared classical test theory and item response theory approaches for scoring the physical function (PF) scale of the SF-36 and found that both scoring approaches provided very similar results. The pattern of mean scores indicated that, on average, the patients reported initially large and then incremental improvement in physical function from baseline. These high correlations, similar distributions, and nearly identical point estimates (means) at each assessment, and similar change scores indicate that the interpretation of CTT- and IRT-based estimates of the SF-36 PF scales produced similar interpretations at the group level. However, the PF score standard deviations were large and at each time point, at least some patients reported PF scores of 0 and 100, which indicated a wide diversity of PF scores at the patient level at baseline and at each follow-up assessment.
As clearly demonstrated in Figure 2, the expected patterns of results were consistent with theory. As expected, MCID values defining both Worsening and Improvement varied by baseline PF score and were typically smaller for the CTT score-level approach compared to the CTT group-level and the IRT response-level approach for extreme PF scores.
The hypothesis that the se-SL approach would be more sensitive to change at the extremes compared to the se-IRT approach, and that the se-IRT approach would be more sensitive to change when PF changes scores were nearer the middle compared to the se-SL approach was only partially supported. Overall, the se-SL approach was consistently more likely to identify more patients as Improved or as Worsened compared to the se-GL and se-IRT approaches regardless of the location of the PF score relative to its scale. When the se-IRT was as good as or better than the se-SL approach, the trend was for the baseline and follow-up PF scores to be near the center of the distribution.
Tables 4 and and55 provide clear evidence of the effects that grouping baseline PF scores and specifying different probability cut points for estimating MCID. When applying the se-SL approach, the threshold for establishing MCID was a 5-point loss when baseline PF scores were 5 or 10, and a 5-point gain when baseline PF scores were 90 or 95, thus providing maximum sensitivity when making clinical judgments at these extremes. In contrast, demonstrating worsening for baseline PF scores of 5 or 10 or improvement when baseline PF scores were 90 or 95 was not possible when applying the se-GL or se-IRT approaches.
The differences between the CTT-based score-level standard error approach for estimating MCID with the 30% improvement criterion suggested by Dworkin et al 6 and Ostelo et al7 were generally as hypothesized. MCID values for 30%-gains demonstrated substantially lower sensitivity to change for baseline PF scores in the 0–50 range but were similar to CTT-score-level-based MCIDs when baseline scores were above 50. As expected, the reverse held true when evaluating the 30% loss rule: MCID values for 30%-losses demonstrated substantially lower sensitivity to change for baseline PF scores when at or above 50 but were similar to CTT-score-level-based MCIDs when baseline scores were below 50.
MCID values for gains clearly demonstrated substantially lower sensitivity to change in PF scores in the 0 – 50 range were 30% improvement MCID values range from 15 to 25. In comparison se-SL based MCIDs for PF scores at these levels were 5 or 10. The 30% Gain approach estimated MCID values continued to be larger than the score-level standard of error approach until Baselines PF scores reached 50. When baseline PF scores were 50–65, the two approaches for estimating MCID were similar. Contrary to expectation, the se-SL based MCIDs for PF scores generally continued to be smaller than for the 30% gain rule even when PF scores were 70 or greater. Only 6.25% (6/36) of the MCIDs were smaller for the 30% Gain rule, and in each case, the difference was 5 points, the smallest difference possible. The same pattern of results was observed when comparing the se-SL based MCID approach with the 30% loss rule.
The smaller MCID values based on the se-SL approach compared to the se-GL and se-IRT approach compares “apples to apples”. Based on the notion that smaller standard errors of measurement estimates will result in smaller MCID values, and therefore, greater sensitivity in identifying change, are the smaller MCID values necessarily better? In the few cases where the 30% change rule produced smaller MCID values compared to the se-SL approach, the argument against the 30% change rule seems obvious. Why would a clinician feel comfortable judging a change to be important when the change is thought to be no greater than would be expected due to measurement error? The argument that the generally smaller MCID values for observed for the se-SL approach represent a better way for estimating change depends on the clinician’ strength of belief that the cut points established for defining change reflected sound clinical judgment. As a new approach, the soundness of the suggested cut-scores provided in this paper are strictly theoretical and the validity of this framework can be evaluated empirically just as others may question the validity of a 30% gain rule as apposed to a 10%, 20% or even 50% rule.
The clinical importance of change is undoubtedly related to a multitude of factors including, but not limited to: 1) patient baseline score; 2) individual factors associated with the patient such as age, gender, general health, health-related habits such as smoking, alcohol use, medications; and 3) personality factors such as their general mood or ability to comply with treatment requirements. Some would argue that baseline level might be a reasonable proxy for many of these factors. In fact, in the various SPORT papers, baseline status was often the single strongest predictor of follow-up performance. Furthermore, little if any research is currently available that provides a framework by which the clinician can integrate these various factors into a coherent and reproducible estimate of patient prognosis.
Standard errors of measurement are instrument- and cohort-specific. The pattern of results observed for the PF scale in the IDH cohort may not hold for other scales (e.g, Bodily Pain, Oswestry) or with other cohorts (e.g., Degenerative Spondylolisthesis or Spinal Stenosis). As demonstrated in this study, for the PF scale within the IDH cohort, standard errors of measurement and, as a consequence, MDMD and MCID estimates, can vary depending on the:
As a consequence, these study results are limited to the IDH cohort, the three methods of estimating se used in this study, the probability levels used to define MCID; and the time intervals from baseline specified (i.e., baseline to 6 weeks, 3 months, 6 months and 1 year).
Expansion of these methods to other scales commonly used for studying treatment efficacy and to the other two diagnostic cohorts enrolled in the SPORT trials would allow a first look at the possible generalizability of the se-SL approach for establishing MCID cut points across patient diagnostic categories.
Demonstrating that patient-level results based on MCID-related classification of patients change status (Worse, Same, Better) will provide more relevant guidance to the clinician when forming clinical judgments is an important next step. In these data, the average gain in PF from baseline to one year (from Table 1) was 31.04 with a standard deviation of 30.5 and change scores ranging from −90 to 100. Consider that these same data, interpreted at the patient-level, suggest a different picture. From Table 4 consider than when using the se-SL approach, overall 74.7% (741/996) showed improvement, 18.5% (184/996) no change and 7.1% (71/996) worsening. Furthermore, of the 50 patients whose baseline PF scores were high (85–100), 40% demonstrated improvement; of the 364 patients with baseline PF scores between 40–80, 62% demonstrated improvement; and of the 582 patients whose baseline PF scores were between 0–35, 85% demonstrated improvement. These results would suggest that, not withstanding the overall 75% improvement rate on the PF scale, the likelihood of patients experiencing this magnitude of success is not very likely if their baseline PF scores are already high. Which information provides the clinician and the patient with a better understanding of the possible treatment outcomes for a given patient, the group-level mean or the patient-level summary of Improved, Same, Worsened, broken down by baseline PF levels?
As for implementing this patient level approach into clinical practice, a subset of Table 3 just showing the se-SL results would be sufficient to allow the practicing clinician to estimate the MCID for any given IDH patient at 6-weeks, 3-Months, 6-Months or 1 year.
This study was undertaken as a proof of concept for using a standard error or measurement approach in conjunction with clinical judgment based on clinically relevant criteria known at baseline to establish a general approach for establishing MCID. CTT-based score-level standard errors of measurement in conjunction with cut score decisions based on the clinical relevance of the baseline PF score and the clinician’s risk tolerance provided a more sensitive approach for estimating MCID compared to the CTT-based group-level or the IRT-based pattern-level approaches. The computational rigor of this approach compared to a more simple approach such as defining MCID as a 30% change represents a poor choice as one is trading ease of use for lack of accuracy, especially when in estimating gain when baseline scores are above 50 or estimating loss when baseline scores are less than 50. The future of this approach depends on demonstrating the value added for clinical decision-making when patient level MCID information is used to help guide clinical decision making.
Estimating Score-Level Standard Errors of Measurement using Classical Test Theory Computing CTT-based score level standard errors of measurement is a four-step process:
Table A1 provides all of the information needed to estimate the se-SL based standard error of measurement and evaluate the credibility of the parallelism of part tests 1 and 2. The similarity in the part 1 and 2 means and standard deviations and coefficient alpha internal consistency reliability estimates along with the high Pearson correlation between the two scores were sufficient to consider these part tests as being reasonable parallel forms.
For example, a PF score of 55 could be split into two part tests 1 and 2 with part 1 score 25 and part 2 score 30. If the overall means of the part 1 and part 2 tests were 25.6 and 27.4, then by equation (3): se-SL = SQRT ([(25–30)−(27.3–28.8)]2) = SQRT(−3.52) = 3.5.