|Home | About | Journals | Submit | Contact Us | Français|
There is evidence of increasing use of cost-utility analysis to assess the relative value of alternative treatment interventions when resources are limited.[1, 2] To estimate Quality-Adjusted Life Years (QALYs) for the denominator of the incremental cost-effectiveness ratio (ICER), outcomes of treatment are measured using a single score, anchored at 0 for death and 1 for perfect health, and weighted for the relative desirability of the health state. Standards for economic evaluations recommend using societal values (utilities or preferences). The two main approaches to obtaining “societal health state values” include: 1) direct measurement of values for health states of a representative sample of the population using methods such as standard gamble, time tradeoff, and visual analogue scale ratings, and 2) indirect measurement using preference-weighted health state classification systems such as the Quality of Well Being Scale, the EuroQoL EQ-5D, [5, 6] the McMaster Health Utilities Index HUI,[7–9] or the SF-6D. [10, 11] In addition, methods have been developed to estimate health state values from existing HRQOL data, for example using regression models.
Preference-weighted health state classification systems are increasingly used in cost-utility analyses to estimate change in QALYs. Furthermore, they are increasingly used as measures of health outcome in clinical trials. Systems vary in their approaches to the design of each component: the descriptive system, preference measurement method, source of community preferences, and approaches to scoring. Questions remain about the comparability of systems in specific populations and about the extent to which differences in systems could impact the results of cost-utility analyses and therefore, policy decisions. In choosing among measurement systems researchers need to know the strengths and weaknesses of alternatives to optimize measurement performance for the particular problem under study, to interpret score changes or differences, and for study planning.
Evidence from cross-sectional comparisons indicates that significant variation exists in mean scores obtained from different systems. [13–19] However, when the research purpose is to measure change due to treatment, as in cost-utility analysis, longitudinal studies are necessary to evaluate system performance. Longitudinal head-to-head comparisons of preference-weighted systems indicate that change scores vary across systems used to estimate health state values (HSVs). [20–34] In particular, there is some evidence of difficulty measuring change at the worst levels of health (floor effects) for SF-derived systems,[15, 27, 35] and at the best levels (ceiling effects) for EQ-5D[14, 36] and in detecting changes in function unrelated to the extremities for HUI3.
To our knowledge, there are no longitudinal comparisons of these systems in the population of patients with intervertebral disc herniation, a common and costly spine disorder. Therefore, we aimed to conduct a critical comparison of the measurement characteristics of the EQ-5D-UK, EQ-5D-US, HUI3, HUI2, SF-6D and an algorithm to estimate the QWB from SF-36 data (eQWB) among Spine Patient Outcomes Research Trial (SPORT) participants with intervertebral disc herniation (IDH).
We used baseline and one year data from an ongoing prospective study of interventions for symptomatic lumbar spine disorders (SPORT). The design of this study has been previously reported in detail. In brief, SPORT is a multi-center study including three randomized trials and three observational cohorts. To be eligible for SPORT, participants were 18 years or older and had a diagnosis of IDH, Spinal Stenosis (SpS), or Degenerative Spondylolisthesis (DS). Participants were excluded if there was evidence of non-surgical treatment for fewer than six weeks for IDH and twelve weeks for SpS and DS; cauda equina syndrome; contraindications to spine surgery; possible pregnancy; active malignancy; current fracture; infection; or prior lumbar spine surgery.
The instruments used to characterize health state values are described below.
The EuroQoL EQ-5D includes five attributes rated on three levels to define 245 health states (when “dead” and ‘unconscious” are added). Using the same EQ-5D health state classsification system with the reference time frame “today,” we applied EQ-5D-UK preference weights and EQ-5D-US weights. The UK (York) weights were measured using time-tradeoff values for a subset of health states from a sample of the UK population. [5, 6] The US weights were measured using time tradeoff in a representative sample of the US population. Both systems use additive models of attribute independence with different adjustments for any health state at the worst possible level.
The McMaster Health Utilities Index has been well described. [8, 9, 39–41] Using the same health state classification system, SPORT is licensed to apply the Mark 2 (HUI2) and the Mark 3 (HUI3) utility functions. HUI(2) represents seven attributes on four or five levels and defines 24,000 health states. HUI(3) has five or six levels for each of its eight attributes and encompasses 972,000 unique health states. The HUI(2) and HUI(3) use multiplicative multi-attribute utility functions based on visual analogue and standard gamble scores obtained from community samples in Canada. [9, 40, 41] The reference time frame for the questionnaire was “the past four weeks” and we did not include the fertility dimension in our survey.
The SF-6D, version 2, provides a method for deriving a preference score from the SF-36 instrument. [10, 11] It represents six attributes on up to six levels. An additive model was used and community weights were derived using standard gamble utilities from a UK population for a subset of health states.
The Quality of Well-Being scale (QWB) is a preference-based health measure that includes three additive functional dimensions and a symptom dimension. Community preferences were measured using category rating for a representative sample of 866 adults in the San Diego area. Scores can range from 0.0 to1.0, though the lowest score for a health state other than death is 0.32. We previously estimated QWB scores using a regression model based on five subscales of the SF-36 reported by the Beaver Dam Health Outcomes Study. 
Criteria for comparison used for this study included a disease-specific measure, the Oswestry Disability Index (ODI) and patient ratings of satisfaction with symptoms (symptom satisfaction), self-perceived progress (progress rating), and self-perceived health (SPH).
The ODI includes nine items on six levels and yields an index score from least to most disability of 0 to 100. For consistency of interpretation, we subtracted ODI scores from 100 so that higher scores indicate better health.
The participant is asked, "If you had to spend the rest of your life with the symptoms you have now, how would you feel about it?" The response categories are very dissatisfied, somewhat dissatisfied, neutral, somewhat satisfied, and very satisfied.
The participant is asked, "How would you rate your progress with your spine-related problem since you first enrolled in SPORT?" Response categories are major improvement, minor improvement, no change, minor worsening, and major worsening.
The first question of the SF-36 which asks, "In general would you say your health is excellent, very good, good, fair, or poor.”
Participants first completed the ODI, followed by the SF-36; the EQ-5D including VAS rating; a symptom satisfaction rating; progress rating; and the HUI.
We summarized participant characteristics according to the four criteria: ODI, symptom satisfaction rating, progress rating, and self-perceived health rating. Mean change scores (change scores) were calculated for each system from baseline to one year. We described the distribution of change scores using means, standard deviations and ranges. We summarized the distribution of health state classifications by dimension and level using percents.
We tested differences between change scores using signed rank tests. We assessed longitudinal validity by calculating Spearman correlation coefficients for change scores for system pairs and using tests for trend across changes in levels of each criterion measure. We evaluated floor and ceiling effects for each system by calculating the proportion of participants who received the highest and lowest possible scores at baseline and at one year. This analysis was repeated for the key dimensions of pain and physical function.
We calculated responsiveness statistics and 95% confidence intervals at one year for each system using distribution- and anchor-based methods.
We calculated distribution-based effect size and standardized response mean estimates as follows:
We calculated anchor-based Minimal Important Difference (MID) estimates and 95% confidence intervals according to four anchors: ODI, symptom satisfaction, progress rating, self-perceived health (SPH) rating. MID was calculated as the mean change for those who reported minimal important change according to each anchor. The scores of those who worsened were multiplied by −1.[47, 48] Minimal important change was defined as one level of change from baseline to one year for symptom satisfaction and self-perceived health. For progress rating, minimal important change was defined as report of minimal improvement or minimal worsening at one year. SPORT sample size calculation was based on a 10-point change in ODI, and consistent with this and other work on important change for ODI, we used a 10–19 point change as the definition of minimal important change in this study. [43, 49]
All analyses were undertaken using STATA, version 9 (STATA Corporation, College Station, Texas).
Data at one year were available for 1,000 participants whose mean age was 42 years (±11) (Table 1). This was a highly educated population with the majority classifying their race as white. The majority of participants reported improved health based on all criterion measures.
A summary of mean change in health state values (change scores) is shown in Table 2. The largest mean change score was 0.40 for EQ-5D-UK, which was 3 times the mean change of 0.13 for eQWB. Standard deviations of the change scores were largest for EQ-5D-UK, followed by HUI3, EQ-5D-US, HUI2, SF-6D, and eQWB. Standard deviations were largest for the change scores, followed by baseline scores, and smallest for the 1-year scores.
Correlation between change scores as measured by Spearman coefficients ranged from 0.55 to 0.99 (Table 3). Not surprisingly, strongest correlations were noted between change scores from related systems, such as EQ-5D-UK and EQ-5D-US; HUI3 and HUI2; and SF6D and eQWB. Moderate to strong correlations were noted between change scores of all other systems.
When compared using sign rank tests, all change scores were significantly different from each other except EQ-5D-US and HUI2. All systems demonstrated linear trends and high correlations between change scores and change in levels of ODI, symptom satisfaction, progress rating, and self-perceived health (Figures 1a–d).
At one year, less than 1% of participants received the lowest possible score for each system, and 28% received the highest possible score for EQ-5D-UK and EQ-5D-US (Table 2). In contrast, HUI3 and HUI2 classified less than 10% at the ceiling, SF-6D defined 5%, and eQWB classified less than 1%. At baseline, each system classified a significant proportion of patients at the worst level for usual/physical function and pain. For the pain dimension, % at the floor was: 40% for EQ-5D; 29% for HUI3; 19% for HUI2; and 28% for SF-6D. For mobility/physical function, it was 3% for EQ-5D; 2% HUI3 and HUI2; and 14% for SF-6D. EQ-5D classified 25% at the floor for usual activities. At one year, there were large proportions at the best level (ceiling) for all dimensions. Specifically, in the mobility/physical function dimension, % at the ceiling was: 73% for EQ-5D; 82% for HUI3 and HUI2 ; and 22% for SF-6D. The proportions at the ceiling for pain were: 33 % for EQ-5D, 20% for HUI3 and HUI2; and 12% for SF-6D. All systems classified a smaller proportion at the floor at one year. Floor effects were also noted for EQ-5D in usual activities and pain/discomfort, and SF-6D in role limitations and vitality.
Table 4 summarizes responsiveness statistics. The estimated QWB score was most responsive, followed by the SF-6D. EQ-5D-UK was consistently the least responsive, although EQ-5D-US, HUI3 and HUI2 demonstrated similar or slightly less responsiveness. For example, the effect sizes for EQ-5D-UK and eQWB were 1.2 and 2.3 respectively. The Standardized Response Means (SRM) were 1.1 and 1.4 respectively.
Overall, MIDs were smaller for eQWB and SF-6D than for the other four systems. Values for EQ-5D-UK were approximately three to five times larger than those for eQWB. For example using ODI as the anchor, the MID estimate for EQ-5D-UK was 0.12, while the estimate for eQWB was 0.05.
Our study is the first large longitudinal comparison of preference-weighted system performance in persons with confirmed diagnosis of IDH. Correlations between systems and tests of trend with external criteria support the notion that all systems were valid measures of HRQL. Estimates of effect size and standardized response mean indicates that all systems demonstrated the ability to measure change in key dimensions of HRQOL in this population of persons with spine disorders.
Considering the results of our validity tests together with the differences in mean change in health state values across systems, it is clear that there is no one system whose overall performance was superior to others. For example the superior responsiveness of eQWB and SF-6D, evidenced by the effect size and SRM estimates found in this study, confers an important advantage by enabling the detection of change with fewer study participants. Similarly, MID estimates indicated that EQ-5D-UK would require a larger magnitude of change to be considered clinically important compared to eQWB. However, the limited variation in scores upon which estimates of responsiveness are based has implications for policy applications. eQWB and to a lesser extent, SF-6D did not provide scores across the full range of health state values relative to the anchors of dead and perfect health. Our study findings were consistent with other comparisons that support overall validity of all systems, and somewhat better responsiveness for SF-6D, paired with potential overvaluation of lower health states.[32, 50] Other studies indicate variations in performance across diagnoses and severity of health states.
In the absence of a clearly superior system, the combination of unique strengths and limitations incorporated into preference-weighted health state classification systems presents difficult tradeoffs for researchers considering system choice. SF-6D demonstrated superior responsiveness and fewer ceiling problems for the pain and mobility dimensions. It is based on the longest of the surveys, with 36 questions, and covers several dimensions particularly relevant to persons with spine problems. Although the SF-36-derived approach may convey advantages in terms of responsiveness and ceiling effects, there is some indication that SF-6D may provide higher values for more severe health states and therefore may undervalue interventions.
The ease of administration is a key advantage of EQ-5D-UK and EQ-5D-US. However, floor and ceiling effects in dimensions highly relevant to spine disorders should be weighed against resource efficiency. EQ-5D-UK does not provide health state values between 0.88 and 1, and has been shown to provide lower mean health state values for health states compared to other systems.[14, 36]
The questionnaire for HUI2 and HUI3 is of intermediate length compared to SF-36 and EQ-5D, with fifteen questions. HUI3 incorporates dimensions not likely to be critical in the spine population, such as speech, vision, and hearing, and the dimensions covering mobility are limited to ambulation and dexterity. HUI3 has potential limitations in characterizing diminished mobility other than ambulation. HUI2 includes mobility, and self-care dimensions important aspects of HRQOL for persons with spine disorders. Both HUI3 and HUI2 demonstrated limitations in characterizing change in mobility in our study.
Practical issues should also be considered in choosing a measurement system. The tradeoffs between resources required for survey administration, acceptability to participants, and measurement properties must be considered carefully. The systems considered in this study range from 5 to 36 questions, with various response levels. Depending on the research context, these may represent important differences.
Although all systems demonstrated very similar patterns of psychometric performance across all criteria, some important differences emerged that can be compared to prior research. Although responsiveness statistics indicated acceptable performance for all systems, eQWB, and SF-6D demonstrated the highest responsiveness, as indicated by larger effect size and SRM and smaller MIDs. The EQ-5D-UK had the lowest responsiveness of the measures. Other studies of the EQ-5D-UK and SF-6D in persons with various conditions have found similar. [34, 51–54] Walters and Brazier found slightly better responsiveness for SF-6D than EQ-5D in the results of combined analysis of data from 11 cohorts. In the same study, the SRM and effect size were larger for SF-6D than for EQ-5D among patients with unspecified back pain. Longworth and Bryan found that SF-6D was limited in capturing change for severe health states but was more responsive among better health states compared to EQ-5D-UK among liver transplant patients. In contrast, Conner-Spady found slightly better responsiveness for EQ-5D-UK compared to HUI3 and SF-6D among rheumatology patients stratified by change status.
We found that responsiveness statistics for HUI3, and HUI2, were similar to or slightly better than those for EQ-5D-UK and US and slightly worse than those for SF-6D and eQWB. Studies conducted among persons with stroke, epilepsy, and heart disease have reported similar patterns [20, 21, 55] In contrast, Feeny et al. found better responsiveness for HUI3 and HUI2 than SF-6D among patients undergoing hip replacement. HUI3 demonstrated slight advantages over HUI2 in responsiveness among patients undergoing breast reduction surgery.
Responsiveness statistics, including MIDs are important for study planning and for interpretation of changes in each system among patients with symptoms related to IDH. Estimates of effect size, SRM, and MID were generally larger in magnitude in our study than in other studies. [20–22, 24, 30, 34, 51–55] This can be explained by the large functional health status changes in our population over the study period. Our MID results for those who reported minimal change were 0.08 for SF-6D and 0.15 for EQ-5D-UK, meaning that a mean change of 0.08 in a clinical study using SF-6D would correspond with the lowest threshold for important change from the patient perspective using progress rating as the criterion. Alternatively, when judging the magnitude of change reported in clinical studies, a mean difference between treatment arms of 0.08 would indicate clinically meaningful difference using SF-6D. However, using EQ-5D-UK, the threshold would be 0.15.
Similarly, deficits in coverage for systems indicated by floor or ceiling problems have very important implications for system performance in measuring change over time. No ceiling effect was noted at baseline in our study. Consistent with previous studies conducted using data from persons with health conditions and from general population samples, large ceiling effects were noted in our study for the overall preference score for EQ-5D-UK and EQ-5D-US at one-year follow-up. [14, 18, 36, 53, 57] Smaller, but potentially significant ceiling effects were noted for HUI3, HUI2, and SF-6D. These results would indicate that EQ-5D-UK and US may have difficulty characterizing change for long-term outcomes compared to other instruments.
Perhaps of greatest significance was the remarkable proportion of patients at the ceiling on all systems for the key dimensions targeted by treatment. All systems classified significant proportions of participants at the floor or the ceiling of the pain and mobility dimensions at baseline or follow-up. These dimensions are particularly relevant for measuring effects of treatment this population. Floor effects were evident at baseline in the pain dimension for EQ-5D-UK, EQ-5D-US, HUI3, HUI2, and SF-6D Our review of the literature found that SF-6D demonstrated floor effects, and a limited range of available scores. This is consistent with other studies of SF-36 in this population.[15, 27, 30, 58]
Ceiling effects were evident in the mobility dimension at baseline for EQ-5D-UK, EQ-5D-US, HUI3, and HUI2, and in mobility and pain dimensions for all systems at one year. Ceiling effects were greater for EQ-5D-UK and EQ-5D-US than other systems in the pain dimension. In contrast, ceiling effects were greater for HUI3 and HUI2 in the mobility dimension. Feeny et al. suggested that HUI3 may be limited in detecting changes in mobility that do not involve the hands.  Our findings are consistent with this concern. However, HUI2 performed similarly to HUI3 in spite of describing mobility in broader terms.
Although psychometric evaluations are fundamental in establishing the measurement characteristics of preference-weighted systems, it is critical to assess validity in the context of their application for policy decision making. To address this question, we compared estimates of mean score change, or mean change in health state value, since this estimate is fundamental to QALY calculation. Except for EQ-5D-US and HUI2, we found that systems produced significantly different estimates of mean change in health state value. EQ-5D-UK produced the largest estimate of mean change, followed by HUI3, HUI2 and EQ-5D-US, SF-6D, and finally eQWB. Other studies have found differences in head-to-head longitudinal comparisons.[20–34, 51, 54–56, 59] Similarly, these studies reported that EQ-5D-UK estimates were generally largest, followed by HUI3, HUI2, and finally SF-6D. These patterns are generally consistent with the results of cross-sectional comparisons.
Although comparisons of mean health state values were more common, we identified comparisons of QALYs or ICERs obtained using relevant systems in the published literature. Pickard et al. found that QALY differences calculated using EQ-5D-UK or HUI3 were two times larger than those obtained using SF-6D or HUI2. Tosteson et al. reported ICERs for EQ-5D-UK and SF-6D in their cost-effectiveness analysis of surgery relative to non-operative treatment for persons with spinal stenosis with and without degenerative spondylolisthesis. The ICER (95% CI) for spinal stenosis using the EQ-5D-UK was $77,600 ($49,564–$120,042) compared to $93,400 ($59,205–143,660) using the SF-6D. The ICER for spinal stenosis with degenerative spondylolisthesis was $115,600 ($90,839–$144,863) using the EQ-5D-UK compared to $172, 500 ($132,178–$221,930) using SF-6D. Van den Hout reported change in QALYs for EQ-5D-UK, EQ-5D-US, and SF-6D in their cost-utility analysis of early surgery versus prolonged conservative care among patients with sciatica from IDH. Their results were consistent with our findings: that EQ-5D-UK provided the largest change in health state values, and smallest cost-utility ratio, followed by EQ-5D-US, and SF-6D. The gains in QALYs were: 0.044 (95%CI: 0.005 to 0.083); 0.032 (0.005 to 0.059), the SF-6D of 0.024 (0.003 to 0.046). Another study investigating acupuncture compared to usual care for nonspecific back pain reported very similar ICERS using EQ-5D-UK and SF-6D. Joore et al found differences in ICERs and the probabilities of acceptability for the ICERs across five conditions using EQ-5D-UK and SF-6D. Specifically, higher probabilities of acceptability were found using EQ-5D-UK for milder health conditions and using SF-6D for more severe health conditions. This study highlights the need to assess the performance of preference-weighted health state classification systems for specific conditions.
The results of our study indicate that using a regression model to “map” from SF-36-based health states to the QWB score for persons with intervertebral disc herniation may be a reasonable approach. The eQWB performed well in psychometric validity tests, but provided the lowest estimation of mean change in health state value, and very little variation in preference estimates. Kaplan et al. reported similar results among patients with arthritis Although the psychometric properties of SF-6D and eQWB may be fairly similar, we recognize the additional steps of incorporating direct valuations as an advantage of the SF-6D over the “mapped” estimates produced by eQWB. Furthermore, our results indicate that eQWB may be more likely to produce qualitatively different cost-utility results than alternative systems. However, interest in developing methods to estimate health state values from existing HRQOL data appears to be increasing, and their relative performance should be investigated. It should be noted that the performance of the eQWB is dependent on the characteristics of SF-36 health state classification system, the Quality of Well Being Scale, and the regression model used to link the two. Our results indicate that the eQWB estimates may be used with some caution in this population, but this should be weighed carefully against the advantages of the SF-6D.
As with any validation studies of HRQOL instruments, there is no gold standard for the performance of preference-weighted health state classification systems. Although it is possible to generate and test hypotheses about the behavior of the systems under known circumstances, some variation would be expected between the systems under examination and the measure used to test validity.
We used a longitudinal cohort design to calculate responsiveness statistics, including effect size. This is not the same computation as is used to characterize the effect of treatment compared to control. It may be argued that for the purpose of interpretation in clinical trials, estimation of the effect size of treatment is the most relevant calculation. Because all patients in this trial undergo some treatment, either surgical or non-surgical, and most were expected to improve, we measured observed change in the entire group. However, effect size is commonly used in applications similar to ours to address longitudinal validity.
Since the order of administration of instruments was not counterbalanced, we cannot rule out an order effect. However, because the questions in each instrument are of similar nature, it is doubtful that there would be significant learning or framing effects exhibited in this application.
In summary, this study provided information about the performance and interpretation of several of the most widely used preference-weighted health state classification systems. The evidence supports the notion that all systems are measuring the same construct, but each has unique characteristics that should be considered when choosing a system. We found evidence that all systems demonstrate validity in this population, with some caveats. All systems demonstrated evidence of ceiling or floor effects for key dimensions relevant to spine disorders. In the context of cost-effectiveness analysis, we found that change scores were significantly different except for EQ-5D-US and HUI2. Change scores were largest for EQ-5D-UK, followed by HUI3, HUI2, EQ-5D-US, SF-6D, and finally eQWB. Such differences indicate that care should be taken when interpreting cost-utility analyses from different systems. Researchers choosing a system should carefully consider the characteristics of each system relative to study goals.
Acknowledgement of Support: The authors would like to acknowledge funding from the following sources: Grant Number F32HD056763 from the National Institute of Child Health And Human Development. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of Child Health And Human Development or the National Institutes of Health. Support for this research was provided by the National Institute of Arthritis and Musculoskeletal and Skin Diseases (U01-AR45444-01A1 and P60-AR048094-01A1) and the Office of Research on Women's Health, the National Institutes of Health, and the National Institute of Occupational Safety and Health, the Centers for Disease Control and Prevention and a New Investigator Fellowship Training Initiative grant from the Foundation for Physical Therapy.
Commercial Support/Conflicts Statement: No conflicts to declare