Several methods of estimating SEMs of 5 commonly used preference scored HRQoL indexes showed these standard deviations to be substantial, and in most ranges of health, well above an often used value for “minimally important difference” (MID) of 0.03-0.04 (

33-

35), although values of MID as high as 0.07 have been suggested (

36). According to previous literature, this would make the indexes investigated inappropriate for individual patient monitoring (

1), although it must be recognized that HRQoL indexes and their subscales may often be used only as ancillary to other information. A recent publication provides guidance on how to apply SEM in assessing the uncertainty in clinical change scores (

37).

Indexes differed in the magnitude of their SEMs with the HUI3 having the largest and the SF6D_36v2 the smallest standard deviation. This conclusion held for both SEM-TR based on test-retest, and SEM-S based on variation of each index around a joint construct of underlying health. Importantly, SEM varied considerably across the range of health, so that average SEM depends on the population composition. Our SEM estimates may be helpful in choosing the most precise index for a certain range of health. However, ceiling effects play a central role and cause SEM to be artificially small close to the maximum index value of 1. SEM in the mid range of health is quite comparable across indexes.

Reliability coefficients for health outcomes measures can be estimated using a variety of methods. The common element is the creation of a ratio of true to observed variance. Some investigators use measures of internal consistency, while others use estimates derived from repeated applications of the measures to the same populations. This analysis primarily uses a method that depends on several modeling assumptions. Nonetheless the reliabilities of the computed from the estimated SEM fell firmly within ranges of previously reported values except for the QWB-SA (

38). In the latter overview, reliability coefficients were tabulated from a range of disease specific and community studies, with the middle of the range being 0.71 for SF-6D, 0.72 for EQ-5D and 0.76 for the HUI3. From the NHMS we have 0.71 for SF-6D, 0.70 for EQ-5D and 0.77 for HUI3. As noted previously, however, reliability coefficients are dependent on the range of health in the population under study, and COHMS does indeed provide lower estimates. A population based study in Canada (

39), arrived at a reliability estimate of 0.77 for the HUI3, which is identical to our NHMS estimate. The reliability coefficients for the indexes estimated by us and others are adequate or almost adequate for population studies (

40)

Our estimates for QWB-SA reliability of 0.59 and 0.64 are well below the reliability of 0.90 previously reported. However, previous estimates of QWB reliability used an entirely different methodology. It may be noted that QWB-SA was found the least strongly related to the construct of underlying health in the IRT analysis (

16), and the reliability estimate from NHMS may therefore reflect some unique variance being included in SEM-S. The IRT analysis identifies common variance across measures. While all five indexes include items on physical and emotional health and symptoms such as pain and discomfort, the QWB-SA differs from other measures because it includes an extensive set of items on symptoms and health problems, some of which are acute. The unique symptom-problem content of the QWB-SA may explain why the QWB-SA was less strongly related to the shared construct and some of the variability between visits in the COHMS. Hence, the reliability of QWB-SA may have been underestimated in our analyses.

We further found that SEM varies across the range of health, although less for the QWB-SA and the SF6D_36v2 than for HUI2, HUI3 and EQ-5D. This non-constancy may lead to misleading estimates of responsiveness and reliability from studies of patients representing a limited range of health. For example ceiling effects may lead to underestimation of SEM and corresponding overestimation of reliability and responsiveness in healthy samples. Notably, our overall SEM is estimated as lower from the NHMS where the percentages falling at the ceiling of indexes are higher than in the cataract sample. The differences in SEM between indexes also somewhat mirror the differences in index ranges, where the minimum observed value of the HUI3 is -0.34, but of the SF6D_36v2 is as high as 0.30. Hence they are partly explained by index scaling. Our results () provide some insight into signal to noise ratio in different ranges of health, and show that different indexes may be best in different ranges. However, we found the signal to noise ratio more sensitive to modeling choices, such as cut-points chosen for the indexes in the IRT model than were the SEM estimates themselves.

We estimated two conceptually different SEMs across two separate samples representing a general population, and post-cataract surgery patients. Given these differences, the similarity of the results is surprising and reassuring. Nonetheless, some caution is in order.

The structural SEM-S in the general population, around the underlying measure of health contains some unique variance, i.e. sensitivity of an index to health conditions not reflected in the other indexes. The unique variance would be considered measurement error if the goal is to estimate the core construct of health common to all indexes, but not if the goal is to measure the construct represented by the specific index itself. On the other hand, some collinearity in the prediction models underlying SEM-S may have remained and have led to underestimation. Such collinearity may have arisen from correlated errors in responses to questions that are similar between indexes.

Test-retest SEM-TR from the cataract sample almost surely contains variance due to short term fluctuations in health such as due to acute illness episodes. Hence, SEM-TR is quite likely an overestimate of SEM as short term health fluctuations would be considered measurement error if the goal is to measure the impact of chronic illness only. Our study of SEM-TR has the weakness of not having access to repeated measures closer than 5 months apart, although stability of long term health is difficult to confirm in any study. Short time intervals are well known to raise the alternative problem of recall bias.

Our method to adjust for reliability of theta is not precise. First of all, the reliability coefficient used was derived from an IRT procedure that did not take sampling weights into account (

16). We adopted this approach to be faithful to our previously published methodology, and also because different methods attempting to produce weighted reliability coefficients did not yield consistent results. Second, the method of adjustment is technically correct only when linear relationship is used to predict index scores and for the overall estimates of SEM. The complexity of our model precluded a more exact solution. In spite of these caveats, SEM-TR and unadjusted and adjusted are all close enough to provide a reasonably narrow range for the size of SEM for the five indexes. In addition, intervals constructed from SEM-S capture close to the expected percentage of differences between time points from the repeated measures.

In addition to generating better understanding of preference scored indexes, our analysis provides guidelines on the magnitude of SEMs of indexes, which should be useful in assessing responsiveness in studies too small to provide reliable internal error standard deviation estimates.