|Home | About | Journals | Submit | Contact Us | Français|
We evaluate the effects of mode and order of administration on health-related quality of life (HRQOL) scores.
We analyzed HRQOL data from the Clinical Outcomes and Measurement of Health Study (COMHS). In COMHS, we enrolled patients with heart failure or cataracts at three sites (University of California, San Diego, UCLA, and University of Wisconsin). Patients completed self-administered HRQOL instruments at baseline and months 1 and 6 post-baseline, including the EQ-5D, Health Utilities Index (HUI), Quality of Well-Being Scale—self-administered (QWB-SA) and the SF-36v2™. At the 6 month follow-up, individuals were randomized to mail or telephone administration first, followed by the other mode of administration. We used repeated measures mixed effects models, adjusting for site, patient age, education, gender and race.
Included were 121 individuals entering a heart failure program and 326 individuals scheduled for cataract surgery who completed the survey by mail or phone at the 6-month follow-up. The majority of the sample was female (53%) and white (86%). About a quarter of the sample had high school education or less (26%). The average age was 66 (36–91 range). HRQOL scores were higher (more positive) for phone administration following mail administration. The largest differences in scores between phone and mail responses occurred for comparisons of telephone responses for those who were randomized to a mail survey first compared to mail responses for those randomized to a telephone survey first (i.e., mode effects for responses that were given on the second administration of the HRQOL measures). The QWB-SA was the only measure that did not display the pattern of mode effects. The biggest differences between modes were 4 points on the SF-36v2™ Physical Health and Mental Health Component Summary Scores, 0.06 on the SF-6D, 0.03 on the QWB-SA, 0.08 on the EQ-5D, 0.04 on the HUI2 and 0.10 on the HUI3.
Telephone administration yields significantly more positive HRQOL scores for all of the generic HRQOL measures except for the QWB-SA. The magnitude of effects was clearly important, with some differences as large as a half-standard deviation. These findings confirm the importance of considering mode of administration when interpreting HRQOL scores.
Health-related quality of life (HRQOL) refers to how well one is able to function in daily life and perceived well-being. HRQOL is conceptualized as encompassing physical, mental, and social function and wellbeing. Generic HRQOL profile measures are designed to yield scores for each of the multiple aspects of HRQOL. In contrast, generic preference-based measures are designed to assess overall value or desirability of health states and to produce a single summary score that reflects the combined impact of the multiple domains of HRQOL. Among the most widely used generic preference measures are the Quality of Well-Being (QWB) Scale , the Health Utilities Index (HUI) , and the EQ-5D . For each of these measures, societal preferences were obtained for a range of health states and used to provide a summary score for every possible state assessed by the instrument.
HRQOL measures can be self-administered (e.g., mail survey) or administered by a trained interviewer (e.g., phone interview). Self-administration tends to be less expensive and a more feasible method of data collection than interviewer-administration . In addition, there is some evidence that respondents prefer self-administration over being interviewed . However, self-administration often results in more missing data and is only possible for those who have sufficient reading and comprehensive skills. In addition, with a self-administered mail survey it is possible that someone other than the target respondent helps with or completes the survey. Furthermore, later questions can influence answers to earlier ones if respondents look over the entire questionnaire before completing it .
Although telephone interview may put great cognitive demands on respondents , it also has the potential for collecting higher quality data because the interviewer can clarify questions and alleviate confusion. Interviewer administration also yields higher participation rates than self-administration. However, interviewer-administration can yield more socially desirable responses than self-administration . For example, computer-assisted telephone data collection of the SF-36v.1 survey yielded lower rates of missing data but more positive HRQOL reports than self-administered mail surveys. As a result, separate norms for mail and telephone administration of the SF-36v.1 were created . Similarly, HUI3 scores were found to be significantly higher for phone than mail administration . In contrast, a recent study found no mean differences in scores on the Functional Assessment of Cancer Therapy-General Survey for people randomly assigned to interview versus self-administration .
This paper examines possible mode effects in HRQOL data from the Clinical Outcomes and Measurement of Health Study (COMHS). The primary purpose of the study was to examine responsiveness of HRQOL to change in patients with heart failure or cataracts . This paper focuses on differences in mail and telephone responses to four generic HRQOL measures collected at the final follow-up data collection interval.
HRQOL measures (see description below) were administered by mail at baseline (first visit after referral for heart failure; between the last clinic visit and surgery for cataract), and at 1-month and 6-months post-baseline. At the 6-month follow-up participants also were asked to participate in a telephone interview. The order of administration of mail versus telephone was randomized.
The SF-36v2™ consists of 36 questions that were selected from a larger pool of items in the Medical Outcomes Study . Twenty of the items are administered using a past 4 weeks reporting interval. The SF-36v2™ is most frequently administered using a 4-week recall period for the majority of the items (a “now” or implicit now time interval is used for the 5 general health perceptions and 10 physical functioning items).
The SF-36v2™ assesses 8 health concepts using multi-item scales (35 items): physical functioning (10 items), role limitations caused by physical health problems (4 items), role limitations caused by emotional problems (3 items), social functioning (2 items), emotional well-being (5 items), energy/fatigue (4 items), pain (2 items), and general health perceptions (5 items). An additional single item assesses change in perceived health during the last 12 months. We focus on the physical health and mental health component summary scores (SF-36v2™ PCS and MCS) derived from the 8 SF-36v2™ scales .
The SF-6D is computed from a subset of 11 of the 36 questions in the proprietary SF-36v2™ questionnaire . The SF-6D reduced the SF-36 to 6 domains (physical function, role limitation, social function, pain, mental health, and vitality), each comprised of 4 to 6 levels, and jointly defining 18,000 health states. Scoring was derived from standard gamble assessments by a population sample from the United Kingdom. We separately coded an algorithm in SAS and verified its output scores with both the developer and vendor, leading to clarification and minor update to the algorithm distributed by the vendor. The scoring algorithm produces scores ranging from 0.30 to 1.0 for those alive.
The QWB-SA assesses self-reported functioning using a series of questions designed to record limitations over the previous 3 days, within three separate domains (Mobility, Physical Activity, and Social Activity). In addition, the QWB-SA includes a series of questions that ask about the presence or absence of different symptom/problem complexes. The 4 domain scores are combined into a total score that provides a numerical point-in-time expression of well-being that ranges from zero (0) for dead to one (1.0) for asymptomatic optimum functioning. Excluding dead (0.00), the minimum possible QWB-SA score is 0.09 and the maximum is 1.0. The original QWB obtained preference ratings of 856 people from the general population. The QWB-SA used convenience samples to model preference for case descriptions and the models were shown to be highly correlated with the population ratings in the original QWB preferences .
The HUI is a family of health status and preference-based HRQOL measures suitable for use in clinical and population studies . Each member of the family includes a health status classification system, a preference-based multi-attribute utility function, data collection questionnaires, and algorithms for deriving HUI variables from questionnaire responses.
HUI utility scores are based on community preference surveys in Canada. The utility scores are based directly on von Neumann-Morgenstern utility theory and extensions to that theory to accommodate multiple attributes. Standard HUI questionnaires cover both HUI Mark 2 (HUI2) and Mark 3 (HUI3) systems. HUI2 consists of seven attributes: sensation (vision, hearing, speech), mobility, emotion, cognition, self-care, pain, and fertility . Fertility was not assessed in the current study and was assumed to be normal. There are three to five levels per attribute, ranging from highly impaired to normal. Similarly, HUI3 consists of eight attributes: vision, hearing, speech, ambulation, dexterity, emotion, cognition, and pain . There are five or six levels per attribute. A single questionnaire, available in both self-administered (15 questions) and interviewer-administered (40 questions) form, collects data sufficient to score both HUI2 and HUI3. The HUI items can be administered using a specific time period, such as the past 4 weeks, but they have also been administered using a “usual health” recall period.
Both HUI2 and HUI3 scoring functions have health states scored less than 0 (dead). HUI2 scores range from −0.03 to 1.00; HUI3 scores range from −0.36 to 1.00 .
The EQ-5D descriptive system consists of 5 dimensions, mobility, self-care, usual activity, pain/discomfort, and anxiety/depression. Each dimension has 3 levels designated simply as no problem, some problem, or extreme problem, and subjects are asked to check the level most descriptive of their current level of function or experience on each dimension. Five dimensions, each with three levels, yield 243 possible distinct health states comprising the classification system. The classification system has been assigned several different standardized scores derived through population-based samples of respondents asked to assign values to subsets of the 243 states using the anchoring labels noted above. A commonly used scoring system is a “tariff” system of weights applied to the dimension levels (and an adjustment for interaction) derived in the United Kingdom from a community sample of persons who valued health states using the time-tradeoff method .
A set of valuation weights has been derived from a U.S. sample  and was used for the present study. This scoring algorithm was derived from time tradeoff assessments of EQ-5D health states made by a population sample of some 4000 US adults in face-to-face household interviews. These U.S.-weighted EQ-5D scores range from −0.11 to 1.00.
First, we estimated correlations (product-moment and intra-class) between the HRQOL scores administered by mail and telephone. Then, we evaluated differences in HRQOL scores by mode using repeated measures mixed effect models with random intercepts, controlling for fixed effects of site, patient age (35–44, 45–64, 65 and older), education (1–11th grade, high school graduate, some college, 4-year college or above), gender and race (White versus non-White). We evaluated all possible two-way interactions and found that only 4% of them were statistically significant at the p< 0.05 level, no more than expected by chance alone.
A sample of 535 patients was enrolled in the longitudinal study: 159 individuals entering a heart failure program and 376 individuals scheduled for cataract surgery. This analysis included 447 people (121 heart failure, 326 cataract) who completed the survey by mail and/or phone at the 6-month follow-up. The majority of the enrolled and analytic sample, respectively, was female (51% and 53%) and white (84% and 86%). About a quarter of the enrolled and analytic sample had high school education or less (28% and 26%). The average age was 66 (35–91 range) for the enrolled sample and 66 (36–91 range) for the analytic sample. Respondent characteristics for those randomized to mail versus telephone mode of administration first were similar (see Table 1). Most of those who completed both modes of data collection completed the two administrations within 3 weeks of one another (61%), but the maximum gap was 213 days between administrations.
Product-moment and intra-class correlations were similar for correlations between HRQOL scores for mail and telephone administration (see Table 2). Correlations ranged from 0.59 for the HUI2 to 0.84 for the SF-36v2™ PCS. In short, the most consistent HRQOL scores for mail and telephone administration was obtained for the SF-36v2™ PCS while the most inconsistency was observed for the QWB-SA and the HUI2.
Mean differences by mode and order of administration, adjusting for covariates, are shown in Table 3. The first column in Table 3 lists the HRQOL measure. Significant differences (p < 0.05) between each of the 4 groups of mode by order combinations are indicated in each row by superscripts. Cells in the same row that share a letter do not differ significantly (p>0.05) from one another. For example, the means in the 2nd (mail response after telephone) and 3rd (mail response before telephone) columns do not differ significantly from one another because each pair of means in a row shares a superscript letter. Similarly, the means in the 4th (telephone response before mail) and 5th (telephone response after mail) do not differ significantly from one another. In contrast, the SF-36v2™ PCS means for mail responses (whether after telephone or before telephone) are significantly different from the telephone response after mail because the “a” superscript for the later mean is not present on the means for the mail responses.
The estimates in the 2nd (mail response) column and 4th (telephone response) column of the table are from the people who were randomized to telephone administration first. The estimates in the 3rd (mail response) column and 5th (telephone response) column are from the people who were randomized to mail survey administration first. These comparisons of pairs of columns represent within group differences by mode. The sample size is larger in column 4th than in column 2 (and in column 3 than in column 5) because fewer people completed both assessments than completed just the initial assessment. In contrast, the comparison of the 2nd (mail response) with the 5th (telephone response) column and the comparison of the 3rd (mail response) with the 4th (telephone response) column represent between group comparisons of mode effects because different people are being compared.
For the initial administration of the measures (shown in columns 3 and 4), the telephone administration yielded significantly higher (most positive) scores than mail administration for the SF-36v2 MCS, SF-6D, and HUI3. Interestingly, telephone administration of the QWB-SA yielded significantly lower scores than mail administration. For the second administration of the measures (columns 2 and 5), HRQOL scores were significantly higher (more positive) for telephone than mail administration for every measure except the QWB-SA (which was not significantly different).
The maximum difference between scores for the mode by order groups (Table 4) ranged from 3 to 10 points on the transformed scores. Effect sizes for these differences ranged from 0.2 to 0.5 (small to medium size). The largest difference was found for the SF-6D and the EQ-5D and the smallest was for the QWB-SA and the HUI2.
Participants tended to report higher (more positive) HRQOL scores when measures were administered by telephone, especially after a mail administration. They tended to report lower (less positive) scores in mail administration, especially following a telephone administration. These differences are also consistent with previous research in which telephone administration yielded more positive HRQOL reports than mail (e.g., Hanmer et al. ); however, a cross-over design has not been previously used so the mode-by-order findings are new. The magnitude of the differences ranged from small to medium effect sizes. Furthermore, these differences ranged from 3–10% of the difference between dead and perfect health for the preference measures and exceed established guidelines on the magnitude of minimally important differences [20, 21].
Interestingly, mode effects were not as pervasive for comparisons of initial administrations of the HRQOL measures by mode. That is, people randomized to complete a mail survey first before the telephone interview reported similar HRQOL scores in comparison to those randomized to a telephone interview first before completing a mail survey. This comparison of mode effects represents the typical parallel group design whereby different conditions are randomized to subjects.
In this study, we used a within group cross-over design because of the (1-p)/2 decrease relative to a between group design in required sample size to detect a given difference. Mode effects were more consistently seen in the second administration of the HRQOL measures (i.e., mail responses that occurred following a telephone interview versus telephone responses that were obtained after a mail survey was completed). The reason for the interaction of mode with order of administration of the HRQOL surveys is unclear. Completing the set of HRQOL measures previously led to an exacerbation of the difference between mail and telephone responses.
The QWB-SA behaved differently than the other HRQOL measures with the one significant mode effect being inconsistent with previous work and the other measures in this study. In particular, telephone responses to the QWB-SA for those randomized to telephone administration first were lower than self-administered QWB-SA responses. The reason for this is not known but it is possible that the presence of an interviewer resulted in more attention to the QWB-SA symptom list. Unlike the other measures, the QWB-SA requires respondents to indicate whether they had recently experiences each of a long list of symptoms or problems for each of the past 3 days. The systematic method required in the telephone interview may have increased the likelihood that each of the many symptoms was considered. It is unclear why this would have only occurred when the telephone administration occurred prior to mail administration.
Because most studies do not administer measures twice within a short time interval at the final wave of data collection, mode effects might be less of a concern if the pattern of findings observed here was the status quo. That is, the mode effects we observed for the condition of administration that is typically used (one administration) were not as pronounced as they were for the second administration of the measures. But the literature on the whole and the significant mode effects observed here for the second administration of the HRQOL measures suggests the need for caution in comparing HRQOL estimates that differ by mode of administration.
It is therefore advisable to consistently use one mode of data collection in longitudinal studies whenever possible. If multiple modes of data collection are necessary (e.g., to ensure adequate response rates), then adjustment for the more positive health reports seen in interviewer administration should be considered. Normative data needs to be reported by mode of administration when significant mode effects are present. Further work is needed to explore the interaction between order of administration and the mode effects observed here.
Source of financial support: This study was supported by grant P01AG020679 from the National Institute on Aging. Dr. Hays was also supported in part by the UCLA Resource Center for Minority Aging Research/Center for Health Improvement in Minority Elderly (P30AG021684) and the UCLA/DREW Project EXPORT (P20MD000148 and P20MD000182). Drs. Hays and Kaplan were also supported in part by the UCLA Claude Pepper Older Americans Independence Center (AG028748). David Feeny has a proprietary interest in Health Utilities Incorporated (HUInc.), Dundas, Ontario, Canada. HUInc distributes copyrighted HUI materials and provides methodological advice on the use of HUI.