|Home | About | Journals | Submit | Contact Us | Français|
Assessment of health-related quality of life (HRQOL) using patient-reported outcomes in arthroplasty has become popular because it provides a unique perspective on successful elective procedures. However, challenges exist in the assessment of HRQOL in clinical practice and in clinical research. Patient compliance with multiple and sometimes lengthy HRQOL assessments administered at multiple follow-up visits is problematic. Many well-validated HRQOL instruments are available, and progress has been made in defining the minimal clinically important difference in hip and knee arthroplasty that denotes the minimal change perceived to be important by patients. Challenges in understanding the literature are attributable to the use of various HRQOL scales, with different scoring ranges and scoring algorithms, different interpretations of highest score, and differences in the presentation of raw versus transformed scores.
Hip and knee arthroplasties are associated with significant improvement in health-related quality of life (HRQOL).1,2 The primary reasons for using an HRQOL measure in assessing outcomes in arthroplasty are that the outcome of arthroplasty is not specific to the joint or limb but to overall impact on health;1,2 in addition, the value of measuring HRQOL is to assess the value of a procedure compared with that of any medical treatment. Cost-effectiveness cannot be compared without these HRQOL data.3
In recent years, the incorporation of HRQOL assessment into arthroplasty research and clinical practice has seen dramatic growth. The challenge is to continue widespread HRQOL assessment in an accurate, efficient, and economically reasonable fashion. This has created unique challenges, such as balancing the utility of HRQOL data with the additional burdens that such data collection represents. In a busy clinical practice, it is increasingly difficult to add yet another outcome assessment.
Assessment of pain is of prime interest in many arthroplasty studies. HRQOL instruments used in arthroplasty studies have varying psychomet-ric properties and varying levels of validation. Minimal clinically important differences (MCIDs) have been recently defined for some instruments. These three characteristics may aid in choosing an HRQOL assessment and in calculating power for a prospective study.
Lessons learned through empirical experience have indicated that a successful HRQOL assessment process must make allowances for practical considerations such as patient burden.3 The critical challenge, therefore, is to balance the researcher’s needs for assessment data with the patient’s willingness to provide information. There is a natural tendency among study investigators to design HRQOL assessments comprehensively, including assessments for information that would be “nice to know” rather than designing a data set that economically and efficiently addresses the specific study hypotheses.
There has been discussion in recent literature as to the relative efficiencies of brief assessments.4 Much research has indicated that often “less is more.”5 Although single-item assessments of relevant HRQOL domains cannot capture the detailed information that a longer assessment can, in measuring the same domain, the brief assessments have demonstrated greater variability and sensitivity to change.6 For multi-item assessments to achieve the same degree of sensitivity to change as single-item assessments do, all of the items in the multi-item assessment must, on average, move the same amount in concert, or the difference is lost. One argument in favor of multi-item generic questionnaires is that they are more likely to capture the net description of a patient’s overall health, such as the global impact of arthroplasty on a patient’s health.
Figure 1 presents an example of this phenomenon,7 wherein assessing the HRQOL of lung cancer patients was accomplished by way of a single-item linear analog assessment (ie, Uniscale) compared with a longer multi-item assessment (ie, Lung Cancer Symptom Scale). The results indicated that the single-item assessment “Please rate your overall quality of life over the past two weeks” had greater sensitivity to change over time than did a multi-item assessment measuring the HRQOL of the same patient.
Salant and Dillman8 present a comprehensive approach to assessment design and implementation known as the Total Design Method, which has demonstrated success in a wide variety of applications. Basic ideas include presenting the assessment instrument as a professionally appearing booklet that engages the patient, with a contact number and a photograph of the principal investigator, rather than handing out a clipboard of poorly photocopied sheets of paper. Included in the booklet is a communiqué from the investigator explaining why the data the patient is being asked to provide are important and what advances in patient care might result; this presentation makes the assessment instrument look like a conversation rather than a test.
Short questionnaires are known to improve patient compliance, response rate, and the quality of response.9 Research has indicated that patients will complete assessments that include fewer than 12 questions without much consideration of the burden. Once two dozen questions have been answered, the patient has typically given all that he or she will give without perceiving a burden. Beyond 25 items, the degree to which the patient continues to comply is strongly related to the relevance, ease, logic, and degree of controversy of the items included. Once 50 items have been asked, one can expect up to 5% attrition and missed questions unless care is taken with the design. Beyond 80 questions, fatigue starts setting in even for healthy persons. A single question that interrupts the patient’s flow of assessment compliance could raise the likelihood that he or she will decide the overall assessment is not worth the time.
Psychometric theory suggests that in some situations one must ask the same question repeatedly or ask questions that are opposites to elicit an accurate response. This sort of approach incurs a risk of raising confusion and ire in the patient, who may not understand why he or she is being asked something repeatedly and why his or her intelligence is being insulted. A good test for determining whether an assessment is compliant with the aforementioned recommendations is to have the investigators complete the questionnaire in a simulated situation. They should ask themselves three questions as they complete the assessment: (1) Are there any questions that are unnecessary to the ultimate analysis? Each data point should have a plan in advance for analysis; otherwise, one is collecting data unnecessarily. (2) Are any questions unclear, confusing, or inflammatory? (3) Are any issues missed?
In terms of patient burden, the logistics of completing the assessment should be considered. This could involve assessments that can be read to the patient and/or completed over the telephone without loss of validity. Proxy respondents can provide usable data if the patient no longer can, but the responses must be assessed with special analytical procedures.10
Another issue of burden relates to the frequency with which HRQOL assessment data are collected. Although it would be optimal from an informational standpoint to gather such data daily, doing so is typically not feasible. Deciding on the assessment schedule is primarily a function of the research hypothesis/question at hand. Is the entity of interest how patients changed over time? Are we interested in only their best score over time? Is the critical information contained in the change from baseline to a given point? Can we define a “response” for each person over time and thereby produce an outcome measure that is comparable with tumor response? We examined these questions in a previous publication.3
Two short questionnaires derived from long versions that are relevant to arthroplasty literature are the reduced version of the Western On-tario and McMaster Universities Osteoarthritis Index (WOMAC) and the Medical Outcomes Study 12-Item Short Form (SF-12), a shorter version of the SF-36. The reduced version of the WOMAC was developed and validated in a cohort of total knee arthroplasty (TKA) and total hip arthroplasty (THA) patients. The WOMAC reduced scale,11 which contains 7 function items reduced from 17 items in the full WOMAC,12 was further externally validated in a cohort of 100 patients with mild/moderate knee osteoarthritis (OA).13 Validation in additional cohorts and generation of population norms is desirable. By reducing patient burden, this offers a practical, shorter alternative to the full-length WOMAC. Tubach et al14 developed a different shortened version of the WOMAC with eight function items; this assessment tool was validated in patients with OA. Validation in the arthroplasty population is lacking. The SF-12 is a generic measure of quality of life (QOL) that has been validated in the general population.15–17 It correlates well with the WOMAC and the SF-3618–21 and has the advantage of having population norms.
We recommend that clinicians and researchers ask the aforementioned three questions before adding additional assessments in clinical follow-up or research study. Short, psychometrically valid questionnaires that address the main question should be the goal.
Because pain is the most common reason for performing TKA and THA, it is not surprising that pain is the most important outcome of interest.22,23 Most HRQOL measures have pain subscales that capture important pain domains, including pain severity on a visual analog or Likert scale (ie, pain subscales of the WOMAC function scale and the Knee Society score [KSS]), and whether pain is present during certain activities (WOMAC, KSS). It is important to define the source and location of pain, especially in the distinction between hip, knee, and back—information that is not captured in most HRQOL assessments. For the purposes of TKA and THA, left-right and hip-knee scores are the minimum necessity for attribution.
In some arthroplasty studies that specifically focus on the quality, location, and impact of pain as an outcome, it may be appropriate to use a pain inventory such as the Brief Pain Inventory or the Short Form McGill Pain Questionnaire (SF-MPQ) in addition to an HRQOL measure.24–26 The advantage of using the SF-MPQ is that sensory (11 items) and affective (4 items) dimensions of pain can be assessed separately; a total score then can be obtained by adding the two results. This may be particularly relevant in view of the recent findings that clinical and/or subclinical depression and anxiety are strong predictors of persistent pain and of suboptimal outcomes in arthroplasty patients.27,28 Use of the SF-MPQ may aid in separating sensory and affective aspects of pain.
When pain is the primary outcome in a study, we recommend using a validated pain questionnaire, such as the Brief Pain Inventory or the SF-MPQ, rather than a composite HRQOL questionnaire. In cases in which HRQOL is the primary outcome and pain a secondary outcome, we recommend using an HRQOL questionnaire that has a validated pain subscale.
A summary of commonly used, disease-specific, and generic instruments used for HRQOL assessment of TKA and THA and their strengths and psychometric properties is provided in Table 1. A recent review found that the KSS and Harris hip score (HHS), followed by the WOMAC and SF-36, were the most common outcome instruments used in clinical trials of knee replacement.62 The authors also noted substantial variation in the types of outcome assessments used.
The KSS and HHS were both developed by expert physician consensus and are condition-specific. The HHS provided the only outcome measure for patients with hip arthroplasty in the 1970s through the 1990s, before the advent and common use of the WOMAC and the SF-36. Because of their clinical relevance and the fact that they are based on a sound clinical framework, the KSS and HHS are still commonly used. An update of the KSS instrument is in progress.
In response to a need for more validated instruments in arthroplasty, a brief, seven-item, arthroplasty-specific HRQOL assessment unified scale, the American Academy of Orthopaedic Surgeons Lower Limb Core Instrument, was developed.56 It specifically addressed the overlap between the SF-36 and WOMAC, complements the SF-36, and has excellent psychometric properties (Table 1).
The SF-36 is ubiquitous in health care research and is perhaps the most widely used generic HRQOL assessment, primarily because of the availability of population-based normative and comparative data. Credible alternatives to the SF-36 are available.24
Many studies of patients with OA and other medical conditions have used specific subscales in addition to summaries or total scores. Examples include use of the role physical, physical function, and pain subscales of the SF-36 and of the function sub-scale of the WOMAC in studies of patients with OA or arthroplasty.63,64 The use of certain and not all sub-scales, if determined a priori, may be appropriate, although this may lead to multiple comparisons and require statistical adjustment such as the Bonferroni correction. Additionally, the amount of change on each sub-scale that is considered clinically important may be different. In some cases, it is meaningful to describe individual components of an instrument. For example, in addition to presenting HHS scores during follow-up of 108 patients after primary THA with a cementless porous-coated anatomic hip, Kim and Kim65 presented data regarding dependence in walking, limp, and range of motion, without performing additional unnecessary statistical comparisons for these scale components. Thus, additional important information should be presented, when appropriate.
It is essential for assessment of HRQOL that both a generic and a condition- or limb-specific instrument be used because specific instruments are more sensitive to change, and generic instruments capture qualifying information about a patient’s general health and allow comparison with patients with other diseases.49,66 Based on psychometric properties and ease of completion, we recommend using the SF-36 (or the SF-12 for brevity) as the generic and the Knee Injury and Osteoarthritis Outcome Score, the WOMAC, or American Academy of Orthopaedic Surgeons Lower Limb Core Instrument as the specific HRQOL instrument. The Oxford Hip44 or Oxford Knee35 scales are alternatives to consider. When comparison with previous studies is most important, the KSS or HHS may be preferred because the two are the most commonly used instruments in arthroplasty studies. The linkage between research hypothesis and obtainable data should be the primary consideration in the choice of outcome measure.
The concept of MCIDs is a key issue in assessing patient-related outcomes (PROs) because of its relevance to clinical practice.67 With an elective procedure such as arthroplasty, this concept is particularly important. Clinical trials and cohort studies of HRQOL provide a change in population means (ie, pre- and postintervention scores) or compare the changes between different interventions (ie, answer the question, “How small a change in the outcome could this study detect?”). These population-level differences are commonly extrapolated to an individual level. However, patients are not interested in knowing population-level differences. Rather, they wish to know the likelihood that they will experience a meaningful improvement for the risk they take with an intervention (ie, “Is this change meaningful to me?”). This concept illustrates the difference between clinical and statistical significance (ie, an intervention may be both clinically and statistically significant or have only clinical or only statistical significance). MCID estimates are calculated using probabilistic arguments drawing on results from statistical theory or using so-called anchor-based measures—usually patient or physician global scales. Anchor-based methods pose a global question to the patient (or physician), asking about the overall improvement in pain (or function) experienced between two visits.
The next step is calculation of the amount of change on a pain scale (eg, visual analog scale, WOMAC pain scale) that corresponds to the minimal change on the global scale (usually the “somewhat better” response). Thus, the amount of change on a global scale serves as an anchor for calculation of MCID estimates. The “anchor” in defining the MCID is either an objective outcome or a patient response to a global question that is related to the outcome of interest. For example, Quintana et al68 used a five-point patient-based anchor in defining MCID in patients with TKA by asking the patients about the improvement in their knee 6 months after the intervention, with the possible responses “a great deal better,” “somewhat better,” “equal,” “somewhat worse,” and “a great deal worse.” Changes corresponding to “somewhat better” were used to establish the MCID for improvement.
MCID estimates have been described for the WOMAC and SF-36. For patients with primary THA, the MCID was 26 points for WOMAC stiffness and 29 points for WOMAC pain.55 For the same cohort, MCID estimates for the eight SF-36 sub-scales ranged from 11 points for the SF-36 physical role subscale to 20 points on the SF-36 physical function subscale. For primary TKA, the MCID estimates were 15 points for WOMAC stiffness and 23 points for WOMAC pain.54 For the same cohort, MCID estimates for the eight SF-36 subscales ranged from 12 points for the SF-36 physical function subscale to 17 points for the SF-36 bodily pain subscale.
Another advantage of calculating MCID estimates is that they can be used as a clinical trial outcome. TKA and THA are associated with large gains in HRQOL. However, when HRQOL outcomes are used to compare surgical approaches (eg, minimally invasive versus regular), specific surgeries (eg, patellar resurfacing versus no resurfacing, cruciate-retaining versus cruciate-sacrificing), or medical interventions in arthroplasty patients, the differences may have smaller effect sizes. In such instances, an MCID estimate is, therefore, a key characteristic for design of adequately powered studies.
For example, two studies that compared patellar resurfacing with no resurfacing reported no difference in KSS between the two groups, but neither study defined the MCID or was adequately powered to find some clinically meaningful difference.69,70 If these studies had collected some patient-reported global measure of improvement, then these data may have been used both to calculate MCIDs and for power calculations. Nonetheless, with 44 patients per treatment group, the study by Diduch et al70 had 80% power to detect a difference of 61% of a standard deviation (SD) (a moderately large and almost certainly clinically meaningful effect size); hence, the detectable effect from this design, although not specified, was likely of a reasonable size. If the sample size had been 64 patients per group, there would have been 80% power to detect the generally accepted clinically significant benchmark of 50% of the SD.10 The fact that no statistically significant result was observed is indicative that likely no clinically meaningful change was missed (although type II errors do occur).
One of the important issues regarding MCID calculation is that the results depend on the anchor used (ie, 4-, 5-, or 7-point scale)71 and, possibly, on patient expectations. The expected change correlating with “somewhat improved” may be different for a surgical versus medical intervention. For example, the MCID on the WOMAC in a TKA population was 23 for pain and 20 for function subscales.54 In a similar study of patients with OA undergoing nonsteroidal anti-inflammatory drug therapy, MCID estimates were 20 for WOMAC pain and 9 for WOMAC function.72 This example of different MCID estimates on WOMAC function scales may be attributable to differences in the patient population, baseline pain level, patient expectation, or the anchor used.
MCID estimates may be different in patients with revision versus primary arthroplasty for multiple reasons. Patients undergoing revision arthroplasty may have different pain severity and a higher likelihood of persistent pain and functional limitation postrevision. The largest improvements in HRQOL are seen in primary arthroplasty, and revision may be viewed more as a procedure to maintain the HRQOL, with less capability for symptomatic relief. Some revision surgeries (eg, intervening earlier in the course of osteolysis around an implant associated with prosthetic wear) may be performed not only to address symptoms but to prevent worse problems that may be more complex to manage in the future. In these cases, MCID or even HRQOL scores may not be the most relevant outcomes.
The use of MCID estimates facilitates the interpretation of normative data and baseline status in evaluating the health status of various populations of patients undergoing THA and TKA. There is a critical need to understand THA and TKA in regard to populations with differing aggregate health status. For example, a THA may be performed in a golfer who is having problems getting around the course, or in a patient from an underserved community who is on the verge of requiring a wheelchair. The baseline health status of these two persons may be quite different, and the outcome may be better in the golfer. The change in clinical status, however, may be greater in the borderline wheelchair patient; and the improvement in independence and mobility, as well as the cost savings for the health care system, may make this procedure more cost-effective. The risks of operating on the severely disabled patient may be greater, but so is the possible benefit.
A somewhat related concept is presenting the data regarding the proportion of patients who achieved previously described clinical end points (ie, responder analysis). For example, for the HHS, the definitions of excellent (90–100), good (80–89), fair (70–79), and poor (<70) outcomes have been described. In a study by Kim and Kim,65 the proportion of patients in these categories was reported. In a study by Diduch et al70 that followed 114 TKAs in young, active patients long-term, the authors reported that the mean KSS for function was 89 and that 94% of knees had good or excellent function. While recognizing the caveat that this particular categorization in HHS is somewhat arbitrary, we recommend that authors consider reporting the proportion of patients in the poor or fair category at baseline who shifted to a better or worse category at follow-up. In any event, responder analysis has the advantage of appearing similar to reports on other clinical variables such as treatment response or disease progression.
We recommend that more studies be done to derive MCID estimates for commonly used HRQOL scales in arthroplasty. With knowledge of MCID on these scales, the clinical significance of results for trials of surgical and nonsurgical interventions in arthroplasty patients can be interpreted in addition to simple statistical significance. At present, MCID estimates are known only for the WOMAC and SF-36, so studies comparing the proportion of patients achieving MCID with one treatment versus another may prefer these instruments.
One of the challenges of HRQOL assessments is that each has its own range and interpretation calibration. For example, Table 2 presents scores for two HRQOL assessments drawn from an arthroplasty study.21 If one looks only at the raw scores, one would conclude that the scores were highly variable across the different assessments. A commonly used approach to solve this problem is to translate each scale onto a continuum of 0 to 100 for ease of interpretation. Once the transformations have been made, however, it is clear that all assessments are reporting similar levels of QOL (roughly 59% to 63% of the theoretic range of the assessment). One may also use T-scores or Z-scores, but nonstatisticians typically find these values difficult to interpret. T-score is the measurement expressed in standard deviation units from a given mean score in a sample, given that the population standard deviation is unknown. A Z-score is the same as the T-score except that the population standard deviation is known.
Both the World Health Organization Quality of Life assessment short version (WHOQOL-BREF) and the WOMAC, for example, use this transformation onto a 0-to-100 scale to improve interpretability, although for the WHOQOL-BREF, a score of 100 indicates best QOL, whereas for the WOMAC, a score of 100 indicates worst QOL.12,61
We recommend transformation to 0-to-100-point scales in which 100 consistently indicates the best possible outcome for ease of interpretation and comparability across outcome measures. The transformation should be clearly described in both the abstract and methods sections of the paper. For established scales for which precise normative estimates are available, one might also include summary statistics in tables for the raw scores so that researchers who do not report transformed scores are still able to make cross-study comparisons.
Assessing outcomes following total joint arthroplasty can be described as asking the question, “What constitutes a good result?” For example, is a good outcome an absolute final measurement of HRQOL or a benefit that is described by a change in clinical status that correlates well with the patient’s satisfaction? Although clearly more research is needed in this area to understand the hierarchy of outcomes from a patient perspective, we speculate that it is a combination of patients’ overall assessment of change, the present state (ie, pain, function, range of motion in the index joint), adverse effects/complications, and the economic and social burden associated with the surgery and rehabilitation. In essence, this is a reflection of the evolution unfolding in modern medicine whereby the patient is viewed as more than merely the sum of disease indicators.73
HRQOL assessment in arthroplasty clinical research is in many ways advanced relative to the incorporation of PROs in other medical disciplines. Usable assessment tools already exist specifically for this patient population (eg, WOMAC, HHS, KSS), and example studies of successful application of generic assessments are in use (eg, SF-36, SF-12). MCID estimates have already been derived specifically for arthroplasty populations and are beginning to be applied in the interpretation of clinical research studies. In theory, the potential impact on patient HRQOL of arthroplasty is in itself profound; the typical effect sizes one would expect for successful treatments are large, well beyond the MCID, especially in terms of pain and physical function. This has led to a record of successful arthroplasty studies in terms of the ability for profound effect sizes to be observed in terms of HRQOL changes. There is room for improvement in the development of a standard approach to HRQOL assessment, analysis, and interpretation.
The future for HRQOL assessment related to arthroplasty may lie in the application of the successful experiences in clinical research to be translated into clinical practice. Ultimately, the test of the feasibility of HRQOL assessment will be the applicability of collecting such information in clinical practice routinely and incorporating it to improve clinical care. Use of outcomes assessment in physician offices is becoming feasible with the advent of computerized touch-screen technology. This is the key to data collection on a large scale, which is the goal of total joint registries.
Grant support was received from the NIH CTSA Award 1 KL2 RR024151-01 (Mayo Clinic Center for Clinical and Translational Research), North Central Cancer Treatment Group (CA25224-27), and Cancer Center grant CA 15083-32.
Dr. Singh or an immediate family member has received research or institutional support from the Mayo Clinic Center for Clinical and Translational Research, the North Center Cancer Treatment Group, and the Cancer Center. Dr. Sloan or an immediate family member has received research or institutional support from the Mayo Clinic Center for Clinical and Translational Research, the North Center Cancer Treatment Group, and the Cancer Center. Dr. Johanson or an immediate family member has received royalties from Exactech, serves as a paid consultant to or is an employee of Stelkast, and has received research or institutional support from DePuy, Exactech, IsoTis Orthobiologics, Zimmer, the Mayo Clinic Center for Clinical and Translational Research, the North Center Cancer Treatment Group, and the Cancer Center.
Evidence-based Medicine: Levels of evidence are described in the table of contents. In this article, references 9, 64, and 69 are level I studies. References 1, 2, 5, 11–22, 24–26, 28, 30, 31, 34–38, 42–63, 66–68, 70, and 72 are level II studies. References 23, 27, 39, and 65 are level III studies. References 3, 4, 6, 7, 10, 29, 32, 71, and 73 are level V expert opinion.
Citation numbers printed in bold type indicate references published within the past 5 years.