|Home | About | Journals | Submit | Contact Us | Français|
To develop and evaluate a prototype measure (OA-DISABILITY-CAT) for osteoarthritis research using Item Response Theory (IRT) and Computer Adaptive Test (CAT) methodologies.
We constructed an item bank consisting of 33 activities commonly affected by lower extremity (LE) osteoarthritis. A sample of 323 adults with LE osteoarthritis reported their degree of limitation in performing everyday activities and completed the Health Assessment Questionnaire-II (HAQ-II). We used confirmatory factor analyses to assess scale unidimensionality and IRT methods to calibrate the items and examine the fit of the data. Using CAT simulation analyses, we examined the performance of OA-DISABILITY-CATs of different lengths compared to the full item bank and the HAQ-II.
One distinct disability domain was identified. The 10-item OA-DISABILITY-CAT demonstrated a high degree of accuracy compared with the full item bank (r=0.99). The item bank and the HAQ-II scales covered a similar estimated scoring range. In terms of reliability, 95% of OA-DISABILITY reliability estimates were over 0.83 versus 0.60 for the HAQ-II. Except at the highest scores the 10-item OA-DISABILITY-CAT demonstrated superior precision to the HAQ-II.
The prototype OA-DISABILITY-CAT demonstrated promising measurement properties compared to the HAQ-II, and is recommended for use in LE osteoarthritis research.
Disability related to osteoarthritis is widely recognized as a serious problem having significant impact at the individual and societal levels. (1–8) Consequently, disability assessment has become important in the evaluation of persons with osteoarthritis. (9–12) In contrast to functional limitation, which can be defined as a restriction in performance of a specific task at the person level (e.g. put on shoes), disability can be viewed as a limitation in performance of activities within the context of social roles (e.g. visit friends). (13, 14)
Persons with osteoarthritis exhibit a wide range of disability and possess the potential to make large changes over the course of treatment, making it difficult to develop one outcome instrument which works well in all patient groups. (7, 9, 15) An ideal disability measurement instrument would cover the full range of activities relevant to osteoarthritis treatment, with a sufficient number of increments in response categories to measure meaningful change across the disability continuum. Furthermore, to understand the impact of arthritis interventions on the disablement process, measurement instruments would be based on a sound conceptual model, allowing differentiation between functional limitation and disability outcomes.(5, 14, 16–18)
Current disability measures characterize the significant progress in disability assessment made to date, but persistent challenges continue to be faced by developers and users of instruments (9, 11, 19–22). These problems include incomplete coverage of the range of disability levels across patients over the course of treatment, and obtaining adequate precision without excessive instrument length.
A fundamental tension between measurement quality and practicality of administration has persisted for decades. Comprehensive fixed-form instruments have suffered from prohibitive respondent burden and administration costs. (21, 23) The introduction of short-form alternatives has raised concerns over relative losses in score precision and ability to measure clinically meaningful change.(24, 25) These and other well known limitations are largely due to traditional administration methods requiring that a fixed set of questions be administered to all subjects. Often, respondents must address redundant questions or those of low relevance. (26–28) Therefore, while overcoming these stated limitations of existing instruments superior measures of disability would improve the basis for valid judgments about the effectiveness of various osteoarthritis treatments for use in cohort studies.
Contemporary methods for outcome instrument construction and data collection provide an opportunity to improve psychometric properties while reducing respondent burden and administrative costs. The introduction of Item Response Theory (IRT) methods, (29–31) has allowed researchers to develop outcome instruments with improved performance across a broad range of disability. However, IRT methods by themselves have not resolved the problem of respondent burden. Recent introduction of computer adaptive testing (CAT) methods combined with IRT methods in the health measurement field offers the potential solution to this challenge. (32, 33) In CAT administration, an iterative computer program uses information from a subject’s previous responses to tailor item selection to provide the most information at the respondent’s current score estimate, thereby eliminating redundant questions about activities that are too hard or too easy. A key strength of this approach is that all scores are on the same metric, regardless of the number of items administered, thus facilitating comparisons across time or across groups with different disability levels.(34)
In this study we developed a disability instrument for lower extremity (LE) osteoarthritis research (OA-DISABILITY-CAT) using IRT and CAT methodologies, and evaluated its psychometric performance in relation to the full item bank and to the Assessment Questionnaire Disability Index (HAQ-II), a widely used health measure in the field.
The Health Assessment Questionnaire (20) is a generic, multi-dimensional instrument designed to measure function and disability outcomes for rheumatoid arthritis (RA) that has been commonly used in many other disease areas, including osteoarthritis.(19) Subsequent versions of the HAQ including the MHAQ, MDHAQ, and the HAQ-II have addressed limitations of the original HAQ. In this study, we used the 10-item HAQ-II (22, 35). In a comparison of the HAQ-II with the earlier versions of the HAQ, and the SF-36, the HAQ-II demonstrated a reliability estimate of 0.88; was highly correlated with the other measures, and had fewer ceiling and floor effects than the earlier HAQs. (22) Higher scores indicate more disability for the HAQ-II; however for consistency in this study we reversed scores so that higher scores indicate less disability.
We conducted 6 semi-structured focus groups each consisting of 5 to 6 patients with LE osteoarthritis. Experienced moderators elicited patients’ perspectives on important outcomes for osteoarthritis research. Transcripts of audiotapes of the sessions were content analyzed.
We directed 3 multi-disciplinary focus groups including 5 to 6 clinicians who had extensive expertise in the treatment of patients with osteoarthritis.
We conducted a comprehensive review of the literature and generated a list of daily life activities covering a broad range of disability levels. The final item bank consisted of 33 daily life activities commonly affected by LE osteoarthritis.
We performed cognitive testing on the item bank to identify problems with questions that would diminish instrument performance by asking 6 adult patients with LE osteoarthritis scripted questions about item meaning.
In the final scale, subjects were asked to report the amount of limitation they had doing each activity as: 1) Not at all limited 2) A little, 3) A lot, 4) Did not do this activity because of the arthritis in my legs, 5) Did not do an activity for reasons other than the arthritis in my legs. The time frame “on an average day over the past month” was used. If the subject responded,‘ did not do activity because of arthritis’, those responses were treated as ‘couldn’t do activity’; if the subject responded, ‘did not do an activity for reasons other than arthritis’, those responses were treated as missing data for the analysis.
We recruited a convenience sample of 323 adults from the greater Boston area from a pool of patients who had previously participated in osteoarthritis research and from a local orthopedic surgeon’s practice. In all cases the diagnosis of knee and/or hip osteoarthritis was confirmed by a physician, and the patient experienced pain or stiffness within the past 30 days consistent with the ACR clinical criteria for defining osteoarthritis. For the majority of the sample, arthritis was confirmed by radiographic evidence as well.
Subjects were contacted by phone to determine eligibility which included: 18 years or older, able to speak English, experienced pain or stiffness in their knee or hip within the prior month, evidence on radiograph of a definite osteophyte for the knee or hip or joint space narrowing for the hip or confirmation from the subject of a physician’s diagnosis of osteoarthritis of the knee or hip. Subjects were not eligible if they used a wheelchair in their home, or had been diagnosed with rheumatoid arthritis, systemic lupus erythematosis, gout, or psoriatric arthritis. Subjects were stratified by functional level, ascertained by the Physical Function domain of the SF-36 to ensure a range of functional ability in the sample.
The OA-DISABILITY-CAT item bank and HAQ-II items were administered by trained interviewers during a home visit with each subject. The HAQ-II was administered first using pen and paper, followed by computerized administration of three instruments, including the OA-DISABILITY-CAT. The order of the three computerized instruments was counterbalanced. Gender-specific items were administrated to the relevant gender. Demographic information (age, sex, ethnicity, race, education, living and housing status) were collected for each subject. All procedures were approved by the Institutional Review Board at Boston University.
We tested the underlying structure of the proposed disability items in a series of confirmatory factor analyses (36) and evaluated item loadings and residual correlations between items using MPlus software.(37) To maximize precision in our evaluation of these skewed categorical data, we chose unweighted least squares (ULS) estimation based on polychoric correlation matrices and variance adjusted estimation methods. (36, 38) We assessed eigenvalues associated with each factor extracted. Our analysis of model fit included the ratio of chi-square to degrees of freedom, Comparative Fit Index, (CFI) Tucker-Lewis Index, (TLI) and Root Mean Square Error Approximation (RMSEA). For CFI and TLI values range from 0 to 1, with higher values indicating better test model fit compared to a baseline model, and 0.95 or greater representing acceptable fit. RMSEA represents misfit per degree of freedom, and lower values signify better fit. Values less than 0.05 suggest a “very good fit”, with values around 0.08 interpreted as “mediocre” fit. Values >0.1 are generally viewed as indicative of a “poor fit.” (39, 40) Our second approach used the magnitude of the factor loadings on the primary factor. Finally, we considered residual correlations; those less than or equal to 0.20 suggest that the primary factor explains the correlation between items, and indicates acceptable fit. (41, 42) Higher residual correlations signified violation of the local independence assumption.
The item calibrations were estimated using the generalized partial credit model (GPCM).(43–45) We estimated IRT-based scores for the disability domain using Weighted Maxmum Likelihood Estimation. (38, 46) We evaluated fit using the likelihood ratio chi-square (G2) statistics for each item based on the comparison of expected and observed values across the distribution of the domain. Bonferroni corrected p-values were used in the significance tests and the likelihood ratio chi-square statistic for the whole test was also examined to verify model fit of the domain. The scores estimated from the IRT model were standardized to have a mean of 50 and standard deviation of 10. All of the IRT analyses was performed using the software package PARSCALE. (47)
A basic assumption of IRT models is that a subject’s score on an item should depend entirely on the subject’s score in the domain being measured and the statistical characteristics of the item. Significant differential item function (DIF) indicates that background variables (such as age or gender) influenced the response. (48) There are two kinds of DIF; uniform DIF, which means the item response difference was constant across the reporting scale for ability; and non-uniform DIF, which means the item response difference between, say, males and females, was not consistent across the score reporting scale for disability.
DIF was assessed using logistic regression, with the OA-DISABILITY-CAT item score chosen as the dependent variable and background variables assigned as the independent variable. In the DIF analysis if the background effect was significant and the interaction effect with a person’s disability level was not, then the item had uniform DIF; on the other hand, if the interaction effect was significant, the item had non-uniform DIF. The analytic strategy successively added disability levels, background variables and interaction terms into the model and model comparison was based on the likelihood ratio test. The effect size of the DIF was classified based on the R-square change between models. (49)
Once a final item bank was identified and item calibrations were generated for the disability domain, we constructed the OA-DISABILITY-CAT algorithms on HDRI™ software developed at Boston University. The CATs were designed to be administered from a stand-alone computer or from a web-based platform. We programmed the CATs to use weighted maximum likelihood (WML) score estimation and selected initial items from those in the middle of the pain and disability ranges. The response to the first item was fed into the CAT algorithm and the application calculated a probable score as well a person-specific standard error (measure of precision). Additional questions were selected and administered until the maximum number of items had been administered (in our analyses, 5, 10 or 15 items were administered).
An assumption of IRT is that all items are locally independent, that is, patients’ responses to any pair of items are statistically independent. (29) Often, items with local dependence are removed from the item bank. In our case, we did not eliminate them from the item bank, but rather dealt with them by special programming within the CAT algorithm that allowed use of only one item within a set of locally dependent items.
We conducted simulations to estimate the performance of CATs of different fixed item lengths (i.e. 5-, 10-, and 15 items) with respect to the full item bank. Mean scores generated by CATs of different lengths were compared with scores generated by the full item bank for the entire sample and across osteoarthritis conditions. To compare the relative precision of the CAT scores at multiple points along the scale with the full item bank we plotted the standard errors in relation to each subject’s disability scores. Pearson correlations were calculated between each of the CAT-generated scores and the full item bank scores to estimate the CAT’s accuracy. We compared the OA-DISABILITY-CAT item distributions, floor and ceiling effects, reliability, and precision against the HAQ-II. To create an appropriate comparison we placed the HAQ-II and OA-DISABILITY-CAT scores on the same metric by fixing the calibrations of one of the instruments and placing the other one on this same scale. Thus, we calibrated the HAQ-II items by anchoring the OA-DISABILITY-CAT item calibrations in the disability domain. For both scales in this analysis, higher scores indicate less disability. To examine the distribution of the DISABILITY-CAT and HAQ-II items, we calculated expected values for each response category for each item. We considered the range of the scale to be the corresponding person score estimates between the expected value of the lowest and highest response category in each scale. In addition, we calculated the percent at the ceiling and floor for each scale. We compared the relative precision of the OA-DISABILITY-CAT scores with the HAQ-II using standard errors. Reliability, the degree to which the differences across patient measurements are due to actual differences in disability (true variance) rather than to measurement error, was examined by comparing the ratio of the true variance to the total variance for each instrument at multiple points along the scale. Reliability was estimated as follows: 1/1+(standard error)2. (50) Any section of the reliability function <0.70 was considered to be inadequate. To test construct validity of the OA-DISABILITY instruments, we calculated Pearson correlation coefficients between the HAQ-II and the OA-DISABILITY instruments (5-, 10-, 15- item CATs and the full item bank). We hypothesized that the correlations would be strong (>0.60).
In the study sample, the average age was 62 years, (sd = 15), 65% of participants were female, and a large proportion had knee osteoarthritis (Table 1). In the disability scale the average percentage of subjects who responded “didn’t do an activity because of arthritis” was 9.21% (sd 12.26%). The average percentage of subjects who responded “didn’t do an activity for reasons other than arthritis” was 18.18%, (sd 13.34%).
Confirmatory factor analysis results were consistent with unidimensionality of the OA-DISABILITY-CAT domain. A unidimensional model (chi-square(df)=251(96), p<0.0001) across all 33 items achieved an acceptable level of fit, explained 62% of the variance, and was easily interpretable. Only 1.4% of the residual covariances were greater than +/−0.20, which means that the local independence assumption was satisfied. Remaining fit statistics were as follows: CFI was 0.95; TLI was 0.99; and RMSEA was 0.07.
The data fit the generalized partial credit model; the chi-square (df)=320(338), p=0.76. In terms of item fit, there was only 1 misfitting item (fairly heavy house or yard work) in the item bank.
There were 3 items which displayed DIF by age (taking part in a regular exercise program, doing low demand sports such as golfing or bowling and using public transportation). After adjusting for disability level, taking part in a regular exercise program was more difficult for those with hip osteoarthritis, and therefore demonstrated DIF by osteoarthritis condition. No items displayed gender DIF, and only uniform DIF was detected in this analysis.
Pearson correlation coefficients between the 5-, 10, and 15-item OA-DISABILITY-CATs and the full item banks were 0.93, 0.99, and 0.99 respectively. This high degree of accuracy is illustrated by score plots for the 10-item CAT and the full item bank (Figure 1). Table 2 shows that the descriptive statistics of scores from the 5- 10- and 15-item OA-DISABILITY-CATs were similar to those for the full item bank and for mean scores generated across osteoarthritis conditions. As might be expected, the 5-item CAT had a smaller range of scores than the 10-, 15, and the full item bank. The standard errors of the 10-item OA-DISABILITY-CAT were slightly larger than the full item bank scores across the range reflecting the fewer number of items that were used to calculate the overall score.
In Figure 2 the breadth of item and response category coverage across the continuum for the OA-DISABILITY item bank is displayed relative to that of the HAQ-II scale. Item and response category coverage is displayed as the range of scores for the sample that correspond to the highest and lowest values of expected item response categories in each scale. The OA-DISABILITY item bank and the HAQ-II scales covered a similar estimated scoring range (Figure 2). The ceiling and floor calculations further illustrated these results, for example, 13 (4.02%) of subjects were at the ceiling (scores indicating least disability) for the OA-DISABILITY item bank compared to 18 (5.57%) for the HAQ-II scale. No floor effects (scores indicating most disability) were detected for either scale.
Correlations between the HAQ-II and the OA-DISABILITY-CATs (5-, 10-, 15- item CATs and the full item bank) ranged from 0.71 to 0.74.
The conditional reliability of the OA-DISABILITY item bank was very strong across the center of the disability continuum. For example, 95% of OA-DISABILITY reliability estimates were over 0.83 versus 0.60 for the HAQ-II (Figure 3). Reliabilities decreased as the level of disability approached the ceiling for both the OA-DISABILITY item bank and the HAQ-II; however, the OA-DISABILITY item bank reliability remained superior to the HAQ-II. At the extreme floor, reliability for the HAQ-II surpassed that of the OA-DISABILITY-CAT.
Figure 4 displays the precision of the 10-item OA-DISABILITY-CAT and the HAQ-II scales as measured by the conditional standard error of measurement statistic. The 10-item OA-DISABILITY-CAT demonstrated superior precision to the HAQ-II for much of the range of scores; however, at the very highest scores precision of the two instruments was similar, and in some cases the precision of the HAQ-II exceeded that of the OA-DISABILITY-CAT.
The results of these analyses revealed that the OA-DISABILITY item bank and the OA-DISABILITY-CAT scales performed well in this sample of persons with LE osteoarthritis. The full 33 item bank calibrated well with a unidimensional IRT model, providing similar breadth and, on average, more precise and reliable estimates of disability than the HAQ-II. The OA-DISABILITY item bank was similar to the HAQ-II with respect to breadth of coverage of the continuum of disability while the 10-item OA-DISABILITY-CAT showed improved reliability and precision throughout much (but not all) of the disability continuum as compared with the HAQ-II. Therefore, the OA-DISABILITY-CAT will be of particular benefit in providing precise and reliable measurement of disability with few items. Nonetheless, further improvements could be made to this scale at the ceiling.
The correlations found between the OA-DISABILITY instruments and the HAQ-II were very strong, (51, 52) and offer one indication that the OA-DISABILITY-CAT item bank provides a valid representation of the repercussions of LE osteoarthritis. In spite of the fact that the two scales have important differences, we would expect that there would be a trend toward increased OA-DISABILITY-CAT scores as HAQ-II scores increase.
In previous research, the HAQ-II has demonstrated strengths in breadth of coverage of the severity of the impact of arthritis. However, it includes items that measure both function (e.g. lift heavy objects) and disability (e.g. do outside work) within the same instrument. This poses a major barrier to investigation of the disablement process, where it is critically important to use distinct measures of impairments, functional limitations, and disability.(17) The incremental improvements of the OA-DISABILITY-CAT over the HAQ-II in terms of reliability and measurement precision are augmented by the considerable advantage of focusing specifically on disability. Instruments such as the OA-DISABILITY-CAT and the OA-FUNCTION-CAT (53) which strive for conceptual clarity allow research to uncover factors that impact the development or resolution of disability given that impairment and/or functional limitations have occurred.(14, 17)
Although preliminary, the results from the present study are encouraging and consistent with previous work indicating that the 10- item CATs will likely provide the opportunity to maximize psychometric properties with minimal data collection time and administrative burden. (34, 35) Further work is needed to ascertain the administrative burden and responsiveness to clinically meaningful change.
At this stage of the development of the item banks, we did not remove any items due to DIF. However, there were some interesting results. Taking part in a regular exercise program, doing low demand sports such as golfing or bowling and using public transportation were more difficult for older patients. Taking part in a regular exercise program was more difficult for those with hip osteoarthritis. These predictable patterns of differences support construct validity of our instrument.
DIF can be handled in several ways. One approach is to simply remove the items from the calibrated item bank and only use those without DIF. One disadvantage of this approach is that it may diminish the sensitivity of the resulting item banks and thus reduce the utility of the CAT instrument. An alternative approach would be to establish different sets of calibrations for hip and knee patients and incorporate them into future CAT applications. We are especially interested in pursuing this second approach in future research.
Several limitations of the research should be acknowledged. Although the sample used in this study was adequate, it was relatively small for an IRT analysis. One consequence is that the person and item standard errors were larger than might be desirable for wider uses of the item banks. Secondly, the effect of sample size on the number of unexpected responses for any particular item in the bank could potentially have lead to erroneously labeling an item “fitting” when with a larger sample; the opposite evaluation may be true. Finally, the impact of sample size for DIF analysis is that our results could have underestimated the presence of DIF. Clearly the structure of the OA-DISABILITY-CAT revealed in this study needs to be replicated in other samples with LE osteoarthritis.
In addition, real data simulations are based on the assumption that the answers to a subset of those items selected using CAT would be identical to the answers given if they were embedded in a larger fixed-form instrument, such as was administered to the calibration sample. Such simulations are likely good (but not perfect) approximations of actual CAT administrations and may overestimate the score agreement of CATs with the full item bank. Future research needs to examine the accuracy of CAT estimates in prospective studies.
This study revealed that the OA-DISABILITY item bank and 10-item OA-DISABILITY-CAT provided comparable or superior measurement properties compared to a widely used traditional measure in a sample of patients with LE osteoarthritis. The strong conceptual basis for this disability scale, combined with incremental improvements in reliability and precision compared to the HAQ-II support OA-DISABILITY-CAT as a strong candidate for future measurement of osteoarthritis-related disability. Further work is needed to test the performance of the OA-DISABILITY-CAT prospectively. This preliminary study and the evolving body of work indicate that the CAT approach combined with IRT offers a viable solution to the longstanding conflict between the need for accuracy in clinical assessment and the equal need for practicality of administration.
Supported by the NIH R01 AR 051870 and 1F32HD056763 and an Independent Scientist Award (K02 HD45354-01) to Dr. Haley
In this section I will ask you about everyday activities you may have done over the past month. I will ask you to what degree you felt limited in doing each activity because of the arthritis in your legs.
For each activity, please choose from the following answers in describing how limited you felt, on an average day, during the past month, because of the arthritis in your legs:
Not at all limited (1)
A little (2)
A lot (3)
For those activities that you did not do, I want to know if that is because of the arthritis in your legs. (4)
If you did not do an activity for reasons unrelated to arthritis, select the response: “Did not do the activity for reasons other than the arthritis in my legs”. (5)
We are interested in learning how your illness affects your ability to function in daily life. Place an X in the box which best describes your usual abilities over the past week.
Without any difficulty (0)
With some difficulty (1)
With much difficulty (2)
Commercial Support/Conflicts Statement. Drs. Haley and Jette have stock interest in CRE Care LLC, which distributes the OA-DISABILITY-CAT Instrument products.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.