This study of the OA-FUNCTION-CAT functional difficulty and functional pain item banks and CAT scales showed strong psychometric properties in this sample of persons with hip and/or knee OA. The full 125-item banks calibrated well with a uni-dimensional IRT model, providing greater breadth and more precise, more accurate, and more reliable estimates of functional difficulty and functional pain than the WOMAC. CAT performance remained close to that of the full item bank and superior to that of the WOMAC.
High correlations of the OA-FUNCTION-CAT item bank and the simulated CATs with the WOMAC can be viewed as one indication of the validity that the OA-FUNCTION-CAT item bank provides in characterizing the functional consequences of hip or knee OA. While the WOMAC has demonstrated acceptable measurement properties in this population, it was noted with a Rasch analysis that the items congregated in the center of the ability range, with several redundancies [38
]. This is not surprising since focusing items in the mid range has been a common approach to the coverage/practicality dilemma within traditional survey construction methods. Our assessment of the breadth of the item banks indicated that the functional difficulty and functional pain item banks improved significantly on the content and scale coverage of the WOMAC.
Indeed, all measures of performance used in this study, including those for reliability and precision, showed improved function of the OA-FUNCTION-CAT over the WOMAC. More specifically, because of the focused effort to improve coverage, the greatest gains were achieved at the high end of the scales. Therefore, the OA-FUNCTION-CAT might be of particular benefit in capturing change among symptomatic patients at either end of the functional difficulty or functional pain domains. However, further improvements could be made to minimize the remaining ceiling effect noted in our analyses.
The results of these analyses are encouraging and consistent with prior studies indicating that the 10-item CATs have the ability to decrease time requirements for data collection requirements while enhancing psychometric properties [35
]. However, these results are preliminary. Future research is needed to assess the administrative burden and the ability of OA-FUNCTION-CAT to detect smaller clinical and patient-relevant differences between groups and over time.
Our analyses with regard to DIF revealed some interesting results. Lifting heavy objects was more difficult for women than for men, and women reported greater levels of pain when lifting 25-pound objects than did men. Men reported more pain in getting clothes off, and women generally had more difficulty getting in and out of trucks or SUVs than did men. Subjects with knee OA had more difficulty than those with hip OA with items that involved stairs and squatting and kneeling. Those with both hip and knee OA had more difficulty with rolling and moving in bed and lower extremity self-care tasks than those who had only one joint affected (hips or knees). Adults with hip OA had more difficulty with moderate lifting than those with either knee or both joints affected. These predictable patterns of differences across different joint conditions suggest construct validity of our instrument.
The number of items that showed DIF by site of arthritis (hip versus knee) was relatively small for the functional domain but greater for the pain domain. These DIF findings revealed that the level of pain in certain activities (for example, climbing stairs, squatting, and kneeling) appears to be greater in the knee patients than in the hip patients. These results, if replicated in future research, may justify the development of separate calibrations for those items with DIF within different sites of OA. Given the limited sample size of patients in each type of OA in this sample, we did not feel justified in creating separate calibrations by site of arthritis. This is an issue that can be addressed in future research.
There are several alternatives for handling DIF. Removal of those items demonstrating DIF is one approach, leaving only those without DIF in the item bank. Unfortunately, this may eliminate items that contribute to the sensitivity and content validity of the resulting item banks. As an example, one alternative would be to develop separate sets of calibrations for hip and knee patients and for males and females and incorporate them into future CAT applications. This is an approach that we consider interesting for potential future research.
Several limitations of this research, including potential limits to the generalizability of this predominantly white, highly educated sample and a rather modest sample size for these analyses, should be acknowledged. In addition, different ethnic group ancestry was not examined. Given the level of CFI/RMSEA values, the structure of the OA-FUNCTION-CAT revealed in this study needs to be replicated in other samples with other sites of lower extremity OA. Similarly, a sample size of 328 subjects for these IRT analyses is acceptable if not ideal. One consequence of a relatively small sample size is that the person and item standard errors are larger than might be optimal for broader application of the item banks. Second, the effect of the relatively small number of unexpected responses for any particular item is more pronounced in a small sample, potentially leading to erroneously labeling an item as 'fitting'. For a two-parameter IRT model, it has been shown that a graded response model can be estimated based on 250 or more subjects [39
]. From the item parameter recovery point of view, evidence suggests that increasing the number of items to be analyzed has little effect on the item parameter recovery but that increasing the number of categories will increase the error variance of the parameter estimates [40
]. Given our relatively small number of categories (four), the sample size for these analyses is adequate.
Simulations of CAT scores, such as those used in this study, are possible whenever datasets include responses to all items in an item pool. Simulations are based on the assumption that the answers to a subset of those items selected using CAT would be identical to the answers given when embedded in a larger fixed-form instrument. Simulations are approximations of actual CAT administrations, and although they are likely to be good estimates, they may overestimate agreement between CAT and full item bank scores. Another area for future research is to assess the accuracy of CAT estimates in prospective studies.