Evaluating how a patient experiences a health care intervention has become increasingly important in a wide range of clinical studies, including pharmaceutical, behavioral, device, and procedural trials. Such experiences are typically captured via “patient-reported outcomes” (PROs). Although PROs have played an important role in major clinical trials and in labeling decisions by the Food and Drug Administration (FDA) , the assessment of PROs has been problematic. For example, many PRO measurement instruments are long and burdensome for patients and research staff. Even rigorously designed instruments can miss the full range of patient experiences or be insensitive to change over time because of floor or ceiling effects. Furthermore, there is a lack of standardization in most therapeutic areas, with many similar PRO measures but no standard metric, making it difficult to compare or combine scores across studies. For these reasons, the National Institutes of Health (NIH), under the NIH Roadmap theme of reengineering the clinical research enterprise (http://nihroadmap.nih.gov/), has identified better assessment of PROs as a pressing need. Accordingly, the NIH has funded the Patient-Reported Outcomes Measurement Information System Network (PROMIS, http://www.nihpromis.org/) to develop better measures of PROs for chronic diseases . The primary focus of this study is to assess clinical investigators’ perceptions of the utility of PRO measures in clinical trials and to anticipate perceived barriers to the adoption of new PRO measures and methods.
PROMIS, along with other recent initiatives (eg, [3-5]), is exploring a new method for improving PRO measurement based on item response theory (IRT). IRT is a psychometric framework that has become the standard in educational testing and is gaining popularity in health assessment because of its ability to generate shorter measures without compromising reliability or sensitivity. IRT models are appropriate when one is interested in measuring an unobserved trait (or construct) that presumably exists along a continuum, such as fatigue or physical functioning. For questions (or items) with multiple ordinal responses (eg, Not at all to Very much), IRT models describe the likelihood of selecting each of the response options as a function of the person’s continuous underlying trait. Once IRT models have been estimated for all of the items, one is able to do two things: (1) identify which items will provide the most information about different parts of the underlying continuum, and (2) translate a person’s responses to items into an estimate of that person’s status on the underlying continuum . Additional details about IRT have been described elsewhere [7-9].
IRT models allow individual questions to be linked, using psychometric principles, in an “item bank.” An item bank is a comprehensive collection of questions (and their response options) designed to measure an underlying construct (eg, fatigue) across its entire continuum. Creation of an item bank requires rigorous qualitative (e.g., expert review, patient focus groups, and cognitive interviews) and quantitative (eg, large-scale field testing of candidate items) evaluation to ensure the items are comprehensible to patients, valid, and precise. Items are calibrated to establish item statistics and properties. With an item bank, researchers have multiple options for instrument development, including creating a customized short form made up of a fixed set of items or administering a tailored test through computerized adaptive testing (CAT). Customized short forms are created by selecting the items that are most sensitive to the distribution of a given trait in the population under study. For example, investigators interested in measuring fatigue in patients with severe heart failure could select the items that are most sensitive to high levels of fatigue. Investigators studying milder heart failure might opt for a short form that includes items more sensitive to lower levels of fatigue. Because all of the items in the two short forms are from the same item bank, the fatigue scores estimated in both clinical studies will be on the same metric.
CAT is a computer algorithm that selects successive items based on an individual’s responses to previously administered items. For example, if an initial item asked how difficult it was for the respondent to walk up a flight of stairs and the respondent’s answer was “extremely difficult,” this would suggest that the respondent’s physical functioning is on the low end. The CAT algorithm would then select a second question that asks about difficulty doing an easier activity, such as walking on flat ground. On the other hand, if the response to the initial item was “not at all difficult,” a second question might ask about a more difficult activity, such as running a mile. In this way, CAT is always searching for the next item that will provide the most unique information about the person . CAT is already used for a variety of standardized educational tests, including the Graduate Record Examination , the North American Pharmacist Licensure Examination , and the United States Medical Licensing Examination . Although CAT is the more sophisticated option than short forms, any option that uses items from the same item bank will produce comparable scores, regardless of whether the same items were asked of respondents . This is another key advantage of measures based on IRT.
Widespread use of IRT item banks could be advantageous to health outcome assessment. However, an important dimension of the effectiveness of this technology is adoption by the end user community—clinical trialists. The purpose of this study was twofold: (1) to evaluate a brief tutorial designed as a basic introduction to IRT-based item banks, and (2) to elicit investigators’ questions and concerns about both current and IRT-based PRO measurement strategies in clinical trial settings. Understanding these concerns is valuable to promote use of item banks and CAT in the clinical trials setting. This information will also be helpful for sponsors who may want to learn more about the issues and challenges of using PRO item banks as well as for developers of item banks who bear the burden of educating both potential users and relevant stakeholders (such as funding agencies and regulatory boards) on the merits of this new technology.