The Patient-Reported Outcomes Measurement Information System (PROMIS®) is an NIH Roadmap initiative designed to improve self-reported outcomes using state-of-the-art psychometric methods (for detailed information, see
www.nihpromis.org). Adapting the
World Health Organization’s (2007) tripartite framework of physical, mental, and social health, PROMIS has developed and calibrated item banks assessing emotional distress, pain, fatigue, sleep disturbance, physical functioning, and social participation (
Buysse et al., 2010;
Cella et al., 2010;
Cella, Yount, et al., 2007;
Fries, Cella, Rose, Krishnan, & Bruce, 2009;
Revicki et al., 2009). It is the most ambitious attempt to date to apply models from item response theory (IRT) to health-related assessment. The PROMIS approach involves iterative steps of comprehensive literature searches; item pooling; development of conceptual frameworks; qualitative assessment of items using expert review, focus groups, and cognitive interviewing; and quantitative evaluation of items using techniques from both classical test theory and IRT (see
Cella et al., 2010;
Cella, Gershon, Lai, & Choi, 2007;
Reeve et al., 2007). We report here on the development and calibration of three item banks capturing the most prominent aspects of emotional distress—depression, anxiety, and anger. We also discuss the conceptual and psychometric challenges that arise when IRT models are applied to constructs assessing psychological symptoms and psychopathology (as contrasted with cognitive skills or aptitudes).
The use of models from IRT to refine measures of psychological symptoms and psychopathology has a 30-year history (
Bejar, 1977;
D. C. Clark, vonAmmon Cavanaugh, & Gibbons, 1983;
de Jong-Gierveld & Kamphuis, 1985;
Gibbons, Clark, vonAmmon Cavanaugh, & Davis, 1985;
Thissen & Steinberg, 1983), although the conceptual basis for IRT has a longer lineage (see
Bock, 1997). The abundance and heterogeneity of measures of emotional distress have made it difficult to determine the comparability of individual scales. Variations in both item content and context (e.g., time frame, response scales, number of response options) further complicate this task. The use of IRT models provides one approach for calibrating different scales on the same metric, but, to date, most applications of IRT to measures of emotional distress have been conducted with single instruments (e.g.,
Bernstein, Rush, Thomas, Woo, & Trivedi, 2006;
Carmody et al., 2006;
Gollwitzer, Eid, & Jurgensen, 2005;
Pallant & Tennant, 2007). Such analyses are informative about the items from those instruments, but they fail to take full advantage of the potential of IRT-calibrated item banks derived from a comprehensive review of measures assessing the construct of interest (
Chang & Reeve, 2005). Item banking and calibration of a large set of items using IRT models can provide a more thorough assessment of a construct, using a common metric that also makes possible the linking of scores between measures used in clinical trials, observational research, and epidemiological studies.
IRT-calibrated item banks underlie the use of computerized adaptive testing (CAT;
Kingsbury & Weiss, 1980,
1983;
McBride & Martin, 1983;
Weiss, 2004) in which the presentation of items is tailored individually to respondents and their levels of the latent construct. The result is an efficient procedure for reducing both the total number of items administered and the measurement error following the administration of each successive item (
Gibbons et al., 2008;
Lord, 1980;
Weiss, 1985,
2004). Simulation studies indicate that CAT using as few as five polytomous items can achieve excellent precision and that scores derived from CAT correlate strongly with the conventional total score from a measure (
Bjorner, Chang, Thissen, & Reeve, 2007;
Choi, Reise, Pilkonis, Hays, & Cella, 2010;
Choi & Swartz, 2009;
Gardner et al., 2004;
Gardner, Kelleher, & Pajer, 2002). Several studies reporting IRT-calibrated CAT administration of single measures of emotional distress are available (e.g.,
Fliege et al., 2005;
Forbey & Ben-Porath, 2007;
Gibbons et al., 2008;
Waller & Reise, 1989;
Walter et al., 2007). The goal of PROMIS, however, is to move beyond IRT analysis of individual instruments to create item banks that provide a comprehensive profile of health status (including physical, mental, and social health), that are psychometrically sound and that are publicly available on the Internet (
Revicki & Sloan, 2007).
Applying IRT models to the measurement of emotional distress involves at least two major challenges: addressing issues of dimensionality and accommodating the asymmetrical nature of the constructs. With regard to the first issue, the conventional wisdom is that traditional tests of ability in the educational literature (e.g., measures of verbal and mathematical proficiency with which the use of IRT has been most common) are more likely to fit unidimensional models than scales of emotional distress (
Gibbons et al., 2008). Instruments assessing emotional distress often sample items from multiple domains (e.g., mood, cognition, behavior, somatic symptoms) to capture a comprehensive set of manifest indicators of the latent construct. Therefore, it is common to observe higher correlations within domains than is expected under the conditional independence assumption of a unidimensional IRT model (
Bjorner et al., 2007;
Steinberg & Thissen, 1996). One of the goals of the current work was to begin with multidimensional conceptual frameworks that were informed by previous empirical (e.g., factor analytic) work and then to shape carefully the most informative unidimensional scales that could be derived for the constructs (
Reise, Moore, & Haviland, in press). This process involved both conceptual and psychometric decisions that produced, in an iterative way, the final content of the item banks (a key issue that we discuss below).
The second challenge is the asymmetrical nature of constructs for emotional distress reflected in positively skewed distributions in the general population. In an IRT context, skewed distributions may lessen confidence in parameter estimates (because of the limited representation of high-threshold response choices, even in large samples). Test information functions may become peaked and truncated, with a relatively narrow bandwidth (
Reise & Waller, 2009). Because our aim was to create a standard frame of reference for outcomes measurement in diverse settings for clinical research (including clinical trials, observational research, and epidemiological studies), we assembled a sample that was representative of the full range of severity of emotional distress. For this purpose, we relied heavily on an Internet panel, enriched with clinical participants from the PROMIS research sites. The Internet panel provided us with community participants who varied across the full spectrum of health (from no self-reported medical or psychiatric conditions to multiple, comorbid conditions); the clinical samples provided us with bona fide patients who were the most likely to endorse items at the highest levels of severity (see
Rothrock et al., 2010).
Most attempts to measure emotional distress begin with moderate to marked indicators of severity, often in the service of identifying people in need of treatment. In the case of depression, these indicators include explicit feelings of sadness, hopelessness, helplessness, and worthlessness recognized as different from one’s usual emotional experience and associated with functional impairment. In this context, the appropriate content (e.g., boredom, wistfulness, nostalgia) and meaning of low-threshold items indicating minimal depression may be problematic. One risk is that the nature of such items may be sufficiently different that they no longer tap the same construct as items of greater severity. For example, items that capture low levels of depression may overlap with transient negative affect, which is universal and may not be as informative specifically about depression. In the current work, we did not alter the conventional content of items assessing emotional distress but rather relied on the use of five-level response options to try to capture low levels of distress—in this case, by endorsements of “never” or “rarely.” Five response choices appear to be a satisfactory number for polytomous items, with the goal of creating items (and scales) that have as large an effective range of measurement as possible (
Hawthorne, Mouthaan, Forbes, & Novaco, 2006;
Roberson-Nay, Strong, Nay, Beidel, & Turner, 2007).
In summary, this article describes the steps we took to develop comprehensive item pools, to evaluate these items using qualitative methods (focus groups and cognitive interviewing), to administer a reduced number of items to a large calibration sample, and to fit IRT models to the resulting data. The final products were IRT-calibrated banks (suitable for CAT) of 28, 29, and 29 items for depression, anxiety, and anger, respectively, and short forms of seven to eight items that provide information comparable to legacy measures containing more items.