Sleep and wakefulness are fundamental neurobiological states regulated by homeostatic and circadian processes. Sleep and wake function in humans can be measured along many dimensions, including qualitative and quantitative aspects, as well as signs and symptoms of specific sleep disorders. Likewise, many different measurement tools are available: retrospective self-reports, prospective self-reports (sleep diaries), longitudinal measures of rest-activity patterns using wrist actigraphy, physiological recordings (polysomnography), and even functional imaging measures.
Among self-report measures, the Pittsburgh Sleep Quality Index (PSQI; Buysse et al., 1989
) is the most widely used scale for sleep disturbance, and the Epworth Sleepiness Scale (ESS; Johns, 1991
; Johns, 1992
) is the most widely used measure of daytime sleepiness, each with over 500 literature citations. Because of the nature of its items and its component structure, however, individual items in the PSQI are not conducive to validation with modern psychometric techniques such as item response theory (IRT) models. As discussed below, IRT models are useful for selecting items with the greatest information for describing a trait such as sleep disturbance. One consequence in the case of the PSQI is that the instrument has relatively poor ability to discriminate lower levels of severity because of the positive skewness of its distribution of scores. The ESS also has limitations, the major one being that it assesses behaviors (e.g., falling asleep in daily situations) that may not apply to all respondents. Thus, given content and psychometric concerns with the PSQI and ESS, there is a critical need for improved patient-reported outcomes (PROs) of sleep and sleep-related impairment during wakefulness. Measures of this sort can be thought of as general “thermometers” that provide continuous, relative values for every individual in the population, rather than as condition-specific measures that categorize individuals based on a cut-off score.
The Patient-Reported Outcomes Measurement Information System (PROMIS™) is an NIH Roadmap initiative designed to improve PROs using state-of-the-art psychometric methods (e.g., models from IRT; for detailed information, see www.nihpromis.org
). The PROMIS Sleep Disturbance (SD) and Sleep-Related Impairment (SRI) item banks were developed using a rigorous and systematic methodology, including literature reviews, qualitative item review, focus groups, cognitive interviewing, and psychometric testing using methods from both classical test theory (CTT) and IRT (Buysse et al., 2010
). This work is the most ambitious attempt to date to apply IRT methods to self-report measures of sleep and sleep-related waking impairments. The final SD and SRI item banks have 27 and 16 items each. The SD and SRI item banks assess qualitative aspects of sleep and wake function. They do not include quantitative or time-based items and do not assess symptoms of specific sleep disorders. Thus, they function as generic measures appropriate for gauging the severity of sleep-wake problems on a continuum, applicable across a range of conditions.
As noted, the use of IRT models was critical to the development of all PROMIS scales, but the distinction between CTT and IRT methods deserves emphasis. CTT, also called true score theory
, assumes that each observed
score equals the individual's true
score plus some error
; Lord & Novick, 1968
). The relationship among observed
score, and error
, which is defined as the ratio of true
score variance to the observed
score variance. Different approaches have been developed to estimate reliability, such as alternate-form reliability, examining the particular form
of a test; test-retest reliability, examining the occasion
of test administration; and internal consistency, examining the individual items of a test. Under the CTT framework, the standard error of measurement, describing the expected score fluctuations due to error, is constant among scores in the same population.
Unlike CTT, IRT refers to a class of psychometric techniques in which the probability of choosing each item response category is modeled as a function of a latent trait of interest. By convention, the latent trait is scaled along a dimension called theta (θ), which has a mean of 0 and a standard deviation of 1. Item discrimination and item difficulty are the two major parameters used to define IRT models and describe individual items. The item discrimination parameter (a), also called slope parameter, indicates the shape of the category response curves, with higher slope parameters yielding steeper curves. Curves that are narrow and peaked indicate that the response categories differentiate well across θ values. The item difficulty parameter (b), also called threshold parameter, indicates the item's location on the θ scale, and represents the θ level necessary to respond above the corresponding threshold with .50 probability.
The relationship between the probability of choosing a certain response category (e.g., never, rarely, sometimes, often, always) for a specific item and the underlying severity level can be described by a monotonically increasing function (i.e., an S-shaped function) called the item characteristic function (ICF). An ICF can be transformed into an item information curve, indicating the amount of information a single item contains at all points along the severity (θ) scale. All of the individual item information curves can be combined to form a test information curve, which indicates the amount and accuracy of information the entire test contains at every point of θ (see and for examples). Thus, the amount of information provided by a test may vary depending on the level of a respondent's severity of sleep disturbance or sleep-related impairment (θ). These standardized curves can be used to compare the measurement precision of two or more scales. In this paper, we compare the test information curves for the PROMIS SD and SRI item banks, the SD and SRI short forms, the PSQI, and the ESS.
Test Information Curves for the PROMIS Sleep Disturbance Full Item Bank, Short Form, Epworth Sleepiness Scale (ESS), and Pittsburgh Sleep Quality Index (PSQI)
Test Information Curves for the PROMIS Sleep-Related Impairment Full Item Bank, Short Form, Epworth Sleepiness Scale (ESS), and Pittsburgh Sleep Quality Index (PSQI)
IRT models permit investigators to evaluate the performance of a single item or subsets of items as well as the entire test. Different items are better at discriminating people having different levels on the continuum of severity. For instance, the question, “Do you fall asleep while watching TV in a dark room late at night” would identify a milder degree of sleepiness than the question, “Do you fall asleep while talking to other people during the daytime?” One practical application of this feature, which CTT cannot provide, is the ability to construct short forms or tailored assessments, using a subset of items selected to maximize precision along clinically important ranges of severity.
Another advantage of IRT is that individuals' θ
estimates are independent of the specific items administered from a larger calibrated item bank. With this feature, IRT serves as the basis for computerized adaptive testing (CAT), a method that provides a unique sequence of items tailored to the individual's personal severity (θ
). CAT avoids administering test items that add little information to an individual's assessment. For instance, during CAT administration of the SD item bank, item S90
(I had trouble sleeping
) might be administered first. S90
is a useful initial item because it has a high slope parameter (‘a
’ in ), indicating high information content. If the individual endorses the most severe category (always
), the CAT would be unlikely to choose item S116
(My sleep was refreshing
) as the next item, because S116
mainly addresses a lower range of severity (indicated by small values for threshold values b
4). The net result is that CAT can provide an extremely efficient method of PRO administration. For more information regarding technical issues in IRT methodology, see Embretson & Reise (2000)
. A more detailed description of the specific PROMIS analytic framework is available elsewhere (see Reeve et al., 2007
Short form item selection order of PROMIS Sleep Disturbance item bank
Individual items from the IRT-calibrated SD and SRI item banks can be selected to create short forms for assessing SD and SRI. The short forms can be constructed adaptively in real time based on each respondent's answers to previous items, as in computerized adaptive testing (CAT). Alternatively, static short forms (i.e., containing a fixed set of items) can be created so that they could be administered without CAT, e.g. in pencil-and-paper format. In this study, we report on the short form development from the PROMIS SD and SRI item banks. In particular, we report the performance of static 8-item short forms of PROMIS SD and SRI in comparison with their full banks and legacy measures including PSQI and ESS.