The fit between the severity distribution of the study population and the range of coverage of the instrument is centrally relevant to responsiveness, standard deviation of change, and sample size requirement issues. An instrument with items about eating, dressing, and grooming will not be sensitive to change in a population of college athletes since most will be able to do these activities easily. An instrument with items based around strenuous and difficult items about walking and climbing stairs will not perform well in a population with longstanding RA since most will have difficulty in many categories.
To illustrate the role of floor and ceiling effects in reducing study power, Figure shows data for over 18,000 PROMIS participants from these six instruments, drawn from an approximately normal population [10
]; these are sometimes termed "boat diagrams". The horizontal axis represents different functional abilities with zero representing the population mean, and each unit to the left representing one SD below the mean. Each unit to the right represents one SD above the mean. The vertical axis represents the standard error (instrument reliability), a sensitivity criterion, shown with reference reliabilities of 0.90 and 0.95. Instruments yield information curves that are more informative at some physical impairment (theta) levels than others. Above the population mean, all current instrument curves rapidly lose their sensitivity and rise steeply, representing a ceiling effect. Below the mean, some curves lose power at about three SDs, while others maintain sensitivity to beyond four SDs. Depending on the severity distribution in a sample sensitivity for a given instrument will be higher or lower. Item-Improved and IRT-based observed instrument improvement is largely due to better coverage of areas of lesser impairment.
Figure 2 Instrument sensitivity and disease severity. Physical Function, on the horizontal scale, is mapped against sensitivity (reliability) on the vertical scale. A better scale has a greater breadth of Physical Function ability. The lower the curve, the greater (more ...)
The Item-Improved HAQ outperforms the Item-Improved PF-10 when functional abilities are more than 1.4 SD units below the population mean, while the opposite is true when abilities are better (Figure ). The PROMIS PF 20 combines these attributes and has the greatest sensitivity across the widest range of Physical Function.
Item-improvement processes and use of IRT-based items can lead to improved instrument performance. In turn, this can make clinical research more efficient and less costly by reducing required sample sizes. The Legacy PF-10 is the most extensively used Physical Function scale and is a valid benchmark for assessing change. However, it is limited, having only 10 items and three response options, which render it less sensitive than 20-item scales or scales with five response options.
The Item-Improved HAQ performed essentially as well as the PROMIS PF 20 in the RA patients studied. This initially unexpected result was partly due to identical response options in the two instruments and to a number of shared items. More importantly, however, these RA patients had average disease severity one to two SDs below the mean, where the HAQ is most sensitive. Where the PROMIS PF 20 is strongest in comparison to the HAQ is in the one-half SD range near the population mean (Figure ), a population not included here. The PROMIS PF 20, therefore, may be expected to outperform the HAQ in such populations [5
The Item-Improved PF-10 outperformed the Item-Improved HAQ in normal populations. Of historical interest, the Legacy HAQ (perhaps not accidentally) has generally been utilized for RA and other serious chronic illness, and the PF-10 for more normal populations. The IRT-based PROMIS PF 20, considering all levels of impairment, outperforms the other instruments.
PROMIS and other CAT applications, better at estimating function at the extremes, should provide substantial further improvement [8
]. PROMIS is presently investigating CAT applications. It appears likely that the full potential of CAT applications requires calibration of additional items at the floor and at the ceiling to further improve the range of coverage. These items are presently being evaluated. Moreover, CAT requires electronic administration in real time, and the logistics are more cumbersome than with traditional pencil and paper administrations.
Theoretically, a nearly equivalent result may be obtained by using brief forms generated from the same calibrated item bank and tailored for strong reliability in a particular severity (theta) range matched to the study population. Such brief forms may also be developed through simulated CAT research, where the paths most frequently chosen identify the best items [8
These data raise the issue of floor and ceiling effects [9
]. PROMIS has defined the physical function domain as "Physical Function" rather than "Disability", expanding the conceptual basis to both increments in ability and decrements in ability. The predominant "gaps" in the present PROMIS PF item bank are in "ceiling" activities above the population average and at the "floor" represented by institutionalized populations. The more immediate problem is the ceiling, where more than 20% of these RA patients had zero HAQ scores. In healthy aging studies, over 50% may be at this ceiling. This results in a substantial loss of information and reduces study power [22
]. Yet, we must be able to assess Physical Function above the mean to study normal or nearly normal subjects where detection of improvement is possible only if the scale permits above average scores.
Additional advantages accrue to PROMIS instruments that strengthen our confidence in recommending their use. The PROMIS process [12
], beginning with item improvement, is directed at enabling instruments that are more patient-centered, validly translatable, have better clarity in diversely educated groups, have less Differential Item Functioning (DIF) across subgroups and are focused on supporting efficiency in clinical research. We, historically involved with the HAQ and PF-10, currently recommend the PROMIS PF 20 as the best available instrument for clinical studies with Physical Function endpoints.
To our knowledge, this is the first randomized study of alternative PRO instruments to document positive results with IRT-based instruments over traditional ones. We all are used to studies comparing treatments, not instruments, where effect sizes are often 0.6 or higher, rather than 0.06. This requires a different perspective, since improved PRO instruments are not interventions, but are more precise outcome measurement tools, much like a more precise sedimentation rate or more precise measurement of blood pressure. The improved precision effect may be smaller than the Minimally Clinically Important Differences (MCID) for interventions. They do, however, make it easier to detect whether an MCID is present or not. This, in turn, results in more efficient clinical research, requiring fewer subjects, fewer centers, shorter recruitment periods, reduced time to study completion, and easier monitoring of the trial, all very important considerations.