Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Pain. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
PMCID: PMC2775487

Development and Psychometric Analysis of the PROMIS Pain Behavior Item Bank


The measurement of pain behavior is a key component of the assessment of persons with chronic pain; however few self-reported pain behavior instruments have been developed. We developed a pain behavior item bank as part of the Patient Reported Outcome Measurement Information System (PROMIS). For the Wave I testing, because of the large number of PROMIS items, a complex sampling approach was used where participants were randomly assigned to either respond to two full item banks or to multiple 7-item blocks of items. A web-based survey was designed and completed by 15,528 members of the general population and 967 individuals with different types of chronic pain. Item response theory (IRT) analysis models were used to evaluate item characteristics and to scale both items and individuals on the pain behavior domain. The pain behavior item bank demonstrated good fit to a unidimensional model (Comparative Fit Index = 0.94). Several iterations of IRT analyses resulted in a final 39 item pain behavior bank, and different IRT models were fit to the total sample and to those participants who experienced some pain. The results indicated that these items demonstrated good coverage of the pain behavior construct. Pain behavior scores were strongly related to pain intensity and moderately related to self-reported general health status. Mean pain behavior scores varied significantly by groups based on pain severity and general health status. The PROMIS pain behavior item bank can be used to develop static short-form and dynamic measures of pain behavior for clinical studies.

Keywords: Pain behavior, item response theory analysis, patient reported outcomes, psychometric analysis, chronic pain, item banks


People with pain engage in behaviors that communicate to others that they are experiencing pain [6,13,31]. Pain behaviors, may include verbal complaints of pain and suffering, non-language sounds, facial expressions, body posturing and gesturing, and limitations in activities. There is growing recognition that assessment of pain behavior is a key outcome of persons suffering from chronic pain [7,26]. Careful measurements of pain behavior can be helpful in several ways. First, it can provide clues about the existence, intensity and causes of pain. Second, knowledge about pain behaviors can provide insights into a person’s attempts to cope with or manage pain [33]. This information may help clinicians identify and reinforce adaptive pain coping efforts. Information on pain behaviors can also reveal maladaptive coping efforts that can be targeted in treatment interventions. Finally, pain behaviors may provide a behavioral marker of enhanced risk for the development of chronic pain and disability [11].

Direct observation provides the most objective approach to assessing pain behavior [13]. Two basic methods have been developed, sampling pain behavior during a standardized task [12] or in a naturalistic setting [23]. Although direct observation methods are objective, they have limitations (e.g., need for ongoing observer training, costs, intrusiveness) that limit their utility in clinical settings.

Self-report provides an alternative strategy for measuring pain behavior [14]. Individuals with pain are often aware of pain’s effect on their behavior and can report on their experience and activities. Given the advantages of self-report (e.g., low cost, ease of administration), it is surprising that self-report pain behavior measures are not more widely used in pain assessment. Among the self-report measures of pain behavior that have been developed and validated, the Pain Behavior Check List (PBCL) is the most widely used in research studies [14,19]. The PBCL assesses four categories of pain behaviors (distorted ambulation, affective distress, facial/audible expressions, and help seeking behavior). Although re is evidence for the reliability and validity of PBCL scores [14], the scale is brief and does not capture potentially important categories of pain behaviors such as isolating oneself from others, crying, and massaging a painful area, among others.

A new effort to develop a comprehensive pain behavior item bank was undertaken through the NIH Patient-Reported Outcomes Measurement Information System (PROMIS) [3,21]. The goal of the current study was to develop a self-report measure that can be integrated into clinical studies and that reliably assesses a full array of pain behaviors that can be integrated into clinical studies. This report focuses on the development and psychometric evaluation of the PROMIS pain behavior item bank. The psychometric characteristics of the final pain behavior item bank were evaluated based both on classical test theory and modern measurement theory methods [8,21,22,28]. Modern measurement theory approaches include application of item response theory (IRT) models to evaluate item characteristics and to scale both items and individuals on a unidimensional construct (i.e., pain behavior). IRT methods allow for more comprehensive understanding of item responses and their relationships to the targeted domain. Items with good psychometric properties can be assembled into short-form scales or computerized adaptive tests for more targeted patient assessment.


Study Design

A web-based survey was developed and conducted to field test 14 candidate PROMIS item banks, including pain behavior. This field test was completed during 2006 using an internet panel maintained by YouGov/Polimetrix (see The 14 item banks represented five domains (i.e., pain, fatigue, physical functioning, social activities, and emotional distress) were administered to selected participants [3]. Because of the large number of items to be tested, a complex sampling approach was used. Participants were randomly assigned either to respond to two full item banks or to multiple 7-item blocks of items. Item banks were comprised of the full set of candidate items for a particular domain; for example, the pain impact bank had 56 items and the pain behavior bank had 52 items. Those administered item blocks responded to 14 7-item blocks, one for each of the 14 PROMIS sub-domains (e.g., fatigue experience and impact, physical function, pain behavior, impact and quality, social function, depression, anxiety, anger).

Study Participants

The PROMIS field test sample was developed to be generally comparable to distributions of gender, age, race/ethnicity (White/Black/Hispanic/Other) and education (High School or less versus more than High School) based on the 2000 US census data. These participants were identified from the YouGov/Polimetrix internet panel. A final scale-setting, community sample was identified and weighted to reflect the 2000 United States General Census. For the current study, the study subjects included those who completed either all 52 items of the pain behavior item bank or a subset of 7 items from this bank.

Wave I Sample

Because of the number of item banks tested in Wave I, a complex data collection strategy was employed (see Figure 1). This strategy included two arms and a total sample size of 21,133. A total of 19,601 were recruited by Polimetrix, with the remaining 1,532 recruited from selected PROMIS primary research sites. In the full bank testing arm, 7,005 persons from the general population were administered 2 of the 14, sub-domain-specific PROMIS item banks. In the second arm, 14,128 individuals were administered randomly-selected 7-item blocks measuring each of 14 PROMIS-targeted sub-domains (Figure 1). The primary research sites and the Polimetrix sample included both community and clinical samples. The clinical samples included persons with heart disease (n = 1,156), cancer (n = 1,754), rheumatoid arthritis (n = 557), osteoarthritis (n = 918), psychiatric illness (n = 1,193), COPD (n = 1,214), spinal cord injury (n = 531), and other conditions (n = 560).

Figure 1
PROMIS Wave 1 Sample

Secondary Data Collection

After initial calibrations of WAVE I data, it was noted that, because most of the participants were from the general community, there were relatively few persons who had moderate to severe pain. We recruited an additional sample to increase the number of participants with more severe pain intensity, including members of American Chronic Pain Association (ACPA), a non-profit organization focused on education and support for people with chronic pain.

Nine hundred and sixty-seven participants with chronic pain were recruited through the ACPA. An invitation to complete the PROMIS pain survey was posted on the ACPA website. To be eligible, participants had to be 21 years of age or older and have at least one chronic pain condition for at least 3 months prior to participating in the survey. Those who met eligibility criteria were asked to complete an informed consent form. After obtaining informed consent, participants immediately began the survey. The survey was posted on the website of the ACPA from September 2007 to March 2008.


Pain Behavior Item Bank

A draft pain behavior item bank was developed based on the published literature, clinician review, and qualitative research with patients experiencing various kinds of pain. Existing observer-rated and self-reported pain behavior instruments were reviewed [5,10,12,14,15,20,23,24,27,29,30,32], and we summarized the range and type of item content covered by the existing published measures. A comprehensive set of draft pain behavior items were generated, and then these items were reviewed by six pain and outcomes research experts. After several iterations of item review and revision, a final draft set of pain behavior items were available for field testing. Content of the final set of pain behavior items was revised based on the results of cognitive debriefing interviews [4]. Cognitive debriefing interviews are designed to determine whether respondents understand the instructions, content of items, recall period, and response scales of a patient-reported instrument [34]. Based on the cognitive debriefing interviews, several of the items were modified to increase respondent understanding. The final pain behavior item bank consisted of 52 items covering movement (i.e., move slowly, stiffness), affective (i.e., irritable, angry), social interaction (i.e., ask for help, withdraw), and facial/verbal (i.e., groan, grimace) behaviors. As can be seen in Table 1, the items cover a very wide range of pain behaviors. Respondents rated how frequently they engaged in each pain behavior using a 6-point Likert-type response scale, ranging from 1 (had no pain) to 6 (always). The recall period for the pain behavior items was the past 7 days.

Table 1
Descriptive Statistics for PROMIS Pain Behavior Items: Pooled WAVE1 and ACPA Data

Other Measures

Demographic information was collected for the study participants (i.e., age, gender, race/ethnicity, education). In addition, all participants completed the PROMIS average pain intensity item based on a 0 to 10 numeric rating scale and verbal pain rating scale (“none” to “very severe”). Information was also collected on a number of chronic medical conditions in the Wave I sample and on different chronic pain conditions in the ACPA sample.

All participants in the Wave I sample completed nine additional global rating items covering physical health, mental health, physical function, fatigue, emotional distress, social satisfaction, social activities, general health perceptions, and quality of life [9]. The recall period for the global items was the past 7 days.

Statistical Analysis

Descriptive statistics for the demographic characteristics and average pain intensity were summarized for the entire study sample. We report means and standard deviations for the pain behavior bank items. The psychometric analyses were guided by the PROMIS analysis plan [21]. Item response theory (IRT) analyses were applied to evaluate the measurement qualities of the pain behavior items [8,21,22,28]. IRT makes full use of data provided by participants’ responses to each item of an instrument or item bank. Simultaneous estimates of the properties of items and persons’ levels of the trait being measured are used to develop a probabilistic model that places individuals’ scores and item properties on the same metric [22]. IRT models estimate a set of properties for each item in the scale (i.e., item calibrations). The item properties represent an item’s relationship with a measured construct (i.e., pain impact, pain behavior) [8,21,22]. The item parameters from an IRT analysis allows instrument developers to select the most effective items to target a patient’s level of functioning.

We examined item-total correlations and Cronbach’s alpha for the pain behavior items. Exploratory and confirmatory factor analyses (CFA) were conducted to assess the unidimensionality of the pain behavior items. For the exploratory analysis, we did not specify any number of factors. A one-factor solution was specified and goodness of fit tests were evaluated in the confirmatory analysis. The fit statistics examined included: Comparative Fit Index (CFI), Root Mean Square Error of Approximation (RMSEA), and Standardized Root Mean Square Residual (SRMR). MPLUS was used to perform the CFA [16].

The graded response IRT model was fit to the 52 items using MULTILOG [25] to estimate item parameters. The graded response model is a flexible IRT model that supports examination of item and scale properties, estimation of item characteristics, and estimating persons’ pain behavior scores based on responses to items of the pain behavior item bank. The item characteristics that are estimated with the graded response model are slope and threshold parameters. The slope parameter indicates the strength of the relationship between an item and the measured construct (i.e., pain behavior). The threshold parameters provide information on item difficulty or severity, and locate the item along the measured construct. IRTFIT [1] was used to assess IRT model fit for each item. IRTFIT computes the extension of S-X2 and S-G2 [17,18] for items with more than two responses. These statistics estimate the fit of the item responses to the IRT model, that is, whether the responses follow the pattern predicted by the model. Statistically significant differences indicate poor fit. The S-X2 (a Pearson X2 statistic) and S-G2 (a likelihood ratio G2 statistic) are fit statistics that use the sum score of all items and compare the predicted and observed response frequencies for each level of the scale sum score.

The development of an item bank starts with more items than will be used in the final bank. The item selection process for the pain behavior bank was iterative. First, the IRT calibration was conducted to estimate item parameters, and IRT model fit was estimated. Items were excluded from the next iteration if the p-value of either S-X2 or S-G2 was less than 0.001. The process continued until no items exhibited significant misfit (alpha = 0.001).

At the conclusion of the iterative IRT and fit analyses, 39 items were selected for the PROMIS pain behavior bank. However, there was a concern regarding the slope parameters of these items. The slope parameters were higher than generally found in other IRT studies. We hypothesized that the high slope parameters were an artifact caused by a large proportion of subjects who reported experiencing “no pain”. This no-pain population created either non-normality or mixed sample (general healthy and those with some pain) that affected the IRT parameter estimation. The high slopes were thought to result from the large variance in responses and the mixture of two distinct populations: people with pain and people without pain.

Several options were proposed to address the high slope findings. One was to exclude subjects who answered 0 or 1 on an 11-point pain intensity scale (0=’No pain’, 10=’worst pain imaginable’). The other solution was to fit the pain behavior items with a constrained nominal model [2]. For a nominal model, each response option has its own slope parameter. In our constrained nominal model, two slopes were estimated, one slope for the “had no pain” option, and one slope for all other response options. We decided that excluding subjects reported no pain was the most feasible approach and to have item parameters consistent with other PROMIS domains. Therefore, IRT analysis was again conduced for the 39 selected items with the “no-pain” subjects excluded. This resulted in lower slope parameter estimates.

For IRT the concept of reliability is conceptualized as “information” and examines measurement precision can differ across the levels of a construct. The relationship between information and standard error (SE) is defined by the formula, SE(θ)=1/I(θ), where θ is estimated trait level, SE is the standard error of θ, and I is information. As the formula indicates, increased scale information is associated with smaller SE’s and, therefore, greater precision.

Differential item functioning (DIF) was examined by gender, age and education. DIF exists when an item functions differently for respondents from different subgroups [21,22]. DIF can be consistent across the range of the trait being measured (uniform DIF), or its impact can vary depending on trait level (non-uniform DIF). Logistic regression as described in Zumbo [33] was used to for detecting DIF. The estimated IRT theta was used as the conditioning variable. DIF was assessed for gender, education (high school or less vs. high school and above), and age (< 65 vs. >=65).

The concurrent validity of the pain behavior bank items was further tested by correlating the pain behavior scores with measures of pain intensity and the global PROMIS items. We evaluated known groups validity by comparing mean pain behavior scores by groups defined by pain intensity (none, mild, moderate, severe, very severe) and general health status (poor, fair, good, very good, excellent). Analysis of variance was used to compare the mean pain behavior scores by pain intensity and general health status group. Scheffe’s adjustment was used to control for multiple comparisons.


Demographic Characteristics

The overall sample (n = 21,133) was 52% female. The sample median age was approximately 50 years; 12% were 18–29 years; 12% were 30–39 years; 16% were 40–49 years, 32% were 50–64 years, and 28% were 65 years or older. The racial/ethnic breakdown was 82% white; 9% black; 8% multi-racial; and 1% Asian/Pacific Islanders or Native Americans. Nine percent of the sample was Hispanic or Latino. Educational attainment was categorized as less than high school (3%), high school diploma (16%), some college (39%), college degree (24%), and advanced degree (19%).

Across the full bank and block design arms, a total of 15,528 persons met inclusion criteria and responded to full item banks (N=881) or 7-item subsets (N=13,695) of the items of the pain behavior bank. In addition, we excluded 125 subjects who gave repetitive strings of ten or more identical responses or had a response time less than or equal to one second per item.

There were 967 participants who responded to 43 pain behavior and a pain intensity item through the ACPA survey. Average age was 48.2 years (SD = 11.1). Eighty-one percent of the respondents were female, and 91% were Caucasian, 1.5% were African-American, and 5% were of Hispanic origin. Eighty-one percent of the participants had an education equal to or greater than high school. The data were combined with Wave 1 full-bank test data to complete the IRT analysis calibrations.

Item Descriptive Statistics

Table 1 summarizes the descriptive statistics for the pain behavior items. Although there was a total of 15,792 subjects in the pooled data (WAVE1 and ACPA), because some respondents were administered 7-item subsets, not the whole pain behavior item bank (Wave 1 block testing) individual items were responded to by approximately 3,000 subjects. The sample sizes were smaller for those items that were not included in the ACPA survey.

As can be seen in Table 1, for many of the items approximately 20% or more of subject responses are at the floor of the response distribution (indicating no pain) (range of 17.0% to 26.7%). Very few responses were at the ceiling of the response distribution (range 0.7% to 16.6%); most items had less than 5% at the ceiling.

Before the IRT analyses, the item response distributions of each item were examined. The results show that items 40 (“flung my arms and limbs around”) and 41 (“screamed”) each had less than 5 observations in the highest response category (6=Always). A decision was made to collapse the two highest response categories for these two items.

Evaluation of Unidimensionality

We evaluated item-total correlations and internal consistency reliability (Cronbach’s alpha) for the total set of 52 pain behavior items. Item-total correlations ranged from 0.44 to 0.87 and internal consistency reliability was 0.99.

The exploratory factor analyses found that the first factor explained 90% of the variance in the pain behavior items. The CFA findings for the one factor solution for the 52 pain behavior items resulted in a CFI of 0.902 and TLI of 0.991, with RMSEA equal to 0.156 and SRMR equal to 0.035.

Based on the item-total correlations and the CFA results, we determined that there was sufficient evidence supporting unidimensionality of the pain behavior item pool.

IRT Analyses

Originally, all subjects who met the inclusion criteria were included in the IRT analysis. The graded response model was fit to these data. Items were excluded from the item list if the p-value of either S-X2 and S-G2 were less than 0.001. After removal of misfitting items, the IRT calibration repeated until there was no item that was significantly misfit. In the first iteration, items 1, 7, 36, 53, and 54 were excluded. In subsequent iterations, eight more items were excluded (items 4, 5, 12, 15, 19, 20, 30, and 52). The item parameters of the final 39 items are shown in Table 2. The slope parameters ranged from 3.55 (“I limped because of pain”) to 5.94 (“When I was in pain I screamed”). The slopes parameters are considered larger than normally observed. The threshold values ranged from −0.72 to 2.15 reflecting a somewhat narrow coverage of the range of the pain behavior domain.

Table 2
IRT Item Parameters and Fit Statistics for PROMIS Pain Behavior Items: Pooled Wave1, CORE Cancer, and ACPA Data (N = 15,834)

To address the unexpectedly large slope parameters we explored excluding subjects who self-reported having no pain. A total of 6,035 subjects who reported 0 or 1 to the global average pain item were excluded, and the item parameters of the 39 items were re-calibrated. The new item parameters were noticeably lower and are shown in Table 3. The slope parameters ranged from 1.81 (“I limped because of pain”) to 3.41 (“I had pain so bad it made me cry”). The threshold values ranged from −2.52 to 3.21 indicating good coverage across the range of the pain behavior domain.

Table 3
IRT Item Parameters and Fit Statistics for PROMIS Pain Behavior Items, Excluding Subjects with 0 or 1 Global Pain Responses: Pooled WAVE-1 and ACPA Data (N=9,589)

The 39 pain behavior items were also calibrated with the nominal model. The nominal model was constrained to have a slope parameter for the “Had no pain” response option, and equal slope parameters for all other response options. Table 4 shows the item parameters for the nominal model. The item parameters for the nominal model cannot be directly compared to those for the graded response model. However, one can gauge the magnitude of the slope by subtracting adjacent parameter estimates. For example, for item 2, the slope for the item characteristic curve of the response option “Had no pain” is the difference between the parameter a2 and a1, 6.24. The slopes for the item characteristic curves for other response options are the difference between a3 and a2, or a4 and a3, etc. Because these are constrained to be equal, thus they are all 2.98. Slopes for the response option “Had no pain” were much larger than those for the combined other response options.

Table 4
IRT Item Parameters and Fit Statistics for PROMIS Pain Behavior Items, Excluding Subjects with 0 or 1 Global Pain Responses: Pooled WAVE-1 and ACPA Data (N=9,589)

Descriptive Statistics and Reliability

The derived IRT theta scores were transformed into T-scores with a mean of 50 and standard deviation (SD) of 10 based on the PROMIS general population sub-sample. The mean T-score for the overall sample was 51.2 (SD=9.4), with median of 50.8, and range from 10 to 90. Figure 2 summarizes the standard errors over the range of pain behavior scores for the total item bank, a 7-item short form, and a 7-item computerized adaptive test (CAT). For the full item bank, reliability is 0.90 or greater across most of the score distribution, and the short form and CAT have reliabilities exceeding 0.80 across the majority of the score distribution. Chronbach’s alpha for the full item bank was 0.98.

Figure 2
Standard Error of Measurement for Pain Behavior Bank, Short Form, and 7-item CAT


To assess the concurrent validity of the pain behavior measure, we correlated the IRT theta score with the global health rating and pain intensity rating. The correlation between pain behavior scores and self-reported pain intensity ratings was 0.69 (p<0.001). The correlation between pain behavior scores and self-reported global health ratings was −0.48 (p<0.001).

We also tested the difference between pain behavior theta scores by groups defined based on the pain intensity ratings and global health ratings. Analysis of variance was used to compare mean theta scores among groups based on pain intensity rating (‘none’, ‘mild’, ‘moderate’, ’severe’, ‘very severe’) (Figure 3). There were significant differences among mean pain behavior scores (p<0.0001), and all pair-wise comparisons were statistically significant after Scheffe’s adjustment (all p<0.001). Analysis of variance was used to compare pain behavior scores among subjects grouped by global health ratings (‘poor’, ‘fair’, ‘good’, ’very good’, ‘excellent’) (Figure 4). We found significant differences by global health groups (p<0.0001), and all pair-wise comparisons were significant with Scheffe’s adjustment (all p<0.001).

Figure 3
Mean Pain Behavior T-Scores by Pain Intensity
Figure 4
Mean Pain Behavior T-Scores by Global Health Status

Differential Item Functioning

Ordinal logistic regression [35] was used to assess items for DIF. DIF was assessed for gender, education (high school or less vs. high school and above), and age (< 65 vs. >=65). The IRT theta score was used as the conditioning variable. Non-uniform DIF was first examined followed by uniform DIF. DIF was determined to be significant when the p-value for the chi-square was less then 0.001. No items had non-uniform DIF or uniform DIF detected due to education levels (data not shown). No items had non-uniform DIF due to gender (data not shown). Item 18 (“ask for help doing things needed to be done”) had uniform DIF due to gender. No non-uniform DIF for the pain behavior items was detected due to age (data not shown). However, there were five items that showed uniform DIF due to age. After reviewing the content of these items, four of these five items were associated with age-related movement: “moved extremely slowly” (item 8), “bend over while walking” (item 22),” “use cane or something for support” (item 29),” and “move my limbs protectively” (item 50). The other item that showed uniform DIF due to age was item 48 (“gritted my teeth or clenched my jaw”). In all cases, the older participants demonstrated more age related movement problems (i.e., the items were easier to endorse for older participants).


As part of the larger PROMIS project, we developed a self-report pain behavior measure and 39 item bank covering a wide array of pain behaviors. This comprehensive pain behavior item bank can be used to develop static short-forms and focused measures of pain behavior for different clinical studies and patient populations. The pain behavior item bank can also be administered as a computerized adaptive test (CAT). CAT allows for more individualized assessment of pain behavior where the items administered are tailored to individual respondents and their severity level on the domain of pain behavior. CAT enables the assessment of the health domain with very few items (see Figure 2). The set of calibrated pain behavior items also allows the researcher to intelligently design a short-form pain behavior scale which targets a specific region of the pain behavior construct.

Based on the results of this study, the PROMIS pain behavior item bank has good evidence supporting unidimensionality, model fit, and coverage of the pain behavior construct, and preliminary evidence supporting construct validity. For the final selected items, a one-factor CFA resulted in excellent fit and good item-total correlations indicating unidimensionality of these pain behavior items. Although the item bank covers different aspects of pain behavior, this set of items appears to collectively measure a single construct, pain behavior. Our findings differ from previous research on pain behavior which indicates multiple pain behavior sub-domains [], however previous research has not used a modern measurement theory approach to pain behavior assessment.

We were challenged by the large slopes from the IRT analysis of the PROMIS general population sample. These large slopes were likely attributable to the significant proportion of subjects who reported no pain and, therefore, for whom the pain behavior items were not relevant. When the IRT analysis was performed including the sample from the ACPA and including only those subjects who reported some pain, the item slopes were lower and more consistent with those expected in an IRT calibration. Also, a nominal IRT model, which estimated slopes for responses of “no pain” and for all other responses resulted in slopes more consistent with expectation. We recommend that for future applications of the pain behavior item banks, researchers use the item calibrations from the sample of subjects with some pain (Table 3). We anticipate that the pain behavior item bank will be used primarily in samples of individuals experiencing some pain.

Reliability of a new patient-reported outcome is most often evaluated using internal consistency reliability coefficients (i.e., Cronbach’s alpha). In IRT analyses, reliability or measurement precision is assessed across different ranges of scores and provides information as to the reliability of the scores at different levels of the targeted domain. We found evidence that the full pain behavior item bank and two short-form measures have acceptable reliability for group comparisons across the range of the pain behavior construct.

We have some preliminary evidence supporting the validity of the pain behavior scores. The pain behavior scores were strongly related to pain intensity scale scores and moderately related to self-ratings of general health status. Mean pain behavior scores vary significantly by groups based on verbal ratings of pain severity and by general health status. Those subjects in the moderate or severe pain severity groups were more likely to report greater pain behaviors. Clearly, additional evidence supporting the validity of the pain behavior scores is needed, especially from clinical populations and patients with different types of chronic pain. Future studies will also need to compare observer rated pain behavior scales with the PROMIS self-reported pain behavior scores to further demonstrate construct validity.

We identified little evidence of differential item functioning among the pain behavior items. None of the items showed DIF by education level, and only one item demonstrated uniform DIF by gender (asking for help). Several items, related to mobility and movement demonstrated uniform DIF related to age. When selecting subsets of PROMIS pain behavior items for specific patient populations, the recommendation is to select those items with no evidence of DIF.

The PROMIS pain behavior item bank overlaps in content with other patient rated pain behavior instruments [5,10,12,14,15,20,23,24,27,29,30,32]. For example, the PBCL contains items covering distorted ambulation, affective distress, facial/audible expressions, and help seeking behavior. The current PROMIS item bank also includes items representing these aspects of pain behavior as well as items assessing other pain behaviors (as reclining, trying to minimize movement, isolating oneself from others, crying, screaming, increased body tension, avoiding physical contact with others, or massaging a painful area.). Future research is needed to evaluate the relationship between the PROMIS pain behavior measure and the PBCL. Future studies should also examine how self-reports of pain behavior collected using the PROMIS measures relate to direct observations of pain behavior.

There are several limitations which should be considered when interpreting the results of these psychometric analyses. First, the psychometric analysis of the self-reported pain behaviors in the PROMIS item bank included general population, clinic samples, and members of the ACPA. A significant proportion of the general population did not report experiencing pain, and this required us to perform several additional IRT analyses to address this problem. For PROMIS, our interest was to have the scores for all the item banks in a metric normed to the general population. However, the inclusion of a general population sample may have affected the item parameter estimates; in fact such an effect was evidenced in the reduced slope estimates when those with very low pain behavior scores were excluded from the calibration. Second, although the ACPA sample reported various chronic pain conditions and moderate to severe pain, differences between the demographic characteristics of the ACPA sample (e.g., 81% female, 91% Caucasian) may result in less generalizability of the current findings to the general population of people with chronic pain. Future research is needed to examine whether this resulted in any significant impact on the pain behavior scores, and further research is needed in patients with various types of chronic pain conditions (e.g., osteoarthritis, low back pain, neuropathic pain, fibromyalgia, etc.). Finally, we were unable to clearly distinguish individuals experiencing acute versus chronic pain in the general population sample, and additional research is needed to examine the PROMIS pain behavior items in these different pain populations.

In conclusion, we have developed a patient-reported pain behavior item bank based on careful and systematic item generation, patient interviews and the application of modern measurement theory methods. Based on the results of this study, we have evidence supporting satisfactory measurement precision across a range of the pain behavior construct, and some evidence supporting construct validity. Additional research is now needed to further investigate the construct validity, test-retest reliability and responsiveness of the new pain behavior item bank in longitudinal studies. Short-form and CAT pain behavior scales need to be developed and evaluated in clinical populations. The PROMIS pain behavior item bank is now available for applications in clinical studies of patients with chronic diseases and chronic pain (see This new patient-reported pain behavior measure has the potential to improve measurement of pain behavior and to increase our understanding of the effects of pain on patients’ lives and of the effects of interventions.


Funding for this research was provided to the participating institutions by the National Institutes of Health through the NIH Roadmap for Medical Research Cooperative Agreement (1U01-AR052177) for the Patient Reported Outcome Measurement Information System (PROMIS) project. The PROMIS is a National Institutes of Health (NIH) Roadmap initiative to develop a computerized system measuring patient-reported outcomes in respondents with a wide range of chronic diseases and demographic characteristics. PROMIS was funded by cooperative agreements to a Statistical Coordinating Center (Northestern University, PI: David Cella, PhD, U01AR52177) and six Primary Research Sites (Duke University, PI: Kevin Weinfurt, PhD, U01AR52186; University of North Carolina, PI: Darren DeWalt, MD, MPH, U01AR52181; University of Pittsburgh, PI: Paul A. Pilkonis, PhD, U01AR52155; Stanford University, PI: James Fries, MD, U01AR52158; Stony Brook University, PI: Arthur Stone, PhD, U01AR52170; and University of Washington, PI: Dagmar Amtmann, PhD, U01AR52171). NIH Science Officers on this project are Susan Czajkowski, PhD, Lawrence Fine, MD, DrPH, Laura Lee Johnson, Ph.D. Louis Quatrano, PhD, Bryce Reeve, PhD, William Riley, PhD, Susana Serrate-Sztein, MD, and James Witter, MD, PhD. This manuscript was reviewed by the PROMIS Publications Subcommittee prior to external peer review. See the web site at for additional information on the PROMIS cooperative group.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


1. Bjorner JB, Smith KJ, Orlando M, Stone C, Thissen D, Sun X. IRTFIT: A Macro for Item Fit and Local Dependence Tests under IRT Models. Lincoln, RI: QualityMetric Incorporated; 2006.
2. Bock RD. Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika. 1972;39:29–51.
3. Cella D, Yount S, Rothrock N, et al. The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap cooperative group through its first two years. Med Care. 2007;45 suppl 1:S3–S11. [PMC free article] [PubMed]
4. DeWalt DA, Rothrock N, Yount S, Stone AA. Evaluation of item candidates: the PROMIS qualitative item review. Med Care. 2007;45 suppl 1:S12–S21. [PMC free article] [PubMed]
5. Dirks JF, Wunder J, Kinsman R, McElhinny J, Jones NF. A Pain Rating Scale and a Pain Behavior Checklist for clinical use: development, norms, and the consistency score. Psychother Psychosom. 1993;59:41–49. [PubMed]
6. Fordyce WE. Behavioral Methods for Chronic Pain and Illness. St. Louis, MO: C.V. Mosby; 1976.
7. Hadjistavropoulos T, Herr K, Turk DC, Fine PG, Dworkin R, Helme R, Jackson K, Parmalee P, Rudy T, Beattie B, Chibnall J, Craig K, Ferrell B, Ferrell B, Fillingim R, Gagliese L, Gallagher R, Gibson S, Harrison E, Katz B, Keefe FJ, Lieer S, Lussier D, Schmader K, Tait R, Weiner D, Williams J. An interdisciplinary expert consensus statement on assessment of pain in older persons. Clin J Pain. 2007;23 Suppl 1:S1–S43. [PubMed]
8. Hambleton RK. Applications of item response theory to improve health outcomes assessment: developing item banks, linking instruments, and computer adaptive testing. In: Lipscomb J, Gotay CC, Snyder C, editors. Outcomes Assessment in Cancer: Measures, Methods, and Applications. Cambridge: Cambridge University Press; 2005. pp. 445–464.
9. Hays RD, Bjorner J, Revicki DA, Spritzer K, Cella D. Development and psychometric evaluation of the PROMIS short-form health status measure. Qual Life Res. in press.
10. Husebo BS, Strand LI, Moe-Nilssen R, Husebo SB, Snow AL, Ljunggren AE. Mobilization-Observation-Behavior-Intensity-Dementia Pain Scale (MOBID): Development and validation of a nurse-administered pain assessment tool for use in dementia. J Pain Symptom Manage. 2007;34:67–80. [PubMed]
11. Jensen MP. Validity of self-report and observational measures. In: Jensen TS, Turner JA, editors. Proceedings of the 8th World Congress on Pain: Progress in Pain Research and Management. Seattle, WA: IASP Press; 1997.
12. Keefe FJ, Block AR. Development of an observation method for assessing pain behavior in chronic low back pain patients. Behavior Therapy. 1982;13:363–375.
13. Keefe FJ, Williams DA, Smith SJ. Assessment of pain behaviors. In: Turk DC, Melzack R, editors. Handbook of Pain Assessment. New York, NY: Guilford Press; 2001. pp. 170–187.
14. Kerns RD, Hayhornthwaite J, Rosenberg R, Southwick S, Giller EL, Jacob MC. The Pain Behavior Check List (PBCL): factor structure and psychometric properties. J Behav Med. 1991;14:155–167. [PubMed]
15. McDanieI LK, Anderson KO, Bradley LA, Young LD, Turner RA, Agudelo CA, Keefe FJ. Development of an observation method for assessing pain behavior in rheumatoid arthritis patients. Pain. 1986;24:165–184. [PubMed]
16. Muthén LK, Muthén B. Mplus User’s Guide. Third Edition. Los Angeles, CA: Muthén & Muthén; 2004.
17. Orlando M, Thissen D. Likelihood-based item-fit indices for dichotomous item response theory models. Appl Psychol Measure. 2000;24:50–64.
18. Orlando M, Thissen D. Further investigation of the performance of S - X2: An item fit index for use with dichotomous item response theory models. Appl Psychol Measure. 2003;27:289–298.
19. Osman A, Barrios FX, Kopper B, Osman JR, Grittmann L, Troutman JA, Panak WJ. The Pain Behavior Check List (PBCL): psychometric properties in a college sample. J Clin Psychol. 1995;51:775–782. [PubMed]
20. Prkachin KM, Hughes E, Schultz I, Joy P, Hunt D. Real-time assessment of pain behavior during clinical assessment of low back pain patients. Pain. 2002;95:23–30. [PubMed]
21. Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, Thissen D, Revicki DA, Weiss DJ, Hambleton RK, Honghu L, Gershon R, Reise SP, Lai J, Cella D. Psychometric evaluation and calibration of health-related quality of life items banks: plans for the patient-reported outcome measurement information system (PROMIS) Med Care. 2007;45(5 Suppl 1):S22–S31. [PubMed]
22. Reise SP. Item response theory and its applications for cancer outcomes measurement. In: Lipscomb J, Gotay CC, Snyder C, editors. Outcomes Assessment in Cancer: Measures, Methods, and Applications. Cambridge: Cambridge University Press; 2005. pp. 425–444.
23. Richards JS, Nepomuceno C, Riles M, Suer A. Assessing pain behavior: the UAB pain behavior scale. Pain. 1982;14:393–398. [PubMed]
24. Terstegen C, Koot HM, de Boer JB, Tibboel D. Measuring pain in children with cognitive impairment: pain response to surgical procedures. Pain. 2003;103:187–198. [PubMed]
25. Thissen D, Chen WH, Bock D. MULTILOG. Lincolnwood, IL: Scientific Software International; 2002.
26. Turk DC, Dworkin RH, Revicki DA, Harding G, Burke LB, Cella D, Cleeland CS, Cowan P, Farrar JT, Hertz S, Max MB, Rappaport BA. Identifying important outcome domains for chronic pain clinical trials: an IMMPACT survey of people with pain. Pain. 2008;137:276–285. [PubMed]
27. Turk DC, Wack JT, Kerns RD. An empirical examination of the “pain-behavior“ construct. J Behav Med. 1985;8:119–130. [PubMed]
28. Van der Linden WJ, Hambleton RK, editors. Handbook of Modern Item Response Theory. New York: Springer; 1997.
29. Vlaeyen JW, Pernot DF, Kole-Snijders AM, Schuerman JA, Van Eek H, Groenman NH. Assessment of the components of observed chronic pain behavior: the Checklist for Interpersonal Pain Behavior (CHIP) Pain. 1990;43:337–347. [PubMed]
30. Waddell G, et al. Non-organic physical signs in low back pain. Spine. 1980;5:117–125. [PubMed]
31. Waters SJ, Dixon KE, Keefe FJ. Ayers S, Baum A, McManus C, Newman S, Wallston K, Weinman J, West R, editors. Cambridge Handbook of Psychology, Health and Medicine. 2nd Edition. Cambridge UK: Cambridge University Press; 2007. Pain Assessment. pp. 300–303.
32. Weiner D, Pieper C, McConnell E, Martinez S, Keefe F. Pain measurement in elders with chronic low back pain: traditional and alternative approaches. Pain. 1996;67:461–467. [PubMed]
33. Wilkie DJ, Keefe FJ, Dodd MJ, Copp LA. Behavior of patients with lung cancer: Description and associations with oncologic and pain variables. Pain. 1992;51:231–240. [PubMed]
34. Willis G. Cognitive Interviewing: A Tool for Improving Questionnaire Design. Thousand Oaks, CA: Sage Publications; 2004.
35. Zumbo BD. Ottawa, ON: Directorate of Human Resources Research and Evaluation, Department of National Defense; 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores.