Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Pain. Author manuscript; available in PMC 2011 November 1.
Published in final edited form as:
PMCID: PMC3129595

PROMIS Pediatric Pain Interference Scale: An Item Response Theory Analysis of the Pediatric Pain Item Bank


An aim of the National Institutes of Health (NIH) Patient Reported Outcomes Measurement Information System (PROMIS) initiative is to develop item banks and computerized adaptive tests (CAT) that are applicable across a wide variety of chronic disorders. The PROMIS Pediatric Cooperative Group has concentrated on the development of pediatric self-report item banks for ages 8-17 years. The objective of the present study is to describe the Item Response Theory (IRT) analysis of the NIH PROMIS pediatric pain item bank and the measurement properties of the new unidimensional PROMIS Pediatric Pain Interference Scale. Test forms containing pediatric pain items were completed by a total of 3,048 respondents. IRT analyses regarding scale dimensionality, item local dependence, and differential item functioning were conducted. A pain item pool was developed to yield scores on a T-score scale with a mean of 50 and standard deviation of 10. The recommended 8-item unidimensional short form for the PROMIS Pediatric Pain Interference Scale contains the item set which provides the maximum test information at the mean (50) on the T-score metric. A simulated CAT was computed that provides the most information at five possible score locations (30, 40, 50, 60, and 70 on the T-score metric).

Keywords: Pain, pediatrics, PROMIS, pain interference, Item Response Theory


The Patient Reported Outcomes Measurement Information System (PROMIS) is a National Institutes of Health (NIH) Roadmap Initiative, created to advance the assessment of patient-reported outcomes (PRO) in chronic diseases. To achieve this goal, self-report items are evaluated using modern measurement theory [“Item Response Theory” (IRT)] in order to derive assessments that are maximally reliable, valid, and generalizable for individuals falling along the full spectrum of the trait being measured [1]. A primary objective is to develop a group of item banks and computerized adaptive tests across a wide variety of chronic disorders [29]. During the past 5 years, the PROMIS Pediatric Cooperative Group has concentrated on the development of pediatric self-report PRO item banks for ages 8-17 years across five generic health domains (physical function, pain, fatigue, emotional health, social health) from the patient perspective, consistent with the larger PROMIS network [4]. It was anticipated that measures of these five health domains would be applicable across numerous pediatric chronic health conditions, and hence were developed as generic or nondisease-specific scales.

Given the widespread occurrence of chronic and recurrent pain in pediatric populations [12], particularly in pediatric chronic diseases [41], an item bank focused on pediatric pain items was an essential component of the PROMIS Pediatric Cooperative Group’s efforts. While the measurement of pain intensity using visual analogue scales [20; 40], rating scales [39; 36; 11], and pictorial scales [22; 21] has received empirical attention in pediatric populations over the past two decades as evidenced by recent comprehensive reviews [8; 32; 44; 5; 23; 26], the measurement of the pain interference construct has received less empirical attention, and consequently was an important focus in the development of the PROMIS pediatric pain item bank [46; 17]. For the purposes of this study, the a priori operational definition of “pain interference” was the interference by pain on daily activities during the past 7 days (interference upon physical, psychological, and social functioning). At the end of each item stem was the phrase “…when I had pain” to explicitly distinguish the items as pain-specific interference, rather than as generic functioning items.

While other scales have been developed that measure physical activities in pediatric patients, including those which have utilized either Rasch or IRT analyses [48; 14], these scales typically contain generic items (i.e., not pain-specific content) or have been used predominantly in specific populations [27]. In contrast, the Child Activity Limitations Interview (CALI) was designed to assess functional impairment in activities of daily living secondary to pediatric chronic and recurrent pain [28; 27]. However, the CALI and CALI-21 were developed utilizing Classical Test Theory rather than IRT. Early research with the CALI-21 demonstrates that it has two factors described as representing “Active and Routine activities”; such detailed factor analysis was an advance over earlier pain measures [27; 26]. Additional analyses of data from the CALI-21 would be helpful to investigate the possibility of local dependence and gender DIF. The larger sample sizes and IRT analytic techniques used in PROMIS item development permit these more detailed levels of psychometric scrutiny.

Thus, the majority of pediatric pain functional impairment scales, consistent with other pediatric assessment instruments, have utilized Classical Test Theory and have rarely taken advantage of IRT analysis in the scale development process [15; 19]. By utilizing IRT analysis, the resulting item bank can be the basis of a more customizable measure for meeting a researcher’s or clinician’s needs. Depending on the desired level of precision, the evaluator can then select the number of items to administer and obtain scores on the same metric as all other users of this item bank [10].

Consequently, the objective of the present study is to address this measurement gap in the pediatric pain literature by describing the IRT analysis of the PROMIS pediatric pain item bank and the measurement properties of the new PROMIS Pediatric Pain Interference Scale, including investigations of scale dimensionality and sources of local dependence and differential item functioning.


Sampling Plan

Participants were recruited in hospital-based outpatient general pediatrics and subspecialty clinics and in public school settings between January 2007 and May 2008 in North Carolina and Texas. This sample was derived to include a broad range of experiences from children that were healthy and children with chronic illnesses. Children completed questionnaires that included items across several domains of health including physical function, pain, fatigue, emotional distress, and social health. North Carolina and Texas were chosen as recruitment sites because of the diversity of cultural experience and population characteristics that existed in those areas.

To be eligible to participate in the large-scale testing survey, subjects were required to meet the following inclusion criteria: between the ages of 8 to 17 years old; able to speak and read English; and able to see and interact with a computer screen, keyboard, and mouse. They provided informed assent prior to study entry and a parent or guardian provided informed consent. Both the informed assent and the informed consent were administered in English so parents were also required to read and speak English. Parent reports were used to determine whether or not the child had any limitations (e.g., physical or cognitive) that would make it too difficult to complete a computer administered survey.

Potential clinic participants were identified through a variety of methods such as a review of pediatric clinic appointment rosters or while in the clinic waiting rooms according to protocols approved by the institutional review boards (IRBs) of The Children’s Hospital at Scott and White (S&W) in Texas, the University of North Carolina (UNC), and Duke University pediatrics clinics. The UNC, Duke, and S&W general pediatric clinics were representative of health issues for which children have physician office visits (e.g., well child visits, acute illnesses as well as some chronic illnesses). The specialty clinics included Pulmonology, Allergy, Gastroenterology, Rheumatology, Nephrology, Obesity, and Endocrinology and primarily saw children with more serious chronic illnesses. Children with asthma were over sampled during recruitment because asthma-specific items were tested. It was anticipated that pediatric patients in Rheumatology, Gastroenterology, and General Pediatrics would manifest recurrent or chronic pain based on previous literature [39; 16; 45].

School-based participants were recruited through the Chapel Hill-Carrboro (NC) Public School System including elementary after school programs as well as required middle and high school health classes. An informational packet about the study, including informed consent documents and a sociodemographic form, was mailed to all of the parents with children enrolled in the health classes to complete and return to the school.

Parents signed an informed consent document and children signed an informed assent document that outlined the following: purpose of the study, participation requirements, potential benefits and risks of participation and measures implemented to protect participant privacy. Child participants received a $10 gift card in return for their time and effort. The study protocols were approved by the institutional review boards at each institution.

To limit respondent burden, the number of items administered to any respondent was limited to no more than 76 items out of the entire pool of 293 PROMIS items and the legacy questionnaires. The items were written to accommodate low literacy levels [8]. Based on the experience of the research team, it was estimated that the younger children would be able to complete the survey in about 25 minutes and the adolescents in about 15 minutes. The 293 PROMIS items were divided among 4 testing forms and one additional form containing only general ‘legacy’ scales (See Table 1). The legacy scales were administered on a separate test form to characterize the population, but were not administered together with the PROMIS items. As such, this data collection does not allow us to compare individual responses on the legacy instruments with responses to the PROMIS items. Some items were administered on more than one form. The inclusion of overlapping items on different forms permits an evaluation of the associations between domains. Each PROMIS item from non-disease specific banks was administered to at least 754 respondents across four forms.

Table 1
Survey participants demographic and background information

Children without asthma were assigned sequentially to 1 of 5 forms (4 forms with PROMIS items and a few legacy general items and 1 form containing only legacy scales). This sampling plan was developed for collecting responses to the candidate items from the targeted PROMIS domains and was designed to accommodate multiple objectives: (1) confirm the factor structure of the domains; (2) evaluate items for local dependence (LD) and differential item functioning (DIF); and (3) calibrate the items for each domain using Item Response Theory.

We developed the PROMIS Pediatric item banks using a strategic item generation methodology adopted by the PROMIS Network [6]. Six phases of item development were implemented: identification of existing items, item classification and selection, item review and revision, focus group input on domain coverage, cognitive interviews with individual items, and final revision before field testing. Identification of items refers to the systematic search for existing items in currently available pediatric scales. This was utilized to identify an initial item pool of over 3345 items. Expert item review and revision was conducted by trained professionals who reviewed the wording of each item and revised as appropriate for conventions adopted by the PROMIS network [4; 6]. Focus groups were used to confirm domain definitions, and to identify new areas of item development for future PROMIS item banks [46]. Cognitive interviews were used to examine and refine wording of individual items [17]. The pediatric items were written in the past tense with a seven day recall period and most utilized a standard set of response options [17]. Items successfully screened through the cognitive interview process were sent to field testing. The final item set contained 293 items across the 6 domains (Physical Function, Emotional Distress, Social Role Relationships, Fatigue, Pain, Asthma) [17].

Most pain items had a 7-day recall period and used standardized 5-point response options (never, almost never, sometimes, often, almost always). Occasionally, participants responded to items on an 11-point pain intensity scale (0 through 10), or a response scale in reference to the number of days (0 through 7 days). A complete list of items may be found in the Tables and Appendix.

Statistical and Psychometric Methods

The PROMIS methods used for the psychometric evaluation and calibration of the pain items have been previously described [29]. First, traditional descriptive statistics were computed to verify that there were no empty (zero frequency) response categories for any item, and as preliminary checks on the validity of the data. Included in these checks were tables of marginal frequencies of item responses and the correlations of item scores with the total summed score.

The IRT model that is used here for item analysis and scoring is based on the assumption that responses to the items indicate individual differences on a single underlying, or latent, variable (here, pain interference). To confirm the validity of that assumption, the second phase of the data analysis used confirmatory factor analysis (CFA) of the interitem polychoric correlation matrix to ensure that the latent variable underlying the item responses was unidimensional. These analyses were performed using the DWLS algorithm as implemented in the software LISREL [18]; this approach takes into account the categorical nature of the item responses, in a way that corresponds with the IRT model that is subsequently used.

In addition to a single-factor model, fitting additional factors, and/or error covariances, served as indications of local dependence (LD) for pairs or small numbers of items. LD is a term that describes any violation of the local independence assumption of unidimensional IRT [15]; that assumption is that all of the observed covariation among the item responses is accounted for by the single latent variable being measured by the scale. If a pair of items are more correlated than is accounted for by the latent variable underlying the responses to all of the (other) items, that is an indication that responses to those items behave to some extent as though the same question had been asked twice (which would produce perfect LD). If an additional factor appears for a small subset of items, that means those items as a cluster measure some other aspect of individual difference variation, and the data analyst must decide whether to measure that additional aspect separately, or set it aside. In the case of the construction of the pain interference scale, items were set aside from subsets that exhibited LD.

Third, after conducting CFA, item sets determined to be unidimensional were next calibrated by fitting Samejima’s Graded Response Model (GRM; [30]) using the software Multilog [7] (the GRM has been selected for other PROMIS scales [29]). Calibration, as that term is used in IRT analysis, refers to the estimation of a set of parameters for each item that characterize the relation of the item responses with the latent variable (here, pain interference) being measured. For each item, the GRM estimates a slope or discrimination parameter (a), reflecting the degree of association of the item responses with the latent construct being measured, and four threshold parameters (bk) (for five response option items; or seven thresholds for eight response options) that indicate the level of pain interference at which a response in a given category or higher becomes probable. In item analysis, the item parameters are used to compute an information function for each item. The statistical information provided by each item reflects the degree to which the item contributes to the precision of measurement of the scale in an additive way: If one has five items that each have information equal to 2.0 at some value of the latent variable, then the information value for the five-item scale is 10. The variance of measurement of the scale at that value of the latent variable is the inverse of the information, so that would be 0.1 in standard-score units. Classical Test Theory is based on algebra that assumes that the variance of measurement has the same value for all scores; in the classical theory, for scores in standard units, reliability is one minus the error variance, so for error variance 0.1, reliability is 0.9. IRT more realistically represents error variance as a quantity that varies as a function of the latent variable; error variance is small for levels of the latent variable where the items provide information, and larger elsewhere. Nevertheless, because 0.9 has often been considered a useful value of reliability, for IRT analysis 10 is a useful value of information. The item parameters computed during calibration can be used to identify the levels of the latent variable for which the items provide information, and items can be selected until aggregate information exceeds some desired value, like 10.

The item parameters obtained in the calibration phase are also used to compute IRT scale scores, either for a summed score for a fixed set of items, or for response patterns for any arbitrary subset of items in a pool. IRT scale scores are estimates of the value of the latent variable (pain interference, here) for which the observed item responses are likely. As a consequence of the assumption of unidimensionality, the IRT scale scores are on a single continuum, and comparable, even if respondents are measured using different subsets of items. This aspect of IRT represents one of its most important advantages over the classical theory, which can provide comparable scores only for a fixed set of items. One use of this feature of IRT is to assemble alternate short forms that yield comparable scores. Another more extreme use is to administer CATs, which adaptively select a customized set of items for each respondent, to provide maximum information at the level of that person. When using a CAT, each person may respond to a different set of questions; nevertheless, their IRT scale scores are quantitatively comparable.

The goodness of fit of the IRT model to the data was examined using the S X2 statistic [24-25] (generalized by Bjorner et al. [3]). As a goodness of fit statistic, a nonsignificant S X2 value suggests adequate fit of the model to the data.

Fourth, for item selection for the final pool, differential item functioning (DIF) was investigated between males and females using the IRT-LR DIF detection procedure [35] as implemented in the software IRTLRDIF [33]. In this case, DIF indicates that the relation of item responses with the latent variable differs between boys and girls. Such a difference suggests that some other factor, related to gender but different from the construct being measured, influences item responses, which is a violation of the assumption of unidimensionality. Here again, a nonsignificant χ2 indicates a lack of DIF. Because DIF detection involves a large number of tests of significance, the Benjamini-Hochberg procedure [2; 47] was used to control for multiple comparisons. In addition to χ2 statistics, graphical methods, as suggested by Steinberg & Thissen [31], were used to evaluate the magnitude of effect sizes when significant DIF was detected. After the item pool was selected, we also evaluated DIF between younger (ages 8-11) and older (ages 12-17) respondents; because we do not expect the scale to be used for the purpose of comparison of pain interference among children classified by age, we did not include these results among the item selection criteria, but the results are reported here.

Finally, though IRT scale scores may be computed from either item response patterns or summed scores, we expect scale scores for summed scores to be used more often. Thus, the Appendix Table A1 provides a translation table to be used for this purpose [34]. The IRT scale scores reported here use the North Carolina sample as the reference group.


Test forms containing PROMIS pediatric pain items were completed by a total of 3,048 respondents. The sample was about 52% female and 58% of the children were between the ages 8 to 12 years old. Sixty percent were Caucasian, 21% Black, 6% multi-racial, and 13% other races (Asian/Pacific Islanders, Native Americans and Other Races). Eighteen percent of the sample was of Hispanic ethnicity. The vast majority of the adults providing informed consent for the children were parents of the child (92%) or grandparents (4%). The educational attainment of these parents or guardians ranged from less than high school (8%) to advanced degree (13%) with 25% reporting a college degree, 33% some college, and 21% a high school diploma. Approximately 23% of the children participating in the survey had a chronic illness diagnosis during the past 6 months. Participant characteristics are summarized in Table 1.

There were adequate numbers of pain items on each of the four forms to permit factor analysis of each. Tables Tables22 and and33 provide the factor loadings from models that fit well. The models indicate that the items on separate forms are generally unidimensional, though with some evidence of local dependence. Local dependence, or nuisance multidimensionality, is modeled in Forms 1, 2, and 4 (Table 2) by error covariances (in this case between two items, or “doublets”). Form 3 (Table 3) contains three items (a “triplet”) pertaining to the physical limitations caused by pain, and as such was modeled as a second factor (with a correlation between the general pain interference factor and the “difficulty moving” subfactor). Indicators of goodness of fit suggest all four models fit the data well, using indices suggested by Reeve et al. [29]: For Form 1 (Table 2) , χ2(7) = 9, CFI = 1.00, TLI = 1.00, RMSEA = 0.02; Form 2 (Table 2), χ2(12) = 10, CFI = 1.00, TLI = 1.00, RMSEA = 0.00; Form 3 (Table 3), χ2(10) = 8, CFI = 1.00, TLI = 1.00, RMSEA = 0.00; and Form 4 (Table 2), χ2(13) = 21, CFI = 1.00, TLI = 1.00, RMSEA = 0.03.

Table 2
Factor Loadings and Error Covariances for Pain Interference Items on Forms 1, 2, and 4
Table 3
Factor Loadings and Error Covariances for Pain Interference Items on Form 3.

The local dependence in Forms 1 through 4 occurs primarily because items share similar wording, or have shared content that differs from the content of the scale’s other items. As an example of shared item content, Form 3 contains a “triplet,” or 3 items with responses that are more related than expected given the items’ relationship with the pain interference dimension. In this case the triplet measures physical limitations caused by pain. In other instances, local dependence may result from shared content or the response scale used. Form 2 contains two items measuring pain intensity on a 0 to 10 scale. In addition to being similarly worded and assessed on a unique response scale, the items are measuring pain intensity, while the scale’s other items assess interference on daily activities caused by pain. To ensure unidimensionality of the final scales, only one item from each doublet or triplet was included in the final item pool.

Following the factor analyses, locally independent sets of items from Forms 1 through 4 were calibrated using the GRM. To control for local dependence identified in the item factor analyses, separate item calibrations were completed for each collection of unidimensional items. This process resulted in two sets of calibrations for each Form (three in the case of Form 3). To avoid capitalization on chance, we conservatively selected parameter estimates across calibrations that had the lower estimated slope. Table 4 shows the item parameter estimates, item fit statistics (S X2), and DIF statistics (LR X2) for the items comprising the final pool (sorted in order of magnitude of slope parameters), and for the items set aside.

Table 4
Item Parameters, Fit Indices, and DIF Statistics for the Pain Interference Items

The Benjamini-Hochberg correction for multiplicity was used with the fit and DIF statistics. Two items had either significant DIF or lack of fit as indicated by the S X2 statistic; however, these items were retained when considered in relation to the relatively good fit of the items comprising the final pool. As indicated in Table 4, there were 15 items set aside. Five were set aside from locally dependent item sets. An additional five were set aside due to low discrimination parameters. Interestingly, these items measured pain intensity, and as such discriminate poorly between levels of pain interference. Finally, four items were set aside for DIF (both threshold and slope DIF). As an interpretive example of threshold DIF, boys were less willing to endorse the item “It was hard to do sports or exercise when I had pain,” after controlling for mean and variance differences between boys and girls. Additionally, slope DIF occurred for the item “I felt grumpy when I had pain,” indicating that “feeling grumpy” is a poor indicator of pain interference for boys. The remaining 13 items comprise the final pain item pool.

In the analysis of DIF by age, five of the 13 items in the pool exhibited significant DIF. For three of those items, the aggregate effect size of the DIF is very small: For the items “It was hard for me to pay attention when I had pain,” “I had trouble doing schoolwork when I had pain,” and “I felt angry when I had pain,” the difference between older and younger children in the expected value of the item response on the 0-4 scale is much less than a half point across the entire range of the latent variable pain interference. To a large extent, the tendency is for those three items to be slightly more discriminating for older than younger children. For the item “It was hard to remember things when I had pain,” younger children tend to give slightly higher responses than older children; the difference, which varies as a function of the latent variable, is around a half point on the 0-4 scale. For “It was hard to get along with other people when I had pain,” older children tend to select slightly higher responses than younger children (again, the difference is a fraction of a point on the 0-4 scale, and is only observed for respondents at high levels of pain interference).

Figure 1 shows test information functions for the pain item pool and four potential short forms on a T-score scale with a mean of 50 and standard deviation of 10 (on which all PROMIS scales are reported). Test information is the expected value of the inverse of the squared standard error of measurement, and indicates the precision of scores on a scaled metric. A standard error of measurement of approximately 0.32 (on a standardized metric, or 3.2 on a T-score metric) is associated with a test information value of 10 and hence a reliability coefficient of approximately 0.90. Three 8-item short forms provide test information greater than 10 for a range of scores between, approximately, 45 to 70 on the T-score scale. The recommended 8-item short form in the Appendix contains the item set which provides the maximum test information at the mean (50) on the T-score metric. However, if more score precision is required (or “broader” precision), the complete item pool is contained in Table 2 and may be used to compute IRT response pattern scores or IRT-scaled scores from summed scores.

Figure 1
Test information functions for Pediatric Pain Interference Scale.

Figure 1 also serves as a simulated Computer Adaptive Test (CAT). A CAT selects items based on an individual’s response to previous items. As such, a CAT can theoretically choose the most informative items for an individual depending on their level of the trait being measured, in this case, pain interference. For this simulation, separate test information functions are computed from the 8 items that provide the most information at five possible score locations (30, 40, 50, 60, and 70 on the T-score metric). In other words, the items used to generate the test information function at T = 50 are those that a perfect CAT would select for an individual at the mean of pain interference. To consider the usefulness of CAT given these items, one may compare both the range of score precision and the magnitude of score precision across the separate potential short forms. In this case, because the items in the final pool generally discriminate in the same range, there is little score precision gained between the four potential short forms. However, the PROMIS Assessment Center contains the item pool and is capable of administering these items as a CAT if the researcher desires to do so.


Recent recommendations from the Pediatric Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (PedIMMPACT) indicated that investigators conducting clinical trials in pediatric chronic and recurrent pain “should consider assessing outcomes in pain intensity; physical functioning; emotional functioning; role functioning; symptoms and adverse events; global judgment of satisfaction with treatment; sleep; and economic factors [23].” However, the consensus by the PedIMMPACT group was that “pain-related functional impairment” measures still require further research. The PROMIS Pediatric Pain Interference Scale in part addresses this identified gap in the empirical literature with the advantages of IRT analyses in the instrument development process.

The present study describes the development of the new NIH PROMIS Pediatric Pain Interference Scale based on an iterative series of IRT analyses regarding scale dimensionality, item local dependence, and differential item functioning. After determining scale dimensionality, items with local dependence and differential item functioning were next identified and removed resulting in the final unidimensional PROMIS Pediatric Pain Interference Scale. A number of possible methods for scoring are presented that can be tailored to meet the objectives of a particular clinical research endeavor. To our knowledge, this is the first pediatric pain interference scale developed through IRT analyses.

The vast majority of generic pediatric pain measures in the empirical literature have utilized Classical Test Theory and generally have not taken full advantage of IRT analysis in the scale development process. The potential advantages of utilizing IRT analysis in item and scale development include greater flexibility in selecting items from the existing pediatric pain item bank tailored to the objectives of a particular clinical research investigation. Further, scales that have been developed with Classical Test Theory often have gaps in their ability to measure the full spectrum of the latent construct; while in contrast, with IRT calibrated items one can construct a measure that is useful across the full continuum of the latent variable [10]. Thus, this analytic methodology provides clinical researchers the opportunity to select the most meaningful items for their study design and hypotheses. In the present study, we proposed a short form measuring pediatric pain interference; however a smaller subset of items from the item bank can also be used and scored on the same metric as the larger set using a more dynamic CAT algorithm.

By administering the pain items spread over several test forms, we are unable to perform factor analyses across the entire bank. This limitation makes it impossible to ensure that pain items from different forms do not exhibit local dependence. Additionally, it is possible that factor analyses would turn out differently if the pain items were analyzed as a single set. Instead, factor analysis was conducted over the subgroups of pain items tested on each form. Because the pain items were created to fill content from qualitative work and then were randomly allocated to each test form, the different test forms can be viewed as replications. By having replicated factor analyses, our impressions of multidimensionality, when repeated across forms, increased our confidence in the factor analytic results. We are currently performing cross-sectional testing using the entire bank to verify these results.

We recruited children from clinics in Texas and North Carolina and schools in North Carolina to achieve a sample with diverse experiences in terms of health outcomes, but also cultural and ethnic influences. This study does not report on using the items in languages other than English or in children living in other countries, as such, we cannot assume that the scales would have the same test characteristics in those other populations.

Using the current sample, we were able to determine that two of the items in the pool, “It was hard to remember things when I had pain,” and “It was hard to get along with other people when I had pain” exhibit sufficient DIF between younger and older children that it would not be wise to use those items in an instrument meant to compare pain interference levels across age. However, for comparisons within age based on other variables, such as treatments, those items are discriminating and useful so they remain in the pool. Future research with other samples may reveal other sources of DIF for other items; an advantage of IRT as a method is that it can detect item-level DIF (a concept completely ignored by Classical Test Theory), and “flag” items to be used only with caution for comparisons across levels of a variable for which DIF exists. Because comparison across gender is ubiquitous, items exhibiting substantial DIF between boys and girls have been set aside from the item pool. Although careful analysis of DIF, as was performed in this study, led to a smaller item bank, we believe this approach will ultimately yield a more broadly applicable measure for comparing results across important populations.

The PROMIS Pediatric Items use a 7-day recall period. The appropriate recall period for pain and other symptoms and functions is a topic of considerable debate with no sound conclusions as to the “best” way to construct a measure, particularly in children. Almost certainly, these effects would be more pronounced in the area of pain severity or frequency than by pain interference. The pain interference items allow the respondent to assess how pain affected their activities which anchors their pain experience in other activities.

The PROMIS pediatric pain item bank was developed to provide accurate and efficient assessment of this important domain utilizing IRT item calibrations, anticipating its use in pediatric patients with chronic and recurrent pain. We are currently testing this item bank, along with other PROMIS pediatric scales in children with rheumatic disease, sickle cell disease, cancer, chronic kidney disease, obesity, and a rehabilitation population to further evaluate aspects of construct validity. In conclusion, the present study provides initial IRT calibrations of the PROMIS pediatric pain interference item bank and the creation of the NIH PROMIS Pediatric Pain Interference Scale which addresses an important gap in the current literature. Further research is indicated on construct validity, including hypothesized associations with emotional distress [38; 36; 9], fatigue [37; 11], functional status [43; 13], pediatric pain coping strategies [42; 45] and generic health-related quality of life [39; 16; 11], as well as tests of the responsiveness of this new scale and item banks in larger samples of pediatric patients with chronic and recurrent pain.


The present study provides initial calibrations of the National Institutes of Health (NIH) Patient Reported Outcomes Measurement Information System (PROMIS) pediatric pain item bank and the creation of the PROMIS Pediatric Pain Interference Scale. It is anticipated that this new scale will have application in pediatric chronic and recurrent pain.


This work was funded by the National Institutes of Health through the NIH Roadmap for Medical Research, Grant 1U01AR052181-01. Information on the Patient-Reported Outcomes Measurement Information System (PROMIS) can be found at and We would like to acknowledge the contribution of Harry A. Guess, MD, PhD to the conceptualization and operationalization of this research prior to his death. We thank Jolynn Pek, Guillaume Filteau, and James McGinley for assistance with the data analysis.


Listed below are the item stems for the recommended eight-item short forms for the PROMIS Pediatric Pain Interference Scale. All items use a 7-day recall period (the preface is “In the past seven days”), and a 5-point response scale with the options never (0), almost never (1), sometimes (2), often (3) and almost always (4).

PROMIS Pediatric Pain Interference Scale Items

  • I had trouble sleeping when I had pain.
  • It was hard for me to pay attention when I had pain.
  • It was hard to stay standing when I had pain.
  • It was hard to have fun when I had pain.
  • I had trouble doing schoolwork when I had pain.
  • It was hard for me to walk one block when I had pain.
  • It was hard for me to run when I had pain.
  • I felt angry when I had pain.

Summed score to scale score translation for these short forms is in Table A1.

Table A1

Summed Score to Scale Score Translation Table for the Recommended Short Form


Scale scores are on a T-score scale; the values of SD are reported as conditional standard errors of measurement.


Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


[1] Ader DN. Developing the Patient-Reported Outcomes Measurement Information System (PROMIS) Medical Care. 2007;45(Suppl 1):S1–S2. [PubMed]
[2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995;57:289–300.
[3] Bjorner JB, Smith KJ, Edelen MO, Stone C, Thissen D, Sun X. IRTFIT: A Macro for Item Fit and Local Dependence Tests under IRT Models. QualityMetric Incorporated; Lincoln, RI: 2007.
[4] Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader DN, Fries JF, Bruce B, Rose M. The Patient-Reported Outcomes Measurement Information System (PROMIS): Progress of an NIH Roadmap Cooperative Group during its first two years. Medical Care. 2007;45(Suppl 1):S3–S11. [PMC free article] [PubMed]
[5] Cohen LL, Lemanek K, Blount RL, Dahlquist LM, Lim CS, Palermo TM, McKenna KD, Weiss KE. Evidence-based assessment of pediatric pain. Journal of Pediatric Psychology. 2008;33:939–955. [PMC free article] [PubMed]
[6] DeWalt DA, Rothrock N, Yount S, Stone AA. Evaluation of item candidates: The PROMIS qualitative item review. Medical Care. 2007;45(Suppl 1):S12–S21. [PMC free article] [PubMed]
[7] du Toit M. IRT from SSI. Scientific Software International; Lincolnwood, IL: 2003.
[8] Eccleston C, Jordan AL, Crombez G. The impact of chronic pain on adolescents: A review of previously used measures. Journal of Pediatric Psychology. 2006;31:684–697. [PubMed]
[9] Eccleston C, McCracken LM, Jordan A, Sleed M. Development and preliminary psychometric evaluation of the parent report version of the Bath Adolescent Pain Questionnaire (BAPQ-P): A multidimensional parent report instrument to assess the impact of chronic pain on adolescents. Pain. 2007;131:48–56. [PubMed]
[10] Embretson SE, Reise SP. Item Response Theory for Psychologists. Erlbaum; Mahwah, NJ: 2000.
[11] Gold JI, Mahrer NE, Yee J, Palermo TM. Pain, fatigue, and health-related quality of life in children and adolescents with chronic pain. Clinical Journal of Pain. 2009;25:407–412. [PMC free article] [PubMed]
[12] Goodman JE, McGrath PJ. The epidemiology of pain in children and adolescents. Pain. 1991;46:247–264. [PubMed]
[13] Hainsworth KR, Davies WH, Khan KA, Weisman SJ. Development and preliminary validation of the Child Activity Limitations Questionnaire: Flexible and efficient assessment of pain-related functional disability. Journal of Pain. 2007;8:746–752. [PubMed]
[14] Haley SM, Fragala-Pinkham MA, Dumas HM, Ni P, Gorton GE, Watson K, Montpetit K, Bilodeau N, Hambleton RK, Tucker CA. Evaluation of an item bank for a computerized adaptive test of activity in children with cerebral palsy. Physical Therapy. 2009;89:589–600. [PubMed]
[15] Hill CD, Edwards MC, Thissen D, Langer MM, Wirth RJ, Burwinkle TM, Varni JW. Practical issues in the application of item response theory: A demonstration using items from the Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 Generic Core Scales. Medical Care. 2007;45(Suppl 1):S39–S47. [PubMed]
[16] Huguet A, Miró J. The severity of chronic pediatric pain: An epidemiological study. Journal of Pain. 2008;9:226–236. [PubMed]
[17] Irwin DE, Varni JW, Yeatts K, deWalt DA. Cognitive interviewing methodology in the development of a pediatric item bank: a patient reported outcomes measurement information system (PROMIS) study. Health and Quality of Life Outcomes. 2009;7(3):1–10. [PMC free article] [PubMed]
[18] Joreskog KG, Sorbom D. LISREL 8.5. Scientific Software International, Inc.; Lincolwood, IL: 2003.
[19] Langer MM, Hill CD, Thissen D, Burwinkle TM, Varni JW, DeWalt DA. Item response theory detects differential item functioning between healthy and ill children in quality-of-life measures. Journal of Clinical Epidemiology. 2008;61:268–276. [PMC free article] [PubMed]
[20] McGrath PA. The measurement of human pain. Endodonics and Dental Traumatology. 1986;2:124–129. [PubMed]
[21] McGrath PA. Pain in children: Nature, assessment, and treatment. Guilford; New York: 1990.
[22] McGrath PJ. The clinical measurement of pain in children: A review. Clinical Journal of Pain. 1986;1:221–227.
[23] McGrath PJ, Walco GA, Turk DC, Dworkin RH, Brown MT, Davidson K, Eccleston C, Finley GA, Goldschneider K, Haverkos L, Hertz SH, Ljungman G, Palermo T, Rappaport BA, Rhodes T, Neil Schechter N, Scott J, Sethna N, Svensson OK, Stinson J, von Baeyer CL, Walker L, Weisman S, White RE, Zajicek A, Lonnie Zeltzer L. Core outcome domains and measures for pediatric acute and chronic/recurrent pain clinical trials: PedIMMPACT recommendations. Journal of Pain. 2008;9:771–783. [PubMed]
[24] Orlando M, Thissen D. Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement. 2000;24:50–64.
[25] Orlando M, Thissen D. Further examination of the performance of S-X2, an item fit index for dichotomous item response theory models. Applied Psychological Measurement. 2003;27:289–298.
[26] Palermo TM. Assessment of chronic pain in children: Current status and emerging topics. Pain Research and Management. 2009;14:21–26. [PMC free article] [PubMed]
[27] Palermo TM, Lewandowski AS, Long AC, Burant CJ. Validation of a self-report questionnaire version of the Child Activity Limitations Interview (CALI): The CALI-21. Pain. 2008;139:644–652. [PMC free article] [PubMed]
[28] Palermo TM, Witherspoon D, Valenzuela D, Drotar DD. Development and validation of the Child Activity Limitations Interview: a measure of pain-related functional impairment in school-age children and adolescents. Pain. 2004;109:461–470. [PubMed]
[29] Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, Thissen D, Revicki DA, Weiss DL, Hambleton RK, Lui H, Gershon R, Reise SP, Lai JS, Cella D. Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Report Outcomes Measurement Information System (PROMIS) Medical Care. 2007;45(Suppl 1):S22–S31. [PubMed]
[30] Samejima F. Graded response model. In: van der Linden WJ, Hambleton RK, editors. Handbook of Item Response Theory. Springer-Verlag; New York: 1997. pp. 85–100.
[31] Steinberg L, Thissen D. Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods. 2006;11:402–415. [PubMed]
[32] Stinson JN, Kavanagh T, Yamada J, Gill N, Stevens B. Systematic review of the psychometric properties, interpretability and feasibility of self-report pain intensity measures for use in clinical trials in children and adolescents. Pain. 2006;125:143–157. [PubMed]
[33] Thissen D. IRTLRDIF: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. L.L.Thurstone Psychometric Laboratory, The University of North Carolina at Chapel Hill; Chapel Hill, NC: 2001.
[34] Thissen D, Nelson L, Rosa K, McLeod LD. Item response theory for items scored in more than two categories. In: Thissen D, Wainer H, editors. Test Scoring. Lawrence Erlbaum Associates; Mahwah, NJ: 2001. pp. 141–186.
[35] Thissen D, Steinberg L, Wainer H. Detection of differential item functioning using the parameters of item response models. In: Holland PW, Wainer H, editors. Differential Item Functioning. Lawrence Erlbaum Associates; Hillsdale, NJ: 1993. pp. 67–113.
[36] Varni JW, Burwinkle TM, Katz ER. The PedsQL™ in pediatric cancer pain: A prospective longitudinal analysis of pain and emotional distress. Journal of Developmental and Behavioral Pediatrics. 2004;25:1–8. [PubMed]
[37] Varni JW, Burwinkle TM, Limbers CA, Szer IS. The PedsQL™ as a patient-reported outcome in children and adolescents with fibromyalgia: An analysis of OMERACT domains. Health and Quality of Life Outcomes. 2007;5(9):1–12. [PMC free article] [PubMed]
[38] Varni JW, Rapoff M, Waldron SA, Gragg RA, Bernstein BH, Lindsley CB. Chronic pain and emotional distress in children and adolescents. Journal of Developmental and Behavioral Pediatrics. 1996;17:154–161. [PubMed]
[39] Varni JW, Seid M, Knight TS, Burwinkle TM, Brown J, Szer IS. The PedsQL™ in pediatric rheumatology: Reliability, validity, and responsiveness of the Pediatric Quality of Life Inventory™ Generic Core Scales and Rheumatology Module. Arthritis and Rheumatism. 2002;46:714–725. [PubMed]
[40] Varni JW, Thompson KL, Hanson V. The Varni/Thompson Pediatric Pain Questionnaire: I. Chronic musculoskeletal pain in juvenile rheumatoid arthritis. Pain. 1987;28:27–38. [PubMed]
[41] Varni JW, Walco GA, Katz ER. Assessment and management of chronic and recurrent pain in children with chronic diseases. Pediatrician. 1989;16:56–63. [PubMed]
[42] Varni JW, Waldron SA, Gragg RA, Rapoff MA, Bernstein BH, Lindsley CB, Newcomb MD. Development of the Waldron/Varni Pediatric Pain Coping Inventory. Pain. 1996;67:141–150. [PubMed]
[43] Varni JW, Wilcox KT, Hanson V, Brik R. Chronic musculoskeletal pain and functional status in juvenile rheumatoid arthritis: An empirical model. Pain. 1988;32:1–7. [PubMed]
[44] von Baeyer CL, Spagrud LJ. Systematic review of observational (behavioral) measures of pain for children and adolescents aged 3 to 18 years. Pain. 2007;127:140–150. [PubMed]
[45] Walker LS, Baber KF, Garber J, Smith CA. A typology of pain coping strategies in pediatric patients with chronic abdominal pain. Pain. 2008;137:266–275. [PMC free article] [PubMed]
[46] Walsh TR, Irwin DE, Meier A, Varni JW, DeWalt DA. The use of focus groups in the development of the PROMIS pediatrics item bank. Quality of Life Research. 2008;17:725–735. [PMC free article] [PubMed]
[47] Williams V, Jones LV, Tukey JW. Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics. 1999;24:42–69.
[48] Young NL, Williams JI, Yoshid KK, Wright JG. Measurement properties of the Activities Scale for Kids. Journal of Clinical Epidemiology. 2000;53:125–137. [PubMed]