|Home | About | Journals | Submit | Contact Us | Français|
To create self-report physical function (PF) measures for children using modern psychometric methods for item analysis as part of Patient Reported Outcomes Measurement Information System (PROMIS).
PROMIS qualitative methodology was applied to develop two PF item pools comprised of 32 mobility and 38 upper extremity items. Items were computer administered to subjects aged 8–17 years. Scale dimensionality and sources of local dependence (LD) were evaluated with factor analysis. Items were analyzed for differential item functioning (DIF) between genders. Items with LD, DIF, or low discrimination were considered for removal. Computerized adaptive testing performance was simulated, and short forms were constructed.
3,048 children (51.8% female, 40% non-white, 22.7% chronically ill) participated. At least 754 respondents answered each item. Factor analytic results confirmed two dimensions of PF. Fifty-two of 70 items tested were retained. A 23 item mobility bank and a 29 item upper extremity bank resulted, and 8 item short forms were created. The item banks have high information from the population mean to 3 standard deviations below.
PROMIS pediatric PF item banks and 8-item short forms assess two dimensions, mobility and upper extremity function, and show good psychometric characteristics after large scale testing.
The Patient Reported Outcomes Measurement Information System (PROMIS) was created through a National Institutes of Health initiative to improve patient reported outcomes (PRO) assessment (1). PROMIS uses modern psychometric methods, including item response theory (IRT), to construct item banks from which static short forms or computerized adaptive tests (CAT) may be created to measure outcomes in a more efficient and precise manner than is possible using classical test theory(2). We describe the development of PROMIS physical function (PF) scales for pediatrics.
Item banks developed to satisfy the assumptions of IRT offer several advantages related to the measurement properties of IRT. Necessary conditions for item bank development are unidimensionality, that a scale measures a single underlying construct, lack of local dependence (LD), or that items share no covariance beyond that of the underlying construct, and lack of differential item functioning (DIF), meaning that people from different groups, (e.g., age, gender) who have a given level of an underlying trait, have the same probability of a given response. IRT based scales include the property of interval level scaling for better interpretation of change, calibration of items across a broad range of an underlying trait to overcome floor/ceiling effects, increased efficiency, and increased precision allowing more sensitivity to change(3). Furthermore, IRT based item banks support CAT, which employs an algorithm whereby only the most informative items targeting an individual’s functioning levels are selected. CAT is in stark contrast to traditional, fixed- length questionnaires which, in order to capture a breadth of patient abilities, may result in patients answering items that are irrelevant to them and create high respondent burden.
There are examples of other disability scales developed using IRT, including the Activities Scale for Kids (ASK), for children with musculoskeletal disorders(4), and the Pediatric Evaluation of Disability Inventory (PEDI), for children with developmental disorders(5, 6). The former includes domains of “personal care”, “play”, “locomotion”, and others, while the latter divides PF into two dimensions, “mobility” and “self-care”. Further, multi-dimensional CAT has been implemented in the PEDI(7). Yet, such measurement approaches have not been widely used outside of the disability community. The PROMIS scales aim to address the need for an IRT-based measurement system applicable across a range of health conditions, available for self- or proxy-administration, that is publicly available.
The PROMIS network aims to standardize PRO assessment across multiple chronic illness populations by creation of PRO item banks using a uniform methodology (8, 9) to cover a range of domains of health-related quality of life (HRQOL). The framework for the health domains measured by PROMIS item banks is based on the WHO tripartite conceptualization of health (physical, social, and emotional)(1, 10), with PF a central component of physical health. In addition to PF, PROMIS pediatric item banks were developed to measure pain, fatigue, anger, anxiety, depressive symptoms, peer relationships, and asthma symptoms by self-report in children ages 8–17 years old(11–14), with proxy-report versions in development for ages 5–17 years old. This report describes the construction and psychometric item analysis of the PROMIS pediatric physical function Mobility and Upper Extremity banks.
The PROMIS pediatric PF domain was conceptualized as “one’s ability to carry out various activities, ranging from self-care (activities of daily living) to more challenging and vigorous activities that require increasing degrees of mobility, strength, or endurance.” We hypothesized that PF is multi-dimensional; in addition to the two dimensions considered in this project, other dimensions remain in need of measurement.
PROMIS methodology for initial item pool creation has been well described elsewhere(9). In brief, a multi-step process began with systematic identification and compilation of PRO items in existing scales. Items were then sorted by unidimensional aspects of a latent trait. New items were devised where there were apparent gaps in coverage across the continuum of a construct. After these processes, a pool of 177 candidate PF items was created. Redundant, vague, misclassified, confusing, or disease specific items were then set aside. Items selected for inclusion were re-written to conform to a standardized stem, recall period, and response options. Individual items were revised to reflect input from cognitive interviews(15). PF items have a standard 7-day recall period (“In the past 7 days”), are written in the past tense (“I could…”), and have a standard 5-point response option: “With no trouble, With a little trouble, With some trouble, With a lot of trouble, Not able to do.” Items were written from the perspective of capability (16). Seventy PF items classified into two pools, Mobility (32) and Upper Extremity (38), remained for testing.
The items were divided across 4 different test forms, as listed in Tables 1a and and1b,1b, and were computer administered to children aged 8–17 years. Each item was administered to at least 754 respondents. Due to a testing scheme that included several items from each of the item banks under development (e.g., PF, pain, fatigue), no one individual was administered the entire bank of PF items. Participants were recruited from medical clinics in North Carolina and Texas, and community schools in North Carolina. Parental informed consent and minor assent were obtained for all study participants. IRB approval was received from participating institutions.
The sample size requirements and testing scheme was designed to allow the following analytic plan: 1) assessing the domain factor structure, 2) tests for differential item function (DIF), 3) evaluation for local dependence, and 4) calibration of PF items using IRT methods.
The study population was characterized with descriptive statistics. We followed a standardized PROMIS framework for psychometric item analyses(8, 12). Data quality was verified and analyses were conducted to ensure that IRT model assumptions were met. Item bank dimensionality was assessed with confirmatory factor analysis (CFA) of the inter-item polychoric correlation matrices using the DWLS algorithm in the computer program LISREL(17). Initial CFA models were fit with two correlated factors (Mobility and Upper Extremity).Investigations for LD included identifying significant error covariances between pairs or small clusters of items(18). If LD was identified only one of the items was selected from the subset to remain in the item bank and the others set aside.
Items in the banks which satisfied the unidimensionality criterion, as demonstrated by CFA, were subsequently calibrated using Samejima’s Graded Response Model(19, 20) using Multilog(21) software. For each item the GRM estimates a slope or discrimination parameter (a), which indicates the degree of association between the item responses and the underlying construct, in this case either Mobility or Upper Extremity, and four thresholds (bk) (for five category items) that reflect the severity of physical functioning where the most probable response occurs in a given category or higher. The fit of the IRT model was based on the S-X 2statistic(22),(23), (24), in which a non-significant result is an indicator of adequate model fit.
DIF between males and females was evaluated using IRT-LR DIF as implemented in IRTLRDIF(25), (26)(the criterion of no DIF is a non-significant test-statistic). The Benjamini-Hockberg procedure was used to make inferential decisions in the context of the multiple comparisons(27),(28).
Short forms were created by selecting items from the calibrated item bank that were the most informative at one standard deviation below the mean (i.e., at T-score of 40). The appendix contains IRT scale scores computed for the summed scores of both short forms(29) as we expect the summed scores will be most useful for end-users.
The candidate pediatric PF items were administered to a racially diverse study cohort of 3,048 children. 22.7% had a chronic medical condition (Table 2).
Results from a two common factor model confirmed that there are two dimensions underlying the 70 item PF item pool, as proposed a priori, which we have labeled Mobility and Upper Extremity Function. However, the two dimensions proved to be highly correlated (from r = 0.61 to 0.93 across forms). Factor loadings and error covariances are displayed in Appendix Tables A1a through A1d. Using goodness of model fit indices as recommended by Reeve et al.(8), suggests adequate model fit. Form 1 (Table A1a), χ2(72) = 64, p = 0.74, CFI = 1.00, TLI = 1.00, RMSEA = 0.00. Form 2 (Table A1b), χ2 (100) = 118, p=0.10, CFI = 1.00, TLI = 1.00, RMSEA = 0.02. Form 3 (Table A1c), χ2 (227) = 346, p = 0.00, CFI = 1.00, TLI = 1.00, RMSEA = 0.03. Form 4 (Table A1d), χ2 (204) = 288, p=0.00 CFI = 1.00, TLI = 0.99, RMSEA = 0.03.
On each of the four forms there were doublets or triplets of items which exhibited LD. This can happen when subsets of items share content or wording that is similar, yet different from the scale’s other items. For example, Form 4 has three locally dependent mobility items, “I could walk a mile”, “I could walk more than one block” and “I could keep up when I played with other kids” (Appendix Table A1d). These all convey the shared idea of endurance, quite distinct from other items on the form. In order to ensure unidimensionality of the scales, in general only one item from each doublet or triplet was preserved in the final item bank (Appendix Tables A1a-d; A2a, b). Two exceptions were in Form 1, of two Upper Extremity items with LD, both were excluded due to poor psychometric performance. Two Mobility items that exhibited statistical LD to a small (albeit significant) extent were retained in the final item pool. Inclusion/exclusion decision was made via team discussion with an attempt to take both clinical and psychometric perspectives into account.
Eight items (6 Mobility and 2 Upper Extremity) were set aside after the factor analyses, due to one or more of the following reasons: loading on both factors, LD or low relationship with the construct or too little variation. These eight items are marked with the notation “(d)” in Tables A1a-d. In Form 3, one of the Mobility items, “I needed help with a bath” loaded onto the Upper Extremity Factor, and was thus moved to the Upper Extremity item pool.
After the factor analyses were complete, locally independent subsets of items were calibrated using the GRM. To avoid calibrating LD items, each calibration included only a single item from each LD doublet or triplet. Appendix tables A2a and A2b show the item parameter estimates sorted based on the magnitude of information each item provides at one standard deviation below the mean. The tables also provide item fit (S-X2) and DIF statistics ( LR X2) for the final item banks. Listed at the bottom of the tables are the 10 additional items that were set aside during the final assembly of the pool. Three items set aside for LD or poor model fit, five items for DIF, and two items for their lack of discrimination. The final Mobility item bank contained 23 items and the final Upper Extremity Bank contained 29 items.
Information functions of the full item bank and short form for Mobility and Upper Extremity are shown in Figure 1 and Figure 2, respectively. These are presented on a T-score scale in which the mean is set at 50 with a standard deviation of 10. Test information is the expected value of the inverse of the squared standard error of measurement, such that a standard error of measurement of approximately .32 (or 3.2 on a T-score metric) is associated with a test information value of 10, which corresponds to a reliability coefficient of 0.90. The short-forms for both Mobility and Upper Extremity (selected as the 8 most informative items at one standard deviation below population mean (i.e., at 40 on a T-score metric)) have information values greater than 10 between scores of about 20 to 45 for Mobility and 20 to 40 for Upper Extremity.
Figures 1 and and22 also function as simulated CATs. Both figures contain five information functions which represent the collective information of 8 (potentially different) items that individuals at five different score locations (30 through 70, on a T-score scale) would receive given a perfectly operating CAT. In this case, because the items are informative in generally the same locations, a CAT may not dramatically improve the efficiency of the questionnaire. Nonetheless, PROMIS Assessment Center (http://www.assessmentcenter.net/) contains the calibrated item bank and allows the user to select and administer items as a CAT.
PROMIS pediatric PF item banks and short forms assessing two dimensions of PF, Mobility and Upper Extremity Function, show strong psychometric characteristics after initial large scale testing. Factor analysis supported the creation of separate PF item banks. While the complete set of items in either item bank was not tested by individuals due to concerns over respondent burden (a potential limitation), an advantage of this approach is that we had replication of the findings across four forms, providing helpful evidence that persuaded us to keep the banks separate. Although the item banks represent different dimensions of PF they are not-surprisingly, highly correlated. The approach to measurement of PF in the PROMIS pediatric item bank we present here diverges from that of the PROMIS PF bank for adult patients, in which PF is treated as a unidimensional construct, though not without debate(30).
With few exceptions(31),(32), traditional PF scales do not allow disaggregation of various aspects of PF, such as lower or upper body function. Aggregation of multiple aspects of PF into a single summary score may blunt the instruments, i.e., reduce precision, mute responsiveness to change, and on a more basic level reduce information and interpretability. The realm of PF assessment and how to define and handle its apparent subdomains is an underdeveloped area, resulting in divergent approaches to its measurement. Although multidimensional CAT is attractive in the potential for increased efficiency of measurement, this method requires added complexity in scoring(33).
One of the primary advantages of IRT derived measurement scales is the potential to overcome floor/ceiling effects by broadening the range of measurement. Despite following PROMIS procedures for item development and testing, the PROMIS Pediatric PF scales show a ceiling effect. Indeed the scaled scores for Mobility and Upper Extremity Function short forms reach only 59 and 57, respectively, less than one standard deviation above the population mean, with a broader range at lower levels of function, down to 14 and 10, respectively.
Large scale testing is currently underway in children with chronic illnesses, including cancer, chronic kidney disease, and rheumatic diseases. Future work is needed for translation and cross cultural validation. Additional item development is needed to enhance measurement of individuals with levels of function above the population mean.
We described the item development, item analyses, and construction of the first version of the PROMIS Pediatric Mobility and Upper Extremity Function item banks. These instruments were tested in a large, diverse population of children ages 8–17 years and show excellent test properties in preliminary testing. Additional work is underway to further validate and calibrate the instruments in a variety of chronic illness populations.
We are grateful to Harry A. Guess, MD, PhD, under whose vision and leadership this PROMIS project to develop item banks for pediatrics took shape.
Funding for this research was provided to participating institutions by the National Institutes of Health through the NIH Roadmap for Medical Research, Cooperative Agreements 1U01AR052181-01 to University of North Carolina, PI: Darren DeWalt, MD, MPH; and U01AR52186 to Duke University, PI: Kevin Weinfurt, PhD. Additional information on the Patient Reported Outcomes Measurement Information System is available at http://nihroadmap.nih.gov and http://www.nihpromis.org
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.