PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Appl Meas. Author manuscript; available in PMC 2013 June 19.
Published in final edited form as:
J Appl Meas. 2010; 11(3): 304–314.
PMCID: PMC3686485
NIHMSID: NIHMS471290

The Use of PROMIS and Assessment Center to Deliver Patient-Reported Outcome Measures in Clinical Research

Abstract

The Patient-Reported Outcomes Measurement Information System (PROMIS) was developed as one of the first projects funded by the NIH Roadmap for Medical Research Initiative to re-engineer the clinical research enterprise. The primary goal of PROMIS is to build item banks and short forms that measure key health outcome domains that are manifested in a variety of chronic diseases which could be used as a “common currency” across research projects. To date, item banks, short forms and computerized adaptive tests (CAT) have been developed for 13 domains with relevance to pediatric and adult subjects. To enable easy delivery of these new instruments, PROMIS built a web-based resource (Assessment Center) for administering CATs and other self-report data, tracking item and instrument development, monitoring accrual, managing data, and storing statistical analysis results. Assessment Center can also be used to deliver custom researcher developed content, and has numerous features that support both simple and complicated accrual designs (branching, multiple arms, multiple time points, etc.). This paper provides an overview of the development of the PROMIS item banks and details Assessment Center functionality.

Historically there has been little consistency in the assessment of patient-reported outcome measures in clinical research. Investigators often measure common symptoms such as fatigue, pain or depression with varied questionnaires, resulting in data that cannot easily be compared. To help standardize the measurement of common symptoms and other aspects of self-reported health, the National Institutes of Health (NIH) awarded several interrelated grants under the NIH Roadmap for Medical Research Initiative to re-engineer the clinical research enterprise. These grants went to seven institutions in 2004 to form a cooperative network under the heading of the Patient-Reported Outcomes Measurement Information System (PROMIS).

The primary goal of PROMIS was to build item banks and short forms that measure key health outcome domains that are manifested in a variety of chronic diseases. All items considered for inclusion in the PROMIS item banks have undergone rigorous qualitative expert and patient review, and quantitative analysis of data collected on general population and clinical samples. Resulting data from a large sample of individuals suffering from a variety of chronic diseases was analyzed to calibrate the final item banks. Concurrently, the project built an electronic Web-based resource for administering computerized adaptive tests, collecting self-report data, and reporting instant health assessments. This paper briefly details the development of the PROMIS item banks and the construction of Assessment Center.

Part I. The Development of the PROMIS Item Banks

Drawing from decades of experience reflected in published literature and previous work of the investigators, PROMIS began item bank development by cataloguing items from well-established instruments. PROMIS investigators conducted inclusive searches and evaluations of existing instruments to enrich the pool of domain-relevant items considered to be potential candidates for the PROMIS item banks. Each domain work-group constructed their own search strategy based upon the specific needs identified within the domain. For example, the Emotional Distress domain group identified 4 general areas (referred to as “subdomains”) for starting bank development: depression, anxiety, anger, and substance misuse. They then created search strategies to identify all known items covering these topics (Klem et al., 2009). Items were not filtered out if they applied to a specific population, but were kept for further qualitative analysis. By performing these searches, PROMIS investigators identified thousands of items relevant to the PROMIS domains. In this initial cycle no judgment was made regarding the quality or redundancy of the items; they were only selected if their content fit or were deemed proximal to the domain definitions.

Confronted with thousands of items, a method for sorting through the content and deciding on the most representative and informative items was needed. We called this method “binning and winnowing.” First, items were placed into common “bins,” typically defined by the content of the stem. Next, items were “winnowed” out if they were deemed redundant or inferior to alternative items in the same bin. (Dewalt, Rothrock, Yount, Stone, and PROMIS Cooperative Group, 2007)

As described in DeWalt et al. (2007), preparation for item administration continued by unifying and rewriting items to produce a bank that was relevant, optimized, and adapted to the technologies we were using (computer-based testing, item response theory (IRT) and computerized adaptive testing (CAT)). Item revision utilized two key sources: expert opinion and patients/potential research participants (focus groups and cognitive interviews). The focus groups helped provide patient input to conceptual gaps in the domain definition leading to the identification of new items, especially where it was judged that existing items did not provide adequate coverage. Cognitive interviews helped ensure that items could be understood with the intended interpretation by potential research participants, especially those with low literacy.

An extensive sampling plan was created (Cella et al., in press) in order to (a) calibrate the items in all of the domains, (b) enable later linking to legacy instruments, (c) calculate scores for several disease populations, (d) explore the factor structure for each domain, and (e) conduct differential item functioning analyses. Data collection took place over a nine month period starting in the summer of 2006. Two data collection designs (“full bank” and “block administration”) were used. The full-bank administration allowed for evaluation of dimensionality and calibration within item banks. Block administration permitted an evaluation of associations between domains. The ability to calibrate items on a common metric (item linking) was possible because of the administration of blocks of items from the full bank to general population and clinical samples. Each item was administered to at least 900 respondents from the general population and 500 respondents with a chronic medical condition. In addition to the PROMIS items and appropriate “legacy” items (items from widely used fixed length measures such as the SF-36) completed by those who were administered full banks of items, participants were administered approximately 21 auxiliary items including global health ratings, sociodemographic variables, and degree of limitation related to 25 chronic medical conditions. A complete description of the calibration strategy can be found in Reeve et al. (2007).

The PROMIS calibration sample included 21,133 subjects. 1,532 respondents were recruited from PROMIS Network sites and 19,601 from an online panel maintained by YouGovPolimetrix (www.polimetrix.com). Panel members were sampled using a matching procedure (Rivers, 2006) designed to be representative of the 2000 U.S. Census (stratified by 5 adult age bands, gender, race/ethnicity for blacks and Hispanics, and to ensure the inclusion of subjects who had not graduated from high school. Analyses suggest the representativeness of the online sample was comparable to a probability-based general population sample (Liu et al., in press).

The dataset was used to produce IRT parameters for each of the items according to the PROMIS analysis plan (Reeve, et al., 2007) which included factor analysis to assess unidimensionality and to calculate item parameters using the graded response model (as well as other models for some of the banks). Items were further assessed for differential item functioning based upon age and gender. Final unidimensional item banks were created for each of the domains. These banks demonstrated reliability, construct validity, and precision (Cella, et al., in press). Similar pediatric item banks were created using additional pediatric samples (Irwin et al., 2010a; Irwin et al., 2010b; Varni et al., in press; Yeatts et al., 2010) (see Table 1). To deliver PROMIS CATs, we began the process of developing Assessment Center (www.assessmentcenter.net).

Table 1
Number of Items in each Item Bank and Short Form (Gershon et al., 2010a)

Members of the PROMIS Statistical Coordinating Center created a web-based platform to enable researchers to create data collection websites that could administer PROMIS CATs and other instruments to research participants or clinical samples. A highly structured development methodology was utilized beginning with Joint Application Design sessions with representatives from numerous institutions in order to supply a general wish list of features. Use cases were then written to ensure that end user expectations matched programmer specifications. Software development methodology included daily programmer “scrum” sessions, weekly Usability Acceptability Testing and continuous Quality Assurance activities pre- and post-release. A complete description of the technology development process utilized to create Assessment Center can be found in Gershon et al., 2010a.

Assessment Center is a dynamic management tool that enables researchers to centralize assessment development and administration activities. Embedded with features that promote instrument development, study administration, data management, and storage of statistical analysis results, Assessment Center houses a library of instruments and items with an emphasis on health-related quality of life. All PROMIS instruments are available electronically for inclusion in computer and web-based data collection. PDFs of all existing PROMIS instruments for paper and pencil administration are also available.

All work within Assessment Center is organized within a study. This enables users to keep research activities separate from other studies and users, and to move from instrument selection to data collection and accrual monitoring easily. Assessment Center users must be part of the study team with applicable permissions to have access to any study information or participant data. Each role has very specific rights. For example, some roles permit a team member to modify instruments, while others permit a team member to manage participant data.

Users can search for instruments in the instrument library and add them to a study or they can create their own instruments. The instrument library displays the user’s custom instruments, team members’ instruments, and “public” instruments. The list of “public” instruments is currently limited to PROMIS instruments. In the future, additional “public” instruments such as the National Institute of Neurological Disorders and Stroke (NINDS) Quality of Life Outcomes in Neurological Disorders (NeuroQOL) battery, (Cella, Victorson, Nowinski, Peterman, and Miller, 2006; Neuro-QOL—Quality of Life Outcomes in Neurological Disorders) and the component instruments that make up the NIH Toolbox for Assessment of Neurological and Behavioral Function (Gershon et al., 2010b) will be included.

The instrument library also maintains default CAT parameters for all existing instruments (see Figure 2). Assessment Center currently provides direct support for the administration of CATs using the graded response model (Samejima, van der Liden, and Hambleton, 1996). With some modification of IRT parameters, indirect support is also provided for the generalized partial credit model (Muraki, 1992), the partial credit model (Masters, 1982), and the rating scale model (Andrich, 1978). In future releases, we expect to add direct support for all of the above models as well as support for one- and two-parameter dichotomous models. Other CAT parameters include item selection model, specific rules for initial item selection, and information regarding the prior person distribution. The instrument can be set to continue administering items until a specified standard error cutoff is obtained, further moderated by minimum and maximum test length.

Figure 2
CAT Parameters

Users can also create their own new instruments. After assigning instrument properties (e.g., name, domain) the user can either import items from any of the instruments accessible to the user from the instrument library or numerous item types can be created. Statistics for existing items are available for review (see Figure 3). The item statistics pages, which contain data analyzed in previous studies, allow a researcher to review how an item performed in previous testing and analysis.

Figure 3
Item Statistics

A series of statistics can be stored for any sample population on the item statistics page. In Figure 3, statistics associated with the PROMIS sub-sample for persons less than 64 years of age are visible. Below this area Differential Item Functioning information for any specified subgroups can be displayed. At the bottom of the screen is information for each calibration sample associated with the item. Space is available to display both the category response function as well as the item information function graphs. For rating scale items, IRT parameters such as slope, guessing, and category thresholds are displayed. In the future, appropriate fields will be displayed according to the IRT model selected. An additional area may be utilized to store various Model Fit and Scalability indices.

Assessment Center users have the option to preview instruments in a list or in a one item per screen format as would be the case when delivering a CAT. For CAT instruments and their associated item calibrations, a view option exists to not only preview the item text, but to view the underlying CAT parameters and interim statistics which are calculated as each item is administered. Figure 4 shows the third item of a CAT. To the left are the item IDs of the first two items and the number of the response selected for those items. In the next column is the person ability estimate for the test taker as well as their standard error of measure at that point in the assessment. The bottom of the screen displays a summary of the CAT parameters in effect for the given assessment.

Figure 4
CAT Preview

Seven different item types can be used including: multiple choice, drop list, check box, text, numeric, date, and comments. In an effort to retain and report steps within item and/or instrument development, and to be compliant with FDA guidance for PRO development regarding the need to track item development history (U. S. Department of Health and Human Services Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, and Center for Devices and Radiological Health, 2009), Assessment Center includes an item history feature (see Figure 5). When an existing item is changed in any way (e.g. addition of response option or grammar change to stem) an item history record is automatically created.

Figure 5
Item History

Following instrument selection/creation, the overall study protocol can be organized. Assessment Center allows for fixed ordering of instruments as well as more complex presentation schemes. An Advanced Study Set-Up utility enables users to create studies that have multiple arms (e.g., intervention and control), multiple assessments for longitudinal study designs, and customized instrument presentation per assessment. There are numerous options for controlling the order in which items are administered; in addition to CAT, researchers have the flexibility to present items in fixed, branched, or random orders.

Researchers are able to create a study-specific data collection platform located at a unique web address. The content of this website’s home page is customizable by the user and can include study-specific graphics. Researchers have the option of including up to three consent documents (e.g., consent, assent, and HIPAA authorization).

Participant registration in Assessment Center is used to collect participant demographic and contact information. This information is stored in a separate database from consent and assessment data. It is also used to populate accrual reports that meet the criteria for NIH progress reports. Registration questions are presented after consent forms and before instruments and can be completed by participants or study staff. Researchers can select from numerous standard demographic field types or enter their own unique fields. A summary table within Assessment Center provides current accrual statistics and can be used to calculate response rates.

Studies that include the PROMIS v1.0 CATs or a PROMIS Profile instrument can generate single time point reports for each participant. For each domain, the participant’s standard score is provided along with information on how that score compares with other people in the general population, others in his/her age group, and others of his/her gender. The second portion of the report includes a graph showing scores with standard error bars (see Figures 6 and and77).

Figure 6
Questionnaire Summary Report Page 1
Figure 7
Questionnaire Summary Report Page 2

Study data can be exported from Assessment Center in five different CSV files which can be opened in programs such as Excel, SAS or SPSS. The first export file, Export Assessment Data, contains instrument and item data; no registration data is included. The second export file, Export Assessment Scores, contains participants’ t-scores based on general population norms for PROMIS CAT instruments. The Export Registration Data file contains the data entered into the registration fields. The fourth export file, Export Consent Data, includes information from the consent forms. Finally, the fifth export file, Export Pivoted Assessment Data, presents assessment data in a format that is typically used by statistical analysts. A Data Dictionary report describes the item IDs and response scores included in all the data exports.

Three additional software products are available to complement Assessment Center (links to downloads can be found at http://www.nihpromis.org): Assessment Center Offline, Firestar, and PROMIScore. Assessment Center Offline can be used to administer surveys and assessments without the need to be connected to the Internet. Data is stored on the local computer used for test administration. A download wizard is available within the Assessment Center application. Firestar (Choi, 2009) is designed for simulating CAT under various IRT models. The program provides a graphically feature-rich simulation environment to evaluate the use of item banks for CAT. The program has been used in a number of post-hoc simulation studies and has been evaluated extensively using both empirical and simulated data (Choi, Reise, Pilkonis, Hays, and Cella, 2009; Choi and Swartz, 2009). PROMIScore can be used to score data produced by PROMIS short forms with and without missing data, and includes capabilities for administering the instruments to individual patients and generating their score profiles.

The Future

As of 2010, PROMIS is in its sixth year of NIH support. Presently, there are 11 adult and 9 pediatric item banks available for use in Assessment Center. Clinical validation studies, providing important longitudinal information on responsiveness to change, are underway. In the next 3 years of PROMIS activity, several new item banks will be developed and validated to measure adult and pediatric domains in physical, mental and social health. In addition, validation studies of existing banks will deepen our understanding of the performance of these new standards for clinical research. It is our hope that the original vision of the NIH in supporting PROMIS for 9 years—to standardize the measurement of symptoms and function using precise, valid item banks and their applications—will be fully realized and will launch a combined scientific effort to expand its applications.

Figure 1
Instrument Characteristics

References

  • Andrich D. A rating formulation for ordered response categories. Psychometrika. 1978;43:561–573.
  • Cella D, Gershon R, Lai JS, Choi S. The future of outcomes measurement: item banking, tailored short-forms, and computerized adaptive assessment. Quality of Life Research. 2007;16(Suppl. 1):133–141. [PubMed]
  • Cella D, Riley W, Stone AA, Rothrock N, Reeve BB, Yount S, et al. Initial item banks and first wave testing of the Patient Reported Outcomes Measurement Information System (PROMIS) network: 2005-2008. Journal of Clinical Epidemiology. in press.
  • Cella D, Victorson D, Nowinski C, Peterman A, Miller DM. The Neuro-QOL project: Using multiple methods to develop a HRQOL measurement platform to be used in clinical research across neurological conditions. Quality of Life Research. 2006;A-14:1353.
  • Choi S. Firestar: Computerized adaptive testing simulation program for polytomous item response theory models. Applied Psychological Measurement. 2009;33(8):644–645.
  • Choi SW, Reise SP, Pilkonis PA, Hays RD, Cella D. Efficiency of static and computer adaptive short forms compared to full length measures of depressive symptoms. Quality of Life Research. 2009 Epub ahead of print. [PMC free article] [PubMed]
  • Choi SW, Swartz RJ. Comparison of CAT item selection criteria for polytomous items. Applied Psychological Measurement. 2009;33(6):419–440. [PMC free article] [PubMed]
  • Davis KM, Chang CH, Lai JS, Cella D. Feasibility and acceptability of computerized adaptive testing (CAT) for fatigue monitoring in clinical practice. Quality of Life Research. 2002;11(7):134.
  • DeWalt DA, Rothrock N, Yount S, Stone AA. PROMIS Cooperative Group. Evaluation of item candidates: The PROMIS Qualitative item review. Medical Care. 2007;45(5 Suppl 1):S12–S21. [PMC free article] [PubMed]
  • Gershon R, Rothrock NE, Hanrahan RT, Jansky LJ, Harniss M, Riley W. The development of a clinical outcomes survey research application: Assessment CenterSM. Quality of Life Research. 2010a;19(5):677–685. [PMC free article] [PubMed]
  • Gershon RC, Cella D, Fox NA, Havlik RJ, Hendrie HC, Wagster MV. Assessment of neurological and behavioural function: The NIH toolbox. Lancet Neurology. 2010b;9(2):138–139. [PubMed]
  • Irwin DE, Stucky B, Langer MM, Thissen D, Dewitt EM, Lai JS, et al. An item response analysis of the pediatric PROMIS anxiety and depressive symptoms scales. Quality of Life Research. 2010a;19(4):595–607. [PMC free article] [PubMed]
  • Irwin DE, Stucky BD, Thissen D, Dewitt EM, Lai JS, Yeatts K, et al. Sampling plan and patient characteristics of the PROMIS pediatrics large-scale survey. Quality of Life Research. 2010b Epub ahead of print. [PMC free article] [PubMed]
  • Klem M, Saghafi E, Abromitis R, Stover A, Dew M, Pilkonis P. Building PROMIS item banks: Librarians as coinvestigators. Quality of Life Research. 2009;18(7):881–888. [PMC free article] [PubMed]
  • Lai JS, Cella D, Chang CH, Bode RK, Heinemann AW. Item banking to improve, shorten and computerize self-reported fatigue: An illustration of steps to create a core item bank from the FACIT-Fatigue Scale. Quality of Life Research. 2003;12(5):485–501. [PubMed]
  • Liu H, Cella D, Gershon R, Shen J, Morales LS, Riley W, et al. Represen tativeness of the PROMIS Internet Panel. Journal of Clinical Epidemiology. in press. [PMC free article] [PubMed]
  • Masters GN. A Rasch model for partial credit scoring. Psychometrika. 1982;47:149–174.
  • Muraki E. A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement. 1992;16(2):159–176.
  • Neuro-QOL. Quality of Life Outcomes in Neurological Disorders. Retrieved May 1, 2010, from http://www.neuroqol.org.
  • Reeve BB, Hays RD, Bjorner JB, Cook KF, Crane PK, Teresi JA, et al. Psychometric evaluation and calibration of health-related quality of life item banks: Plans for the Patient-Reported Outcomes Measurement Information System (PROMIS) Medical Care. 2007;45(5 Suppl 1):S22–S31. [PubMed]
  • Rivers D. Sample matching: Representative sampling from Internet panels. Palo Alto, CA: Polimetrix; 2006.
  • Samejima F, van der Liden WJ, Hambleton R. The graded response model: Handbook of modern item response theory. New York: Springer; 1996.
  • U. S. Department of Health and Human Services Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, and Center for Devices and Radiological Health. Guidance for industry patient-reported outcome measures: use in medical product development to support labeling claims. 2009 from http://purl.access.gpo.gov/GPO/LPS113413. [PMC free article] [PubMed]
  • Varni JW, Stucky B, Thissen D, Dewitt EM, Lai JS, Dewalt DA, et al. PROMIS pediatric pain interference scale: An item response theory analysis of the pediatric pain item bank. Journal of Pain. in press. [PMC free article] [PubMed]
  • Ware JE, Jr, Kosinski M, Bjorner JB, Bayliss MS, Batenhorst A, Dahlof CG, et al. Applications of computerized adaptive testing (CAT) to the assessment of headache impact. Quality of Life Research. 2003;12(8):935–952. [PubMed]
  • Yeatts KB, Stucky B, Thissen D, Irwin D, Varni JW, DeWitt EM, et al. Construction of the Pediatric Asthma Impact Scale (PAIS) for the Patient-Reported Outcomes Measurement Information System (PROMIS) Journal of Asthma. 2010;47(3):295–302. [PMC free article] [PubMed]