|Home | About | Journals | Submit | Contact Us | Français|
The NIH Patient-Reported Outcomes Measurement Information System (PROMIS) Roadmap initiative is a cooperative group program of research designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes relevant across medical conditions. For adults, 11 domains have been developed in physical, mental, and social health.
The objective of the current study was to assess feasibility and construct validity of PROMIS item banks versus legacy measures in a observational study in systemic sclerosis (SSc).
Patients with SSc in a single academic center completed computerized adaptive technology (CAT) administered PROMIS item banks during the clinic visit and legacy domains (using paper-and-pencil). The construct validity of PROMIS items was evaluated by examining correlations with corresponding legacy measures using multitrait-multimethod analysis.
Participants consisted of 143 SSc patients with an average age of 51.5 years; 71% were female and 68% were Caucasian. The average number of items completed for each CAT-administered item bank ranged from 5 to 8 (69 CAT items per patient), and the average time to complete each CAT-administered item bank ranged from 48 seconds to 1.9 minutes per patient (average time= 11.9 minutes/per patient for 11 banks). All correlations between PROMIS domains and respective legacy measures were large and in the hypothesized direction (ranged from .61 to .82).
Our study supports the construct validity of the CAT-administered PROMIS item banks and shows that they can be administered successfully in a clinic with support staff. Future studies should assess the feasibility of PROMIS item banks in a busy clinical practice
The National Institutes of Health (NIH) Patient-Reported Outcomes Measurement Information System (PROMIS) Roadmap initiative (www.nihpromis.org) is a cooperative research program designed to develop, evaluate, and standardize item banks to measure patient-reported outcomes (PROs) across different medical conditions as well as the US population (1). The goal of PROMIS is to develop reliable and valid item banks using item response theory (IRT) that can be administered as computerized adaptive tests (CAT)(1–3). CAT selects the most informative questions from an item bank on the basis of a person’s previous responses; this process determines an individualized score using a minimum number of questions while preserving precision. Eleven adult domains have been developed to date as short forms and CATs in physical, mental, and social health(1). We tested the 11 PROMIS domains in patients with systemic sclerosis (scleroderma; SSc) in a single center observational study.
Scleroderma, meaning thickened skin, is a rare disease that affects 300 to 700 people per million population(4). Scleroderma manifests itself in several forms, including localized disease, overlap syndromes, scleroderma-like diseases, and systemic sclerosis (SSc)(5). SSc is an autoimmune disease that includes thickening of the skin and internal organ involvement (heart, lung, gastrointestinal, and kidney involvement). People with SSc have both skin hardening and internal involvement; depending on the extent of skin involvement, SSc is divided into limited SSc and diffuse SSc(5). Patients with limited SSc generally have a more favorable outcome, with a 5-year survival as high as 86%(6). Diffuse SSc is characterized by rapid skin thickening and potentially severe pulmonary, cardiac, renal, and gastrointestinal involvement occurring in the first 3–5 years of disease and may be associated with poor survival(5). SSc is a chronic rheumatic disease with no effective treatment or cure, in which patients cope with pain, disfigurement, disability, and feelings of helplessness, each of which can impair health-related quality of life (HRQoL)(7–9). In this study we sought to assess the feasibility of administrating PROMIS item banks in an academic clinical setting and construct validity of PROMIS domains versus legacy measures in an observational study of patients with SSc. We hypothesized that the PROMIS item banks can be administered in a clinical setting with adequate staff support without disrupting the flow of clinic, and that the item banks would be highly correlated with corresponding legacy instruments.
We recruited SSc patients receiving care in the UCLA Scleroderma Program to serve as participants in the UCLA Scleroderma Quality of Life Study. The original objective of the study was to assess minimally important differences for OMERACT-endorsed outcomes measures in SSc. We added PROMIS item banks Adult patients (≥18 years) with a diagnosis of SSc were included in the study(10). Patients with SSc were further divided into limited SSc, diffuse SSc and overlap syndrome. Limited SSc is defined as skin thickening distal, but not proximal, to the knees and elbows, with or without facial involvement; diffuse SSc is defined as skin thickening distal and proximal, to the knees and elbows, with or without facial involvement; and overlap syndrome is defined as patients with SSc and another rheumatic disease (such as inflammatory myositis or rheumatoid arthritis).
This study is a single center observational study where patients with SSc are invited to participate during their clinic visits. UCLA Scleroderma clinic is a weekly rheuatology clinic where patients with SSc are seen by 3 scleroderma experts (D.K., D.E.F, and P.J.C). Each clinician has 2 dedicated rooms assigned to him and assigned 10–12 patients. The current analysis reports the baseline data. SSc patients with new (60-minute time slot) or follow up (usually 30–45 minute time slot) consultations are approached at the time of their scheduled clinic visit by the front desk staff or nurse checking-in the patient and invited to participate in the study. If a patient is interested, s/he is handed a UCLA Institutional Review Board-approved written consent and HIPAA forms. The physician completes the clinical visit and then discusses the study in detail. Because one of the objectives of the study is to assess the feasibility of administrating PROMIS item banks in a clinical setting, we invite all patients irrespective of their disabilities. If the patient is interested, s/he signs the consent and HIPAA forms and the study coordinator directs the patient to the PROMIS Assessment Center (www.assessmentcenter.net) to complete the 11 item banks (discussed below) (11). Assessment Center is an online research management tool supported by PROMIS that administers CAT-administered item banks. The patient completes the PROMIS domains in the examination room and the physician uses the other assigned examination room to examine the next patient. The item banks are completed using a dedicated desktop computer in each examining room and patient has complete privacy. Subsequent to that, patient is asked to complete the legacy instruments in the room. Majority of the legacy measures were completed using paper-and-pencil during the clinic visit; in rare instances they were completed at home and returned within 1 week using pre-stamped envelopes.
PROMIS version 1.0 item banks including anger, anxiety, depression, fatigue, pain behavior, pain impact, physical function, sleep disturbance, wake disturbance, satisfaction with participation in social roles and satisfaction with participation in discretionary social activities were administered as CATs (available at www.nihpromis.org). With the exception of physical function which does not include a time frame and the social health banks that reference “lately,” all item banks reference the past 7 days.
All banks other than pain behavior use five response options that most commonly reflect intensity (e.g., not at all, a little bit, somewhat, quite a bit, very much) or frequency (e.g., never, rarely, sometimes, often, always). Pain Behavior includes a “Had no pain” response option as well. All PROMIS instruments are scored using a T score metric so that the mean in the U.S. general population is 50 with a standard deviation of 10. Higher scores reflect more of what is being measured. Therefore, high scores for physical and social function are desirable, whereas high symptom scores are undesirable. CATs were set to administer enough items to achieve a standard error (SE) <0.30 (corresponding to reliability>.90) after a minimum of 5 items were administered per bank. Each CAT stopped after 20 items were administered even if the SE criterion was not met. Additional information about the banks is available at www.nihpromis.org. Legacy instruments were also included in this study. Legacy instruments are the most widely used survey instruments to assess a particular patient-reported outcome; considered the state-of-the-science prior to PROMIS. Legacy instruments included the SF-36® version 2 (12), Health Assessment Questionnaire-Disability Index (HAQ-DI) (13),10-item Center for Epidemiologic Studies Depression Scale (CES-D)(14), Functional Assessment of Chronic Illness Therapy (FACIT)-Fatigue(15), and Medical Outcomes Study (MOS) Sleep scale(16). These legacy instruments were chosen as they have been endorsed by the Outcomes Measures in Rheumatology (OMERACT)(17, 18) and/or recently evaluated in SSc(19, 20).
The SF-36 version 2 is a generic health status measure consisting of 36 items assessing 8 scales.(12, 21) The 8 scales are summarized into Physical Component Summary (PCS) and Mental Component Summary (MCS) scores. The scales and summary scores are normalized to the U.S. general population, for whom the mean score is 50 and the standard deviation is 10. We used the 4-week recall period version of the SF-36 v.2.
The Health Assessment Questionnaire-Disability Index (HAQ-DI) is an arthritis-targeted measure intended for assessing functional ability in arthritis(13). It is a self-administered 20-question instrument that assesses a patient’s level of functional ability and includes questions that involve both upper and lower extremities. The HAQ-DI score ranges from 0 (no disability) to 3 (severe disability). It has a 7 day recall period.
Depressive symptoms were measured with the 10-item Center for Epidemiologic Studies Depression (CESD-10) Scale(14). The CESD-10 uses a 4-point categorical response scale (range 0 to 30) with higher scores representing greater depressive symptoms. A score ≥10 on the CESD-10 represents depressive symptoms. It has a 7 day recall period.
The Functional Assessment of Chronic Illness Therapy-Fatigue (FACIT-Fatigue) is a 13-item questionnaire that assesses self-reported fatigue and its impact upon daily activities and function over the past 7 days. The range of possible scores is 0–52, with lower scores reflecting more fatigue.
The Medical Outcomes Study (MOS) Sleep scale(16) yields a sleep problems index and 6 scale scores. Answers were based on a retrospective assessment over the past 4 weeks. Quantity of sleep is scored as the average hours slept per night. The other scales and 9-item sleep problem index are scored on a 0–100 possible range, and higher scores indicate more of the concept being measured.
Mean scores, standard deviations (SD), ranges, and percentages of respondents scoring the minimum (floor) and maximum (ceiling) possible scores were calculated to evaluate scale score distributions for PROMIS and legacy instruments. For easy interpretability, floor effect is presented as “worst” possible score and ceiling as “best” possible irrespective of the direction of the scale.
Internal consistency reliability for legacy items was estimated using Cronbach’s alpha(22). An alpha ≥ 0.70 is considered satisfactory for group comparisons(23). We also assessed if there were differences in socio-demographics, type of SSc or disease duration in patients falling in the 1st versus 4th quartiles for total CAT-items completed.
The construct validity of PROMIS measures was evaluated by examining correlations with corresponding legacy measures using a computer program for analyzing a multitrait-multimethod (MTMM) matrix. Construct validity is supported in MTMM analyses when the highest correlations are found for different methods of assessing the same domain (validity diagonals) and weaker correlations among measures of different domains (21). The 6 PROMIS domains selected for analysis were depression, fatigue, pain behavior, physical function, sleep disturbance, and satisfaction with participation in discretionary social activities. This is due to fact that only 6 PROMIS domains had corresponding legacy scales administered in the study. The corresponding legacy scales were the CESD-10, FACIT-Fatigue, SF-36 bodily pain, SF-36 physical functioning, MOS 9-item sleep problem index, and SF-36 social functioning, respectively. Analyses were also repeated by replacing HAQ-DI for SF-36 physical functioning and SF-vitality for FACIT-Fatigue. We hypothesized correlation coefficients for validity diagonals of ≥0.50 (a large effect size) and that these would be significantly larger than off-diagonal correlations.
We recruited 143 patients with SSc. The average (SD) age was 51.5 (14.7) years; 117 (71%) were female and 94 (68%) were Caucasian. The mean (SD) disease duration was 7.5 (8.2) years. Seventy-six (55%) had limited SSc, 55 (39%) had diffuse SSc, and 9 (6%) had overlap syndrome. On average, patients had moderate functional disability as defined by HAQ-DI of ≥1.0 (24). Forty-five (32%) had depressed mood (CESD > 10) and SF-36 PCS (mean score=38.2) and MCS (mean score= 47.8) scores were 1.2 and 0.2 SD below the U.S. population means, respectively (Table 1). The average fatigue level (FACIT-F) was 31.6, a standard deviation below the general population and very close to the cut score of 30 indicating clinically significant fatigue. Ceiling effects (proportion of patients who reported no impairment) were seen in 6% of CESD scores, 10% of HAQ-DI scores, and 2% of FACIT-F scores.
Reliability, as assessed by Cronbach’s alpha, for all legacy domains was >0.70 and ranged from 0.73 (SF-36 general health) to 0.96 (SF-36 role limitations-physical).
Scores on PROMIS physical functioning domains in this sample were about 1.0 SD worse than scores from a sample representing the US general population(25) (Table 2). Other scales were 0.2 to 0.66 SD below the general population. Ceiling effects were seen in the pain domains (15–20%) suggesting no pain in these patients. Also, in 8 of 11 banks, no patient answered maximum 20 questions in the item bank (Table 2). In the remaining 3 banks (Social Satisfaction Discretionary, Social-Satisfaction Roles, Wake Disturbances) 5–12% of patients completed all questions in the item bank.
As discussed in the Methods section, CATs were set to administer a minimum of 5 items and enough items to achieve a standard error (SE) <0.30. The average number of items completed for each CAT-administered item bank ranged from 5 to 8 (69 CAT items per patient for all 11 CAT item banks), and the average time to complete each CAT-administered item bank ranged from 48 second to 1.9 minutes per patient (average time= 11.9 minutes/per patient for 11 CAT item banks; Table 3). Some patients’ physical impairments required study administration modifications while others experienced difficulty understanding the questions and required additional clarification and time to complete the study questionnaire. For example, some patients had one or more digital amputations and/or severe hand contractures related to their SSc and were physically unable to operate a mouse or computer keyboard (n=27 subjects). In these instances, the patients read the questions on the screen themselves, provided the answer to the coordinator, and then the coordinator selected the given response in Assessment Center. This same procedure was followed for patients who complained of poor vision and inability to read the questions despite setting the type font to its largest size. The mean time to administer PROMIS items was not different between patients who required additional help versus those who did not (p=.0993). There were no differences in the education level (p=0.3) and depressed mood (p=0.5) in patients who required additional help versus those who did not. Ten patients (7%) required > 25 minutes to complete their assessment. Of these, 5 of the patients took an average of 9.0 minutes to complete the anger item bank. Only 2 of these ten patients belonged to the physical impairment group.
We also explored if there were any differences in the demographics of patients who completed ≤ 56 items (1st quartile) vs. ≥ 81 items (4th quartile; Table 4) when combining all 11 item banks. There were no statistical differences in the patients in regards to their socio-demographic characteristics (Table 4).
The correlations in the MTMM matrix are provided in Table 5. Validity diagonals (correlations among different methods of measuring the same domain) were the largest correlations across the row and column in every case with one exception: the PROMIS scale (satisfaction with participation in discretionary social activities) had about the same size correlation with the legacy scale FACIT-Fatigue (r=.62) than with the SF-36 social functioning counterpart. Eighty-three percent of the paired correlation t-tests were statistically significantly larger than relevant off-diagonal correlations in the MTMM matrix, providing substantial support of construct validity of the measures. The correlation between HAQ-DI and PROMIS physical functioning CAT was 0.71 and between SF-vitality and PROMIS fatigue CAT was 0.75. Other MTMM matrices were analyzed replacing SF-physical functioning with HAQ-DI and FACIT-Fatigue with SF-vitality. Similarly, validity diagonals were the largest correlations across row and column with at most one exception in these matrices. The matrix with SF-physical functioning and FACIT-Fatigue is presented (Table 5) because it had the highest average convergent validity correlation.
The PROMIS initiative aims to create reliable and valid item banks that can be used across chronic conditions. This study expands on the original PROMIS effort by testing the performance of the PROMIS CATs in systemic sclerosis (SSc). We have demonstrated the feasibility of administrating CATs in patients with SSc seen in an academic center without interrupting the flow of the clinic. In addition, the PROMIS domains showed construct validity against the appropriate legacy instruments.
One of the objectives of PROMIS is to develop meaningful and precise instruments while reducing respondent burden(1, 2). In our patients with SSc, mean/median time of completion 11 CAT-administered domains was 11.9/9.0 minutes. Patients and providers can therefore anticipate approximately one minute per concept measured in clinical practice settings where PROMIS CATs are employed. In comparison, there were 91 items in the 5 legacy instruments that captured 6 health constructs (physical functioning, mental health, bodily pain, social functioning, sleep, and fatigue). Applying rule of thumb of completing 3–5 items/minute(26), these would take approximately 18.2 minutes to 30.3 minutes for 6 health contructs.
The time estimate is important given the PROMIS domains were administered during the office visit with a dedicated study coordinator and were done in patients with moderate physical disability. Data collection occurred in the clinic usually after a physician visit. In rare occasions, patients intiated the study visit before seeing the physician. This may explain the time of 79.7 minutes for a single patient who likely stopped in the middle to be interviewed and examined by the physician but didn’t log off.
The administration of PROMIS domains by the study coordinator in patients with disability didn’t significantly increase the time to complete the items. The study coordinator was able to read the questions or help answer patients’ choices, especially in patients with disabilities. However, this study was performed in a research setting. If these measures are to be used in routine clinical practice, that task would fall to a clinical staff (e.g., nurse, receptionist), which could drastically impact clinic flow and should be assessed in future studies.
By using an item bank for a particular domain that covers the continuum of the construct, PROMIS aims to reduce the floor and ceiling effects in PROs(2). As an example, Rose et al (27) used IRT to construct and evaluate a preliminary item bank for physical function (N=17,726). In simulations, a 10-item CAT eliminated floor and decreased ceiling effects. When using the 9-item HAQ-DI or 10-item SF-36 physical function measure, there were significant floor and ceiling effects. In our study, the presence of floor effect was minimal but a higher ceiling effect was seen in the assessment of pain impact and behavior (ranging between 20–25%) suggesting a large proportion of patients reported no pain. Lack of high proportion of patients with floor/ceiling effects on other measures reflect moderate-to-severe physical, mental and social impact of SSc.
We assessed construct validity of the PROMIS domains and legacy measures. All correlations between PROMIS domains and corresponding legacy measures were large and in the hypothesized direction; the absolute value of these correlations ranged from .61 to .82. The correlations of PROMIS scale (satisfaction with participation in discretionary social activities) correlated as highly with the legacy item FACIT-Fatigue (r=.62) as it did with it (r=.61). This is probably due to the fact that the fatigue probably plays a large role in the ability to participate in social activities
With the exception of physical function which does not include a time frame and the social health banks that reference “lately,” all PROMIS item banks reference the past 7 days. In comparison, SF-36 BP and SF scales and MOS Sleep scale use 4-week recall period. In a previous study, researchers compare acute (one-week recall) and standard (four-week recall) versions of the SF-36 scale scores in 142 patients with asthma(28). The acute form scores were reliable, the scales conform to assumptions underlying their scoring and scaling, and scales had factor content similar to the standard version.
Our study has many strengths. To our knowledge, this is the first application of CAT-administered PROMIS items in clinical setting without disrupting the flow of the clinic. Second, this study administered legacy instruments and PROMIS domains allowing us to assess the construct validity of PROMIS within SSc, a disease not specifically targeted in the instrument development.
Our study has limitations. First, we did not assess responsiveness of the PROMIS banks and legacy instruments. This will require longitudinal data. Second, we only included 6 legacy instruments in this study whereas we included all 11 PROMIS item banks. This is because the original objective of the study was to assess minimally important differences for OMERACT-endorsed outcomes measures in SSc. Fourth, we only evaluated the feasibility of PROMIS item banks in the research clinical setting. We had a dedicated study coordinator who was responsible for helping patients complete the PROMIS domains and providing legacy instruments. We had dedicated clinic area and the administrators were supportive of our research objectives. The feasibility of administrating PROMIS domains still needs to be determined. Fifth, we didn’t evaluate differential item functioning as we administered the CATs in this study. We made an assumption that the items performance was similar in US population and our sample. Previous analyses of DIF in PROMIS has shown that the items are generally robust to age, gender and education. Ongoing work is evaluating DIF by other subgroups (e.g., by disease group).
Finally, this study did not assess the clinical utility of the measures. As an example, legacy instruments such as HAQ-DI and CESD are used for clinical decision-making(19, 24). Rather, the objective of the study was to assess construct validity; other studies are needed to determine appropriate score cut points and linkage of legacy instrument scores with PROMIS item banks. In conclusion, our study provides support for the construct validity of CAT-administered PROMIS item banks and shows that is feasible to administer them in a clinical practice. Studies are underway to assess clinical utility and responsiveness of the PROMIS item banks.
D. Khanna, Maranian, Rothrock, Cella, Gershon, Furst, PP Khanna, Spiegel, Bechtel and Hays were supported by a National Institutes of Health Award (NIH/NIAMS U01 AR057936A), the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AR052177). D. Khanna is also supported by NIAMS K23 AR053858-04) and the Scleroderma Foundation (New Investigator Award). Dr. PP Khanna was also supported by National Institutes of Health Award (T32 AR 053463). Hays was also supported by the UCLA Resource Center for Minority Aging Research/Center for Health Improvement in Minority Elderly (RCMAR/CHIME), NIH/NIA Grant Award Number P30AG021684, and the UCLA/Drew Project EXPORT, NCMHD, 2P20MD000182