General PROMIS Methods
The process began with a step-wise qualitative item review that included: 1) identification of items from existing fatigue scales, 2) item classification and selection, 3) item revision, 4) focus group exploration of domain coverage, 5) CI on individual items, 6) final revision before field testing[9
]. More than 80 fatigue questionnaires were initially reviewed, resulting in a list of over 1000 potential fatigue items, though many of these were quite similar to one another. By the time of the CI step, a total of 136 potential items remained (see examples in ). These items were grouped into four non-overlapping sets of 34 items each, with one set administered to each subject in the first round of CI (some subjects were administered additional revised items during a second round of CI). We allowed similar questions to be grouped together so that respondents could consider and comment on the similarities and differences between wording choices if that seemed important to them.
Examples of PROMIS fatigue items
Recruitment of CI Participants
The participant sample was intended to represent a diverse range of chronic health conditions (e.g., diabetes, chronic pulmonary disease, cardiovascular disease, musculoskeletal disease, chronic pain, and chronic gastrointestinal conditions) and socio-demographic characteristics. Fatigue is common in many of these conditions, but there is also substantial variability between individuals. The aim was to include subjects with mild, moderate, and severe levels of fatigue in the review of each item. Participants were interviewed at the University of North Carolina (UNC), Chapel Hill, Medical School. Potential participants were recruited from two sources: the North Carolina Musculoskeletal Health Project and the UNC General Internal Medicine Practice.
The North Carolina Musculoskeletal Health Project is a collaborative database established by researchers and clinicians at the Thurston Arthritis Research Center and Department of Orthopedics at the UNC Medical School. The database contains a list of consecutive patients from the rheumatology and orthopedics clinics seen at UNC who consented to participate in future studies. Potential cognitive interview participants were mailed an invitation letter that provided an overview of the purpose and nature of the cognitive interviews and asked if they would be willing to participate. Interested patients could contact the study personnel by email or phone. The research staff also followed up with phone calls to assess interest in participating in the study and to determine eligibility. In addition, patients were directly approached and screened for eligibility at the UNC General Internal Medicine Practice with the permission of the treating physician. This study was previously approved by the UNC Institutional Review Board and is protocol # 05-2571.
Patients were eligible to participate in the cognitive interviews if: 1) they were at least 18 years of age, 2) had seen a physician for a chronic health condition within the past 5 years, 3) were able to speak and read English, 4) were willing to provide written informed consent prior to study entry, 5) had no concurrent medical or psychiatric condition that, in the investigator’s opinion, may preclude participation in this study, and, 6) had no cognitive or other impairment (e.g., visual) that would interfere with completing an interview.
Conducting the Cognitive Interviews
Each CI was conducted face-to-face and lasted approximately 45-60 minutes. Patients completed paper and pencil questionnaires consisting of 34 items (from the total of 136), and then were debriefed by the interviewers. Going item by item, the interviewer asked a series of open-ended questions, following a script, seeking comments with regard to the item stem (body of the question), the response options, and the time frame (the period covered by the questions, which was uniformly set at seven days). The interviewer asked summary questions at the end of the interview (questions shown in ).
Probes for Cognitive Interview
All 136 items were reviewed by five to six participants during the first interview round and the 19 items subjected to a second round of CI were reviewed by a minimum of three more participants. While this is not a large number of CI per item, it should be noted that most of the PROMIS items were taken and slightly modified from existing questionnaires that had been used in large numbers of subjects already. In addition, CI was only one in a series of techniques used to refine the questionnaire items.
CI data was collected by trained interviewers at UNC, Chapel Hill Medical School. They were faculty or graduate students in public health or social work who underwent two CI training sessions for four hours each, including methods, protocol review, and practicing with feedback. All interviews were conducted by two staff. One conducted the interview while the other took detailed notes and recorded the interview. Recordings were only used to fill in gaps in the notes and were not transcribed verbatim. After the interview, one staff took the notes and organized them into a cohesive report along with the comments from the other cognitive interviews for the given item.
Modification of Items on the basis of CI
After completion of the first round of cognitive interviews, on an item-by-item basis, we decided if each item needed revision based on feedback from cognitive debriefing. As mentioned in the introduction, there is no standard method for using CI feedback to modify items. The summary of CI feedback for each item was reviewed by a group of five individuals at Stony Brook University including persons with expertise in the study of fatigue and the development of self-report measures (the present authors [AS, DJ, and CC] and two other members of the research team). The group decided on a consensus basis whether to retain, revise, or eliminate each item. In arriving at a decision, the group placed particular weight on comments that arose in the feedback of more than one respondent. However, a single negative remark was occasionally enough to lead to a decision to revise or eliminate an item (e.g., a remark signaling a serious misunderstanding of the item stem). For items judged as requiring substantial revision, a second round of CI was undertaken with three to five participants reviewing each item.
Evaluation of Decisions Made in Response to CI
After the CI process was complete, we decided to formally evaluate whether the items we accepted fared better than the eliminated items in terms of the concerns raised by subjects during CI. Revised items that were sent back for re-evaluation after Round 1 were not re-rated, since we were most interested in the final disposition of an item (retained versus eliminated). Items revised after Round 1 were only rated after completion of CI Round 2 when their final disposition (i.e., retained versus eliminated) was known. (There were two items that were revised after Round 2 without another CI round, and we decided to exclude these items from analysis).
We developed and adopted methods to assess the following questions regarding the accepted versus eliminated items: Did the retained items have fewer serious CI concerns than eliminated items? Were eliminated items more likely to be seen as non-applicable to respondents’ lives? What types of concerns were raised for eliminated versus retained items using the QAS-99 [18
]? Below we describe the methods used to address each of the questions raised. Two of the present authors (DJ and CC) employed these methods approximately four months after the initial decisions had been made. In an effort to minimize the influence of the earlier decisions on the more quantitative formal evaluations, information on item disposition was removed from item spreadsheets that were used during the formal evaluations.
Metric for evaluating if the retained items raise fewer serious concerns during CI than eliminated items
We categorized the number of concerns that were raised for each item into mild concerns and serious concerns. Concerns were defined as follows:
We considered a concern mild when a subject suggested alternate wording without specifically stating that the current wording was bad. Words like “preferred”, “offered”, or “suggested” were considered a “mild” concern.
We considered a concern serious if one or more of the following conditions were met: a) The respondent insisted on a wording change, using expressions like “should”, “needs to”, “must”, etc.; b) the respondent specifically said something negative about the existing item, regardless of whether the respondent provided alternate wording or not; or c) the comments of a respondent reflected a misunderstanding of either the item stem or the response options.
Each item was reviewed by three to six CI participants (5 to 6 in the first round of CI, and at least 3 in the second round) who could indicate whether they had problems with the stem and/or response categories. Because the total number of concerns raised (either mild or serious) could differ based on the number of participants, the number of concerns for each item was divided by the number of participants who viewed that item. We calculated this separately for mild and for serious concerns. Our primary focus was on the serious concerns. This procedure allowed us to get a quantitative picture of the number of concerns raised for each of the eliminated as well as retained items.
We determined the degree to which we were able to reliably assess the severity of concerns raised by participants by having two raters jointly review a subset of items (n = 19). The raters coded and then discussed the items one at a time in an effort to increase inter-rater coding consistency. The remaining items were then rated independently. The two raters classified participants’ concerns as mild and/or serious in an identical fashion for 93% of the items (111/119). As an alternative measure of reliability, we obtained the intraclass correlation, which yielded a value of .91 (p < .001) for mild concerns and .93 (p < .001) for serious concerns (using a two-way mixed model). The few remaining differences between raters were resolved by joint discussion of the items, so that all ratings reported were the consensus opinion of both raters.
Metric for evaluating if eliminated items were more likely to be seen as non-applicable to respondents’ lives
We counted the number of subjects who stated that an item was not applicable to their lives during the past 7 days. For example, respondents commented that the particular experience or particular event mentioned in the item did not occur for them during that time period. For example, subjects not working rated the following item as non-applicable: “how often did you feel used up at the end of the workday?”
As with the severity ratings, two raters jointly reviewed a subset of items (n = 19). The raters coded and then discussed the items one at a time in an effort to increase inter-rater coding consistency. The remaining items were then rated independently. A 99% agreement rate (118/119) was achieved for the applicability ratings and the intraclass correlation for applicability ratings was .86 (p < .001). The difference between raters on the single item was resolved by joint discussion of the item, and all ratings reported were the consensus opinion of both raters.
Metric for evaluating the types of concerns raised for eliminated versus retained items using the QAS-99
We used the QAS-99 [18
] as a method of categorizing the item problems identified during the CI process. The QAS-99 consists of eight major categories that address item problems (shown in ). Most of the QAS-99 categories (categories 3-8) identify types of problems that are associated with each item from the respondent’s perspective, but category 1 (Reading) pertains to difficulties reading items from the interviewer’s perspective
and category 2 (Instructions) pertains to difficulties respondents have with the overall instructional set
rather than to any individual item. Because the focus of the CI in PROMIS was to obtain item-by-item analysis from the respondent’s perspective, we excluded categories 1 and 2. Therefore, the major categories we assessed were: Clarity, Assumptions, Knowledge/Memory, Sensitivity/Bias, Response Categories, and Other Problems.
To ensure that the complexity of the rating task was captured, we chose to establish inter-rater reliability for the QAS-99 classifications on the items with the highest likelihood of exhibiting problems, that is, the 55 items that were revised or eliminated in each round of CI. For training purposes a subset of the items (n = 8) that would undergo QAS-99 classification was jointly reviewed and discussed by the two raters. The raters coded and then discussed the items one at a time in an effort to increase inter-rater coding consistency. Inter-rater reliability was established on the 47 remaining items.
Establishing inter-rater reliability for QAS-99 ratings was more complicated than for the severity and non-applicability ratings, because it was possible to assign more than one QAS-99 problem category to each item [18
]. Thus, the raters could agree on some but not all of the same categories. For example on item X, Rater 1 could assign problems with Clarity and Assumptions; for the same item Rater 2 could assign problems with Clarity and Response Options. As a result, inter-rater agreement could be determined in multiple ways and we defined two levels of agreement: identical and partial. Identical agreement
required that the choices of the two raters were exactly the same. Identical agreement was obtained for 79% (37/47) of the items. Partial agreement is a more lenient standard. It required that at least one of the choices of the two raters (and possibly more)
was the same. All differences on the QAS-99 were resolved by the two raters, following discussion of the items, and all ratings reported can be considered the consensus opinion of both raters.