To operationalize the domains in , an 80-item pool was constructed by selecting questions from existing instruments and creating new ones where none existed. The items in the pool were categorized under the domains they were intended to measure and were reviewed by a subset of the expert panel for face and content validity.
All 80 items were further refined with three rounds of face-to-face cognitive testing with 20 respondents with chronic conditions. Items were evaluated in terms of how well they were understood, the degree to which there was variability in responses, and the adequacy of the response categories. Seventy-five items were retained after the cognitive interviews and used for the pilot study.
The pilot study was conducted with a convenience sample of 100 respondents. Participants were recruited through newspaper advertisements and were paid for their participation. Respondents ranged in age from 19 to 79 and reported a wide range of chronic conditions. Items were administered through a telephone interview that included the 75-item pool and a limited set of demographic and health status questions.
The initial set of items constituting the PAM were selected using Rasch analysis (Rasch 1960
; Wright and Masters 1982
; Wright and Stone 1979
; Massof 2002
). Rasch measurement can be used to create interval-level, unidimensional, probabilistic Guttman-like scales from ordinal data such as rating scale responses to survey questions. The measurement model calibrates the “difficulty” of the items in terms of response probabilities. The calibration of an item on the measurement scale indicates how much of the measured variable a respondent must exhibit to be able to endorse the item.
Once the measure is constructed, individuals are measured as to where they fall on the scale, and their location represents how much of the variable each respondent possesses. In the case of the PAM, an individual's location indicates how activated the person is. Both the people who are measured and the items doing the measurement are located on the same equal interval scale, yet these two parameters are statistically independent of each other. This concept of parameter separation means that the calibration of the items is independent of the activation levels of the particular respondents measured.
The precision with which an item's scale location, or calibration, has been estimated is represented by the item's standard error of measurement. Likewise, the precision of each individual respondent's estimated scale location is specified by the standard error of measurement of that person.
Item selection is based on item fit statistics representing how much responses to an item deviate from the model's expectations. A fit value of 1.0 indicates perfect fit to model expectations. Fit values >1.0 indicate more stochastic variability in responses than expected (e.g., persons with low measured activation endorsing items requiring a high level of activation) and fit values <1.0 indicate that responses to the item by persons of different activation levels do not vary as much as the model expects.
Two item fit statistics are calculated. Infit is an information-weighted residual and is most sensitive to item fit when the item's scale location is close to the respondent's scale location. Outfit is more sensitive to item fit for items with a scale location that is distant from the respondent's scale location. Simulation studies and experience suggest that item fit values between .5 and 1.5 produce sufficient unidimensionality and expected response variability for useful rating scale measurement (Smith 1996
). All analyses were conducted with the Winsteps
Rasch models software application (Linacre 2002
shows the 21 items constituting the preliminary activation measure, the calibrated scale location (difficulty) of each item, and the fit and item discrimination statistics. Item difficulty calibration on the “calibration” shown in indicates how much activation is required for a patient to have .5 probability of responding “agree” to an item. Item scale locations have been transformed from the original logit metric to a user-friendly 0–100 metric where 0=the lowest possible activation and 100=the highest possible activation as measured by this set of items. While the metric allows for a potential range of 0–100, the items included in the measure only covered the range from 40–60, not tapping what would be theoretically the lowest or highest ranges of the construct.
Preliminary (from Stage 2) 21-Item Activation Measure with Calibrations
All the domains derived through the conceptualization stage () are reflected in the 21 items, except for the domain of accessing appropriate and high-quality care. While items addressing this domain correlate with the 21-item measure, fit statistics revealed these items tap a different construct than activation.
Most importantly, this analysis indicates that the items form a unidimensional, probabilistic Guttman-like scale. Close inspection of the difficulty order of items on the scale suggests that they reflect a developmental model of activation
(Bond and Fox 2001
). Beliefs about the patient role and basic knowledge about one's condition and treatment appear to be important early developmental steps. Items in this early stage involve areas such as knowledge of medications and needed lifestyle changes as well as a belief that active involvement in one's health care is important. Only a small amount of activation is required to be able to endorse these items. Skills and confidence appear to come at later developmental steps. Items at the midpoint of the scale involve confidence that one can identify when medical care is needed, and that one can follow through on medical recommendations and handle symptoms on one's own. Items at the top of the activation continuum, indicating greatest activation, include maintaining needed lifestyle changes, having the confidence to handle new situations or problems, and keeping chronic illness from interfering with one's life.
Rasch person reliability is the proportion of the total sample variability in measured activation that is not measurement error. Rasch person reliability provides upper and lower bounds to the estimate of the “true score” reliability of a measure. Real person reliability is calculated under the assumption that all of the misfit in the responses is due to departure of the data from the model's expectations. This is the lower bound reliability of the measurement of persons in this sample with this set of items. Model person reliability is based on the assumption that the data fit model expectations and that the misfit in the data is due to the probabilistic nature of the model. This is the upper-bound reliability. The true reliability of the measure lies somewhere between these lower and upper bounds. The Rasch person reliability for the preliminary 21-item measure was between .85 (real) and .87 (model). Cronbach's alpha was .87.
We also conducted a test–retest reliability assessment. Thirty respondents from the pilot survey were reinterviewed two weeks after the initial interview with the same protocol. For each person we calculated the precision of their measured activation at test and again at retest, measured by the standard error of measurement (SEM) for each person's estimated activation at each time point. The SEM times 1.96 provides the 95 percent confidence interval (CI) for each person's measured (estimated) activation. Twenty-eight of the 30 respondents had a retest activation estimate within the 95 percent CI of their test activation estimate.
To assess criterion validity, we interviewed 10 respondents from the pilot study: five who scored at the lowest end of the activation scale, and five who scored at the highest. An in-depth, open-ended, semistructured interview protocol was used to elicit elaborated explanations of how respondents dealt with common problems and challenges associated with managing their conditions, such as handling a situation with a physician who did not answer questions well, their responses to recommendations to change their lifestyle, and handling self-treatments on their own. The interviews were transcribed and three judges, blinded to the person's measured activation, reviewed and independently categorized each transcript as that of a person “low” or “high” in activation.
The three independent judges' classification of respondents agreed with their measured activation level (high or low) 83 percent of the time (or 25 of the 30 classifications were correct). Cohen's kappa for measured activation and each judge's classification were .80, .90, and .90 (p <.001 for all three kappas). No one respondent was misclassified by all three judges. These findings suggested that the preliminary measure had criterion validity when evaluated using the key criterion of self-described behavior.