|Home | About | Journals | Submit | Contact Us | Français|
This study examined two approaches to linking items from two pain surveys to form a single item bank with a common measurement scale. Secondary analysis of two independent surveys: IMMPACT Survey with Main Survey (959 chronic pain patients; 42 pain items) and Pain Modules (N=148; 36 pain items); and CORE Survey (400 cancer patients; 43 pain items). There were common items included among the three data sets. The two approaches were examined, one in which all items were calibrated to an item response theory (IRT) model simultaneously and another in which items were calibrated separately and then the scales were transformed to a common metric. The two approaches produced similar linking result across the two sets of pain interference items because there was sufficient number of common items and large enough sample size. For pain intensity, simultaneous calibration yielded more stable results. Separated calibration yielded unsatisfactory linking result for pain intensity because of a single common item with small sample size. The results suggested that simultaneous IRT calibration method produced the more stable item parameters across independent samples, hence, is recommended for developing comprehensive item banks. Patient reported health outcome surveys are often limited in sample sizes and the number of items owing to the difficulty of recruitment and the burden to the patients. As a result, the surveys either lack statistical power or limited in scope. Using IRT methodology, surveys data can be pooled to lend strength to each other to expand the scope and to increase the sample sizes.
Modern measurement theory methods are being used increasingly to develop and evaluate existing and new symptom assessment and health outcome instruments (1, 2, 3, 4, 5, 6, 7). For example, the NIH sponsored Patient-Reported Outcome Measurement Information System (PROMIS) initiative is focused on developing and evaluating item banks designed to measure pain, fatigue, physical function, social function, emotional distress and other domains. Item response theory (IRT) analyses have been instrumental in the psychometric evaluation of these item banks (8, 9). While common in the educational testing field, scale linking has not been frequently applied to health outcome measures (4). However, there are situations where scale linking is needed such as when researchers are developing several large health outcome item banks each consisting of a large number of items. In order to reduce burden on the participants, item banks may be divided into several comparable subsets (8, 9). Participants are administered only a subset of the items that comprise the item bank(s). Through scale linking, all subsets can be put on a common measurement metric. Another example where test linking may be needed is when a researcher wants to create one instrument by pooling health outcome items from two studies where there is overlap in the items administered across the two studies.
This study examines several methods for linking pain-related items across two patient surveys. A survey conducted by the Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT) committee (10) and a survey conducted by the Center on Outcomes, Research and Education (CORE), Evanston Northwestern Healthcare and Northwestern University (6). The IMMPACT Survey included 959 patients with chronic pain conditions and the CORE Survey included 400 patients with various cancers. Because the two surveys included common pain interference and intensity items, we were able to combine the two datasets to conduct the linking study. The obvious benefits of combining the datasets are that the subject sample size and number of items is larger and the domain coverage of items is broader.
The process of creating a common scale is usually referred to as scale linking in IRT analyses. The application of IRT starts with item calibration—the process of fitting item response data to an IRT model and obtaining item parameters for each item (11, 12). There are different ways of linking sets of items using IRT depending on the data collection designs. Two set of items can be linked through common subjects, common items, or both. The item banks for the IMMPACT and CORE surveys included common items but the different samples constituted non-equivalent groups. There are two ways to achieve linking in such a situation, that is, a common item non-equivalent group design. One is to link by “simultaneous (concurrent) IRT calibration” (13). The other is by “separated IRT calibration” in which each sample is calibrated separately and then linked through “scale transformation” (13). For this study, we compared these two scale linking methods.
The CORE and IMMPACT survey item sets included items measuring three pain domains: pain intensity, pain quality and pain interference.
The CORE pain item bank data were collected in a cross-sectional survey of 400 cancer patients. Table 1 summarizes the demographic and clinical characteristics of this sample. The original CORE pain item bank contained 61 items, but this item bank was reduced to a final data set of 43 items (6). Items without good psychometric characteristics were excluded from the final pain item bank. The development of this bank is described elsewhere (6).
The IMMPACT Survey was a cross sectional internet-based survey of individuals with chronic pain recruited through the American Chronic Pain Association website (10). The purpose of the survey was to gather data on the importance and relevance of different domains impacted by chronic pain and to measure pain intensity, quality and interference, and functional and psychological well-being outcomes in people with chronic pain. The sample consisted of 959 respondents who had one or more chronic pain conditions (see Table 1). The survey included a Main Survey section with a total of 42 pain-related items, and the Pain Module that included 36 items from the CORE pain item bank. All 959 patients took the Main Survey; only a subset of 148 also took the Pain Module.
Pooling of the IMMPACT Main Survey, the Pain Module, and the CORE pain item bank yields 29 pain interference, 10 pain intensity, and 38 pain quality items. The three item sets shared eight common items: one pain intensity item and 7 pain inference items. That left the IMMPACT Main Survey with 10 unique pain inference items and 24 unique pain quality items. The IMMPACT Pain Module did not have a unique item to itself. All its items were common with the CORE pain item bank. These included 12 pain interference items, 10 pain intensity items, and 14 pain quality items. Figure 1 shows the compositions of the unique and shared common items of the three item sets for the three pain domains.
Since there was no common item shared between the IMMPACT Main survey with Pain Module and CORE item bank for the pain quality domain, they could not be linked using the separated calibration approach. Thus, pain quality domain was excluded from this study. The two pain domains, pain interference and pain intensity, were analyzed separately as they each was assumed to be unidimensional.
Samejima’s Graded Response Model (14) as implemented in MULTILOG (15) was used to calibrate the items. The pain interference and pain intensity domains were analyzed separately. Higher response categories indicate worse pain or severe pain-related problems. The extreme response categories for some of the items were excluded from the IRT calibration because no patients endorsed these categories. The maximum number of response categories allowed by MULTILOG is ten. Therefore, for the 11-category items, the two highest response categories were combined into one category.
Simultaneous IRT calibration represents the easiest and most direct way to link two sets of items when there are shared common items. In this method, the item parameters of the common items are constrained to be equal across the samples. In essence, each common item is treated as one single item administered to different groups of samples. When the simultaneous calibration is completed, all items are automatically on the same measurement scale (16).
The mean and standard deviation of the population distribution of latent trait (θ) are also estimated during the calibration. In the linking, when the multiple samples are considered equivalent, the single group IRT calibration is conducted. In this study, the groups are non-equivalent as they are from different populations (e.g., cancer and chronic pain); therefore, multiple group IRT calibration is used, in which, one of the groups will have mean fixed to zero, and other groups’ means will be estimated (13).
For multiple groups simultaneous calibration, we included all items and all subjects in a single calibration run. The sample from the CORE survey was treated as the reference group with its population mean fixed to zero. The IMMPACT sample mean was estimated. Linking using simultaneous IRT calibration with multiple groups was conducted separately for the pain interference and pain intensity domains.
The second linking approach that we used was to conduct separate item calibrations followed by scale transformation. One IRT calibration was conducted for the combined IMMPACT Main Survey and IMMPACT Pain Module. The second IRT calibration was conducted for the CORE pain item bank.
This method results in separate sets of item parameters for each data set that are not on the same measurement scale because the samples of the two studies are considered non-equivalent. To link the separate sets of item parameters, “scale transformation” is performed on the common items. (13, 16). Scale “transformation constants” are calculated and used to place item parameters on a common mathematical metric. This method takes advantage of the parameter invariance of IRT models. When the assumptions of the model are met, item parameters calibrated in different samples are the same within a linear transformation. Because the measurement scale in IRT is unobservable, there is no natural origin for the scale and it is arbitrarily given a mean of 0 and standard deviation of 1. The linear transformation of one metric to the other can be expressed as,
where, A is the slope, and B is the intercept. The new item parameters then can be transformed to the target metric using the same coefficients as follows,
Where aj and a* j are the slope parameters, and bj and b* j are the location or threshold parameters.
In this study, we transformed the item parameters from the IMMPACT item calibration to be the same as the CORE item calibration, using the CORE calibration as the common metric. There are different scale transformation methods (13, 16). The most commonly used are the mean/mean, mean/sigma, and test characteristic curve methods. We used four transformation methods to obtain the transformation constants by using the computer program STUIRT (17). STUIRT is one of many equating and scale transformations computer programs (available at www.uiowa.edu/~casma). These programs can be used in conjunction (13). In this study, linking using separated IRT calibration was conducted separately for pain interference and pain intensity.
To evaluate how well the two calibration approaches work, we first examine whether the calibration converged. There is no closed solution for the item parameters in IRT. MULTILOG uses marginal maximum likelihood (18) method to estimate the item parameters. The calibration is said to converge when item parameter estimates are obtained. Usually, non-convergence indicates problem in fitting of the data to the IRT models.
Non-convergence is evident when there is extreme item parameter. The underlying scale of the IRT model is conventionally set a mean equal 0 and standard deviation of 1. Therefore, any threshold parameter that exceeds ± 6.0 is considered extreme since it is six standard deviations away from the mean. Other evidence of non-convergence related to the order of the threshold parameters. The graded response model assumes that the response options are monotonically ordered, subsequently, the threshold parameters are ordered. Non-ordered threshold parameters, in graded response model, are indication of non-convergence or problematic model fitting. In this study, we examine the value of the item parameters and the order of the threshold parameters to evaluate how well the two calibration approaches work.
The results of the simultaneous IRT calibration are summarized in Table 2 for the pain interference domain. There were 29 pooled items and 1,364 subjects for the pain interference domain. The slope parameters were all reasonable large from 1.84 to 3.74, and all the threshold parameters were monotonically increasing. The mean θ for the CORE sample was fixed at zero, and the mean θ for the IMMPACT sample was 2.22. The item characteristic curves suggest that 10 response categories may be too many. Some of the middle response categories overlap with each other. Subjects with similar degrees of interference (based on estimated θ) were equally likely to endorse more than one overlapping response categories. On average, the IMMPACT sample reported higher levels of pain interference. This was expected since all CORE subjects were cancer patients and not all of them experienced significant pain (6). Whereas, the IMMPACT sample was comprised of patients who had chronic pain from rheumatoid arthritis, osteoarthritis, lower back pain, fibromyalgia, diabetic neuropathy, and other neuropathic pain.
There were 10 pooled items and 1,364 subjects for the pain intensity domain. There was only one item that was answered by all subjects (Pain on average in past week). Only the patients administered the IMMPACT Pain Module survey and the CORE patients answered the other 9 items. The slope parameters were very high, from 2.78 to 4.73, except for the item “I have minor aches and pains” which had a slope of 0.90. This item was one of the unfolding model items (19), where the patients with no pain or with severe pain would both answer “None of the time.” This item does not fit the graded response model. Mean θ of the IMMPACT sample was 2.41, higher than the CORE sample. The item characteristic curves show that, again, the items with 10 response categories can be reduced to fewer response categories with minimal loss of information.
These results indicated that linking was applied successfully using the simultaneous calibration approach. The calibrations converged for both domains, and there were no significant problems observed for either the item parameters or item characteristic curves.
The pain interference domain provided a good test of the separated calibration approach. There were 7 common items between the IMMPACT Main survey and CORE items with 959 and 400 subjects in each study, respectively. In addition, the IMMPACT Pain Module and CORE surveys shared 12 common items and had 148 and 400 subjects, respectively. We first estimated item parameters for the pain interference items for the IMMPACT sample. These included 959 subjects who answered the Main survey and 148 subjects who also answered the Pain Module survey. There were 29 interference items in this data set, 17 items from the Main survey and 12 from the Pain Module. Item calibration also was completed for the CORE pain interference items. There were 19 pain interference items completed by 400 CORE patients. There were 19 common items between these two data sets (7 from the Main survey and 12 from the Pain Module survey). Different from the simultaneous calibration where the mean θ was fixed for one of the samples and estimated for the other, for the each separated calibration the mean θ was always fixed at zero. That is, the mean θ was fixed at zero for both the IMMPACT and CORE samples when their parameters were being calibrated in separate runs. The slope parameters ranged from 1.20 to 2.99 for the items in the IMMPACT sample, and from 2.49 to 5.96 for the CORE sample. The slopes were higher for the IMMPACT items were due to higher correlation between the items, and the more homogenous responses among the CORE survey patients. The threshold parameters for the IMMPACT items ranged from −5.56 to 0.66, and ranged from −0.11 to 1.92 for the CORE items. The ranges of the threshold parameters cannot be directly compared between the IMMPACT and CORE items at this time.
We also conducted separate calibration for the items within the pain intensity domain. For the IMMPACT sample, only one item was answered by all IMMPACT patients, and the other 9 items were answered by the 148 Pain Module survey patients. Because of the small sample size and since many of the response categories had zero counts the calibration was not stable. This was evidenced in the item parameters estimates. One item had a slope of 0.13. Thresholds for some items were as high as 16.6 and even 37.0, and were not monotonically increasing. Based on this result, we chose not to attempt to link items of the intensity domain based on the separated calibration approach.
After the item parameters were estimated, the item parameters from the IMMPACT calibration were transformed to be on the same metric as the CORE calibration. Using the computer program, STUIRT, we obtained transformation constants, A and B (see equation 1), using four different scaling methods. The results are shown in Table 3 for the pain interference domain. We found that the transformation constants were very close among the four transformation methods. However, the transformation constants obtained using the Haebara and Stocking-Lord approaches were more similar. We used these constants to transform the item parameters of the non-common items, and the items from the two separated calibration onto a common metric. The transformed item parameters using the Stocking-Lord transformation method are shown in Table 4.
Next, we compared the simultaneous approach and the separate approach for the pain interference domain. Note that although we were comparing the item parameters of the same items between the two approaches, the comparison was not direct. The IMMPACT and CORE items were on the same scale as a result of the simultaneous calibration approach. The IMMPACT and CORE items were also on the same scale as a result of the separated calibration approach. However, the measurement scales for the two approaches were not on the same metric scale. The two sets of “item bank” calibrations differed by a linear transformation. Therefore, another scale transformation was necessary to make the two banks on the same mathematical metric in order to compare the simultaneous calibration and separated calibration approaches.
We correlated the item parameters of the two approaches. The correlation between the slope parameters was 0.923 (Figure 2). The correlations between the threshold parameters ranged between 0.911 and 0.992, except the first threshold parameter (b1, the threshold parameter indicates the point where the probability passes 50% for endorsing the second or higher responses). Figure 3, the scatter plots of the item threshold parameters, shows the straight line relationship between linking methods, except an outlier for b1. The b1 parameter for the item “Pain interference with your daily activities,” a 4-point scale item, was very small because there were zero observations for the lowest response option. Without this outlier, the two sets of item parameters were very similar. We also examined the descriptive statistics and correlations of the IRT scores of the two approaches. Table 5 shows these statistics including the correlation of between IRT scores for the pain interference domain obtained with the two approaches. The two scales differed by a factor of 0.784, the ratio of the standard deviations for the IRT scores of the CORE sample (1.047/0.821). The correlations between the IRT scores of the two approaches (see Table 5), were as high as 0.999 for the IMMPACT and CORE samples, and for overall. Thus the two calibration approaches produced very similar item characteristics.
In this study, we demonstrated how pain items from separate surveys can be linked to the same measurement scale to form a single item bank when there were some shared common items. For the datasets that were used in this study, we obtained better linking results using a simultaneous calibration when the sample size was relatively small. This is demonstrated by comparing the results for the pain intensity domain, where the separated calibration did not converge. The common items for the pain intensity domain were answered by only 148 subjects. When the items were calibrated separately, we obtained extreme item parameter estimates (threshold parameters estimates as high as 16.6 and 37.0) and some of the threshold parameters were not monotonically ordered correctly‥
These findings correspond to simulation studies that have found the simultaneous calibration method produced more accurate results than separated calibration when the IRT model fits the data (13, 16). Cook et al. (20) also found that sample sizes of 300 or more subjects were necessary for linking health outcome measures. When sample size is small, the separated calibration approach is not advisable based on the finding of non-convergence results.
For the pain interference domain, we had sufficient sample size for linking scales. In this case, we obtained similar results when linking by simultaneous calibration and using separated calibrations (as shown in Table 5). Therefore, results of this study suggest that simultaneous calibration is preferable when linking sets of item from two surveys, particularly with smaller sample sizes and fewer common items. However, with sufficient sample size and enough common items, separate calibration approaches can yield similar results to the simultaneous calibration methods.
There is no fixed rule regarding the number of common items across two samples. Angoff (21) suggested that common items should contain the larger of 20 items or 20% of the total number of items. Other investigators (22) recommend from 5 to 10 items to form the link between samples. In general, a larger number of common items will result in more precise and stable item calibrations in the bank. In our comparison, even when there were few (< 5) common items, we found that the simultaneous IRT calibration method produced converged item parameter estimates across independent samples of pain intensity and pain interference items. The results of this study suggest that simultaneous calibration methods should be the recommended approach for linking sets of health domain items across two or more studies.
Linking scales are common in education owing to the need to maintain a consistent and comparable test scores across many test forms and test administrations. Reports of empirical studies of linking scales in education are numerous (13, 18, 23, 24, 25, 26, 27, 28). Linking scales across samples are less common in health outcome research. While the methods and experiences maybe borrowed from education, there is uniqueness in health outcome content and items which require further exploration on the appropriateness of different linking methods. Testing in education is routine and longitudinal, and the population is relatively homogenous. The sample sizes are often large, and the number of test items is almost unlimited. In health outcome studies, the population is more heterogeneous, the sample sizes are frequently small, and the number of items may be limited. However, the small sample sizes and limited item banks in health outcome studies make linking scales and items more important. Through linking, small studies can be pooled to create larger sample sizes and to create larger and more diverse item banks. More empirical research is needed to understand and further explicate how best to link health outcome instruments and item banks. Further guidance is needed as to the best methods and to provide empirically based recommendations .for accomplishing linking between different health outcome studies.
In conclusion, patient reported health outcome surveys are often limited in sample sizes and the number of items owing to the difficulty of recruitment and the burden to the patients. Thus the surveys’ content coverage may be restricted and statistical power limited. Using IRT methodology, surveys data can be pooled to lend strength to each other to expand the content coverage and to increase the sample sizes. This in turn increases the statistical power of the pooled study and provides for a more comprehensive item bank. In addition, linking the two studies allows scores to be placed on the same metric which enables improved comparison of findings between studies even if the same set of items were not included in both studies.