|Home | About | Journals | Submit | Contact Us | Français|
The aims of this paper are to present findings related to differential item functioning (DIF) in the Patient Reported Outcome Measurement Information System (PROMIS) depression item bank, and to discuss potential threats to the validity of results from studies of DIF. The 32 depression items studied were modified from several widely used instruments. DIF analyses of gender, age and education were performed using a sample of 735 individuals recruited by a survey polling firm. DIF hypotheses were generated by asking content experts to indicate whether or not they expected DIF to be present, and the direction of the DIF with respect to the studied comparison groups. Primary analyses were conducted using the graded item response model (for polytomous, ordered response category data) with likelihood ratio tests of DIF, accompanied by magnitude measures. Sensitivity analyses were performed using other item response models and approaches to DIF detection. Despite some caveats, the items that are recommended for exclusion or for separate calibration were “I felt like crying” and “I had trouble enjoying things that I used to enjoy.” The item, “I felt I had no energy,” was also flagged as evidencing DIF, and recommended for additional review. On the one hand, false DIF detection (Type 1 error) was controlled to the extent possible by ensuring model fit and purification. On the other hand, power for DIF detection might have been compromised by several factors, including sparse data and small sample sizes. Nonetheless, practical and not just statistical significance should be considered. In this case the overall magnitude and impact of DIF was small for the groups studied, although impact was relatively large for some individuals.
Surveys that assess latent traits or states such as attitudes, affect, and health often use scales in order to increase the likelihood of accurate measurement. Conceptual and psychometric measurement equivalence of such scales are basic requirements for valid cross-cultural and demographic subgroup comparisons. Differential item functioning (DIF) analysis, commonly used to study the performance of items in scales, examines whether or not the likelihood of item (category) endorsement is equal across subgroups that are matched on the state or trait measured. Two basic types of DIF examined are uniform and non-uniform. Uniform DIF implies that one group is consistently more likely than another to endorse an item at each level of the trait or state, e.g., depression. Non-uniform DIF is observed when there is cross-over, so that at certain levels of the state or trait, one group is more likely to endorse the item, while at other levels, the other group is more likely to endorse the item (see also the Glossary).
The underlying state for the data presented in this paper is depression, and the items are scored in the impaired direction, reflecting depression symptomatology. Socio-cultural and health-related factors appear to affect the response patterns as well as the factorial composition of scales assessing depressive symptomatology (Mui, Burnette & Chen, 2001; Pedersen, Pallay & Rudolph, 2002). The variety of depression scales, the differences in the methodology, and the diversity in group variables examined for DIF across the reviewed studies make the synthesis of findings a difficult task. However, consistent findings across the articles suggest the presence of differential functioning in a substantial number of CES-D (Radloff, 1977) items as a function of a variety of sociodemographic and health-related variables. For example, the authors of several studies found that one or both of the interpersonal items, “people are unfriendly” and “people dislike me” showed DIF with respect to one or more of several variables: race, physical disorder, stroke, and interview mode (Chan, Orlando, Ghosh-Dastidar & Sherbourne, 2004; Cole, Kawachi, Maller & Berkman, 2000; Grayson, Mackinnon, Jorm, Creasey & Broe, 2000; Pickard, Dalal & Bushnell, 2006; Yang & Jones, 2007). Similarly, the affective CES-D item tapping sadness showed DIF based on physical disorder and interview mode (Grayson et al.; Chan et al.). DIF was also observed in the “crying” items (contained in the CES-D, and most other depression scales) with respect to gender (Cole et al.; Gelin & Zumbo, 2003; Reeve, 2000; Yang & Jones), race/ethnicity (Spanish-speakers) (Azocar, Areán, Miranda & Muñoz, 2001; Teresi & Golden 1994), physical disorder (Grayson et al.), and stroke (Pickard et al.).
The impact of DIF in the CES-D was found to be substantial in some studies (Chan, et al., 2004; Cole et al., 2000), but less so in others (Pickard et al. 2006). Low impact of DIF was also observed in the Hospital Anxiety and Depression Scale (Zigmond & Snaith, 1983) in one study of breast cancer patients (Osborne, Elsworth, Sprangers, Oort & Hopper, 2004). The impact of DIF in the Beck Depression Inventory (BDI) (Beck, Ward, Mendelsohn, Mock & Erbaugh, 1961) was demonstrated to be sizable, with artificially inflated scores for Latinos (in contrast to English speakers) (Azocar et al. 2001). Similarly, another analysis demonstrated that half of the items on the BDI scale accounted for 80% of the differential test (scale) functioning, and item response theory (IRT)-adjusted cutoff scores reduced considerably the false negative rate of clinically diagnosed patients with depression who would have been classified as non-depressed without DIF adjustment (Kim, Pilkonis, Frank, Thase & Reynolds, 2002). As is demonstrated by these studies, the findings of salient DIF in many depression measures underscore the need for examination of DIF in items measuring depression. A more detailed review can be found in Teresi, Ramirez, Lai and Silver (2008).
The aims of this paper are to present findings related to DIF in the Patient Reported Outcome Measurement Information System (PROMIS) (Cella et al. 2007; Reeve et al. 2007) depression item bank, and to discuss strengths and limitations of approaches used in DIF detection analyses. Analyses of gender, age and education were performed. Sample sizes were insufficient to examine race/ethnicity.
The overall sample is discussed in Liu et al. (under review). However, a brief description of the subsamples used for the primary and sensitivity analyses is presented here. These data are from individuals who were administered the full bank of emotional distress items; data were collected from a survey panel by a polling firm, Polimetrix (www.polimetrix.com; www.pollingpoint.com).
The studied (also called the focal) group was females in the analyses of gender; the sample sizes for the groups were 379 females and 356 males. In the analyses of education, the studied group was low education through some college (n=518), and the reference group was college or advanced degree (n=217). The studied group for age was those 65 and over (n=201); the sample size for the younger reference group was 533. The sensitivity analyses sample sizes were 258 for the group aged 60 and over, and 476 for the group under age 60.
Depressive symptoms was a subdomain of emotional distress, and the 32 depression items studied were modified from several instruments, including two items from the Geriatric Depression Scale (Yesavage et al. 1982), one from the BDI (Beck et al. 1961), four from the CES-D (Radloff, 1977), and three from the Medical Outcomes Study (Stewart, Ware, Sherbourne & Wells, 1992). Other items came from an assortment of sources. It is noted that many of the items are quite similar to those from popular and older scales used cross-nationally, such as the Depression scale from the Geriatric Mental State or the Comprehensive Evaluation and Referral Examination (CARE) (Gurland et al. 1976; Copeland et al. 1976; Golden, Teresi & Gurland, 1984), and the most recent rendition of these instruments, the EURO-D (Prince et al. 1999). The timeframe for all items was the past 7 days. Items were administered using a five point response scale: ‘never’, ‘rarely’, ‘sometimes’, ‘often’ and ‘always’. Because of sparse data, the categories, ‘often’ and ‘always’ were collapsed, resulting in four ordinal response categories for the preliminary analyses; however, the final analyses required collapsing ‘sometimes’, ‘often’ and ‘always’, resulting in three categories. Sensitivity analyses of binary data were conducted, collapsing ‘never’ and ‘rarely’ vs. the other categories.
Extensive qualitative analyses, including focus groups and cognitive interviews were performed prior to data collection. Based on these data, the items were modified for use in PROMIS in order to refer to the same time frame, have the same response options, and target a 6th grade reading level (see DeWalt, Rothrock, Yount & Stone, 2007). Thirteen focus groups with 104 participants, largely from outpatient psychiatric clinics were convened (DeWalt et al.). Individuals were selected to be representative of a variety of chronic diseases, cultures and ages. Cognitive interviews were conducted; examined were the meaning of the item, the recall and decision process, including social desirability, and the response process. The protocol was based on that of Willis (2005). All questions were first completed by respondents using a paper-and-pencil format, followed by cognitive interviews with probes to elicit the information; five cognitive interviews were performed for each item (see DeWalt et al.).
DIF hypotheses were generated by asking a set of clinicians and other content experts to indicate whether or not they expected DIF to be present, and the direction of the DIF with respect to several comparison groups: gender, age, race/ethnicity, language and education. (Hypotheses with respect to race/ethnicity were also elicited, but subgroup sample sizes were not sufficient for DIF analyses.) A definition of DIF was provided, and the following instructions related to hypotheses generation were given. “Differential item functioning means that individuals in groups with the same underlying trait (state) level will have different probabilities of endorsing an item. Put another way, reporting a symptom (e.g., crying frequency) should depend only on the level of the trait (state), e.g., depression, and not on membership in a group, e.g., male or female. Very specifically, randomly selected persons from each of two groups (e.g., males and females) who are at the same (e.g., mild) level of depression should have the same likelihood of reporting crying often. If it is hypothesized that this is not the case, it would be hypothesized that the item has gender DIF.” Forms were developed for this purpose, and completed by 11 individuals (four clinical health psychologists, one clinical psychologist, two psychiatrists, one oncology nurse, and three “other” professionals). A summary table (available from the authors) was developed arraying the hypotheses and findings from the literature.
Prior to any formal tests of DIF, a best practice recommended by Hambleton (2006) was used to examine the data at a basic, descriptive level. Ten equal intervals of the sum score were formed based on the focal group sums. The item means were examined for each group within each of the levels, and tested for significant group differences. Such “fat matching” (Dorans & Kulick, 2006) is not ideal; however sparse data precluded finer distinctions.
Item response theory (IRT) is often used in DIF analyses (see Hambleton, Swaminathan & Rogers, 1991; Lord, 1980; Lord & Novick, 1968). The method used for DIF detection that is described in this paper was the IRT log-likelihood ratio (IRTLR) approach (Thissen, 1991, 2001; Thissen, Steinberg & Gerard, 1986; Thissen Steinberg & Wainer, 1993), accompanied by magnitude measures (Teresi, Kleinman & Ocepek-Welikson, 2000). DIF magnitude was assessed using the non-compensatory DIF (NCDIF) index (Raju, van der Linden & Fleer, 1995; Flowers, Oshima & Raju, 1999). Finally, scale level impact was assessed using expected scale scores, expressed as group differences in the total test (scale) response functions. These latter functions show the extent to which DIF cancels at the scale level (DIF cancellation). The findings presented here focus on IRTLR; however, other methods were also used in sensitivity analyses to examine DIF in this item set. These other methods include SIBTEST (Shealy & Stout, 1993a,b) for binary items and Poly-SIBTEST (Chang, Mazzeo & Roussos, 1996) for polytomous items. SIBTEST is non-parametric, conditioning on the observed rather than latent variable, and does not detect non-uniform DIF.
A second method used in sensitivity analyses was logistic regression (Swaminathan and Rogers, 1990) and ordinal logistic regression (OLR) (Zumbo, 1999; Crane, van Belle & Larson 2004), which typically condition on an observed variable. Uniform DIF is defined in the OLR framework as a significant group effect, conditional on the depression state; nonuniform DIF is a significant interaction of group and state. Three hierarchical models are tested; the first examines depression state (1), followed by group (2) and the interaction of group by state (3). Non-uniform DIF is tested by examining 3 vs. 2; then uniform DIF is tested by examining the incremental effect of 2 vs. 1, with a chi-square (1 d.f.) test (Camilli & Shepard, 1994). A modification, IRTOLR (Crane et al. 2004; Crane, Gibbons, Jolley and van Belle, 2006) uses the depression estimates from a latent variable IRT model, rather than the traditional observed score conditioning variable, and incorporates effect sizes into the uniform DIF detection procedure. Finally, also used was the multiple indicators, multiple causes (MIMIC) approach (Jöreskog & Goldberger, 1975; Muthén, 1984), which is a parametric model with conditioning on a latent variable; while related to IRT, the model comes from the tradition of factor analyses and structural equation modeling, and does not test for non-uniform DIF (see also Jones, 2006).
The following analyses were conducted using the graded (for polytomous, ordered response category) item response model (Samejima, 1969). Sensitivity analyses were performed using a two parameter logistic model (for items that were collapsed into two categories, non-symptomatic and symptomatic). The graded response model is given in the glossary under IRT.
The expectation is that respondents who are depressed would be more likely than those who are not depressed to respond in a symptomatic direction to an item measuring depression. Conversely, a person without depression is expected to have a lower probability (than a person with depression) of responding in a depressed direction to the item. The curve that relates the probability of an item response to the underlying state or trait, e.g., depression, measured by the item set is known as an item characteristic curve (ICC). This curve can be characterized by two parameters in some forms of the model: a discrimination parameter (denoted a) that is proportional to the slope of the curve, and a location (also called threshold, difficulty, or severity) parameter (denoted b) that is the point of inflection of the curve. (See also the Glossary for definitions.) According to the IRT model, an item shows DIF if people from different subgroups but at the same level of depression have unequal probabilities of endorsement. Put another way, the absence of DIF is demonstrated by ICCs that are the same for each group of interest.
IRTLR, the DIF detection procedure used in these analyses is based on a nested model comparison approach (Thissen et al. 1993). First, a compact (or more parsimonious) model is tested with all parameters constrained to be equal across groups for a studied item (together with the anchor items defined below and in the Glossary) (model 1), against an augmented model (model 2) with one or more parameters of the studied item freed to be estimated distinctly for the two groups. The procedure involves comparison of differences in log-likelihoods (−2LL) (distributed as chi-square) associated with nested models; the resulting statistic is evaluated for significance with degrees of freedom equal to the difference in the number of parameter estimates in the two models. For the graded response model, the degrees of freedom increase with the number of b (difficulty or severity) parameters estimated. (There is one less b estimated than there are response categories.) Severity (b) parameters are interpreted as uniform DIF only if the tests of the a parameters are not significant; in that case, tests of b parameters are performed, constraining the a parameters to be equal. The final p values are adjusted using Bonferroni (Bonferroni, 1936) or other methods such as Benjamini-Hochberg (B-H) (Benjamini & Hochberg, 1995; Thissen, Steinberg & Kuang, 2002).
Important first steps (not presented here) in the analyses include examination of model assumptions such as unidimensionality (see Reise, Morizot & Hays, 2007). These analyses were conducted prior to release of these data sets for DIF analyses, and provided evidence of essential unidimensionality. A standardized residual measure of goodness-of-fit, defined as the difference between the observed and expected frequency divided by the square root of the expected frequency for each response pattern associated with a particular level of the underlying state or trait (denoted theta), measured by the scale was calculated. The standardized residual is distributed approximately normally with mean of 0 and σ2 of 1. High values are indicative of poor fit.
If no prior information about DIF in the item set is available, initial DIF estimates can be obtained by treating each item as a “studied” item, while using the remainder as “anchor” items. Anchor items are assumed to be without DIF, and are used to estimate theta (depression state level), and to link the two groups compared in terms of depression state level. Anchor items are selected by first comparing a model with all parameters constrained to be equal between two comparisons groups, including the studied item, and a model with separate estimation of all parameters for the studied item. This process of log-likelihood comparisons is performed iteratively, and is described in detail in Orlando-Edelen, Thissen, Teresi, Kleinman, and Ocepek-Welikson (2006).
The magnitude of DIF refers to the degree of difference in item performance between or among groups, conditional on the trait or state being examined. Expected item scores can be examined as measures of magnitude. (See Figure 1 for examples.) An expected item score is the sum of the weighted (by the response category value) probabilities of scoring in each of the possible categories for the item. A method for quantification of the difference in the average expected item scores is the non-compensatory DIF index (Raju and colleagues, 1995) used in DFIT (Oshima, Kushubar, Scott, Raju, 2009; Raju, Fortmann-Johnson, Kim, Morris, Nering, & Oshima, 2009). While chi-square tests of significance are available, these were found to be too stringent, over identifying DIF. Cutoff values established based on simulations (Fleer, 1993; Flowers et al. 1999) can be used in the estimation of the magnitude of item-level DIF. For example, for the data presented here, the cutoff values are 0.006 for binary items, and 0.024 and 0.054 for polytomous items with three or four response options (after collapsing categories due to sparse data) (Raju, 1999). Because NCDIF is expressed as the average squared difference in expected scores for individuals as members of the focal group and as members of the reference group, the square root of NCDIF provides an effect size in terms of the original metric. Thus, for a polytomous item with three response categories, the recommended cutoff of 0.024 would correspond to an average absolute difference greater than 0.155 (about 0.16 of a point) on a three point scale (see Raju, 1999; Meade, Lautenschlager & Johnson, 2007). Because of the sensitivity of cutoff thresholds to the distribution of parameter estimates, simulations to derive cutoffs based on empirical distributions have been incorporated into the latest versions of software such as DFIT (Raju, Fortmann-Johnson, Kim, Morris, Nering, & Oshima, 2009) and ordinal logistic regression (Choi, Gibbons & Crane, 2009). The issue is what difference is meaningful and makes a practical difference. Recent work on effect sizes is presented in Stark, Chernyshenko & Drasgow (2004); Steinberg & Thissen (2006); and Kim, Cohen, Alagoz & Kim (2007).
Expected item scores were summed to produce an expected scale score (also referred to as the test or scale response function), which provides evidence regarding the effect of DIF on the total score. Group differences in these test response functions provide overall aggregated measures of DIF impact. Impact at the individual level was examined by comparing DIF-adjusted and unadjusted estimates of the latent depression state scores. Estimates were adjusted for all items with DIF, not just for those with DIF after adjustment for multiple comparisons or those with high DIF magnitude.
Software used was IRTLRDIF (Thissen, 2001) and MULTILOG (Thissen, 1991). Additionally, NCDIF (Raju et al. 1995; Flowers et al. 1999) was evaluated using DFITP5 (Raju, 1999). Prior to application of the DFIT software, estimates of the latent state or trait (theta) are usually calculated separately for each group, and equated together with the item parameters. Baker’s (1995) EQUATE program was used in an iterative fashion in order to equate the theta and item parameter estimates for the two groups and place them on the same metric. If DIF is detected, the item showing DIF is excluded from the equating algorithm, and new DIF-free equating constants are computed, and purified iteratively.
The results of preliminary analyses of group item means within sum score intervals provided information used in collapsing categories for use in more formal tests of DIF. The analyses showed that (a) the data were sparse and skewed; (b) the distributions were different for age groups; (c) a few items emerged as likely candidates for flagging, with findings consistent with the formal DIF analyses using the various methods.
Examination of standardized residuals (not shown) showed that most items fit the IRT model for the age and education analyses (after collapsing categories from four to three). Collapsing categories to three resolved the problem of sparse data for most analyses; however cell sizes were relatively small (12–20 in the symptomatic category above “none/never”) for the high education group for some items, e.g., “I felt I had no reason for living.” The three category solution resulted in all items fitting for the low education group except for slight misfit in category 3 for items 4 and 24 (z=2.13). For high education, some misfit was evident with z scores ranging from 2 to 4. For age, no misfit was observed among older subjects (z=−1.01 to 1.47). For younger subjects, some misfit was observed, with z scores ranging from 2 to 3. For gender, all items fit the IRT model with four response categories for females (z=−1.20 to 1.47); however, some misfit was observed for males. Reduction to three categories reduced the sparse data and misfit; however, some items still evidenced relatively high misfit (z=3.00 to 5.00).
Shown in Table 1 are the final item parameters and DIF tests for gender. The most severe indicator of depression was “no reason for living”; among the least severe indicators were “I felt that I had no energy”, and “felt lonely”. Females were more depressed; the estimated mean was −0.27 for females, and −0.55 for males, indicating that the difference between the average depression levels for women and men was about one fourth of a standard deviation. As shown, eleven items evidenced gender DIF prior to adjustment for multiple comparisons, two with non-uniform DIF. Items with non-uniform DIF were: “felt helpless” and “sad”. Uniform DIF was evident for the items: “crying”, “nothing could cheer me up”, “people did not understand me”, “trouble feeling close to people”, “depressed”, “unhappy”, “nothing interesting”, “life was empty”, “trouble enjoying things I used to do”. After adjustment for multiple comparisons, using either the Benjamini-Hochberg (1995) or Bonferroni (1936) correction, only the item “I felt like crying” showed uniform DIF; the NCDIF index for this item was above the cutoff (0.074). The item is a more severe indicator for males; it takes higher levels of depression for men to endorse the item. The magnitude of DIF for this item is shown in Figure 1.
The most severe indicators for the education analyses were: “no reason for living”, “worthless”, “helpless”, “nothing cheers me up”, “wanted to give up”. The estimated mean for the depression state for the low education group was somewhat higher than that of the high education group (−0.38 vs −0.49). Four items were found to have education-related DIF prior to Bonferroni/Benjamini-Hochberg correction, two with non-uniform DIF, “felt hopeless”, and “pessimistic”. After adjustment, no items evidenced education-DIF (see Table 2). Overall, the magnitude of DIF was small (see Table 4), and only one item had NCDIF above the cutoff, “I felt that I had no energy”. This item was a more severe indicator for those with higher education (see Figure 1).
Shown in Table 3 are the analyses of age. The original analyses of age with four response categories produced sparse data and very high a parameter estimates, resulting in false non-uniform DIF detection, and the identification of only three anchor items. In order to reduce sparse data, the top three categories were collapsed, yielding three response categories. The most severe indicator was “no reason for living”; among the least severe was “no energy”. Comparison of the distributions indicated less depression for the older than for the younger cohort (estimated μ = −0.84 for older respondents vs. −0.23 for the younger group). The results of these analyses produced 15 anchor items. Before adjustment for multiple comparisons, 13 items showed DIF, one with non-uniform DIF; after Bonferroni/B-H correction, two items showed uniform DIF: “I felt I had nothing to look forward to” (NCDIF=0.031), and “I had trouble enjoying the things I used to enjoy” (NCDIF=0.080). Both of these items were more severe indicators for the younger cohort. Two other items had NCDIF values above the cutoff: “I felt like crying” (NCDIF=0.065) and the item, “I found that things in my life were overwhelming” (NCDIF=0.026). Both items evidenced uniform DIF prior to adjustment for multiple comparisons, and were more severe indicators for the older cohort.
It is possible that lack of model fit and sparse data may have resulted in false DIF detection in the earlier analyses with four response categories; thus as stated above, the primary analyses were performed using three response categories. Additionally, because some depression scales use binary versions of many of the items examined, and in order to obtain more robust results, IRTLR analysis was also performed using a binary version of the items: not present vs. symptomatic (combining the categories above none and rarely into the value, 1). This reduced the number of parameters estimated, and also reduced the sparse data.
Consistent with the primary analyses, only one item showed significant gender DIF in the sensitivity analyses, the crying item. The NCDIF index for the crying item ranged across sensitivity analyses from 0.043 to 0.091. This represents a relatively small effect size or absolute difference ranging from 0.21 to 0.30 on a two to four point scale, depending on the analyses.
For education, after adjustment for multiple comparisons, no items evidenced DIF; this is the same result found in the primary analyses. The NCDIF index was low for all items, consistently showing low magnitude of DIF. However, in the primary analyses with three categories, one item, “I felt that I had no energy”, although not significant after adjustment for multiple comparisons, did evidence NCDIF over the threshold for education comparisons.
For age, the results for the analysis examining the focal group, aged 60 years and over instead of 65 and over showed that the item, “nothing to look forward to” showed uniform DIF, and the item, “trouble enjoying things I used to” showed non-uniform DIF after Bonferroni/B-H correction. The latter finding was consistent with the findings for the polytomous version, in which the item was found to show uniform DIF of relatively high magnitude. The NCDIF ranged from 0.046 to 0.089 across sensitivity analyses for this item.
The various analyses of age produced similar parameter estimates. The item, “I had trouble enjoying the things I used to do” evidenced uniform DIF; across most of the range of the depression scores, the probability of endorsement was higher for the older than the younger cohort. Regardless of analyses, this item showed significant, relatively higher magnitude DIF.
A concern is that sparse data may have produced spurious (large and inconsistent) a parameters. The correlations among a parameters estimates for the final models, and those used in the sensitivity analyses ranged from 0.905 to 0.998 for age, from 0.895 to 0.995 for education, and from 0.934 to 0.992 for gender, providing some evidence for the consistency of estimates for the final models.
The consistent finding across all methods is that there is gender DIF associated with the item, “I felt like crying”. Higher conditional endorsement was observed for women; the item was a more severe indicator of depression for men than for women. This finding was both hypothesized by PROMIS content experts, and found in the literature on DIF in depression measures.
The item, “I had trouble enjoying the things I used to enjoy” was hypothesized to have higher conditional endorsement in men. This was not observed to be significant with IRTLR after adjustment, but the hypothesis was confirmed by two analyses (IRTOLR and DFIT). This item was found by all methods to have DIF, either for gender, age and/or education.
Another item hypothesized by content experts to possibly show gender and age DIF that was confirmed by three methods (IRTOLR, Poly-SIBTEST and DFIT) to show age, gender or education DIF was the item, “I felt that I had no energy”. Conditional on depression, those 65 and over were less likely to report no energy (IRTORL, IRTLR before Bonferroni adjustment); and those with lower education were more likely to endorse the item (POLYSIB); the βuni from the SIBTEST analyses was −0.26. This result was consistent with the primary analyses where the NCDIF index was above the threshold (0.039). No confirmatory literature was available.
Figure 2, showing the test response functions, is a summary of the findings across the methods regarding the impact of DIF associated with the PROMIS depression items. Impact, examined at the aggregate level, was found to be minimal in the IRTOLR and MIMIC analyses when mean scale or latent state scores were examined with and without adjustment for DIF. As shown in Figure 2, this result is confirmed using the expected scale scores that are based on the IRTLR/MULTILOG result. The Differential Test Function (DTF) (Raju et al. 1995) values (a density-weighted summary of differences between groups in test response functions) were small, and not significant: 0.071 for gender, 0.047 for education and 0.129 for age.
Analyses at the individual level using IRTOLR showed DIF impact for some people; this result was confirmed using IRTLR. An examination of the differences in thetas with and without adjustment for DIF showed that 22.4% of subjects changed by at least 0.5 theta (about one half standard deviation), of which 5.8% changed by the equivalent of one standard deviation in the analyses of gender; for education the figures are 53.5% and 25.1%, and for age 3.8% changed by at least 0.5 standard deviations. The impact is in the direction of false positives for depression. For example, using a cutoff of theta ≥ 1, comparable to about a standard deviation above the mean, 9.5% would be classified as depressed prior to DIF adjustment, but not after adjustment in the analyses of gender; the comparable figures for education and age are 13.6% and 3.4%, respectively. Thus, the impact at the individual level was large for at least 100 people. Examination of the characteristics of those with very large changes in theta (≥1.25) shows that for the gender analyses all were females with lower education (86.1%); in the analyses of education, all except two were of lower education. Thus a common component across the analyses is lower education. The discrepancy between the aggregated and the individual impact result is in part because the expected scale scores reflect DIF cancellation, in that items with DIF in one direction may cancel items with DIF in another direction, producing low overall impact. However, specific individuals may still be affected.
Despite caveats discussed below, based on review of (a) hypotheses (b) findings from the literature (c) the collective results from the various analyses, the items that were recommended for exclusion from calibration or treatment using some other technique that accounts for DIF include the following two items, “I felt like crying” and “I had trouble enjoying things that I used to enjoy”. The item, “I felt I had no energy” was also flagged as an item that showed DIF across several methods and comparisons, and was recommended for further review.
The findings were of relatively few items with significant or salient DIF in the depression item bank. Only two items were recommended for exclusion from calibrations, and one was recommended for further review. These items were hypothesized to show DIF and found in the literature to evidence DIF. Variants of the crying item have been identified in most studies as evidencing DIF (Cole et al. 2000; Gelin & Zumbo, 2003; Grayson et al. 2000; Pickard et al. 2006; Reeve, 2000; Teresi & Golden, 1994; Yang & Jones, 2007). Steinberg and Thissen (2006) in discussing DIF in the context of personality inventories and other depression surveys showed that uniform gender DIF of relatively large effect size has been observed for the crying item. While it is possible to use other methods, such as separate group calibrations for items with DIF, expert review of content area is necessary to determine whether an item is sufficiently clinically salient to remain in the item pool or bank. Procedures for accounting for DIF in the context of CAT require further development.
Limitations of the study include the inability (due to small sample sizes) to examine DIF by ethnicity or language. Smaller sample sizes may also have affected the power to detect DIF for the analyses of education and age. However, sensitivity analyses, conducted using other models, e.g., MIMIC, did not yield substantively different results; MIMIC has been found in simulation studies to be more powerful than IRTLR for uniform DIF detection under conditions of smaller sample sizes (Woods, 2009).
A caveat is that, to the extent that the findings are not robust given the various features of the data described above, these impact results could be incorrect. It is noted that many of the analyses of impact from the literature (see below) concluded that the impact of DIF on depression scale scores was not trivial, and the impact on some individuals may be large. The findings from these analyses were similar. While the aggregated impact was low, with evidence of DIF cancellation, relatively large individual impact (defined as a large change in the depression estimate after DIF adjustment) was observed for about 14% of the sample for at least one analysis. This underscores the need for removal or separate calibration of items with a high magnitude of DIF.
These findings should be interpreted in the context of factors that may affect DIF (see Teresi, 2006). Several features of the data have been found to be problematic in terms of DIF detection for one or more methods. These include the presence of sparse and skewed data, usually from small subgroup sample sizes. For example, simulation studies have shown that skewed data (with floor effects) resulted in reduced power for DIF detection using ordinal logistic regression (Scott, Fayers, Aaronson, Bottomley, de Graeff et al., 2009). An attempt was made to remedy this by collapsing categories and performing sensitivity analyses using binary items. Gelin and Zumbo (2003), using an ordinal logistic regression approach to examine DIF in the CES-D items found that the endorsement proportions and the way in which items were scored: ordinally, binary or in terms of a persistency threshold (frequency of at least 3–7 days) affected DIF results, with higher magnitude of DIF for the binary and ordinal methods. In the analyses presented in this paper, the results were similar, regardless of categorization method.
Parametric models are more powerful for DIF detection with small subgroup sample sizes such as those observed in this study. IRTLR has been found in simulation studies, conducted using several polytomous response IRT models, to result in false DIF detection in the presence of group differences in the state/trait distributions when large sample sizes are studied (Bolt, 2002). While group differences in distributions were observed, particularly for age groups, simulation studies have not observed type 1 error inflation in studies of smaller subgroups such as those present in PROMIS.
IRTLR has also been found to be affected by lack of purification, magnitude of DIF present, and degree of DIF cancellation in simulations of binary data, resulting in both over and under-identification of items with DIF (Finch, 2005; Navas-Ara & Gómez-Benito, 2002; Wang & Yeh, 2003; Finch & French, 2007). An assumption in the use of IRTLR log-likelihood tests for DIF-detection is that the conditioning variable is DIF-free. The authors of one recent simulation (Finch & French) recommend first using SIBTEST or LR to purify the matching variable. In the case of ordinal data, OLR would be most efficient, followed by IRTLR. Purification was performed for the data set reported in this paper; however, predetermined anchor items were not available. Although Poly-SIBTEST was one of the methods used in sensitivity analyses, it was not used as a first-stage screen to identify potential anchor items as suggested by Finch and French, but rather to examine the convergence of findings across methods. It is noted that the recommendations of these authors were based on the results of simulations of binary data, and may not hold for polytomous data, such as those reported in this paper. The selection of anchor items has been discussed in earlier literature (e.g., Cohen, Kim and Wollack, 1996; Thissen, Steinberg and Wainer, 1993); more recent research has focused on anchor items and purification approaches in the context of IRTLR (Orlando-Edelen, Thissen, Teresi, Kleinman and Ocepek-Welikson, 2006; Woods 2008), MIMIC (Shih & Wang, 2009; Wang, Shih, Yang, in press) and the more general multigroup confirmatory factor analysis approach to examining invariance (French & Finch, 2008). These studies generally support the use of anchor items or the selection of invariant referent items. One study showed that the inclusion of at least 4 anchor items was preferable, in the context of power for IRTLR DIF detection (Wang and Yeh, 2003); a similar result was observed for MIMIC (Shih and Wang, 2009). On the positive side, a sufficient number (at least 10 in these analyses) of anchor items were identified for each IRTLR analysis, mitigating the impact of DIF on initial estimates.
False DIF detection (type 1 error inflation) can also result from model mis-specification for polytomous data (Bolt, 2002). A recent examination of the results of model misspecification of the likelihood ratio test used in IRTLR and other nested models such as logistic regression examined models that included 5, 10 and 30 items; data were generated using the one, two and three parameter logistic IRT model, with a normally distributed latent variable. Thus, generalization is to binary items. Lack of model fit was found to affect the G2 used in nested models to test for DIF in IRTLR. If the least restricted model does not fit the data, this misspecification will result in incorrect statistical inferences (Maydeu-Olivares & Cai, 2006) and inflated type 1 error (Yuan & Bentler, 2004). Model fit was examined in the analyses reported here, and some misfit was observed for the high education group, and for males. Further collapsing resolved the misfit problems for education groups, and reduced, but did not eliminate misfit for males.
Finally, a good practice recommended by several authors (e.g., Crane et al. 2004; Hambleton, 2006; Millsap, 2006; Teresi & Fleishman, 2007) is to apply magnitude measures to identify salient DIF. For example, one recent simulation study of logistic regression (French & Maller, 2007) found that use of effect sizes under several conditions reduced false DIF detection, albeit at the expense of reduced power. An issue is what flagging rules to use in DIF detection (Hidalgo & López-Pina, 2004). Simulation studies of flagging rules or cutoff thresholds indicative of magnitude have resulted in differing suggested values, leading to a recent recommendation (Meade et al. 2007) to derive empirically cutoff values appropriate for the data set used. While such magnitude measures were applied in these analyses, cutoff values were not data-specific. PROMIS investigators have examined different criteria for flagging DIF (Crane et al. 2007), and are currently developing the capability to derive empirical thresholds using Monte Carlo simulations (Choi, Gibbons & Crane, 2009).
Despite these limitations, several strengths of the study include the extensive qualitative analyses performed that led to item revisions, the generation of DIF hypotheses and the use of purification of the conditioning depression variable. Additionally, model assumptions were tested, and sparse data controlled to the extent possible by collapsing categories. Finally, extensive sensitivity analyses were performed, and multiple methods were used in combination with DIF magnitude measures in order to investigate DIF and converge on valid, consistent findings, as suggested by Hambleton (2006).
Little DIF was found in the depression item bank for the groups studied. The extensive qualitative analyses that preceded this effort may have mitigated the extent of DIF in the item bank. On the one hand, false DIF detection (Type 1 error) was controlled to the extent possible by ensuring model fit and purification. On the other hand, power for DIF detection might have been compromised by several factors, including sparse data and small sample sizes. Nonetheless, practical and not just statistical significance should be considered. In this case the overall magnitude and impact of DIF was small for the groups studied; although impact for some individuals was relatively large, supporting the removal or separate calibration of a few items. This is a particularly important consideration in the context of item banks, because individuals may receive only a subset of items, with the potential for magnification of the impact of DIF for some people. Future analyses of the item bank should be performed examining ethnicity and language.
A question arises as to the practical implication of DIF for selection and prediction. A discussion of the relationship of measurement invariance to fair selection and prediction invariance is beyond the scope of this article, but is discussed in several seminal works, e.g., Meredith (1993), Millsap (1997), and more recently in Millsap (2007), Meredith and Teresi (2006), and Borsboom, Romeijn and Wicherts (2008). As shown by Millsap (2007), and illustrated in part by the results presented here, measures may show no aggregate prediction bias, but can produce systematic selection errors at the individual level due to measurement bias.
Item banks are being used increasingly to assess health and psychological domains; in that context it is critical that decisions and resource allocation based on these assessments result from a valid measurement process. The methods described in this paper are key steps in the development and evaluation of item banks, and of short-form measures that may be constructed from such banks. These methods may also be applied to examine the performance of existing measures. Individual differences, reflected in cultural, gender, educational or ethnic diversity, must be considered in the development and evaluation of measures. Analysis of DIF in the PROMIS depression item bank is an important step toward the goal of increasing the likelihood of measurement equivalence across diverse groups.
Anchor items are those items found (through an iterative process or prior analyses) to be free of DIF. These items serve to form a conditioning variable used to link groups in the final DIF analyses. (See also the discussion of purification, below.)
In the context of item response theory, DIF is observed when the probability of item response differs across comparison groups such as gender, country or language, after conditioning on (controlling for) level of the state or trait measured, such as depression.
Uniform DIF occurs if the probability of response is consistently higher (or lower) for one of the comparison groups across all levels of the state or trait.
Non-uniform DIF is observed when the probability of response is in a different direction for the groups compared at different levels of the state or trait. For example, the response probability might be higher for females than for males at higher levels of the measure of the depression state, and lower for females than for males at lower levels of depression. For some DIF detection methods, e.g., logistic regression, non-uniform DIF is defined as a significant group by depression interaction.
The magnitude of DIF relates to the degree of DIF present in an item. In the context of IRT, a measure of magnitude is non-compensatory DIF (NCDIF) described for binary items as the unsigned probability difference (Camilli & Shepard, 1994), and later expanded to polytomous items by Raju and colleagues (1995). This index reflects the group difference in expected item scores (see Expected Item Scores). In essence this method provides an estimate of what expected score would obtain for an individual if s/he was scored based on the parameters and depression estimates for group X, and then based on the depression and parameter estimates for group Y. NCDIF is the average squared difference in expected item scores for a given individual as a member of the focal group, and as a member of the reference group. Theoretical work in this area was provided by Chang and Mazzeo (1994). (For computational details, see Collins, Raju & Edwards, 2000; Morales, Flowers, Gutiérrez, Kleinman & Teresi, 2006; Teresi et al. 2007).
Specific cutoff values are used to indicate salient DIF. While chi-square tests of significance are available, these were found to be too stringent, over identifying DIF. Cutoff values established based on simulations (Fleer, 1993; Flowers, Oshima & Raju, 1999) provide an estimate of the magnitude of item-level DIF. The cutoff values are controversial; for example, for polytomous items with five response options the recommended cutoff in the manual is 0.096 (Raju, 1999). However, simulation studies have suggested the use of different cutoff values for five response categories: 0.032 for smaller sample sizes of 300 per group (Bolt, 2002) or 0.016 (Flowers, 1995). The most recent recommendations (Meade et al. 2007) suggest using 0.0115 for a liberal test of DIF for five response category items, and 0.009 for a conservative test for sample sizes ≤ 500/group. Recently, Oshima, Raju and Nanda (2006) have recommended other cutoff values for binary items, and empirically derived cutoffs based on the data set have been incorporated into DFIT8 (Oshima, Kushubar, Scott, Raju, 2009).
Impact refers to the influence of DIF on the scale score. There are various approaches to examining impact, depending on the DIF detection method. In the context of IRTLR, differences in “test” response functions can be constructed by summing the expected item scores to obtain an expected scale score. Plots (for each group) of the expected scale score against the measure of the state or trait (e.g., depression) provides a graphic depiction of the difference in the areas between the curves, and shows the relative impact of DIF. The Differential Test Functioning (DTF) index (Raju et al. 1995) is a summary measure of these differences that incorporate such a weight, and reflects the aggregated net impact. The DTF is the sum of the item-level compensatory DIF indices, and as such reflects the results of DIF cancellation. (See also Stark et al. 2004.)
Individual impact can be assessed through an examination of changes in depression estimates (thetas) with and without adjustment for DIF. The unadjusted thetas are produced from a model with all item parameters set equal for the two groups. The adjusted thetas are produced from a model with parameters that showed DIF based on the IRTLRDIF results estimated separately (freed) for the groups.
DIF is said to cancel if the net impact of DIF is trivial. For example, if the differences between expected scale scores (defined below) for the groups compared are negligible, resulting in small areas between the curves relating expected scale scores to the measure of the latent state, depression, then DIF is said to cancel. Because the expected scale score is on the raw metric of the scale score, at each level of depression disorder, it is possible to locate the average scale score associated with that degree of symptomatology.
An EIS is the sum of the weighted (by the response category value) probabilities of scoring in each of the possible item categories. Used by Wainer, Sireci and Thissen (1991), this effect size measure is gaining in popularity. (See also Collins et al. 2000; Orlando-Edelen et al. 2006; Steinberg & Thissen, 2006; Teresi et al. 2007.)
The expected scale score is the sum of the expected item scores. The test response function (Lord & Novick, 1968) relates average expected scale scores to theta (the estimate of depression). Note that these scores are typically not weighted by the response frequency; however, such a weight can be applied so that the results reflect the relative frequencies in the sample.
Several forms of item response theory models are available for binary, categorical and ordinal data. Because the data presented here were ordinal, with up to five ordered response categories, a graded response model (Samejima, 1969) was applied to the data using MUL-TILOG (Thissen, 2001). In this model (which reduces to the 2 parameter logistic item response model with binary data), we assume ordered responses, x=k and k=1,2,…m. The discrimination parameter or slope can be defined as ai, and difficulty parameters for response k as bik.
P*(k) is the ICC describing the probability that a response is in category k or higher, for each value of θ (see Thissen, 2001; Orlando-Edelen et al. 2006). The model assumes an average discrimination across response categories. Note that the scaling parameter, D, is used in some IRT programs, but not in MULTILOG, the program used in these analyses.
Item sets that are used to construct preliminary estimates of the attribute assessed, e.g., depression, include items with DIF. Thus, estimation of a person’s standing on the attribute may be incorrect, using this contaminated estimate. Purification is the process of iteratively testing items for DIF so that final estimation of the trait can be made after taking this item-level DIF into account. Because simulation studies have shown that most methods of DIF detection are adversely affected by lack of purification, the process is critical, particularly for IRTLR. Using this method, anchor items are selected that are free of DIF. For most models, these anchor items and the studied item form the conditioning set of items used in the DIF detection process. During this iterative process items may change in terms of DIF status, a result of the use of a less than optimal (contaminated) conditioning variable at various steps in the analyses. The final estimates of the attribute use all items, however, only after parameters have been appropriately set as freely or equally estimated, depending on whether the items showed DIF or not.
These analyses were conducted on behalf of the Statistical Coordinating Center to the Patient Reported Outcomes Measurement Information System (PROMIS), a United States National Institutes of Health roadmap project. PROMIS was funded by cooperative agreements to a Statistical Coordinating Center (Evanston Northwestern Healthcare, PI: David Cella, PhD, U01AR52177) and six Primary Research Sites (Duke University, PI: Kevin Weinfurt, PhD, U01AR52186; University of North Carolina, PI: Darren DeWalt, MD, MPH, U01AR52181; University of Pittsburgh, PI: Paul A. Pilkonis, PhD, U01AR52155; Stanford University, PI: James Fries, MD, U01AR52158; Stony Brook University, PI: Arthur Stone, PhD, U01AR52170; and University of Washington, PI: Dagmar Amtmann, PhD, U01AR52171). NIH Science Officers on this project are Deborah Ader, Ph.D., Susan Czajkowski, PhD, Lawrence Fine, MD, DrPH, Louis Quatrano, PhD, Bryce Reeve, PhD, William Riley, PhD, and Susana Serrate-Sztein, PhD. This manuscript was reviewed by the PROMIS Publications Subcommittee prior to external peer review. See the web site at www.nihpromis.org for additional information.
Funding for analyses was provided in part by the National Institute on Aging (NIA), Resource Center for Minority Aging Research at Columbia University, PI: Rafael Lantigua, Co-Director, Jeanne Teresi (AG15294), and by the NIA project, AG025308, Understanding Disparities in Mental Status Assessment, PI: Richard Jones. This paper was prepared for the International Conference on Survey Methods in Multinational, Multiregional and Multicultural Contexts, Berlin, Germany, June 25–28, 2008.