|Home | About | Journals | Submit | Contact Us | Français|
This research provides an example of testing for differential item functioning (DIF) using multiple indicator multiple cause (MIMIC) structural equation models. True/False items on five scales of the Schedule for Nonadaptive and Adaptive Personality (SNAP) were tested for uniform DIF in a sample of Air Force recruits with groups defined by gender and ethnicity. Uniform DIF exists when an item is more easily endorsed for one group than the other, controlling for group mean differences on the variable under study. Results revealed significant DIF for many SNAP items and some effects were quite large. Differentially-functioning items can produce measurement bias and should be either deleted or modeled as if separate items were administered to different groups. Future research should aim to determine whether the DIF observed here holds for other samples.
Differential item functioning (DIF) occurs when an item on a test or questionnaire has different measurement properties for one group of people versus another, irrespective of group-mean differences on the variable under study. For example, if the Schedule for Nonadaptive and Adaptive Personality (SNAP; Clark 1996) item: “I enjoy work more than play” has gender DIF, the probability of responding “True” is different for men versus women even when they are matched on degree of workaholism. Men and women may have different mean levels of workaholism—this is separate from the issue of DIF. Detecting DIF is important because it can lead to inaccurate conclusions about group differences and invalidate procedures for making decisions about individuals.
Numerous methods have been proposed for identifying DIF (Camilli and Shepard 1994; Holland and Wainer 1993; Millsap and Everson 1993). For most methods, it is desirable to select a few DIF-free items to define the matching criterion that is used for testing the other items for DIF. For some methods, people are matched on summed scores (i.e., the sum of observed item scores); for others, people are matched on an estimate of the latent variable that underlies the item scores. The matching is likely to be more accurate with latent-variable methods because they account for measurement error in the items.
DIF testing using latent variables may be accomplished using a multiple-group model or a single-group model with covariates. For categorical item data, both of these models may be parameterized either as an item response model fitted to the data directly or as a confirmatory factor analysis (CFA) model fitted to a matrix of polychoric correlations. Multiple group models are usually parameterized as item response models (this method is often referred to as IRT-LR-DIF; Thissen et al. 1986; Thissen et al. 1988, 1993), and single-group models with covariates are a type of multiple indicator multiple cause (MIMIC) structural equation model. Woods (in press) contrasts these two latent-variable approaches and compares them in simulations.
The present research is an application of the MIMIC-model approach, illustrated in Fig. 1 (this figure is further discussed below). Muthén (e.g., 1985,1988, 1989) popularized the use of MIMIC models to test for DIF using estimation methods appropriate for categorical data (see also MacIntosh and Hashim 2003; Muthén et al. 1991). A simple MIMIC model has one latent variable (i.e., factor) regressed on an observed grouping variable to permit group mean differences on the factor. An item is tested for DIF by regressing it (i.e., responses to it) on the grouping variable. There is evidence of differential functioning if group membership significantly predicts item responses, controlling for group mean differences on the factor.
The MIMIC approach has several advantages (some of which are shared by IRT-LR-DIF). Matching is based on the latent variable which is likely to be more accurate than a summed score. Multidimensional items, or multiple factors, are easily modeled. Good software is available for estimating the models with methods designed for categorical data (e.g., Mplus; Muthén and Muthén 2007). It is easy to examine DIF for more than two groups at once and to control for additional covariates when testing for DIF. Covariates may be continuous or categorical. Separate item parameter estimates for each group are not a direct byproduct of the analysis (as with IRT-LR-DIF), but they are easily calculated as a function of the regression coefficients.
A disadvantage of MIMIC models is that they test for uniform, but not nonuniform1, DIF (see definitions in Mellenbergh 1989 and Camilli and Shepard 1994, p. 59). In the present study, the MIMIC approach was used despite this disadvantage because an important grouping variable in the data, ethnicity, has six levels, and the sample sizes are small in some groups (e.g., Ns=17, 68, 75). Muthén (1989) pointed out that MIMIC models perform well with smaller within-group sample sizes than multiple-group models, and more easily accommodate more than two groups. Based on simulations with binary item data, Woods (in press) concluded that DIF tests were more accurate with MIMIC models than IRT-LR-DIF when the focal-group N was small (e.g., around 25, 50, or 100).
The present study further investigates the five SNAP scales that showed differential test functioning (DTF) in previous research by testing individual items for DIF. Woods et al. (2008) found that five subscales of the SNAP functioned differently for Air Force recruits depending on their gender, ethnicity, or both. All items were considered simultaneously in this previous study so this was DTF rather than DIF. MIMIC-modeling is preferable to IRT-LR-DIF for these data because some of the focal-group sample sizes are small.
The data are the same as those analyzed by Woods et al. (2008). The sample consisted of 2,026 Air Force recruits (1,265 male, 761 female) completing basic military training at Lackland Air Force Base in San Antonio, Texas. Most were between 18 and 25 years old (Mdn=19). They self identified as Caucasian (1,305), African-American (348), Hispanic (75), Asian (68), or Native American (17), with the remaining 213 classified as “other.” Oltmanns and Turkheimer (2006) describe details about the sample and data collection procedures.
The SNAP is a factor-analytically-derived self-report questionnaire that was originally developed as a tool for the assessment of personality disorders in terms of trait dimensions (Clark 1996). The 375 True/False items can be organized into three basic temperament scales (Negative Temperament, Positive Temperament, and Disinhibition) and 12 trait scales (Mistrust, Manipulativeness, Aggression, Self-harm, Eccentric Perceptions, Dependency, Exhibitionism, Entitlement, Detachment, Impulsivity, Propriety, and Workaholism). Alternatively, the items may be clustered into 13 diagnostic scales, corresponding to personality disorder categories presented in the Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R; American Psychiatric Association 1987). Six validity scales identify individuals who have produced scores indicating response biases, careless/defensive responding, or deviance (Clark 1996; Simms and Clark 2006). Although it would be interesting and potentially valuable to evaluate DIF for all SNAP scales, the present investigation is limited to a subset of the scales that showed DTF in previous research: Three trait scales (Entitlement, Exhibitionism, and Workaholism), and two temperament scales (Disinhibition and Negative Temperament).
All MIMIC models used here were similar to the example shown in Fig. 1 (particular models are described in the next section). This is a standard one-factor2 CFA model plus observed covariates: The factor is regressed on dummy-coded indicators of gender (male = 0, female = 1) and ethnicity. The latent scale is identified by fixing the residual variance of the factor, ζ, to 1. There are five binary indicators of ethnicity, with white as the reference group. Because there were only 17 Native Americans, this category was used only for SNAP scales with fewer than 17 items (Entitlement and Exhibitionism). For the other scales (Disinhibition, Workaholism, and Negative Temperament), Native Americans were combined with the “other” ethnic group. Testing items for DIF involves regressing them on all of the grouping variables. The model in Fig. 1 permits DIF for item 2 while assuming all other items are DIF free.
All analyses were carried out using Mplus (version 4.21, Muthén and Muthén 2007). All models were parameterized as two-parameter logistic item response models and fitted to the data using the robust maximum likelihood estimator “MLR”. The Mplus parameterization is:
where uij is a response given by person i to item j, θ is the latent variable (i.e., factor), and aj and τj are discrimination and threshold parameters, respectively. The Mplus threshold differs from the threshold in Birnbaum’s (1968) popular 2PL model, bj. The 2PL parameterization is:
Nevertheless, τj is just a rescaled version of bj (τj = ajbj), so the interpretation is the same. The threshold is the value of θ at which the probability of endorsing the item is .5.
The following procedures were repeated for each of the five SNAP scales. First, DIF-free items were identified empirically; the remaining items are studied items. Second, each studied item was individually tested for DIF. Third, a final model was constructed which permitted group variance in τj for all differentially-functioning (D-F) items. Estimates of discrimination parameters (aj), thresholds (τj), group mean differences on the factor for the kth covariate (γk), and DIF effects (i.e., regression coefficients indicating association with the grouping variables: βjks) from the final model are reported.
If the DIF status of all items is unknown prior to a MIMIC analysis, it seems desirable to fit a model supposing all items have DIF: All items would be regressed on the grouping variables. Unfortunately, such a model is not identified. There is also a conceptual problem because at least one DIF-free item is needed to define the factor on which the groups are matched. Therefore, preliminary analyses were performed to select a subset of DIF-free items to define the factor in subsequent analyses. Every item was tested for DIF with all other items presumed DIF-free. This was accomplished by regressing one item at a time on all of the grouping variables. The model in Fig. 1 illustrates this type of test for item 2. Item j was assigned to the DIF-free subset if aj was at least .5 and all βjks were nonsignificant (α=.05).
The assumption that all other items are DIF-free is increasingly incorrect for scales with more DIF. However, previous simulation studies indicate that the error produced by violation of this assumption is inflated Type I error (Finch 2005; Stark et al. 2006; Wang 2004; Wang and Yeh 2003). In the present context, inflated Type I error means some DIF-free items will appear to have DIF and not be selected for the DIF-free subset. That is not particularly problematic. All items not included in the DIF-free subset are subsequently tested for DIF, so if an item really is DIF-free but excluded from the DIF-free subset initially, researchers are still likely to conclude it is DIF-free based on the subsequent test (which should be nonsignificant).
Items not assigned to the DIF-free subset (studied items) were tested individually for DIF using likelihood ratio (LR) difference tests for nested models. The LR statistic is −2 times the difference in log likelihoods, and follows a χ2 distribution with df equal to the difference in the number of estimated parameters. Also, with the Mplus “MLR” estimator, the LR statistic must be divided by a term that is a function of the number of estimated parameters in each model and the scaling correction factors given by Mplus. This was carried out as shown in an example given on the Mplus website (http://www.statmodel.com/chidiff.shtml).
To test studied item j for DIF, a full model was compared to a more constrained model. In both the full and constrained models, all of the original items from the scale were used, and items assigned to the DIF-free subset were not regressed on any grouping variables. In the full model, all studied items were permitted to have DIF (i.e., all studied items were regressed on all grouping variables). In the constrained model, invariance was presumed for item j (i.e., item j was not regressed on any grouping variables). A significant difference between these models indicates that fit significantly declines if item j is assumed DIF-free. Therefore, item j has DIF.
An alternative approach is to compare a model that presumes no DIF in any item to a model that permits DIF for studied item j. This was not done because the LR statistic follows a χ2 distribution more closely when the baseline model fits the data as closely as possible. A model presuming no DIF in any item is probably rather far from reality. Stark et al. (2006) recently discussed this issue in the context of multiple-group DIF testing.
To control the false discovery rate, the Benjamini and Hochberg (1995) procedure was applied within each SNAP scale (see also Thissen et al. 2002; Williams et al. 1999). The MULTTEST procedure in SAS was used to obtain Benjamini-Hochberg adjusted p-values for the LR statistics which are compared to α=.05 instead of the raw p-values.
A final MIMIC model was constructed for each SNAP scale, in which only items that showed significant DIF were regressed on the grouping variables. The factor was also regressed on the grouping variables. The final model provides estimates of aj, τj, group mean differences on the factor (γk), and DIF effects (βjks). A negative βjk indicates that τj is smaller for the focal group (women, African Americans, Asians, Hispanics, Native Americans, or “other”s) than for the corresponding reference group (men or whites). In other words, the level of the latent variable required for recruits to respond “True” to the item was lower for members of the focal group. A positive βjk indicates that τj is larger for the focal group.
One τj will be reported for each item. For items without DIF, this τj applies to all participants. For differentially-functioning (D-F) items, this τj applies only to white men (i.e., when all covariates = 0), because there is a separate τj for each of the 12 groups (white men, white women, African American men, African American women, etc.). The τj for the other 11 groups may be calculated by adding the regression parameter(s) for the corresponding DIF effect(s) to the τj for white men. An example of this computation is provided when specific results are described. Example item response functions (IRFs) are also presented, which are given by Equation (1) and show the probability of responding “True” as a function of a person’s level of the factor (and the item parameters). D-F items have a separate IRF for each of the 12 groups.
The group mean level of Entitlement was significantly larger for African American versus white recruits (γ2=0.52, SE=.06) and for Hispanic versus white recruits (γ4=0.38, SE=.15). As shown in Table 1, one Entitlement item (number 49) qualified for assignment to the DIF-free subset; all others were tested for DIF. Table 1 lists the items (ordered by LR statistic), with the LR statistic, raw p-value, and Benjamini-Hochberg adjusted p-value (pBH). Four items printed in bold type have uniform DIF (pBH is less than .05). Item parameter estimates from the final model are also listed in Table 1. Remember that the τj for D-F items applies to white male recruits only.
For D-F items, group differences in τj (i.e., βjks) are given in Table 6. An asterisk flags significant effects (α=.05). Controlling ethnicity, all four items were more easily endorsed by women than men. Holding sex constant, the threshold for item 83 was larger for African Americans and “other”s compared to whites. In contrast, τj for item 120 was much lower for Asian versus white recruits, and τj for item 125 was lower for African Americans and Hispanics compared to whites.
Focusing on item 120 (“I have many qualities others wish they had”) as an example, τj may be computed for each group. For African American male recruits, β2 is added to the τj for white male recruits given in Table 1: −1.01+0.26=−0.75. For Asian male recruits, β3 is used instead: −1.01 −1.14=−2.15. For Hispanic, Native American, and “other” men, this calculation would use β4, β5, and β6. To obtain thresholds for women, β1 is included in the addition. The threshold is −1.01 −0.46=−1.47 for white female recruits, and −1.01 − 0.46+0.26=−1.21 for African-American female recruits. The thresholds for women in other ethnic groups are calculated analogously.
For item 120, Fig. 2 displays the IRF for white men (solid line) and for the group that differs most from them: Asian women (dashed line). Dots are plotted at probability = .5 to indicate the value of τj for each of the other ten groups. The full IRF is not shown for the other groups to simplify the graph. Because MIMIC models do not permit group differences in aj, the IRF is the same shape for each group, just shifted over on the latent axis according to τj.
Two items were assigned to the DIF-free subset for Exhibitionism. Results listed in Table 2 show that eight items displayed significant DIF. Details are given in Table 6. Men and women differed on four items, African American and white recruits differed on six items, and there was one Hispanic-white difference and one “other”-white difference. The mean level of Exhibitionism was significantly lower for women than men (γ1= −0.14, SE=.05) and for Asians versus whites (γ3= −0.36, SE=.14).
Four Disinhibition items were assigned to the DIF-free subset (see Table 3). Twenty-two items had significant DIF. As shown in Table 6, women differed from men on 17 items. Differences from whites were observed on 14 items for African Americans, seven items for “other”s, 3 items for Asians, and two items for Hispanics. The mean level of disinhibition was significantly lower for female versus male recruits (γ1= −0.21, SE=.06) and for African American versus white recruits (γ2= −0.18, SE=08).
The DIF-free subset consisted of eight items for Negative Temperament, and there were 10 D-F items (see Table 4). As apparent from Table 6, there were gender differences on seven items, and differences from whites for African Americans on five items, “other”s on five items, Asians on four items, and Hispanics on one item. Controlling ethnicity, the factor mean was significantly greater for female versus male recruits (γ1=0.16, SE=05).
Eight items were assigned to the DIF-free subset for Workaholism, and there were four D-F items (see Table 5). As shown in Table 6, African Americans differed from whites on all four items, and men differed from women on two items. Differences from whites were also observed for Asian recruits on two items and “other”s on two items. The mean level of workaholism was significantly greater for “other”s than whites (γ6=0.21, SE=.08).
This research illustrated how MIMIC models may be used to test for DIF with binary item response scales. The methodology (and software used here) also applies straightforwardly to items with ordinal (Likert-type) response scales. Hopefully, examples of methodology for testing DIF will help to increase the frequency with which researchers apply these procedures to other scales and samples in pursuit of eliminating measurement bias in psychology.
Future research with additional samples is needed because the pattern of DIF, and the parameter estimates, observed for Air Force recruits may not be the same for other samples. These Air Force recruits are similar to many college samples with respect to age and ethnicity, but more heterogeneous with respect to education and intelligence. There were more men in this sample than in many samples obtained through psychology participant pools. Also, there may be personality differences between college students and young adults who self-select into the Air Force. The present study focused on 5 SNAP scales that appeared to show differential test functioning in a previous analysis with the same data. With other samples, some of the other 9 SNAP trait and temperament scales or 13 SNAP diagnostic scales should be examined.
The relatively small sample sizes for specific ethnic minority groups was a limitation of this study which led to the choice of MIMIC modeling instead of IRT-LR-DIF. MIMIC modeling has advantages (mentioned earlier), but one disadvantage is that only uniform, not nonuniform DIF, is tested. Also, simulations have suggested that with focal-group sample sizes less than about 100, parameter estimates in models with many parameters (as in the final models used here), are likely to be less accurate than with larger focal groups (Woods, in press). A warranted aim for the future is to assess DIF on SNAP items with larger focal-group sample sizes, either with MIMIC modeling, or with IRT-LR-DIF, if possible.
Keeping in mind these limitations and qualifications, the present results suggested that some SNAP items functioned differently for different demographic groups. Although a person’s probability of endorsing an item should depend only on their level of the latent variable and qualities of the items, responses to some SNAP items also depended on gender, ethnicity, or both. The DIF effects for some items were huge. These findings imply that scores on these five SNAP scales do not mean the same thing for all Air Force recruits.
D-F items have been observed on many psychological scales. Waller et al. (2000) found that many items on the Minnesota Multiphasic Personality Inventory (MMPI; Hathaway and McKinley 1940) functioned differently for black versus white respondents, and opined that “any omnibus inventory… is likely to contain numerous items that perform differently across various homogeneous groups” (p. 142). Perhaps this has happened because some instruments were written before methods for testing DIF were well developed or widely available, and many others were simply created without attention to the possibility of DIF.
One obvious strategy for eliminating DIF is to revise or delete D-F items on extant scales, and to routinely test for DIF when new measures are constructed. As part of this, it will be useful to understand causes of DIF. Surely group membership is a proxy for some other (probably continuous) variables. For example, in the present study, it is unclear why Asian women more readily reported having “many qualities others wish they had” such that this SNAP item was not as strongly indicative of Entitlement for them as it was for white men. Were Asian women actually more talented in some way? The present findings raise many questions of this sort that could be explored in future research.
Another way of managing DIF is to model it. For example, group mean differences on SNAP scales that have D-F items could be estimated with the DIF modeled using a MIMIC or multiple-group model. As in the final model fitted in the present study, parameters for D-F items would be estimated separately for each group, whereas parameters for invariant items would be held equal across groups. Such a model gives an estimate of the mean difference with DIF taken into consideration. Scores for each individual could be computed from these models as well. Although it might be ideal to have DIF-free instruments, modeling DIF can certainly help to reduce the proliferation of misleading results.
1Uniform DIF occurs when item thresholds differ between groups: An item is more easily endorsed for one group than the other. DIF is nonuniform if item discrimination also differs between groups; thus, the group difference depends on the level of the latent variable.
2Because each scale is evaluated for DIF separately from the other scales, it is not problematic for a SNAP item to be included on more than one scale.