Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2862492

Formats

Article sections

Authors

Related links

J Psychopathol Behav Assess. Author manuscript; available in PMC 2010 May 3.

Published in final edited form as:

J Psychopathol Behav Assess. 2009; 31(4): 320–330.

doi: 10.1007/s10862-008-9118-9PMCID: PMC2862492

NIHMSID: NIHMS197126

Carol M. Woods: ude.ltsuw.icstra@sdoowc

See other articles in PMC that cite the published article.

This research provides an example of testing for differential item functioning (DIF) using multiple indicator multiple cause (MIMIC) structural equation models. True/False items on five scales of the Schedule for Nonadaptive and Adaptive Personality (SNAP) were tested for uniform DIF in a sample of Air Force recruits with groups defined by gender and ethnicity. Uniform DIF exists when an item is more easily endorsed for one group than the other, controlling for group mean differences on the variable under study. Results revealed significant DIF for many SNAP items and some effects were quite large. Differentially-functioning items can produce measurement bias and should be either deleted or modeled as if separate items were administered to different groups. Future research should aim to determine whether the DIF observed here holds for other samples.

Differential item functioning (DIF) occurs when an item on a test or questionnaire has different measurement properties for one group of people versus another, irrespective of group-mean differences on the variable under study. For example, if the Schedule for Nonadaptive and Adaptive Personality (SNAP; Clark 1996) item: “I enjoy work more than play” has gender DIF, the probability of responding “True” is different for men versus women even when they are matched on degree of workaholism. Men and women may have different mean levels of workaholism—this is separate from the issue of DIF. Detecting DIF is important because it can lead to inaccurate conclusions about group differences and invalidate procedures for making decisions about individuals.

Numerous methods have been proposed for identifying DIF (Camilli and Shepard 1994; Holland and Wainer 1993; Millsap and Everson 1993). For most methods, it is desirable to select a few DIF-free items to define the matching criterion that is used for testing the other items for DIF. For some methods, people are matched on summed scores (i.e., the sum of observed item scores); for others, people are matched on an estimate of the latent variable that underlies the item scores. The matching is likely to be more accurate with latent-variable methods because they account for measurement error in the items.

DIF testing using latent variables may be accomplished using a multiple-group model or a single-group model with covariates. For categorical item data, both of these models may be parameterized either as an item response model fitted to the data directly or as a confirmatory factor analysis (CFA) model fitted to a matrix of polychoric correlations. Multiple group models are usually parameterized as item response models (this method is often referred to as IRT-LR-DIF; Thissen et al. 1986; Thissen et al. 1988, 1993), and single-group models with covariates are a type of multiple indicator multiple cause (MIMIC) structural equation model. Woods (in press) contrasts these two latent-variable approaches and compares them in simulations.

The present research is an application of the MIMIC-model approach, illustrated in Fig. 1 (this figure is further discussed below). Muthén (e.g., 1985,1988, 1989) popularized the use of MIMIC models to test for DIF using estimation methods appropriate for categorical data (see also MacIntosh and Hashim 2003; Muthén et al. 1991). A simple MIMIC model has one latent variable (i.e., factor) regressed on an observed grouping variable to permit group mean differences on the factor. An item is tested for DIF by regressing it (i.e., responses to it) on the grouping variable. There is evidence of differential functioning if group membership significantly predicts item responses, controlling for group mean differences on the factor.

Example MIMIC model permitting DIF for only item 2. γ_{k} = regression coefficient showing the group mean difference on the factor for covariate *k*; β_{jk} = regression coefficient showing the group difference in the item threshold for item **...**

The MIMIC approach has several advantages (some of which are shared by IRT-LR-DIF). Matching is based on the latent variable which is likely to be more accurate than a summed score. Multidimensional items, or multiple factors, are easily modeled. Good software is available for estimating the models with methods designed for categorical data (e.g., Mplus; Muthén and Muthén 2007). It is easy to examine DIF for more than two groups at once and to control for additional covariates when testing for DIF. Covariates may be continuous or categorical. Separate item parameter estimates for each group are not a direct byproduct of the analysis (as with IRT-LR-DIF), but they are easily calculated as a function of the regression coefficients.

A disadvantage of MIMIC models is that they test for uniform, but not nonuniform^{1}, DIF (see definitions in Mellenbergh 1989 and Camilli and Shepard 1994, p. 59). In the present study, the MIMIC approach was used despite this disadvantage because an important grouping variable in the data, ethnicity, has six levels, and the sample sizes are small in some groups (e.g., *N*s=17, 68, 75). Muthén (1989) pointed out that MIMIC models perform well with smaller within-group sample sizes than multiple-group models, and more easily accommodate more than two groups. Based on simulations with binary item data, Woods (in press) concluded that DIF tests were more accurate with MIMIC models than IRT-LR-DIF when the focal-group *N* was small (e.g., around 25, 50, or 100).

The present study further investigates the five SNAP scales that showed differential test functioning (DTF) in previous research by testing individual items for DIF. Woods et al. (2008) found that five subscales of the SNAP functioned differently for Air Force recruits depending on their gender, ethnicity, or both. All items were considered simultaneously in this previous study so this was DTF rather than DIF. MIMIC-modeling is preferable to IRT-LR-DIF for these data because some of the focal-group sample sizes are small.

The data are the same as those analyzed by Woods et al. (2008). The sample consisted of 2,026 Air Force recruits (1,265 male, 761 female) completing basic military training at Lackland Air Force Base in San Antonio, Texas. Most were between 18 and 25 years old (*Mdn*=19). They self identified as Caucasian (1,305), African-American (348), Hispanic (75), Asian (68), or Native American (17), with the remaining 213 classified as “other.” Oltmanns and Turkheimer (2006) describe details about the sample and data collection procedures.

The SNAP is a factor-analytically-derived self-report questionnaire that was originally developed as a tool for the assessment of personality disorders in terms of trait dimensions (Clark 1996). The 375 True/False items can be organized into three basic temperament scales (Negative Temperament, Positive Temperament, and Disinhibition) and 12 trait scales (Mistrust, Manipulativeness, Aggression, Self-harm, Eccentric Perceptions, Dependency, Exhibitionism, Entitlement, Detachment, Impulsivity, Propriety, and Workaholism). Alternatively, the items may be clustered into 13 diagnostic scales, corresponding to personality disorder categories presented in the Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R; American Psychiatric Association 1987). Six validity scales identify individuals who have produced scores indicating response biases, careless/defensive responding, or deviance (Clark 1996; Simms and Clark 2006). Although it would be interesting and potentially valuable to evaluate DIF for all SNAP scales, the present investigation is limited to a subset of the scales that showed DTF in previous research: Three trait scales (Entitlement, Exhibitionism, and Workaholism), and two temperament scales (Disinhibition and Negative Temperament).

All MIMIC models used here were similar to the example shown in Fig. 1 (particular models are described in the next section). This is a standard one-factor^{2} CFA model plus observed covariates: The factor is regressed on dummy-coded indicators of gender (male = 0, female = 1) and ethnicity. The latent scale is identified by fixing the residual variance of the factor, ζ, to 1. There are five binary indicators of ethnicity, with white as the reference group. Because there were only 17 Native Americans, this category was used only for SNAP scales with fewer than 17 items (Entitlement and Exhibitionism). For the other scales (Disinhibition, Workaholism, and Negative Temperament), Native Americans were combined with the “other” ethnic group. Testing items for DIF involves regressing them on all of the grouping variables. The model in Fig. 1 permits DIF for item 2 while assuming all other items are DIF free.

All analyses were carried out using Mplus (version 4.21, Muthén and Muthén 2007). All models were parameterized as two-parameter logistic item response models and fitted to the data using the robust maximum likelihood estimator “MLR”. The Mplus parameterization is:

$$Pr({u}_{ij}=1\mid \theta )=\frac{1}{1+exp[{\tau}_{j}-{a}_{j}\theta ]},$$

(1)

where *u _{ij}* is a response given by person

$$Pr({u}_{ij}=1\mid \theta )=\frac{1}{1+exp[-{a}_{j}(\theta -{b}_{j})]}=\frac{1}{1+exp[{a}_{j}{b}_{j}-{a}_{j}\theta ]}.$$

(2)

Nevertheless, τ* _{j}* is just a rescaled version of

The following procedures were repeated for each of the five SNAP scales. First, DIF-free items were identified empirically; the remaining items are *studied items*. Second, each studied item was individually tested for DIF. Third, a final model was constructed which permitted group variance in τ* _{j}* for all differentially-functioning (D-F) items. Estimates of discrimination parameters (

If the DIF status of all items is unknown prior to a MIMIC analysis, it seems desirable to fit a model supposing all items have DIF: All items would be regressed on the grouping variables. Unfortunately, such a model is not identified. There is also a conceptual problem because at least one DIF-free item is needed to define the factor on which the groups are matched. Therefore, preliminary analyses were performed to select a subset of DIF-free items to define the factor in subsequent analyses. Every item was tested for DIF with all other items presumed DIF-free. This was accomplished by regressing one item at a time on all of the grouping variables. The model in Fig. 1 illustrates this type of test for item 2. Item *j* was assigned to the DIF-free subset if *a _{j}* was at least .5 and all β

The assumption that all other items are DIF-free is increasingly incorrect for scales with more DIF. However, previous simulation studies indicate that the error produced by violation of this assumption is inflated Type I error (Finch 2005; Stark et al. 2006; Wang 2004; Wang and Yeh 2003). In the present context, inflated Type I error means some DIF-free items will appear to have DIF and not be selected for the DIF-free subset. That is not particularly problematic. All items not included in the DIF-free subset are subsequently tested for DIF, so if an item really is DIF-free but excluded from the DIF-free subset initially, researchers are still likely to conclude it is DIF-free based on the subsequent test (which should be nonsignificant).

Items not assigned to the DIF-free subset (studied items) were tested individually for DIF using likelihood ratio (LR) difference tests for nested models. The LR statistic is −2 times the difference in log likelihoods, and follows a χ^{2} distribution with *df* equal to the difference in the number of estimated parameters. Also, with the Mplus “MLR” estimator, the LR statistic must be divided by a term that is a function of the number of estimated parameters in each model and the scaling correction factors given by Mplus. This was carried out as shown in an example given on the Mplus website (http://www.statmodel.com/chidiff.shtml).

To test studied item *j* for DIF, a full model was compared to a more constrained model. In both the full and constrained models, all of the original items from the scale were used, and items assigned to the DIF-free subset were not regressed on any grouping variables. In the full model, all studied items were permitted to have DIF (i.e., all studied items were regressed on all grouping variables). In the constrained model, invariance was presumed for item *j* (i.e., item *j* was not regressed on any grouping variables). A significant difference between these models indicates that fit significantly declines if item *j* is assumed DIF-free. Therefore, item *j* has DIF.

An alternative approach is to compare a model that presumes no DIF in any item to a model that permits DIF for studied item *j*. This was not done because the LR statistic follows a χ^{2} distribution more closely when the baseline model fits the data as closely as possible. A model presuming no DIF in any item is probably rather far from reality. Stark et al. (2006) recently discussed this issue in the context of multiple-group DIF testing.

To control the false discovery rate, the Benjamini and Hochberg (1995) procedure was applied within each SNAP scale (see also Thissen et al. 2002; Williams et al. 1999). The MULTTEST procedure in SAS was used to obtain Benjamini-Hochberg adjusted *p*-values for the LR statistics which are compared to *α*=.05 instead of the raw *p*-values.

A final MIMIC model was constructed for each SNAP scale, in which only items that showed significant DIF were regressed on the grouping variables. The factor was also regressed on the grouping variables. The final model provides estimates of *a _{j}*, τ

One τ* _{j}* will be reported for each item. For items without DIF, this τ

The group mean level of Entitlement was significantly larger for African American versus white recruits (γ_{2}=0.52, SE=.06) and for Hispanic versus white recruits (γ_{4}=0.38, SE=.15). As shown in Table 1, one Entitlement item (number 49) qualified for assignment to the DIF-free subset; all others were tested for DIF. Table 1 lists the items (ordered by LR statistic), with the LR statistic, raw *p*-value, and Benjamini-Hochberg adjusted *p*-value (*p _{BH}*). Four items printed in bold type have uniform DIF (

For D-F items, group differences in τ* _{j}* (i.e., β

Focusing on item 120 (“I have many qualities others wish they had”) as an example, τ* _{j}* may be computed for each group. For African American male recruits, β

For item 120, Fig. 2 displays the IRF for white men (solid line) and for the group that differs most from them: Asian women (dashed line). Dots are plotted at probability = .5 to indicate the value of τ* _{j}* for each of the other ten groups. The full IRF is not shown for the other groups to simplify the graph. Because MIMIC models do not permit group differences in

Two items were assigned to the DIF-free subset for Exhibitionism. Results listed in Table 2 show that eight items displayed significant DIF. Details are given in Table 6. Men and women differed on four items, African American and white recruits differed on six items, and there was one Hispanic-white difference and one “other”-white difference. The mean level of Exhibitionism was significantly lower for women than men (γ_{1}= −0.14, SE=.05) and for Asians versus whites (γ_{3}= −0.36, SE=.14).

Four Disinhibition items were assigned to the DIF-free subset (see Table 3). Twenty-two items had significant DIF. As shown in Table 6, women differed from men on 17 items. Differences from whites were observed on 14 items for African Americans, seven items for “other”s, 3 items for Asians, and two items for Hispanics. The mean level of disinhibition was significantly lower for female versus male recruits (γ_{1}= −0.21, SE=.06) and for African American versus white recruits (γ_{2}= −0.18, SE=08).

Item parameter estimates and tests for differential item functioning: Negative Temperament (28 items)

The DIF-free subset consisted of eight items for Negative Temperament, and there were 10 D-F items (see Table 4). As apparent from Table 6, there were gender differences on seven items, and differences from whites for African Americans on five items, “other”s on five items, Asians on four items, and Hispanics on one item. Controlling ethnicity, the factor mean was significantly greater for female versus male recruits (γ_{1}=0.16, SE=05).

Eight items were assigned to the DIF-free subset for Workaholism, and there were four D-F items (see Table 5). As shown in Table 6, African Americans differed from whites on all four items, and men differed from women on two items. Differences from whites were also observed for Asian recruits on two items and “other”s on two items. The mean level of workaholism was significantly greater for “other”s than whites (γ_{6}=0.21, SE=.08).

This research illustrated how MIMIC models may be used to test for DIF with binary item response scales. The methodology (and software used here) also applies straightforwardly to items with ordinal (Likert-type) response scales. Hopefully, examples of methodology for testing DIF will help to increase the frequency with which researchers apply these procedures to other scales and samples in pursuit of eliminating measurement bias in psychology.

Future research with additional samples is needed because the pattern of DIF, and the parameter estimates, observed for Air Force recruits may not be the same for other samples. These Air Force recruits are similar to many college samples with respect to age and ethnicity, but more heterogeneous with respect to education and intelligence. There were more men in this sample than in many samples obtained through psychology participant pools. Also, there may be personality differences between college students and young adults who self-select into the Air Force. The present study focused on 5 SNAP scales that appeared to show differential test functioning in a previous analysis with the same data. With other samples, some of the other 9 SNAP trait and temperament scales or 13 SNAP diagnostic scales should be examined.

The relatively small sample sizes for specific ethnic minority groups was a limitation of this study which led to the choice of MIMIC modeling instead of IRT-LR-DIF. MIMIC modeling has advantages (mentioned earlier), but one disadvantage is that only uniform, not nonuniform DIF, is tested. Also, simulations have suggested that with focal-group sample sizes less than about 100, parameter estimates in models with many parameters (as in the final models used here), are likely to be less accurate than with larger focal groups (Woods, in press). A warranted aim for the future is to assess DIF on SNAP items with larger focal-group sample sizes, either with MIMIC modeling, or with IRT-LR-DIF, if possible.

Keeping in mind these limitations and qualifications, the present results suggested that some SNAP items functioned differently for different demographic groups. Although a person’s probability of endorsing an item should depend only on their level of the latent variable and qualities of the items, responses to some SNAP items also depended on gender, ethnicity, or both. The DIF effects for some items were huge. These findings imply that scores on these five SNAP scales do not mean the same thing for all Air Force recruits.

D-F items have been observed on many psychological scales. Waller et al. (2000) found that many items on the Minnesota Multiphasic Personality Inventory (MMPI; Hathaway and McKinley 1940) functioned differently for black versus white respondents, and opined that “any omnibus inventory… is likely to contain numerous items that perform differently across various homogeneous groups” (p. 142). Perhaps this has happened because some instruments were written before methods for testing DIF were well developed or widely available, and many others were simply created without attention to the possibility of DIF.

One obvious strategy for eliminating DIF is to revise or delete D-F items on extant scales, and to routinely test for DIF when new measures are constructed. As part of this, it will be useful to understand causes of DIF. Surely group membership is a proxy for some other (probably continuous) variables. For example, in the present study, it is unclear why Asian women more readily reported having “many qualities others wish they had” such that this SNAP item was not as strongly indicative of Entitlement for them as it was for white men. Were Asian women actually more talented in some way? The present findings raise many questions of this sort that could be explored in future research.

Another way of managing DIF is to model it. For example, group mean differences on SNAP scales that have D-F items could be estimated with the DIF modeled using a MIMIC or multiple-group model. As in the final model fitted in the present study, parameters for D-F items would be estimated separately for each group, whereas parameters for invariant items would be held equal across groups. Such a model gives an estimate of the mean difference with DIF taken into consideration. Scores for each individual could be computed from these models as well. Although it might be ideal to have DIF-free instruments, modeling DIF can certainly help to reduce the proliferation of misleading results.

^{1}*Uniform* DIF occurs when item thresholds differ between groups: An item is more easily endorsed for one group than the other. DIF is *nonuniform* if item discrimination also differs between groups; thus, the group difference depends on the level of the latent variable.

^{2}Because each scale is evaluated for DIF separately from the other scales, it is not problematic for a SNAP item to be included on more than one scale.

- American Psychiatric Association. Diagnostic and statistical manual of mental disorders. 3. Washington, DC: Author; 1987.
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B. 1995;57:289–300.
- Birnbaum A. Some latent trait models. In: Lord FM, Novick MR, editors. Statistical theories of mental test scores. Reading, MA: Addison & Wesley; 1968. pp. 395–479.
- Camilli G, Shepard LA. Methods for identifying biased test items. Thousand Oaks, CA: Sage; 1994.
- Clark L. SNAP Manual for administration, scoring, and interpretation. Minneapolis: University of Minnesota Press; 1996.
- Finch H. The MIMIC model as a method for detecting DIF: comparison with Mantel-Haenszel, SIBTEST, and the IRT likelihood ratio. Applied Psychological Measurement. 2005;29:278–295.
- Hathaway SR, McKinley JC. A multiphasic personality schedule (Minnesota): I. Construction of the schedule. Journal of Psychology. 1940;10:249–254.
- Holland PW, Wainer H. Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum; 1993.
- MacIntosh R, Hashim S. Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement. 2003;27:372–379.
- Mellenbergh GJ. Item bias and item response theory. International Journal of Educational Research. 1989;13:127–143.
- Millsap RE, Everson HT. Methodology review: statistical approaches for assessing measurement bias. Applied Psychological Measurement. 1993;17:297–334.
- Muthén BO. A method for studying the homogeneity of test items with respect to other relevant variables. Journal of educational statistics. 1985;10:121–132.
- Muthén BO. Some uses of structural equation modeling in validity studies: Extending IRT to external variables. In: Wainer H, Braun HI, editors. Test Validity. Hillsdale, NJ: Lawrence Erlbaum; 1988. pp. 213–238.
- Muthén BO. Latent variable modeling in heterogeneous populations. Psychometrika. 1989;54:557–585.
- Muthén BO, Kao C, Burstein L. Instructionally sensitive psychometrics: an application of a new IRT-based detection technique to mathematics achievement test items. Journal of Educational Measurement. 1991;28:1–22.
- Muthén LK, Muthén BO. Mplus: Statistical Analysis with Latent Variables, (Version 4.21) [Computer software] Los Angeles, CA: Muthén & Muthén; 2007.
- Oltmanns TF, Turkheimer E. Perceptions of self and others regarding pathological personality traits. In: Krueger RF, Tackett J, editors. Personality and psychopathology: Building bridges. New York: Guilford; 2006.
- Simms LJ, Clark LA. Differentiating Normal & Abnormal Personality. New York: Springer; 2006. Chapter 17: The schedule for nonadaptive and adaptive personality (SNAP): A dimensional measure of traits relevant to personality and personality pathology.
- Stark S, Chernyshenko OS, Drasgow F. Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology. 2006;91:1291–1306. [PubMed]
- Thissen D, Steinberg L, Gerrard M. Beyond group-mean differences: The concept of item bias. Psychological Bulletin. 1986;99:118–128.
- Thissen D, Steinberg L, Wainer H. Use of item response theory in the study of group difference in trace lines. In: Wainer H, Braun H, editors. Test validity. Hillsdale, NJ: Erlbaum; 1988. pp. 147–169.
- Thissen D, Steinberg L, Wainer H. Detection of differential item functioning using the parameters of item response models. In: Holland PW, Wainer H, editors. Differential item functioning. Hillsdale, NJ: Erlbaum; 1993. pp. 67–111.
- Thissen D, Steinberg L, Kuang D. Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics. 2002;27:77–83.
- Waller NG, Thompson JS, Wenk E. Using IRT to separate measurement bias from true group differences on homogeneous and heterogeneous scales: An illustration with the MMPI. Psychological Methods. 2000;5:125–146. [PubMed]
- Wang W. Effects of anchor item methods on detection of differential item functioning within the family of Rasch models. The Journal of Experimental Education. 2004;72:221–261.
- Wang W, Yeh Y. Effects of anchor item methods on differential item functioning detection with the likelihood ratio test. Applied Psychological Measurement. 2003;27:479–498.
- Williams VSL, Jones LV, Tukey JW. Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics. 1999;24:42–69.
- Woods CM. Evaluation of MIMIC-model methods for DIF testing with comparison to two-group analysis. Multivariate Behavioral Research in press.
- Woods CM, Oltmanns TF, Turkheimer E. Detection of aberrant responding on a personality scale in a military sample: An application of evaluating person fit with two-level logistic regression. Psychological Assessment. 2008;20:159–168. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |