|Home | About | Journals | Submit | Contact Us | Français|
Examination of the equivalence of measures involves several levels, including conceptual equivalence of meaning, as well as quantitative tests of differential item functioning (DIF). The purpose of this review is to examine DIF in patient-reported outcomes. Reviewed were measures of self-reported depression, quality of life (QoL) and general health. Most measures of depression contained large amounts of DIF, and the impact of DIF at the scale level was typically sizeable. The studies of QoL and health measures identified a moderate amount of DIF; however, many of these studies examined only one type of DIF (uniform). Relative to DIF analyses of depression measures, less analysis of the impact of DIF on QoL and health measures was performed, and the authors of these analyses generally did not recommend remedial action, with one notable exception. While these studies represent good beginning efforts to examine measurement equivalence in patient-reported outcome measures, more cross-validation work is required using other (often larger) samples of different ethnic and language groups, as well as other methods that permit more extensive analyses of the type of DIF, together with magnitude and impact.
A series of cross-national studies of mental health conditions among community residents, conducted in the 1960s and 1970s (e.g., Gurland, Fleiss, Cooper, Kendell & Simon, 1969) identified differences among countries in terms of prevalence rates. A question arose regarding the reasons for the differences, and whether measurement bias could be partially responsible. One result of these studies was the development of standardized methods for assessment of physical and mental health states (e.g., Copeland et al., 1976; Golden, Teresi, & Gurland, 1984; Gurland et al., 1972; Spitzer, Fleiss, Burdock, & Hardesty, 1964). More recently, these efforts at standardization have resulted in initial attempts (e.g., Bode et al., 2006; Fliege et al., 2005; Hahn, Cella, Bode, Gershon, & Lai, 2006; Lai, Cella, Chang, Bode, & Heinemann, 2003; Lai et al., 2005; Reeve et al., 2007; Ware et al., 2003) to develop item banks that can be used in computerized adaptive testing (CAT). Because relatively smaller subsets of items are used to establish health status in CAT, it is necessary to ensure that the item bank is acceptable from the perspective of measurement equivalence.
At the same time, there has been growing concern about disparities in access to and delivery of health services, possibly resulting in differential outcomes associated with care (Bloche, 2004; National Research Council, 2004; Smedley, Stith & Neslon (Eds.), 2003; Steinbrook, 2004). Health care decisions are often made based on assessments of the health status of individuals; however, as illustrated in an edited volume of reviews (Skinner, Teresi, Holmes, Stahl & Stewart, 2001) evidence of the cultural equivalence of health-related measures is sparse. The major goal of a recently published special issue of Medical Care (Teresi, Stewart, Morales & Stahl, 2006) was to provide state-of-the-art overviews of both qualitative and quantitative methods that can be used to examine measurement equivalence. A major quantitative method for the examination of cultural equivalence is differential item functioning (DIF). Because of the increasing diversity observed in many societies, such analyses are becoming central to measurement development and evaluation. The purpose of this review article is to summarize the findings with respect to DIF in measures of self-reported depression, quality of life and general health. Detailed reviews are presented in an accompanying table.
Although examination of the equivalence of measures involves several levels, including conceptual equivalence of meaning, this review will focus narrowly on methods and results based on analysis of DIF. In addition to focusing only on quantitative methods, excluded from formal review in the accompanying table are analyses of factorial invariance, which is another method for examining measurement equivalence. A discussion of the similarities and differences of these two approaches (factorial invariance and DIF analyses) is beyond the scope of this review, but is summarized in several articles (McDonald, 2000; Meade & Lautenschlager, 2004; Mellenbergh, 1994; Millsap & Everson, 1993; Raju, Laffitte & Byrne, 2002; Reise, Widaman & Pugh, 1993; Takane & De Leeuw, 1987; Teresi, 2006a). The focus of this article will be on issues critical to the examination of DIF, as well as a review of DIF analyses in patient reported outcomes in selected areas. Definitions of the concepts used within this article will be provided as a means of orienting the reader to the topic.
DIF involves the evaluation of conditional relationships between item response and group membership. Groups should be selected for study based on theoretical considerations that include whether or not the construct studied is hypothesized to have the same conceptual meaning across groups. For example, if the construct studied is a specific type of pain, clinical experts should decide if this is best measured by a disease-specific or generic scale. If disease-specific, it makes little sense to study that scale for DIF with respect to different disease groups because the construct itself was intended to be different across groups. On the other hand, if a theoretical argument can be advanced that a scale, e.g., a health-related quality of life (HRQoL) subscale should measure the same unidimensional construct across groups that differ in education, literacy or type of disease, then the scale should be studied to insure that DIF is of low magnitude.
As an illustration of a definition of DIF: a randomly-selected person of low literacy with low perceived HRQoL should have the same chance of responding in the low HRQoL direction to an item measuring HRQoL as would a randomly selected individual also with low HRQoL, but who is of high literacy. For this example, uniform DIF indicates that the DIF is in the same direction across the HRQoL continuum, while non-uniform DIF means that the direction of DIF is different, depending on the level of HRQoL.
Magnitude of DIF refers to the degree of DIF, and can be measured by examining parameters or statistics associated with the method, for example, the odds ratio, beta coefficient or increment in R-square associated with the DIF term for the studied item. An important point is that in mental and physical health assessment, the pool of items is limited so that items cannot be discarded as easily as in educational testing. Because DIF detection methodologies are influenced by sample size, many items may show significant DIF for at least one comparison, even after adjustment for multiple group comparisons; however, such statistically significant DIF may not be clinically meaningful. It is thus critical to assess the magnitude of DIF in order to assess DIF saliency. Internal impact goes beyond the item level to determine the impact of DIF on the entire measure or scale. Impact can be assessed at the aggregate level by examining the relationship between the expected scale score and the disability or quality of life estimate; for example, how much do mean group differences in total score distributions change with and without inclusion of the items with DIF? Another example is the impact of DIF on the relationships of demographic characteristics with health variables (Crane, Gibbons, Jolley & Van Belle, 2006; Fleishman & Lawrence, 2003; Fleishman, Spector & Altman, 2002; Morales, Flowers, Gutierrez, Kleinman & Teresi, 2006; Teresi, Cross & Golden, 1989). DIF may also influence the relationship between patient-reported health variables and predicted outcomes such as access to care, functional decline and morbidity. This latter relationship has been referred to as external impact, predictive validity or predictive scale bias, and may be examined in terms of predictive values and regression coefficients. The impact measures just described are all at the aggregate or group rather than individual level. The impact on specific individuals can also be examined.
One method for DIF adjustment is the removal of items that contribute to overall DIF; this is not necessarily the best procedure because most analyses have been of pre-existing relatively short scales. Item removal can thus result in change of meaning of the construct, imbalance of severe and less severe indicators and lowered reliability (see Teresi, 2006b; Hambleton, 2006). Moreover, some items may show DIF cancellation, and their removal could bias a test because some favor one group and others the other group. Borsboom, Mellenbergh and Van Heerden (2002) and Borsboom (2006) discuss the idea that within-group comparison leads to absolute rather than relative bias; items showing absolute (but not relative) bias may not need to be removed. On the other hand, in large-scale item banking projects, in which the goal is to construct relatively DIF-free item sets, removal of items with a large magnitude of consistently identified DIF may be warranted. When selection and treatment decisions are based on individual person-assessments, the presence of DIF in the measure can result in bias and negative impact of DIF. These decisions must be balanced in the context of a conceptual map drawn by content experts who help to determine the relative salience of each item for the intended construct.
Presented briefly are several methods for DIF detection that were used in the articles reviewed below. Methods can be categorized broadly as parametric or non-parametric, and differ in terms of whether the conditioning patient-reported outcome variable is based on a latent variable or observed score.
Non-parametric contingency table approaches include the Mantel-Haenszel chi-square method (M-H; see Holland & Thayer, 1988; Dorans & Holland, 1993). The M-H method examines whether the odds of a symptomatic response within each score group on the measure is the same across groups. This method was used by two authors reviewed in Table 1 (Azocar, Areán, Miranda & Muñoz, 2001; Cole, Kawachi, Maller & Berkman, 2000) to examine depression measures. An extension of this method using a variant of the gamma statistic (Goodman & Kruskal, 1954) for polytomous responses was used by Bjorner and colleagues (1998) to examine the SF-36 Health Survey (Ware, Gandek & The IQOLA Project Group, 1994), and by Groennvold and colleagues (1995). Using the M-H method, a common odds ratio (which tests whether or not the likelihood of item symptom response is the same across disability groups) also can be used to construct a DIF magnitude measure. Odds are converted to log odds and various transformations provide interpretable magnitude measures.
An advantage of such methods is that few assumptions are required. However, similar to the Rasch model (Rasch, 1980; described below), this method may not be optimal if the discrimination parameters vary across groups (see Bock, 1993). Moreover, the M-H method uses an observed score treated as a categorical variable rather than the theoretically preferred latent conditioning variable.
Related to the M-H method is the parametric contingency-table approach to the examination of DIF, based on logistic regression (LR; Swaminathan & Rogers, 1990). LR examines whether the odds of admitting to a symptom are different between two groups. Item response is predicted from total observed scores, group status and the interaction of group by the total score. Like M-H, this method has traditionally used a summary conditional disability measure based on observed scores; however, Crane and colleagues (2004, 2006), have developed a method that incorporates a latent conditioning variable based on IRT estimation. LR has been expanded to include ordinal logistic regression (OLR) to accommodate polytomous data. A likelihood test, distributed as Chi-square with separate tests for uniform and non-uniform DIF has been recommended (Jodoin & Gierl, 2001). Purification has also been recommended (Camilli & Shepard, 1994), in which a corrected or unbiased estimate of DIF is achieved by removing items with DIF from the total score (but retaining the studied item) before the final DIF analysis.
LR procedures yield estimates of odds ratios that provide information about the direction and magnitude of the DIF. For example, inclusion of the group variable (for a test of uniform DIF) permits computation of the exponent of the regression coefficient associated with group membership. In addition, at each step in the model building, a corresponding estimate of effect size value is an R2 difference from the OLR models that can be applied to both binary and ordinal items (Gelin & Zumbo, 2003; Zumbo, 1999). The beta coefficient also can be evaluated for significance and for magnitude.
Advantages of LR include ability to model multidimensional data and capability to include covariates. A disadvantage is the use of the observed conditioning variable. While it is possible to use a latent conditioning variable (see Crane, Van Belle & Larson, 2004), this is infrequently applied in practice, and adds to the complexity of the analyses (see Millsap, 2006). Crane and colleagues (Crane, Gibbons, Narasimhalu, et al., 2007; Crane, Gibbons, Ocepek-Welikson, et al., 2007) used the latent conditioning OLR method in the analyses of a quality of life and a general health measure reviewed in Table 1. LR with an observed conditioning variable was used to cross-validate results in the paper by Petersen and colleagues (2003) to examine a measure of quality of life. It was also used in the studies by Cole and colleagues (2000) to examine depression items, and by Scott and colleagues (2006a, 2007) and Perkins and colleagues (2006), examining quality of life and health, respectively.
Estimates of disability in latent variable models are not based on observed scores. Latent variable methods include MIMIC (Muthén, 1984) and various item response theory (IRT) models such as the Rasch, one parameter IRT model or the IRTLR tests based on the two or three-parameter IRT models. Most of these models can be linked to IRT; thus a brief explication of IRT follows. According to the IRT model, an item shows DIF if people from different subgroups but at the same level on the underlying construct measured, have unequal probabilities of responding symptomatically to a particular item.
Item scores are related to the level of the underlying construct, e.g., depression, by functions that provide an estimate of the probability of occurrence of each possible score on an item for a randomly selected individual of given disorder, disability or symptomatology level. In most applications in psychology and physical health, one or two parameters are estimated: the item difficulty (bi) is the point on the total symptomatology continuum where the probability of a specific symptom response is .5. In applications in which symptoms are scored in a disordered direction, a high b means that the item maximally discriminates (separates symptomatology levels or groups) at a higher or more severe level of symptomatology. High b's are characteristic of items that are positively responded to by individuals with more symptomatology, so that relative to items with lower b's, individuals have to be at greater levels of symptomatology before they will have a 50% chance of endorsing the item. The difficulty (severity) parameter is tested for uniform DIF. The two-parameter logistic (for binary items; Lord and Novick, 1968) and graded response (Samejima, 1969) or generalized partial credit models (Muraki, 1992) (for polytomous items) estimate a discrimination parameter that is used in tests of non-uniform DIF.
The one parameter (1-PL) Rasch (Rasch, 1960) model (used in many of the DIF studies reviewed here) does not incorporate a discrimination parameter, and can be used to examine uniform DIF when the non-uniform DIF is not of concern. Rasch models are among the most popular methods for use in development and evaluation of health measures; as a byproduct, DIF analyses can be performed. The basic concept of this approach is to compare the item locations between two groups (i.e., reference versus focal or studied groups) using a t-test. Assume that item i has two difficulty estimates (di1 and di2) for groups 1 and 2 with associated error, Si1 and Si2, respectively. The formula to test for DIF will be in which (Si12+Si22)½ estimates the expected standard error of the difference between di1 and di2 (also see Wright and Stone, 1979). The obtained value of t12 is compared to the critical value of t. For example, if an alpha level of .05 is used, item i is considered to have significant DIF when |t12| >1.96. Another method commonly used within the 1-PL/ Rasch framework is to examine whether the significant displacement values are found when the item calibrations are anchored (i.e., fixed as expected scores). Typically, item calibrations are anchored using values obtained from the reference group, and displacement values are examined by analyzing the focal group data. The displacement is defined as the “(observed score-expected score)/modeled score variance”, which is used to test the hypothesis that the data are generated with the expected scores. An item with a significant displacement value (e.g., > 2 standard error deviation from the expected value) is considered as demonstrating uniform DIF between reference and focal groups. The displacement values can be requested by using WINSTEPS (Linacre, 2005). There are many other software packages that can be used to conduct Rasch or 1-PL IRT analysis and generate item calibrations, e.g., RUMM and ConQuest.
Kubinger (2005) presents situations in which Rasch models are appropriate and discusses approaches to examining model fit. For example, samples can be partitioned into groups, e.g., gender and fit tested using Anderson’s likelihood ratio test. The problem arises that with large sample sizes, even with adjustments for multiple comparisons, the type 1 error rate may be inflated (larger than the nominal level), resulting in the spurious classification of items as malfitting (e.g., Smith, Rush, Fallowfield, Velikova, & Sharpe, 2008). Kubinger makes the important point that magnitude of violations should be considered. Are the differences in parameter estimates of clinical consequence? Graphical displays of such differences can aid in interpretation.
In some of the articles, an extended Rasch model was used, in which the fit between the data and the model is examined using analysis of variance of residuals (ANOVA; Hagquist & Andrich, 2004). Each person is classified according to one of N class intervals, and according to the studied group variable, e.g., gender. A byproduct is a set of residuals that can be used in a two-way ANOVA. The class intervals are chosen to ensure sufficient cell sizes. A significant class-interval effect, irrespective of the group variable, indicates that the item does not fit the model across the construct continuum. A significant group effect, controlling for class interval is an indicator of uniform DIF, while a significant interaction between class-intervals and group is indicative of non-uniform DIF, although the discrimination parameters are still constants.
Analysis of variance using Rasch logits for detection of DIF is an extension of the t-test used to examine differences in the difficulty parameter between groups. A comparison of the Rasch model ANOVA procedure with the Mantel-Haenszel and logistic regression approaches has shown favorable performance for detection of uniform DIF; however, logistic regression showed better performance in the detection of non-uniform DIF (see Whitmore & Schumacker, 1999). Rasch models may not be optimal for the detection of DIF in health data because of the assumption of equal discrimination parameters; however, assuming adequate fit to the data (see Kubinger, 2005; Zumbo and Thomas, 1997), it is often used in circumstances in which sample sizes are smaller (see Lai, Teresi & Gershon, 2005).
The two-parameter logistic and graded response model (Samejima, 1969), used in programs such as IRTLRDIF (www.unc.edu/~dthissen/dl.html) to model ordinal (polytomous) data produces discrimination and difficulty parameters that are the basis for several of the magnitude estimates related to the probability differences (SPD, UPD; Camilli & Shepard, 1994) and Differential Functioning of Items and Tests (DFIT) methodology (Raju, van der Linden & Fleer, 1995). Examples of their use can be found in Morales et al. (2006), Teresi, Kleinman & Ocepek-Welikson (2000) and Teresi et al. (2007). DFIT also accommodates the one parameter (Rasch) model, although there is less experience with applications of this model to DFIT. Several magnitude and impact measures are available in the context of the DFIT methodology.
The Rasch model was used in 10 of the 32 articles included in the table; most with respect to quality of life and general health measures. Four authors (Bjorner, et al. 2004; Chan, Orlando, Ghosh-Dastidar, Duan & Sherbourne, 2004; Hepner, Morales, Hays, Edelen, & Miranda, 2008; Kim, Pilkonis, Frank, Thase & Reynolds, 2002) used an IRT method other than Rasch; these included a two parameter graded response or generalized partial credit model and one author used a non-parametric IRT approach (Moorer, Suurmeijer, Foets & Molenaar, 2001).
A latent variable approach linked to IRT as originally proposed by Birnbaum (Lord & Novick, with contributions by Birnbaum, 1968) is the multiple indicators, multiple cause (MIMIC; Jöreskog and Goldberger, 1975; Muthén & Muthén, 1998–2004) approach to DIF detection (see also Thissen, Steinberg & Wainer, 1993). This approach can be characterized as a confirmatory factor analysis model that incorporates a threshold (difficulty) value. This model uses equality constraints to test whether the item parameters differ between groups. The test of DIF is whether the likelihood of a symptom is different between groups after controlling for disorder as well as other covariates. The DIF measure is the coefficient of the path relating the studied background variable (covariate) to the item, after controlling for the indirect effects (through the latent variable) of other covariates on the item. MIMIC models examine the magnitude of DIF through examination of the direct effect estimate (which detects any residual variance in item response associated with membership in a particular group).
Six studies reviewed in Table 1 used the MIMIC approach. For example, Gallo and colleagues (1998) examined differential performance of depression items between African Americans and Whites by estimating the direct effects of item parameters (DIF indicators), after controlling for the indirect effects of covariates. Yang and Jones (2007) also applied this method to depression items, and Yu and colleagues (2007) used MIMIC to examine items measuring health. MIMIC has been used (Grayson, MacKinnon, Jorm, Creasey & Broe, 2000) to determine the impact of bias on depression scale scores through examination of the bias effect on the total score, estimated as the sum of all the direct loadings from a specific predictor (background characteristic) to the items, contrasting this with the genuine effect that arises from the predictor on the latent variable. Impact has also been examined using MIMIC by comparing the estimated group effects (differences in the coefficients relating group membership to disability) in models with, and without, adjustment for DIF (Fleishman, Spector & Altman, 2002). A major advantage of MIMIC is the inclusion of covariates; possible disadvantages include the inability to examine non-uniform DIF.
Disadvantages of the MIMIC, Rasch and other IRT-based methods are that violations of model assumptions and lack of fit can lead to false DIF detection. All models have to be checked carefully, thus adding to the steps necessary to properly implement the method. An issue in the measurement of self-reported psychological and physical health is the nature of the construct, and whether model misspecification can occur if items are generative rather than emergent. Most latent variable measurement models that form the basis for many DIF detection methods assume that the latent factor causes the symptoms. As an example, it is assumed that indicators of a physical health disorder factor, such as heart disease, blood pressure, shortness of breath, etc. are correlated because they are caused by physical health disorder. However, if the indicators cause the physical health disorder (they are generative rather than emergent), model misspecification may result (Cohen, Cohen, Teresi, Marchi & Velez, 1990; see also Bollen & Lennox, 1991; Fayers & Hand, 1997; Fayers, Hand, Bjordal & Groenvold, 1997). Model assumptions play an important role in DIF detection; lack of model fit and violation of model assumptions can result in false DIF detection. Not all of the articles reviewed provided tests of the assumptions discussed below.
Multidimensionality can be mistaken for DIF (Mazor, Hambleton & Clauser, 1998). While some models, e.g., MIMIC and LR can accommodate multidimensional data, most models and applications assume unidimensionality of the underlying trait, and there are additional assumptions associated with specific models. Two major approaches to assessing dimensionality are parametric factor analytic or bifactor models (see Reise, Morizot, Hays, 2007), and the non-parametric methods. In the context of the Rasch model, various fit tests have been used to examine dimensionality.
A contributor to inaccurate DIF detection is lack of model fit (e.g., Bolt, 2002). DIF analyses will be incorrect if, for example, a one-parameter model is selected for DIF detection, when a two or three parameter model would better fit the data (Hambleton, 2006). Numerous fit indices (many distributed as chi-squares) have been investigated; most tests of goodness-of-fit are influenced by combinations of sample size, distributional form, and estimation procedure.
While a detailed discussion is beyond the scope of this review, the issue is: does lack of model infit mean lack of dimensionality? Does it mean DIF? Are all three synonymous? It has been argued (e.g., Teresi, 2006a) that the three concepts, while interrelated are not synonymous, and that model fit, model assumptions and DIF should all be tested separately, and not conflated. As pointed out by Kubinger (2005), lack of model fit does not imply lack of unidimensionality. While some, e.g., Roussos and Stout (1996), have taken the view that DIF implies multidimensionality, McDonald (2000) provides several scenarios for DIF that are not necessarily due to a second nuisance dimension. Borsboom and colleagues (2006,2002) argue that an alternative cause of DIF might be “relative bias” that might occur if an individual is rating him/herself in relation to others in the setting, for example, members of a football team as contrasted with members of some other team sport. Within groups, the item may perform well, and be related to the measure of the underlying construct, but show DIF due to relative bias or to factors such as poor translation. To return to a point made earlier, it is essential that content experts determine the context in which DIF should be studied, and if possible generate hypotheses about potential DIF that can be tested.
A major assumption of most methods is that all items in the measure other than the studied item are unbiased. Thus, iterative or two-stage purification is recommended in order to avoid erroneous DIF detection (e.g., Clauser, Mazor & Hambleton, 1993; Holland and Thayer, 1988). Methods of purification vary, but all are based on the notion that items with DIF (except for the studied item for most methods) are removed from the item set used to estimate the disability measure. These DIF-free items form the anchor set.
Several of the studies examined translations of instruments; these can be affected by lack of conceptual equivalence in different groups. Qualitative analyses are thus important in the determination of reasons for DIF, such as changes in content, format, difficulty of words or sentences, and differences in cultural relevance. Roussos and Stout (1996) recommended a substantive (qualitative) analysis in which DIF hypotheses are generated, and it is decided whether or not unintended "adverse" DIF is present as a secondary factor. Substantive reviewers examine item content, and previously published analyses in order to generate DIF hypotheses. This review is followed by statistical analyses comprised of confirmatory tests of DIF hypotheses. This procedure can be extended to patient-reported outcome measures through use of qualitative methods that include focus groups and cognitive interviews (see Nápoles-Springer, Santoyo-Olsson, O’Brien and Stewart, 2006). This process is rarely performed in practice, and usually after the fact; several of the authors reviewed here provided extensive post-hoc evaluation of the possible reasons for DIF. Examples of good substantive reviews are the articles by Azocar et al., 2001; Gallo et al., 1998; Groenvold, et al., 1995; Kucukdeveci, Sahin, Ataman, Griffiths & Tennant, 2004; Kutlay, Kucukdeveci, Gonul & Tennant, 2003; Pagano & Gotay, 2005 and Prieto et al., 2003.
Based on considerations of parsimony, this review was restricted to three domains: depression, quality of life and general health. Collectively, these domains were selected for review because they represent important patient-reported outcomes, contain overlapping item content, and are included in major item banking projects, such as the Patient Reported Outcomes Measurement Information System (PROMIS; www.NIHPROMIS.org) project (Reeve, 2006; Reeve et al., 2007). PROMIS, part of the U.S. National Institutes of Health (NIH) roadmap initiative (RFA-RM-04-011), aims to provide an infrastructure for clinicians and researchers by establishing generic item banks across various disease groups, and applications of computerized adaptive testing.
Patient populations were limited to adults. Measures developed for use with children and adolescents were excluded. This decision was made in the interest of parsimony, and because measurement in children encompasses different issues; constructs may not be conceptually equivalent, and targeted outcomes could be different for children as contrasted with adults.
Identification of manuscripts addressing measurement equivalence using DIF was conducted through the use of several search engines. An initial search using the Columbia University Library PubMed database was conducted on November 11, 2005. Two parameters were specified: time frame (from 1995 to 2005) and key words appearing in the citation and the abstract, i.e., “Differential Item Functioning.” An Ovid-assisted cross-referencing search was conducted on the same date, using the same parameters. After deleting duplicates a total count of 120 unique references were identified. A web-based ProQuest search conducted on December 2, 2005, as a second crosscheck, failed to identify any new articles. An additional search was conducted using the Columbia University Library PubMed database on February 6, 2006 expanding the dates to include 2006 and the key words to include item bias. A search was conducted on February 12, 2006 through the Northwestern University library system using the keywords, “quality of life,” “health-related quality of life,” “DIF,” and “item bias.” Finally, a search was conducted in July, 2008. The purpose was to check the results of the search in the areas of quality of life, depression and general health; several additional articles were identified.
The selected articles from the first two searches were then divided (in terms of their content) into “methodological” and “applied”. Only applied articles (i.e., articles in which DIF was applied to a specific measure) were retained for evaluation. However, methodological articles were used in the overall review. A second-level iteration of manuscript selection was then conducted in which only manuscripts focusing on measures of the constructs of interest (depression, general quality of life and general health) that used samples from adult populations of sufficient size were included.
Table 1 is a summary of some of the information contained within the MEGS, which were created in order to provide evaluative criteria for measurement review. The MEGS contain elements determined by the measurement cores of the U.S. Resource Centers for Minority Aging Research (RCMAR) to be important in the evaluation of measures for equivalence across groups differing in characteristics such as ethnicity, literacy and education. Additional information can be found at the following website (www.research-HHAR.org). Elements include: sample characteristics (recruitment, data collection methods, response rate), format of the measure (design, readability, type (level) of measurement, scoring (range, direction, rules and missing data), translations, psychometric properties (scale construction, basic summary statistics, variability, test-retest, interrater, internal consistency reliability, content, construct, concurrent and predictive validity, sensitivity to change) and differential item functioning (variables studied, sample size, DIF method used, tests of model assumptions, purification, evidence of uniform and non-uniform DIF, magnitude and impact of DIF), and review of strengths and weaknesses. Specific comments related to each article appear in Table 1, in the column titled, “Review”.
Presented above are brief definitions of the elements used in the MEGS that are relevant to examination of DIF. These include methods for examination of DIF, tests of model fit and assumptions, purification, evidence of uniform and non-uniform DIF, magnitude and impact of DIF, all of which can affect the detection rate (e.g., Rogers & Swaminathan, 1993; Whitmore & Schumacker, 1999). Note that basic psychometric analyses such as reliability and validity of the measures are not the focus of this review, but are assumed to have been examined.
Shown in the table are the results of DIF analyses for the measures reviewed. For brevity, only selected elements from longer internal summary reviews of each article are shown. These include the name of the measure, source of the DIF analyses, the method used, the results and the summary of the methodological review. Because this information is included in the table, it is not repeated; rather each section below is a brief summary of findings and recommendations for future work.
Two forms of severe mental illness, unipolar major depression and bipolar disorders, have been identified by a World Health Organization study (Lopez & Murray, 1998) to be among the leading causes of disability. Recent findings from the National Health Interview Survey (Pratt, Dey & Cohen, 2007) showed that the prevalence of serious psychological distress was more than twice as high (5.9%) among Hispanic older persons (65 and over) than among non-Hispanic Blacks (2.4%) and Whites (2.1%). These differences among race and ethnic groups were not observed among younger cohorts. Several studies (e.g. Callahan & Wolinsky, 1994; Gallo et al., 1998; Koenig et al., 1992; Teresi et al., 2002) of depression have shown lower rates of depression among Blacks as contrasted with other groups, and among older as contrasted with younger cohorts. In order to determine if differences in rates between race/ethnic, age and gender groups reflect actual differences and not item bias, studies of factorial invariance and DIF are needed.
Several constructs or appellations related to depression include dysthymia, emotional distress, general distress, serious psychological distress, affective disorder and anxiety. Typically a measure labeled as one of the above contains at least some depression items, and some items from most depression scales are also found on measures of these other constructs. There is a vast literature discussing the interrelationships among some of these constructs (e.g., Hockwarter, Harrison & Amason, 1996; Huelsman, Nemanick & Munz, 1998; Lawton, Kleban, Rajagopal, Dean & Parmelee, 1992; Teresi, Abrams & Holmes, 2000), a discussion of which is beyond the scope of this presentation. However, because the majority of the measures reviewed below are intended to measure depression, this is the term that is used here to discuss the constellation of symptoms examined in terms of factorial invariance and DIF.
Several studies have examined the factor structure of depression measures, most using exploratory factor analyses; however, recently studies using confirmatory multi-group factor analyses have emerged. Strict residual-level factorial invariance (using a multi-group factor model) is equivalent to DIF testing using a 2-parameter IRT model (see Meredith and Teresi  for a discussion). Most of the 14 DIF studies of depression used a variant of IRT; six used the one parameter Rasch or other IRT models, including the two parameter model with likelihood tests, and five used MIMIC or restricted factor analyses. Others used the Mantel-Haenszel ordinal logistic regression or SIBTEST (Shealy and Stout, 1993) methods.
An important first step in the conduct of DIF analyses is examination of dimensional invariance because unidimensionality is an assumption of most methods used. Dimensionality is usually tested using a factor analytic approach. Reviewed below are studies of factorial invariance or DIF in measures of depression; another review of two popular depression measures can be found in Mui, Burnette, and Chen (2001). The reviews related to depression presented in the MEGS include the years 1995 to 2008; however, some important earlier work is briefly reviewed (for a more detailed review of these earlier studies, see Teresi & Holmes, 1994, 2001)
Studies of factorial invariance and individual factor analyses with different groups have been conducted with respect to various measures of depression. For example, Foley, Reed, Mutran, and DeVellis (2002) examined the factor structure of the CES-D among older African Americans using an exploratory factor model. The eigenvalue for the first factor was 6.26, explaining 31% of the variance, while the second was 1.63, explaining 8% of the variance; thus an essentially unidimensional depression/somatic symptoms factor characterized the data. The results demonstrated that the factor structure was different from that observed in previous studies, and that collectively across studies, divergence among solutions was observed. Gregorich (2006) subjected the Somatic and Retarded Activity factor of the CES-D (originally reported by Radloff using exploratory factor analyses) to confirmatory factor analyses. Samples of Black and White men over age 50 were studied; the results indicated that metric invariance was achieved for all five items; however, the item, ‘effort’ did not achieve strong factorial invariance of item intercepts, and ‘appetite’ was not strictly invariant (equal residual variances).
The majority of recent studies of DIF in depression measures have focused on the CES-D. The group variables as well as the method used to examine DIF varied across studies, however. Using an IRT log likelihood approach, Chan et al. (2004) found 12 items manifesting DIF (uniform and/or non-uniform): “happy”, “enjoyed life”, “could not get going”, “talked less than usual”, “everything was an effort”, “felt as good as others”, “felt depressed”, “felt sad”, “trouble keeping my mind on what I was doing”, “people dislike me”, “my life had been a failure”, and “felt hopeful about the future”, when mode effect (phone vs. mail) was examined. Discussing the impact of DIF, the authors noted that DIF could result in an increase of up to six points on the depression continuum for the mail respondents. On the other hand, Cole et al. (2000) found 17 of the 20 CES-D items to be relatively free of item bias by age, gender, and racial groups. Only three items, “people are unfriendly”, “people dislike me”, and “crying spells” were found to function (uniformly) differently among subgroups of gender and race. The magnitude of DIF on the interpersonal items, however, reflected proportional odds up to three points higher for Blacks as compared to Whites with equivalent levels of depressive symptomatology. Similarly, the magnitude of DIF on the “crying spells” item showed a two-point increase in proportional odds for women as compared to men, matched on overall depressive symptoms. These artificially increased odds for endorsing such items by Blacks and/or by women could carry as an overall bias at the scale level. The authors highlighted a shorter, relatively DIF-free version of the CES-D, which correlated .99 with the original scale. More recently, Yang and Jones (2007) replicated the findings of Cole and colleagues (2000), using a latent variable model approach, MIMIC. Data from the New Haven Established Population for the Epidemiologic Studies of the Elderly were used to examine DIF related to age (75 and over vs. younger), gender, and race (Black vs. White). Blacks were more likely to respond in a higher category, conditional on depression to the items: “people are unfriendly” and “people dislike me”. The proportional odds for women were higher than for men for the item “crying spells”. These items had relatively large magnitude of DIF because the proportional odds for these items were all two or above. Using Rasch analysis, similar findings were reported by Covic, Pallant, Conaghan and Tennant (2007). “I felt tearful” and “I had crying spells” had significant DIF on both age and gender. The subgroup aged 53 years or less (compared to 54–65 & 66+) and females (compared to males) were significantly more likely to endorse these two items.
In their examination of the contribution of specific physical disorders to uniform DIF in the CES-D, Grayson et al. (2000) found item-specific effects for age, gender, and marital status.
Older participants reported being more “bothered by things” and less “hopeful about the future”. Men found things “less of an effort”, were “less fearful”, “slept better”, and reported “crying less”; being widowed was associated with “feeling at least as good as others”, and with more “fear and loneliness”. (pg 276)
Additionally, mobility, ADL, and IADL impairment showed direct effects on “poor appetite”, “finding everything an effort”, “restless sleep”, and “inability to get going”. In terms of physical disorders, heart disease, stroke, any other systemic disease, gait instability, and cognitive impairment showed positive association with depression. However, individuals with physical disorders, for reasons unassociated with depression, underreported on items such as: “felt as good as others”, “talked less than usual”, “people are unfriendly”, “enjoyed life”, “crying spells”, “felt sad”, and “people dislike me”, and showed higher endorsement on items such as: “poor appetite”, “everything an effort”, and “inability to get going”. As discussed by the authors, depending upon the group variable being examined, the impact of DIF on the total CES-D score ranged from trivial to considerable (over seven times the magnitude of the effect on depression). Similarly, Pickard, Dalal, and Bushnell (2006) found that four items, “My sleep was restless,” “I felt that people disliked me,” “I did not feel like eating,” and “I had crying spells”, demonstrated statistically significant uniform DIF when stroke and primary-care groups were compared. The authors do not recommend stroke-specific modifications to the CES-D, however, arguing that only one item was identified as uniquely psychometrically problematic. It is noted that the small sample size for the stroke patient group renders the study results exploratory.
Gelin and Zumbo (2003) examined the CES-D using the Health and Health Care Survey data collected from 600 community resident adults residing in Northern British Columbia, Canada (290 women and 310 men). Using an ordinal logistic regression approach, they found that the manner in which items were scored affected DIF results, as did the endorsement proportions. Depending if items were scored as binary, ordinally, or according to a persistency (frequency of at least 3 to 7 days) threshold, results changed. The “crying” item showed high magnitude of gender DIF for both binary and ordinal scoring methods, in the direction that the conditional endorsement was higher for women than for men; the item was a much more severe indicator for men because it takes higher levels of depression before men will endorse this item. While DIF was observed for two other items (“effort” and “hopeful”) using the persistence method, the low item prevalence renders these results less robust.
Several factorial invariance studies of the General Health Questionnaire-12 have been performed. This 12-item measure assesses minor psychiatric disorders and has been viewed as a measure of general distress. Items include “concentration,” “sleep disorder due to worry,” ”feeling depressed,” “worthless,” “unhappy,” and “lack of enjoyment of activities.” Three factors have been observed: anxiety/depression, social dysfunction, loss of confidence (Shelvin and Adamson, 2005). However, these analyses showed that a higher order factor or a 12 item summary measure may be sufficient; factorial invariance of loadings, error variance and factor variances was established for gender. The authors of another study (Makikangas et al., 2006) of the GHQ-12 (ignoring minor misfit and one invariant factor loading) demonstrated factorial invariance of thresholds, loadings and factor means over time. Jorm et al. (2005) examined the factorial invariance of the Goldberg Depression and Anxiety scales using a community survey of 7485 persons in several age categories: 20–24, 40–44, 60–64. These authors established weak (metric) factorial invariance for the two factors. A generalized measure of psychological distress was also recommended: the sum of all items, excluding one, “difficulty falling asleep”. A potential weakness, acknowledged by the authors is that older age groups were not examined. Additionally, only metric invariance was established. Duncan-Jones, Grayson and Moran (1986) used IRT to examine the gender bias of the 12-item GHQ. Two items (“feeling constantly under strain” and “feeling unable to overcome difficulties”) were more related to depression for women than they were for men.
An examination of item bias (using IRT) associated with the SHORT-CARE Depression scale, was conducted by Teresi and Golden (1994); these authors found that some of the somatic symptoms (“headaches”, “crying”, and “lack of interest”) were relatively less severe indicators of depression for Latinos than for White, non-Latinos. “Crying” was of higher DIF magnitude. Across the disability spectrum, the likelihood of endorsement of this item in particular was higher for Latinos than for White, non-Latinos. A subset (with some modifications) of the SHORT-CARE Depression Scale, including the “crying” item, is contained within the EURO-D, a widely used measure that has recently been evaluated psychometrically (Castro-Costa, Dewey, Stewart, Banerjee, Huppert, Mendonca-Lima, et al., 2008).
Gibbons, Clark, Vonammon-Cavanaugh, and Davis (1985) used IRT to examine the BDI, comparing medically ill inpatients with psychiatric patients. The vegetative symptoms, “loss of weight” and “of sexual interest” were particularly poor discriminators of depression severity among the medically ill sample. Two items (“loss of satisfaction”, “loss of social interest”) were found to maximally assess depression severity. Azocar et al. (2001), examining uniform DIF identified four BDI items to be biased for the Spanish- (vs. English) speaking sample. The items, “I feel like I am being punished”, “I feel like crying”, and “I believe I look ugly”, were more likely to be endorsed, and the item “I can’t do any work at all” was less likely to be endorsed by Spanish speakers regardless of their level of depression. The authors point out that the impact of DIF in this scale is such that it could result in an artificial increase of the mean scores for Latino samples up to six points (possible scores ranged from 0 to 30) above those of English-speaking samples with equivalent depression levels. Kim et al. (2002) using item response theory to examine the contribution of age to DIF in the BDI, found three items reflecting uniform DIF across all levels of depression: “loss of libido”, “weight loss”, and “disappointment in self”, in which midlife patients were more likely to endorse “loss of libido” and “disappointment in self”, and less likely to endorse “weight loss” than late-life patients. This finding is similar to the earlier study reviewed above (Gibbons, et al., 1985) that found that the vegetative symptoms (“loss of weight” and of “sexual interest”) were not well related to depression severity among a medically ill sample. Kim et al. also found non-uniform DIF on 8 of 11 BDI items: “self-criticism”, “social withdrawal”, “irritability”, “guilt feelings”, “sense of failure”, “sleep disturbance”, “somatic preoccupation”, and “work inhibition”. They found that the impact of DIF on the BDI was not trivial, given that approximately half of the items on the scale accounted for 80% of the differential test functioning.
In contrast to the above findings of DIF related to several items in the BDI, Tang, Wong, Chiu, Lum, and Ungvari (2005) failed to document any uniform or non-uniform DIF of the GDS with respect to age, education or cognitive impairment in a sample of Chinese patients. However, more recently, Broekman, Nyunt, Niti, Jin, Ko, Kumar, et al. (2008), examining DIF in a heterogeneous Asian population, found ten of the GDS-15 items to show DIF associated with age, gender, ethnicity and chronic illness. Specifically, six items, e.g., “drop many activities and interests”, “prefer staying home”, “more problems with memory”, “feel pretty worthless”, “not full of energy”, and “not happy most of the time” showed age-related DIF. Five items showed gender-related DIF: “afraid that something bad is going to happen”, “prefer staying home”, “more problems with memory”, “feel situation is hopeless”, and “not satisfied with life”. The four items that showed ethnicity-related DIF were “prefer staying home”, “think not wonderful to be alive”, “feel pretty worthless”, and “more problems with memory”. Finally, two items showed illness-related DIF: “feel pretty worthless” and “not full of energy”. The authors concluded that the cumulated effects of specific item bias due to age, gender, ethnicity and chronic illness could potentially bias the total test score.
Items from depression diagnostic scales were also investigated for DIF. For example, Gallo et al. (1998), examining the Diagnostic Interview Study (DIS; Robins, Helzer, Croughan & Ratcliff, 1981) for racial bias, found certain items showing uniform DIF. “Sleep disturbance” and “sadness” were less likely to be endorsed by African Americans, and “difficulty with concentrating” and “thoughts of death” were more likely to be reported by older African Americans than by Whites. No discussion of the impact of DIF was presented, however.
Other research examining DIF associated with Diagnostic Statistical Manual (DSM) symptoms has been performed (Simon and Von Korff, 2006). The Composite International Diagnostic Interview (CIDI) (Kessler, Wittchen, Abelson, McGonagle, Schwarz and Kendler, 1998), a DSM III-R-based diagnostic interview schedule, showed race/ethnicity-related DIF (Breslau, Javaras, Blacker, Murphy, and Normand, 2008). Blacks (vs. Whites) were less likely to endorse “lack of energy”, “felt worthless” and “thoughts of suicide” at the item level, and “loss of energy” and “self-reproach” at the symptom level. Similarly, underestimation of depression was found among Hispanics (in contrast with Whites) for “increased weight” and “waking early” at the item level and for “suicidality” at the symptom level. In another study, the Patient Health Questionnaire depression scale (PHQ-9; Kroenke, Spitzer, and Williams, 2001), a criterion-based depression measure based on symptoms from the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV), was examined. Exploratory factor analyses were performed across four samples: African American (n=598), Chinese-American (n=941), Latino (n=974) and non-Hispanic Whites (n=2,520) yielded essentially unidimensional factors across samples (Huang, Chung, Kroenke, Delucchi, and Spitzer, 2006). Eigenvalues ranged from 3.5 to 4.42, and variances explained were: 38.9 (Chinese), 39.6 (Latino), 40.1 (African American), and 49.1 (non-Latino White). Testing for DIF, items such as “anhedonia”, “sleep”, and “appetite” showed significant DIF using the Mantel-Haenszel statistic in the Chinese-American group and Latino groups as compared to White non-Latinos. In addition, “depressed mood”, “low energy”, “appetite change”, and “low self-esteem” evidenced DIF for Latino as contrasted with non-Latino Whites. Most of the DIF found at the item level in those two groups, using the Mantel-Haenszel method no longer showed a significant level of DIF after controlling for covariates of age, gender, and English-language ability in the MIMIC model test. No item-level DIF was reported for the African American group. While there were few scale group differences before DIF adjustments among the groups defined by race/ethnicity, there were some differences in proportion over threshold classified as depressed. Additionally, subgroup comparisons involving gender, age and language could be influenced by DIF. In contrast with some of the findings reviewed above showing DIF in Black and White comparisons, in an examination of DIF associated with depression items in the Primary Care Evaluation of Mental Disorders (PRIME-MD; Spitzer, Williams, Kroenke, Linzer, deGruy, Hahn, et al., 1994), Hepner and her colleagues (2008) did not find any significant DIF among lower income Black and White women.
The response patterns and the factorial composition of scales assessing depressive symptomatology have been found to be affected by several factors (Mui et al., 2001; Pedersen, Pallay, & Rudolph, 2002). For example, the affective CES-D item tapping sadness showed DIF based on physical disorder and interview mode, and a similar DIS item reflected DIF based on race. Additionally, the CES-D interpersonal items “people are unfriendly” and “people dislike me” showed DIF with respect to one or more of several variables: interview mode (Chan et al., 2004), gender, race (Cole et al., 2000), physical disorder (Grayson et al., 2000), and stroke (Pickard, et al., 2006). DIF was also observed in the “crying” item of the CES-D by age, gender, race, physical disorder, and stroke condition (Azocar, et al., 2001; Cole, 2000; Covic, et al., 2007; Gelin and Zumbo, 2003; Grayson et al., 2000; Pickard et al., 2006; Reeve, 2000; Teresi and Golden, 1994; Steinberg and Thissen, 2006; Yang and Jones, 2007).
The impact of DIF in the CES-D, as discussed by the respective authors, ranged from trivial to significant, depending on the reference group studied. For example, in some studies DIF was found to result in a considerable artificial increase in the overall depression score for mail responders (Chan et al., 2004), and for Blacks (Cole et al., 2000); on the other hand, scale adjustments were not warranted for stroke patients (Pickard et al., 2006). Similarly, Osborne and colleagues (2004), discussing the impact of DIF in the HADS (Zigmond & Snaith, 1983), did not recommend adjustments for cancer patients. The impact of DIF in the BDI (Beck et al., 1961) was demonstrated to be sizable, showing artificial, favorable endorsement of some of the items by Spanish-speaking Latinos (in contrast to English speakers; Azocar et al., 2001); similarly, another analysis demonstrated that half of the items in the scale accounted for 80% of the differential test functioning (Kim et al., 2002).
In summary, about two thirds of the studies reviewed in the area of depression examined magnitude, and almost all estimated the impact of DIF; a little over one third examined non-uniform DIF. In general, findings were of large amounts of DIF of sizeable magnitude and impact. Adjustments of scale scores were frequently recommended.
Quality of life has been conceptualized in many different ways (see Lawton, 1991; Katz and Gurland, 1991; Gurland & Gurland, in press). It is beyond the scope of this article to discuss issues related to the definition, except to point out that overlapping domains (e.g., emotional, physical and functional states) are included in most generic quality of life measures. It is also noted that a distinction can be made between health-related quality of life and general quality of life; the latter includes additional dimensions such as environmental and personal resources (Albert and Teresi, 2002). Given that most of the quality of life measures reviewed here focus on physical, functional and emotional states, most of the findings generalize to DIF and health-related quality of life.
Various approaches were used to examine DIF in the measures of health-related quality of life; most feed into the IRT framework. While parametric IRT examines DIF at the item level, some non-parametric IRT methods (e.g., Mokken) define DIF at the response category level by testing the hypothesis of “equal item step order” across subgroups. Among the nine studies of DIF in quality of life measures reviewed, four used the Rasch model; logistic regression was used in two studies; the two-parameter model was not used in these articles.
Instruments that measure quality of life often contain items related to emotional health. A recent analysis (Crane, Gibbons, Ocepek-Welikson, et al., 2007; Teresi, Ocepek-Welikson, Kleinman, Cook, Crane, Gibbons, et al., 2007) of 15 emotional distress items from a study of quality of life included several depression items from a number of scales (Cancer Rehabilitation Evaluation System [CARES-SF; Ganz, Schag, Lee & Sim, 1992; Schag, Ganz, & Heinrich, 1991]; European Organization for Research and Treatment of Cancer Quality of Life Questionnaire [EORTC; Aaronson, et al., 1993]; Functional Assessment of Cancer Therapy [FACT; Cella, 1997], Medical Outcomes Study Short Form Health Survey [SF-36; Hays, Sherbourne, & Mazel, 1993; McHorney, Ware & Raczek, 1993; Ware and Sherbourne, 1992]). The findings from two analyses of these data showed that several items evidenced uniform DIF. Conditional on positive mental health, White respondents were more likely than African Americans to report that they were “not worried about dying” and “did not feel worried” (Teresi, et al., 2007). Crane, Gibbons, Ocepek-Welikson, et al. (2007) also identified “feeling worried” as showing DIF for the comparison of African Americans and Whites; however, this was not significant after Bonferroni correction. Women were more likely to say that they were “able to enjoy life” and were “content with their quality of life” (Teresi et al.); Crane, Gibbons, Ocepek-Welikson, et al. (2007) also identified “content with quality of life” as showing gender DIF before, but not after Bonferroni correction. Older respondents (66 and over) as contrasted with the younger cohort were more likely to report that they “were not worried about dying”; “feeling calm and peaceful” also showed age DIF (Teresi, et al., 2007). Crane, Gibbons, Ocepek-Welikson, et al. (2007) did not find DIF related to age, but did find one item (“being a happy person”) to show DIF for marital status. These authors concluded that there could be DIF impact associated with race on the General Distress scale for some individuals.
Several authors have examined DIF in the EORTC QLQ-C30 emotional function subscale. Two items repeatedly showed uniform DIF in studies of different languages and ethnic groups, “Did you worry” (in six studies) and “Did you feel irritable” (in three studies). In a study that compared language groups (Bjorner, et al., 2004), ‘worry’ was estimated to be the most informative, with the largest mean threshold parameter, and one of the largest slopes. In a study of 359 Caucasian, Filipino, Hawaiian, and Japanese cancer patients, the ‘worry’ item was significantly less difficult for Hawaiians (Pagano & Gotay, 2005). The item, “Did you worry” showed DIF for English vs. Norwegian, Dutch, and French speakers (Petersen et al., 2003) in a cross-national study of 10 countries. This item also showed non-uniform DIF when comparing subgroups of Norwegian with English speakers. A review of 13 translations of the EORTC performed by Scott, Fayers, Bottomly, Aaronson, de Graff, Groenvold, et al. (2006) found that respondents using the Norwegian, Turkish or the two Chinese translations were less likely to endorse “Did you worry” compared to English speakers, while Germans were more likely to endorse this item. An examination of several EORTC items by Teresi and colleagues (2007) found significant yet moderate DIF in the direction that Blacks required more positive health in order to endorse this item. In a review by Bjorner et al. (2004), “Did you feel irritable?” consistently had the smallest slope and mean threshold, indicating that respondents were less likely to report symptoms based on this item, and it provided markedly less information than the other items. In the ten-country cross-national study, “Did you feel irritable” showed DIF for English vs. Norwegian, Spanish, and German speakers (Petersen, et al., 2003). Scott et al. (2006a) also found DIF for Spanish speakers, and in addition, found DIF for Dutch speakers. This item also showed non-uniform DIF for subgroups of Spanish and English speakers (Bjorner). Two separate studies found two additional items in the subscale and showed DIF for language groups: “Did you feel tense” for English vs. Swedish and Spanish speakers (Petersen, et al., 2003) and English vs. Polish and Singapore Chinese speakers (Scott, et al., 2006a). The second item, “Did you feel depressed” showed uniform DIF for English vs. Norwegian and Swedish speakers in both studies. In addition, German, and Finnish speakers (Petersen et al., 2003) and Polish and the two Chinese translations (Scott, et al., 2006a) showed uniform DIF for this item.
In a study that examined item bias in the EORTC QLQ-C30 (Aaronson et al., 1993) among age groups and form of treatment among Danish breast cancer patients (Groenvold et al., 1995), one physical function item, “Do you have to stay in a bed or a chair for most of the day?” was biased across both age and treatment groups. This item was also found to show DIF for several language and cultural groups (Scott, et al., 2006a, 2007). In addition, “trouble doing strenuous activities”, “taking a long walk” and “taking a short walk” showed DIF among age (Teresi, et al., 2007), language (Scott, et al., 2006a), and the latter two for cultural groups (Scott, et al., 2007). Crane, Gibbons, Ocepek-Welikson, et al. (2007) also reported uniform DIF that remained after Bonferonni correction for age for “trouble with a long walk”, and “trouble doing strenuous exercise” for gender.
Scott and colleagues found that Turkish vs. English speakers and Islamic (Turkey, Iran and Egypt) vs. UK groups required more “help with eating, dressing, washing or using the toilet” (Scott, et al., 2006a, 2007). Among Turkish respondents, the item, “I find it difficult to take care of people I am close to”, showed DIF for age. Younger, in contrast to older respondents, were less likely to report difficulty in caring for persons to whom they are close (Kutlay et al., 2003).
The item, “Did pain interfere with your daily activities”, showed DIF across language (Scott, et al., 2006a), cultural groups (Scott, et al., 2007), and cancer treatment groups (Groenvold, et al., 1995). Those receiving chemotherapy were less likely to report “difficulty remembering things”. Relative to other language comparison groups, English speakers scored significantly lower on the item, “did pain interfere with your daily activities” (Scott et al., 2006a). South Western Europeans were significantly less likely to endorse “have you had pain” (Scott et al., 2007). In a separate study, Caucasians showed significantly less difficulty with “work at job” and “constipation”. Caucasians and Japanese had greater difficulty with “social activities” than did Hawaiians and Filipinos (Pagano & Gotay, 2005). Danish and German speaking respondents were more likely to endorse the “family life” item, and along with the Spanish speaking group, score lower on the “social activities” item than did English speakers (Scott et al., 2006a).
A Rasch analysis of the EUROQoL (Kind, 1996) performed by Prieto and colleagues (2003), identified the emotional function item, ‘Anxiety/ depression’ as the easiest (more likely) item to endorse across ten European countries. ‘Mobility’ and ‘self-care’ were the most difficult (least likely) items to endorse for all countries except Denmark, where respondents were more likely to endorse the ‘mobility’ item (Prieto et al., 2003).
A study of the Turkish translation of the Rheumatoid Arthritis Quality of Life Scale (De Jong, Van der Heijde, McKenna & Whalley, 1997) found two biased physical function items, “I find it difficult to walk to the shops”, and “I sometimes have problems using the toilet”. The latter was more difficult for Turkish respondents than Americans. The authors also found uniform cross-cultural DIF in the items “Often gets frustrated” and “Feels unable to control situation”.
An examination of the WHOQOL-BREF found DIF for age groups for “able to get around”, “satisfied with sex life”, and “ability to get things you like to eat”. In addition, four items exhibited DIF between elementary, secondary and higher education groups (Wang, Yao, Tsai, Wang, & Hseih, 2006).
Groenvold and colleagues (1995) and Pagano and Gotay (2005) examined the impact of removing items with DIF from the scales; both concluded that biased items should not be removed. Bjorner and colleagues (2004) examined the effect of removing items with DIF from the scoring algorithm; they concluded that scoring algorithms that took language-related DIF into account did not perform as well as those that ignored DIF. In contrast, an analysis performed by Petersen et al. (2003) found that scale scores were equivalent when the biased item, “worry” was removed from the scale. No recommendations were made to remove items.
The general health measures often contain subscales or items measuring several domains, including physical and mental health, pain, vitality and social role functioning. Five authors of the nine studies in this area examined a commonly used instrument to measure general health, the Short Form (SF)-12 (Ware, Kosinski & Keller, 1996), SF-36 (Ware and Sherbourne, 1992; Ware, Gandek & The IQOLA Project Group, 1994) or RAND-36 (Hays, Sherbourne, & Mazel, 1993). A little over half of the nine studies of general health applied a latent variable DIF model, such as Rasch or MIMIC. However, Bjorner and colleagues tested for uniform and non-uniform DIF in the SF-36 using a partial gamma coefficient in Danish and American populations (Bjorner, Kreiner, Ware, Damsgaard & Bech, 1998). Using this method, they found four items from the Physical Functioning and two from the General Health Scale behaved differently for these cultural groups. Fleishman and Lawrence (2003) examined DIF in the SF-12 (Ware et al., 1996) by race/ethnicity, age, gender, and education level for an American, adult sample (aged 17 or older) using the MIMIC model. Ten of the twelve items showed DIF for the demographic variables tested. Yu and colleagues (2007) examined DIF in the physical function (PF) and mental health (MH) subscales of the SF-36 for demographic characteristics and hypertension, rheumatic conditions, diabetes, respiratory diseases, and depression also using the MIMIC model. Uniform DIF was observed for numerous items. Perkins, Strump, Monahan and McHorney (2006) examined DIF in all subscales of the SF-36 in two large national datasets; the National Survey of Functional Health Status contained general population data, and the Medical Outcomes Study contained data from a chronically ill population. Data were examined with respect to age, education, race and gender using proportional-odds logistic regression. While numerous items exhibited DIF, most were not of high magnitude. Moorer and colleagues (2001) examined the RAND-36 (Hays et al., 1993) using a non-parametric IRT model, Mokken scale analysis for polychotomous items (MSP). Using this method, the authors found no evidence of DIF across disease groups (multiple sclerosis, rheumatism, and COPD). In addition, two groups examined DIF in 15 RAND-36 items using the item response theory log-likelihood ratio and ordinal logistic regression approaches in a sample with cancer or HIV/AIDS (Teresi et al., 2007; Crane, Gibbons, Ocepek-Welikson, et al., 2007). High magnitude of DIF was observed for three of the five items identified with uniform DIF.
The majority of DIF findings relate to the physical function subscale. Nine of the ten items exhibited DIF in at least one group; “vigorous activities” was most frequently cited as problematic while “walk one block” showed no DIF findings. “Vigorous activities” showed both uniform and non-uniform DIF for age (Perkins, et al., 2006; Yu, Yu, & Ahn, 2007; Teresi, et al., 2007), education (Perkins, et al.; Yu, et al.), income (Yu, et al.), disease group (Yu, et al.), gender (Perkins, et al.), and race (Perkins, et al.; Teresi, et al.). While older people with poor physical function reported less limitation than younger people, among those with high physical function, older persons reported more limitation than did younger people with “vigorous activities” (Perkins, et al., 2006). A MIMIC model showed negative effects for age for this item (Yu, et al., 2007). Differences of large magnitude were found in sample of cancer, HIV and AIDS patients; conditional on functional status, White respondents were less likely than Blacks, and those 66 and older were less likely than those younger to report that they were capable of “vigorous activities” (Teresi, et al., 2007). In general and sick populations, those with less education and Blacks reported less limitation with respect to “vigorous activities” with more pronounced difference at lower to mid-levels of physical functioning (Perkins, et al.). In addition to similar education findings, Yu and colleagues found that lower income had positive DIF effects (2007). The analyses by Perkins and colleagues of a sample of chronically ill persons, showed non-uniform gender DIF for this item. Finally, in a large data set, those with respiratory disease scored lower than expected on “vigorous activities” (Yu, et al., 2007).
Older people reported more limitations in “moderate activities” (Yu, et al., 2007; Fleishman & Lawrence, 2003), and people with hypertension more frequently endorsed “health limits moderate activities” (Yu, et al.). Conditional on function, fewer Americans than Danes (Bjorner, et al., 1998), Blacks as contrasted with Whites (Teresi, et al.), and females had greater limitations “lifting or carrying groceries” (Yu, et al.; Teresi, et al.). Older people, those with less education, and females reported more limitations with “stair climbing” (Fleishman & Lawrence; Yu, et al.). Age showed both uniform negative effects for “bending/kneeling/stooping” (Yu, et al.), and non-uniform effects (Perkins, et al.; Teresi, et al.). Older people with poor physical function reported less limitation than did younger people. Among those with high physical function, older people reported more limitations than did younger persons with this item. Those with hypertension more frequently endorsed limitations in “bending/ kneeling/ stooping” (Yu, et al.). Crane, Gibbons, Ocepek-Welikson, et al. (2007) found DIF for this item across three forms of logistic regression.
Both uniform and non-uniform DIF were found for “walking more than a mile”. Both Crane, Gibbons, Ocepek-Welikson, et al. (2007), and Teresi and colleagues found DIF in this item using the same data set, but two DIF detection methods (LR and IRT Log likelihood). Whites compared with Blacks, and older compared with younger persons reported more limitations (Teresi, et al.). Among those with low physical functioning, Blacks reported less limitations than Whites “walking more than a mile”, while at higher function levels Blacks reported more limitation than Whites (Perkins, et al.). Age showed negative effects for this item (Yu, et al.). Fewer Danes than Americans indicated limitations in “walking more than a mile” and “walking several blocks” (Bjorner, et al.). Older people (Perkins, et al.), females (Yu, et al.), and Americans (compared to Danes; Bjorner, et al.) reported fewer limitations with respect to “bathing or dressing”. This item also showed non-uniform DIF for age and education (Perkins, et al.).
There were DIF findings for each mental health item in at least one of the three studies in which DIF was evidenced for this subscale. Americans (compared with Danes; Bjorner, et al., 1998) and Blacks (compared with Whites) were more likely to endorse ‘been a very nervous person’, while those with less than a college education and those who were married were less likely to endorse this item (Yu, et al., 2007). The group with less education showed a negative effect for “nothing could cheer you up” (Yu, et al.). Americans were more likely than Danes to endorse this item (Bjorner, et al.). “Felt calm and peaceful” showed multiple effects across studies. Older age groups (Perkins, et al., 2006; Fleishman & Lawrence, 2003; Yu, et al.; Teresi, et al., 2007), those with low education (Perkins, et al.; Fleishman & Lawrence) and minorities (Fleishman & Lawrence; Yu, et al.) were more likely than expected to give higher ratings. Those who were married endorsed this item less often than expected (Yu, et al.). Fleishman and Lawrence found that women and older age groups were less likely to report having “felt downhearted”. Finally, those with less education (Perkins, et al.), low income, and women (Yu, et al.), were more likely to report having “been a happy person”.
All five general health items showed DIF for at least one group across two studies. “Health in general” was rated significantly lower by Danes than by Americans (Bjorner, et al., 1998), older people (70+) compared to younger (18–39); those in the low education (0–11 years) as contrasted with those with high education (13+ years), and Blacks vs. Whites (Perkins, et al.). In both data sets examined by Perkins, older people, and those 55–69 years old compared with those 18–39 years old, were less likely to report “getting sick easier”. In both data sets, older people were less likely to ‘expect their health to get worse’. In the sample from the sick population “expect health to get worse” showed DIF for gender, age and race (Perkins, et al.). Additionally, Perkins, et al. found the item “health is excellent” to show gender DIF in the sick population; males were more likely to endorse the item. DIF was also observed for “I am as healthy as anybody I know” (Bjorner, et al.).
Those aged 70 and older reported less pain than younger people (Fleishman & Lawrence). Perkins and colleagues found that all items in the vitality subscale showed either uniform or non-uniform DIF for age in both data sets examined. Non-uniform findings were more pronounced for those at the lower range of the vitality scale. Conditional on vitality, older people reported having less ‘energy’ and having felt less “full of pep”; however, they were less likely to report having “felt worn out”, and “felt tired”. In addition, “felt worn out” showed DIF effects for education and “felt tired” showed non-uniform DIF for race (Perkins, et al.). The examination of the SF-12 showed women were more likely to report not “having energy”; this item was also rated more highly (less symptomatology) by older people, Blacks and Hispanics (Fleishman & Lawrence).
Physical and emotional role items infrequently exhibited DIF. The physical and emotional role limitation items of the SF-12 showed small effects for education. Those with low education (<12 years) reported that they “accomplished less” (Fleishman & Lawrence). Perkins and colleagues found one physical role item showed positive DIF (“limited in kind of work”) for females in the sick population, and older people were more limited in the sick and general populations (Perkins, et al.). An examination of the SF-12 “social activities” item showed older people reported more interference with “social activities” than those younger than 40 and less interference for ‘other’ race (Fleishman & Lawrence).
Another general health instrument, the Stanford Health Assessment Questionnaire (HAQ; Fries, Spitz, Kraines & Holman, 1980) was evaluated for Turkish and American populations. The authors found that the item “grip” showed DIF by gender, and the three item Activities subscale showed DIF by culture; Turkish respondents scored slightly higher than did American respondents (Kucukdeveci et al., 2004). Hahn and colleagues (2005) examined uniform DIF for items in the three subscales of the Functional Assessment of Cancer Therapy – Breast (FACT-B; Cella, 1997) for Austrian and American patients using a one-parameter Rasch measurement model, and item location comparison. The trial outcome index (TOI) showed that Americans responded significantly more positively to “enjoy life” and “feel sexually attractive”, and significantly less positively for “bothered by weight change”, “energy”, and “arms swollen and/or tender”. In the social/family well-being subscale (SWB), “family communication” was more negatively rated by Austrians, and “satisfaction (sex life)” was more negatively rated (had a higher threshold calibration) by Americans. Three items from the emotional well-being scale (EWB), “proud of coping”, “worry (dying)”, and “sad” showed DIF between the groups. Adjusting for DIF did not alter the direction of differences, but slightly altered the effect sizes for each group for all three scales. Teresi and colleagues (2007), examining several FACT items, found that “able to enjoy life” and “content with my quality of life” were more severe indicators for men, and that “worried about dying” was a more severe indicator for the younger cohort and for Blacks.
One additional general health measure, the Sickness Impact Profile (Bergner, Bobbitt, Carter & Gilson, 1981), was examined for DIF. The authors adjusted for sickness level and examined differences between age groups. Younger participants were less likely to endorse “I get around only by using a walker, crutches” and “I do not walk up or down hills”. Younger men were less likely to endorse the mobility item “I do not get around in the dark or in unlit places without someone’s help” (Lindeboom et al., 2004).
Half of the studies of health measures examined impact; one (Hahn et al., 2005) examining the FACT-B (Cella, 1997), resulted in the conclusion that impact was slight. The authors (Bjorner, et al., 1998) of a study of the SF-36 concluded that the impact of DIF at the scale level was slight with respect to comparisons of Danish vs. American translations; however, the authors of another study (Fleishman & Lawrence, 2003), examining the SF-12 (Ware, et al., 1996), concluded that the impact of DIF with respect to self-reported Black status and age was large enough to be important. Crane, Gibbons, Narasimhalu, et al. (2007) also found scale level impact related to race for all subscales of the FACT. While several SF items showed significant DIF, removal of items was not recommended by the authors of these studies.
Review of articles examining DIF in patient self-reported outcomes across three domains (depression, quality of life and general health) identified several poor-performing items within and across domains. These items were flagged because they evidenced large magnitude and/or consistent DIF across studies. Examining first items measuring depression, it was observed that 7 items were problematic. The “crying” item, despite slight semantic divergence across various depression scales (e.g., CESD, Short Care, BDI), showed consistent DIF for demographic and health-related groups. Additional poor performing items: “people dislike me” (CESD), “everything was an effort” (CESD), “sleep disturbance” (Short Care, DIS, BDI, PHQ-9), “feeling calm and peaceful” (SF) and “appetite” (CESD, BDI, PHQ-9) showed differential functioning for various demographic and/or health-related groups. Similarly, “energy” included in depression (GDS, PHQ-9) and general health (SF) scales showed consistent differential item functioning with regard to several demographic variables.
Turning to items measuring quality of life, worth noting are the items, “felt worried” and “worried about dying”, which are included in emotional distress (FACT) subscales as well as in quality of life (EORTC QLQ-30) measures. A variant of the “worry” item was repeatedly found to perform differently for racial/ethnic, age, and language groups. Also contained in the quality of life measure, EORTC, and in health measures such as the SF and the HAQ are items related to walking. The problematic items are “taking a long walk”, “taking a short walk”, “I find it difficult to walk to the shops”, “walking more than a mile”, “walking several blocks”, and “I do not walk up or down hills”. The general health item “vigorous activities” also showed DIF for a wide variety of demographic variables.
Because the following points apply to most of the studies, they are given as an overview prior to review of each content area. Ten of the analyses used a one-parameter Rasch model, six used MIMIC, and three used a M-H or contingency-based method; therefore over half of the analyses were not capable of examining non-uniform DIF. However, a few studies using the Rasch model were able to examine a form of non-uniform DIF by applying the ANOVA approach using Rasch logits or residuals. The use of the ANOVA approach allowed an interaction term of studied group by ability levels, which is an estimate of non-uniform DIF. However, most Rasch analyses were based on t-tests, accompanied by plots of difficulties with 95% confidence intervals. These analyses permitted examination only of uniform DIF. Some of the analyses were very thorough, examining assumptions and impact. Few studies using the Rasch approach examined magnitude of DIF or incorporated purification. A frequently cited reason for the use of the model was the requirement of smaller sample sizes; however, some of the subgroup sample sizes were most likely too small for generalization, e.g., around 30. Generally, the studies that focused on DIF, rather than examining DIF as a byproduct of other analyses produced more comprehensive results.
Few studies employed a two-parameter IRT model that does allow the detection of non-uniform DIF, and most of the authors of studies using the one-parameter model did not test the fit of that model against alternative models with more parameters. No studies used a three parameter IRT model. Unlike educational testing, guessing is rarely examined in studies of DIF in health and psychology. However, as pointed out by Kubinger and Gottschall (2007), guessing behavior may be culturally determined. While it is doubtful that respondents will guess in answering health or mental health questions, it could be a factor in cognitive assessments. Additionally, other response sets, such as a tendency to use the extreme or positive ends of the response category continuums can play a role in biasing item response (McHorney and Fleishman, 2006). It has been found for example that Latino respondents tend to endorse the extreme categories (Marin, Gamba and Marin, 1992). Such factors may help to explain findings related to DIF. Azocar and colleagues (2001) provide a detailed discussion of DIF related to extreme endorsement among Spanish-speakers as contrasted with English speakers.
A handful of studies used the MIMIC latent variable model; the authors applying this approach were meticulous in the execution of the analyses, usually examining assumptions, magnitude and impact. This method allowed simultaneous examination of multiple exogenous (studied) variables. The drawback to the approach is that the method does not allow detection of non-uniform DIF.
One study applied a non-parametric IRT method. A strength of the Mokken scale procedures used by Moorer and colleagues (2001) is that DIF is examined across scales; however, as others have shown, parametric methods of DIF detection might have identified more items with DIF. In that study, no evidence of DIF was observed across disease groups.
The logistic regression approach was also used infrequently; however when applied, it allowed estimation of non-uniform DIF because an interaction term for group and ability could be included. However, the downside of these analyses was that observed scores were typically used as conditioning variables, rather than the theoretically preferred latent variables. While it has generally been advised that purification be used to avoid false DIF detection, this practice was infrequently applied.
An important limitation is that in the interest of parsimony, the studies reviewed here focused only on formal tests of DIF. However, strict metric-level factorial invariance (using a multi-group factor model) is equivalent to DIF testing using a 2-parameter IRT model (Meredith & Teresi, 2006). Thus, there are measures that have been evaluated using an equivalent method that are not reviewed here. Additionally, although it is believed that the search was comprehensive, it is possible that some studies not contained in the databases searched were inadvertently omitted. A few studies were not included in the table because the DIF analysis was very limited, and could not be evaluated, based on the information presented in the article, or the sample sizes were too small. One study using two methods (Crane, Gibbons, Ocepek-Welikson, et al., 2007; Teresi et al., 2007) was not included in the table because the item set was from several scales; however, the findings were summarized in the body of the paper. Finally, the review was focused only on three areas of patient-reported outcomes, albeit highly salient in terms of clinical trials and observational studies. As previously stated, there were numerous articles on DIF related to function and disease-specific health that were not included in the interest of parsimony.
Most studies (8 of 14) reviewed in the area of depression examined magnitude, and all but one estimated the impact of DIF; about one-third examined non-uniform DIF. In general, findings were of large amounts of DIF of sizeable magnitude and impact. Adjustments of scale scores were frequently recommended. Among the quality of life studies reviewed, just over half included an assessment of magnitude and of impact of DIF. The authors of these studies tended to conclude that the impact of DIF was minimal, and scale adjustments were not warranted. However, DIF may have been underestimated, as only two studies included a formal evaluation of non-uniform DIF. Finally, of the studies of general health reviewed, half included measures of magnitude; impact was discussed with respect to over half of the studies; however, formal tests of non-uniform DIF were rarely performed. The authors of one of the five studies reviewed that examined the impact of DIF in general health measures concluded that DIF had an important impact on the emotional (mental) health component (Fleishman & Lawrence, 2003). Within the limitations of this review, it might be concluded that depression measures are more subject to DIF than are other types of measures; however, a major caveat is that most of the analyses of general health measures reviewed here did not incorporate tests of non-uniform DIF.
In summary, as a whole, these studies provide good beginning estimates of the presence of DIF in measures of patient-reported outcomes; however, most results should be cross-validated with other (usually larger) samples, using a method that permits examination of non-uniform DIF, while also incorporating the use of latent variables. Examination of magnitude and impact, coupled with qualitative review of item content is also critical in order to achieve an understanding of the role of DIF in assessment of patient-reported outcomes.
These analyses were supported in part by the United States National Institute on Aging: Resource Centers on Minority Aging Research (5P30AG015294), the United States National Institutes of Health Roadmap project: Patient Reported Outcomes Information System (PROMIS) (5U01AR052177), and the United States National Institute on Aging project: Understanding Disparities in Mental Status Assessment (AG0253008).