In this national survey, both Hispanics and non-Hispanic blacks tended to endorse depression questions less frequently than non-Hispanic whites. This was particularly true for blacks, who were significantly less likely than whites to have 5 of the 8 symptoms examined. Hispanics were less likely than whites to have 1 of the 8 symptoms, suicidality. We used a nonparametric IRT approach to examine whether the lower levels of endorsement among minorities in this survey could be explained by DIF against minorities, defined as a lower probability of endorsing questions among minorities compared with non-Hispanic whites after adjusting for differences in underlying levels of depression.
DIF against minorities was found for several of the questions used to assess depression. DIF against minorities at the question level led to DIF against minorities at the symptom level for 2 symptoms in the black-white comparison and for 1 symptom in the Hispanic-white comparison. However, the existence of DIF does not necessarily imply that the conclusions regarding the relative prevalence of depression between groups is incorrect. To assess the impact of the DIF against minorities on estimates of the relative prevalence of depression between ethnic groups, we followed a procedure of removing questions with DIF against minorities if this could be done without adversely affecting the construct validity of the test. After performing this procedure, we were able to reduce DIF at the symptom level for the black-white comparison and eliminate DIF at the symptom level for the Hispanic-white comparison. Recalculating the prevalence of depression without the questions with DIF, we found no evidence that differences in lifetime prevalence between groups were significantly affected by DIF in the original questions. With respect to the main question posed in this study, we found no evidence that the low prevalence of depression among minorities relative to whites is a result of DIF.
DIF Against Blacks
Two questions were removed because they had significant DIF against blacks: felt worthless and thoughts of suicide. In both cases, there were other questions measuring the same symptom that were not found to exhibit DIF. This suggests that nuisance factors related to the specific wording of the question with DIF, rather than factors related more generally to the symptom, account for the observed DIF in both cases, which implies that the DIF in these 2 cases is likely to be adverse. Because of DIF in the felt worthless question, DIF was also found for the symptom of self-reproach. DIF in this symptom was no longer found when the felt worthless question was removed. This suggests that one means of improving the consistency of the assessment of depression would be to remove the felt worthless question. However, an equally valid and perhaps more clinically defensible approach would be to add an additional question with an alternative phrasing of the question that might be more commonly endorsed by minority groups at the same level of depression. Using questions with countervailing DIF by design may be the best way to achieve consistent measurement given the reality of cultural variations in the idioms of depression and the way that the questions are combined to determine a diagnostic assessment.
Although 1 of the 4 questions used to assess suicidality displayed DIF against blacks, no DIF was found at the level of the suicidality symptom even with this question included. This is most likely a result of the facts that (a) this question has a lower prevalence in the population than other questions used to assess this symptom and (b) blacks were slightly (nonsignificantly) more likely than whites to endorse other suicide questions at the same level of depression. Therefore, DIF in this question has a negligible impact on the relative prevalence of depression.
DIF against blacks was also found for the lack of energy question. Because this was the only question used to assess this symptom, we could not remove it from the test and maintain the test’s construct validity. Therefore, these data do not provide sufficient evidence to suggest whether DIF in this case is due to the specific wording of the question or to some clinically meaningful variation in the manifestation of depression between ethnic groups.
DIF Against Hispanics
Three questions were found to have DIF against Hispanics. Two of these questions, weight gain and early waking, did not result in DIF at the symptom level because of countervailing influences from other questions. Both of these questions inquire about discrete experiences that are understood to be instances of the more general symptom: weight gain is a discrete type of appetite and weight change, and early waking is a discrete type of sleep disturbance. Therefore, in neither case can we rule out the possibility that DIF for these questions is benign, due to an auxiliary factor that is clinically meaningful. Therefore, we cannot make a recommendation as to whether these questions should be removed or changed on the basis of these data.
The third question with DIF against Hispanics was the thoughts of suicide question. As in the black-white comparison, DIF in this question against Hispanics did not adversely affect consistency of measurement at the symptom level.
These results should be interpreted in light of 4 limitations of the IRT approach to assessment of DIF and response bias. First, our assessment of DIF relies on an internal criterion to assess respondents’ underlying levels of depression. Because we did not have a valid subset of questions, we had to use the leave-one-out approach to measure depression levels. This approach may result in estimates of depression that are contaminated by DIF, which creates problems with detecting DIF in individual questions and symptoms. In particular, it is impossible to detect DIF if all questions (or symptoms) have DIF of similar magnitude and direction (Camilli, 1993
). This theoretical concern has been borne out in simulation studies by Gierl et al. (2004)
, who investigated the performance of SIBTEST in the presence of pervasive DIF. They found that in these circumstances, SIBTEST may fail to detect questions with significant DIF and, further, may falsely flag questions that in reality lack DIF.
Two aspects of our DIF detection procedure limited the possibility that our conclusions would be distorted by pervasive DIF. First, we used 1-sided tests to examine DIF against minorities. This prevented us from erroneously removing any DIF-free questions that, because of contamination in our underlying measure of depression, appeared to have DIF in favor of minorities. Second, we used an iterative purification procedure whereby we eliminated questions with DIF against minorities and then re-examined the remaining questions for DIF until no new questions with DIF were found (Camilli and Shepard, 1994
). Because questions with severe DIF against minorities will still seem to have significant DIF and will therefore be removed after the first pass, the estimate of depression used in the second pass will be less contaminated by DIF. This means that we will then be able to detect questions with moderate DIF against minorities in the second pass, and so on and so forth. If most or all questions have similar amounts of DIF against minorities (e.g., because of a greater reluctance to disclose potentially embarrassing information), then we will not be able to detect DIF even with an iterative purification procedure based on 1-sided tests. However, in less extreme situations, the 1-sided iterative procedure will allow us to locate those questions that truly have DIF against the minority group.
A second limitation of this analysis is SIBTEST’s assumption that DIF is unidirectional. This means that when DIF against minorities occurs in a question, minorities have lower probability of endorsing that question at all levels of depression. However, there maybe questions where DIF occurs in opposite directions at different levels of depression. If this is the case, then our SIBTEST procedure would not allow us to detect these questions as having DIF.
A third limitation stems from the 2-stage structure of the CIDI assessment of depression, which does not collect full information on depressive symptoms for respondents who are negative on the initial screening questions. This means that we cannot investigate DIF in the screening questions where it may actually do the most damage in terms of biasing group prevalence estimates. For example, we still do not know whether blacks were less likely to screen positive for depression because of DIF in the screening questions or because of actual differences in prevalence. It would be valuable to repeat this analysis in a sample that contained full data on the screen negatives.
This study also has the limitation of sample size. The sample size limits our ability to examine DIF with respect to specific ethnic subgroups, such as Puerto Rican versus Mexicans in comparison with non-Hispanic whites. It also limits our ability to examine DIF within sociodemographic subgroups, such as males and females. Although our main conclusions support the findings of epidemiological comparisons between broad ethnic groups (e.g., Hispanics compared with non-Hispanic whites), it is important to recognize that neither this epidemiological pattern nor our methodological result necessarily holds true for all subgroups within the broad ethnic groups examined.
Finally, this study relies on retrospective recall of depressive symptoms. Though this is a potential limitation of the method, there is no evidence of differential recall between ethnic groups (Shrout et al., 1993