The literature on cross-cultural measurement of mental disorders has focused on the important issues of literate translations of English questions that do justice to idiomatic expressions within specific language communities (
22,
23) and the internal psychometric features of the translated instruments (
24), such as test-retest reliability. Although these are important steps for the establishment of cross-cultural comparability of measures, they do not guarantee that individuals will provide the same responses to questions posed in the new language as they would in the first language (
25). When bilingual respondents are available in a study population, it is possible to directly test the equivalence of measures by employing randomized survey designs (
26,
27).
To our knowledge, our study was the first to use a randomized survey experiment to test whether a carefully prepared translated version of an epidemiologic battery was empirically equivalent to the version prepared in the original language (English). This experiment helps address the question of whether literate translated phrasings of questions evoke the same or different recollections of symptoms of psychiatric illnesses as the original questions. This is a preliminary step toward determining whether the measure distinguishes between psychiatric illnesses and culturally determined responses to illness.
The results for the Spanish and English versions of the WMH-CIDI were comparable for 10 of the 11 diagnoses considered here. For these diagnoses, there was no evidence that the rates in the Spanish arm differed from those obtained in the English arm for bilingual respondents.
The one possible exception was GAD, which seemed to have a higher lifetime rate among those bilingual respondents assigned to the English arm. We looked for evidence that the translation of a specific question or set of questions could account for this difference, and we found none. The excess rate in the English-version group was spread across questions and decision rules. We also checked to see whether this excess was consistent with differences between monolingual English respondents and monolingual Spanish respondents, and we found that the pattern was not consistent in these comparisons. In fact, monolingual English respondents had slightly lower rather than higher rates of GAD. When all four groups are compared together, it seems that the bilingual respondents assigned to the English arm are the ones whose lifetime rate (8.2 percent) is unusual in comparison with monolingual English (5.7 percent), bilingual Spanish (4.0 percent), or monolingual Spanish (6.2 percent) respondents. Because the risk of Type I error increases with multiple significance tests, and given the fact that we considered 30 comparisons in , we cannot rule out the possibility that the difference in GAD is simply a sampling fluctuation.
There were several limitations to our approach. One is that our comparisons focused only on the concrete response probabilities and the use of this information to generate prevalence rates. Even when these rates are strictly comparable across language groups, reasonable questions can still be raised about the extent to which Western Anglo-American psychiatric nosology can be applied universally across the globe (
28–
30).
Another limitation is that we cannot guarantee that the response processes in the bilingual group were accurate reflections of what happens with monolingual groups. Bilingual respondents tend to be more educated and more complex in their cultural identity, and hence they might have had unique responses to the questions we asked them. There is an active research literature in linguistics that aims to determine how language processing is affected by knowledge of more than one language (
31). Subsequent studies of the psychometric item response patterns (
32) in the four language groups could help determine whether there was any evidence of differential item functioning across the groups. However, the psychometric methods make assumptions about how response patterns can be modeled that are not needed in the experimental analysis presented here.
A final limitation, an important one, is that our results cannot establish that the various diagnostic algorithms are uniformly equivalent across the two language versions. Our method made use of available bilingual respondents in a survey of Latino Americans and was not designed to make a definitive statement about the absolute equivalence of the translations. We presented confidence intervals for the contrasts between the language versions to provide readers with information about the degree of certainty that the versions would produce the same prevalence results. Although the majority of the 95 percent confidence intervals in included a value consistent with the versions' being equivalent, a number of the intervals also included values that would be consistent with important differences. This occurred for extremely rare disorders for which we had little power to detect most differences. For a number of other disorders, however, the 95 percent confidence intervals suggest that our data are inconsistent with large differences (e.g., odds ratios of 3 or more.) The power and precision of our analyses were affected by the relatively modest number of bilingual respondents who were identified in our representative sample.
Even with these limitations, we believe that the use of language experiments in health surveys of multilingual populations provides a feasible step toward testing the translation equivalence of the survey instruments. Routine tests of translational equivalence of major surveys are consistent with good management goals of continuous quality improvement of products and services. The feasibility of continuous quality improvement procedures such as survey experiments is increased by the widespread use of electronic data collection methods, such as laptop computers, which can be programmed to implement rigorous randomization designs. As epidemiologists and health policy analysts turn toward the problem of documenting risk differences and reducing service disparities that affect immigrant groups, it is essential to adopt routine checks on the equivalence of reports obtained in different languages to rule out spurious influences on patterns of findings.
Perhaps most critically, the finding of scant translation effects from this experiment adds confidence in the validity of the comparisons of NLAAS groups that differ with regard to the proportion of persons who provided information in Spanish versus English. Recent reports have been able to document important differences in the mental health morbidity of Latinos of different national origins (
33) and to provide a clearer description of the so-called immigrant paradox, whereby economically disadvantaged immigrants appear to have unusually good health relative to Anglos and children of immigrants (
34). Both basic epidemiologic inferences and health-services analyses benefit from rigorous tests of language effects.