Based on our eight-step developmental approach, we were able to identify and in many cases remediate problems that were rooted in interviewer (mis)behavior, translation error, problems of cultural adaptation, and general problems of question design. Based on the nature of these problems, we can suggest several conclusions concerning the nature of translated instruments. First, for interviewer-administered instruments, we suggest that data quality is not simply a function of correct translation, but also of proper delivery. Even where questions were found to contain no obvious translation problems, interviewers sometimes misread questions in ways that fundamentally affected their meaning. In training, researchers need to stress that interviewers should read each question as worded, to avoid variance in responses due only to discrepancies in interviewer performance. Or, if interviewers are allowed to deviate from standard wording (in an attempt to assist respondents), it is important to carefully monitor interviews for signs of bias.
Second, our results suggest that respondents answering the questionnaire in languages other than English, especially if elderly or unacculturated to American society, may not be as familiar with the implicit rules of questionnaire interaction. They may therefore be less likely to choose the specific response categories provided, or to answer in the terms implied by the question. The interviewer in these cases needs to be prepared to use neutral follow-up probing to elicit codeable responses (e.g., “How many minutes or hours would you say it is before you smoke your first cigarette of the day?”).
Further, we note that even an intensive translation effort relying on multiple experts was not entirely successful in eradicating some clear errors subsequently identified through empirical pretesting, and that in general, each evaluation step served as a successive filter to catch problems that had slipped though earlier steps. However, despite the use of a multi-step process, some problems may be persistent, such that sources of cross-cultural non-comparability sometimes cannot be eradicated through the discovery of a corrected verbal translation. In particular, a basic concept (e.g., “light” versus “regular” cigarette) may not be represented in such a way that survey questions concerning that concept will be meaningful for all groups. However, by pretesting we can at least identify these limitations, and be prepared to interpret results with important caveats.
Alternately, the pretesting process may reveal particular problems that can be resolved – but that may also require modification of the source (e.g., English) questionnaire, especially when the problems appear generic in nature and not tied to a particular cultural group or translated version. Therefore, for investigations in which constraints on the source version are less severe than for the TUSCS-CPS, we advocate the practice of decentering
(McKay, et al., 1996
; Werner & Campbell, 1970
), in which the original questionnaire is considered to be open to modification, based on results obtained through testing of the translations.
Caveats and suggestions for further research
First, we acknowledge that the techniques used – cognitive interviewing and behavior coding – are relatively new in application to cross-cultural research, and can potentially be applied in a wide variety of ways. As such, there are no commonly accepted practices known to guarantee validity and to avoid biased results. Most seriously, given that assessment of difficulties across languages necessarily involved different sets of bilingual researchers, our cognitive probing or coding methods might have been applied inconsistently across group, such that observed discrepancies between language versions therefore reflect between-group variation in our testing processes. On the other hand, we did find that our interviewers independently recorded difficulties that were common across languages (generic problems), which is somewhat reassuring, as it represents agreement concerning the identification of a common subset of problems with the evaluated items.
A second limitation is that our pilot test, which served as the basis for behavior coding, failed to obtain equivalent participant groups, with respect to demographic variables that might influence the survey response process. In particular, the significance of the wide gender discrepancies across our groups is unclear – future studies might consider more intensive efforts to recruit female Asian smokers in particular. As a final caveat, our measures of item function were indirect, as we lacked criterion measures by which to directly assess degree of response error produced by the evaluated survey questions, for any language group. We therefore advocate future research efforts that access true criterion measures (e.g., objective counts of cigarettes smoked in the past 30 days).
Despite these limitations, we conclude that pretesting with cognitive interviewing and behavior coding likely reveals significant defects in translated versions, and appears to be potentially useful as a general set of procedures for maintaining quality control. This investigation therefore supplements the burgeoning literature indicating that it is worth additional time and expense to apply pretesting techniques in pursuit of cross-cultural comparability of translated survey items. In closing, we point out that the field of survey translation and cross-cultural evaluation is dynamic, and awaits the establishment of demonstrated “best practices.” We suggest that researchers attend to further developments in the relevant research area, as these will guide future development and quality assurance of health survey questionnaires generally.