A primary objective of the review process is the derivation of conclusions from the available data to guide future policy. In this case, however, the development of such a policy would appear to be hindered by the existence of directly contradictory results obtained from research of equal quality. As noted above, the data from the studies contained in this review have already been the subject of extensive evaluation (NRC 2000
; Rice et al. 2003
) and continue to excite controversy that is so far unresolved (Davidson et al. 1999a
; Grandjean and White 1999
; Stern and Gochfeld 1999
). A secondary objective of the review process, the identification of data gaps in the literature, appears inappropriate in circumstances where so much research has been carried out to date. Although the contribution of existing published research is unquestioned, it may be time to concede that there is little further that can be drawn from these data or, one suspects, from repeated studies of a similar type. Experience from other fields (Spurgeon 2002
) suggests that further cross-sectional studies employing similar neurobehavioral outcomes will serve only to increase rather than reduce the uncertainty surrounding this issue. In the remainder of this article, therefore, I discuss some of the possible reasons for the inconsistency in the existing data and indicate some areas where alternative approaches might be required to achieve some progress in this field.
The common objective of the investigations reviewed above was to establish whether there is an association between prenatal exposure to MeHg and developmental effects. Although the various studies had many elements in common, perhaps the most noticeable feature of the studies as a group was the variation in the methods used to assess the two basic elements of the association, namely, the exposure and the effect. It is not surprising that research using different combinations of biological and psychological measures produces inconsistent results. The debate surrounding each of these elements, although undoubtedly complex, merits resolution in advance of any further research.
In terms of the most appropriate biological marker of prenatal exposure, opinion is divided between maternal hair and cord blood as the biological sample of choice. Studies that have attempted to define the relationship between different biological indices have produced inconsistent and somewhat wide-ranging results, and conversion from one set of values to another appears to involve a number of questionable assumptions (Office of Environmental Health 1999
). Other difficulties in the interpretation of the data set arise as a result of the use of different units of measurement and a lack of clarity in some studies about whether the measure is of organic, inorganic, or total mercury concentration. Thus, there is continuing uncertainty about the association between elements of the diet and concentrations in child hair, maternal hair, cord blood, and maternal blood, as well as uncertainty about the strength of any relationship between each of these elements and the relationship between each and the actual exposure of the fetus. Elements of the debate about hair versus blood samples must be linked to a large number of other unanswered questions surrounding prenatal exposure measurement. These relate particularly to the relative importance of exposure at different periods of gestation, the relationship between these and average exposures, and the importance of peak exposures. The development of the central nervous system is time related and unidirectional. The inhibition of one stage of development tends to cause alterations to subsequent processes, with limited capacity for compensation for cell loss (Annau and Eccles 1986
; Trask and Kosofsky 2000
). Both the dose and timing of any environmental insult are important in terms of the specific nature of any adverse effects. How far do our current methods of prenatal exposure assessment reflect the need to take this into account?
The present enthusiasm for evidence-based policy and practice appears to offer an ideal opportunity to address these types of questions, either through the medium of an expert workshop or that of a written systematic review. The important issues in either process include a) definition of the important questions to be addressed to achieve valid and reliable assessment of prenatal exposure, b) identification of available data that could be used to answer these questions, and c) identification of new research required to fill any identified data gaps. In advance of some consensus on these issues, further research is likely to provoke more controversy rather than lead to any resolution of the current uncertainty.
The outcomes used in these studies were predominantly psychological tests. Use of such tests in environmental and occupational health research, which began in the early 1980s, has always been controversial, and the apparent inconsistencies in the data produced has provoked much debate in both environmental and occupational health research (Koller et al. 2004
; Levy et al. 2004
). Results relating to prenatal MeHg exposure represent a particular example of a wider problem and highlight a number of questions related to the more general field of neurobehavioral toxicology.
Specifically, two main areas are of concern. The first, and perhaps the more straightforward, relates to the control of variables that either represent potential confounders or may act as modifiers of the effects under investigation (Spurgeon and Gamberale 1997
). They are perhaps best considered under the broad headings of situational variables (physical testing conditions and test procedures), tester variables (reliability of the examiners), and subject variables (individual characteristics such as age, gender, and socioeconomic group). In all epidemiologic research involving psychological testing, the list of these variables is potentially very long, and researchers appear divided about which to include. In research on MeHg, the majority of studies consider important subject characteristics such as age, ethnic and socioeconomic group, and aspects of parental lifestyle. However, for a number of other variables (e.g., aspects of the caregiving environment), inclusion is patchy. For many of these variables, useful literature is available on their effects on children’s abilities or on test performance, and it may be possible to reach an evidence-based consensus on their inclusion or exclusion. For other, mainly procedural factors, data appear relatively scarce. A systematic review that encompasses other areas of psychology, for example, that pertaining to human–computer interaction, might reveal relevant information. For example, how much does the size of the screen affect performance on a computer-administered test? How much does the physical location of testing (home, laboratory, hospital) affect test performance? Existing data on the effects of time of day (Smith 1992
), for example, indicate that in epidemiologic studies this factor should always be controlled. Intuitively it would seem appropriate that the physical testing situation and procedures should be standardized for all subjects as far as is practically possible, regardless of whether firm evidence exists about the influence of heating, lighting, noise control, or the arrangement of furniture. Less well-researched aspects of the test situation can be explored usefully within the researchers’ data. Is there, for example, a significant difference between test scores obtained at the beginning and at the end of the week or at different times of the year?
The effects of the tester, particularly where tests are not computer-administered, may be important, not only because of different interactions with different subjects but also because of the examiner’s variable moods, motivation, levels of fatigue, and tendency to introduce systematic errors into the testing procedure. It cannot be assumed that confining testing to one examiner or using examiners who have undergone a single period of training removes tester variation. In terms of reliability, it may be advantageous to employ more than one tester in some circumstances. Measures such as the videotaping of testing procedures, double scoring, and examination of the test data for trends related to some of these factors have all been used to account for or eliminate this potential source of variation (Harvey et al. 1988
). Similarly some estimation, albeit a subjective rating, of the child’s level of co-operation with the testing procedure is important to include. Potentially this is a major source of variation in test performance rarely alluded to in published reports. Ideally, tests should also include parallel forms or practice trials to ensure that maximal performance level is recorded for each subject.
All except one of the studies in the field of research under discussion here present detailed accounts of quality control procedures in relation to MeHg assessment. It is relatively rare to find equally detailed discussion of procedures for outcome assessment. This is a situation that occurs frequently in neurobehavioral investigations. Lack of reference to quality control does not necessarily imply that control was limited but may suggest something about the attitude of researchers toward its importance. The implications for further research are 2-fold. First, systematic work is needed on the effects of factors considered likely to affect test performance, including both a review of the available data and, if necessary, further investigative work. Second, consensus must be reached on good practice such as that available in some other areas of toxicology, notably animal experimentation. Although this consensus may exist at an informal level in the field, the considerable methodologic variations between different neurobehavioral studies suggest that many aspects are currently opinion based rather than evidence based.
A second and fundamental issue in terms of outcome measures relates to the types of tests used and, by implication, the interpretation of the results they provide and the comparability of these between studies. The tests employed in the studies described above are mainly tests of intellectual functioning. However, those used in different studies, and sometimes within the same study, derive from a number of separate traditions of intellectual assessment, each of which was developed for a different purpose and different client group. Although each has some advantages, none were developed specifically for neurotoxicity research and none is entirely appropriate for this type of application.
Attainment tests are attractive in the sense that they offer the opportunity to benchmark the performance of children in basic skills such as literacy and numeracy against that of their peers. However, such tests tend to reflect the use of abilities rather than the underlying abilities themselves. Given the range of social and educational factors interacting with the ability to produce attainment, this effectively introduces additional variables into the equation (Gadzella et al. 1989
In contrast, the neuropsychological approach characterized, for example, by tests such as the Trailmaking test used in the Seychelle Islands study or the Bender Gestalt test used in the Faroes Islands study was developed to provide detailed evaluation of patients with suspected damage to the brain. Such damage might have resulted from head injury or other insult or from a degenerative disease of the nervous system. In these circumstances the purpose of assessment is to provide detailed information about the nature of the problem in functional terms and thus provide a basis for rehabilitation and progress monitoring. Assessment in a clinical setting tends to be a flexible process that draws as much on the qualitative aspects of the interaction between psychologist and client as it does on the numerical test scores. The clinician is interested in the patient as an individual and reaches a professional judgment on the basis of a number of sources of information. There is a risk that tests of this nature lose much of their value when applied in a routine fashion to large groups of people. Many who work in the field of clinical neuropsychology appear to be deeply uneasy about the transfer of these techniques to an epidemiologic setting (British Psychological Society 2001
). Particular concern arises when tests designed for administration by a psychologist are adapted for computer presentation. Researchers with a neuropsychological background tend to adopt a clinical approach by administering a very large battery of tests to cover all aspects of functioning (Davidson et al. 1995a
; Grandjean et al. 1997
; Kjellstrom 1991
). In an epidemiologic setting this can be inappropriate, resulting in multiple comparisons and the possibility of chance findings. Moreover, it often leads to confusion from a psychological point of view, where the results appear as a collection of apparently unconnected findings with no discernible meaningful pattern. Where studies use the same tests, it is common for significant associations to appear in both studies but in relation to different outcomes (Grandjean et al. 1997
). Finally, there are questions about the ability of tests designed for more severely affected patient groups to detect relatively subtle effects in community samples (Spurgeon 1996
; Stollery 1985
Tests derived from a psychometric tradition are concerned with the assessment of intelligence quotients (IQ) in the general population and were originally developed to describe normal distributions of cognitive functioning. The Wechsler Scale (Wechsler 1991
) represents the most widely used test battery in this respect. Developmental scales for very young children fit within this tradition, replacing formal testing where this is impractical, although it should be noted that maternal reports of developmental milestones are subject to numerous sources of error such as inaccurate recall, differing definitions of certain behaviors, and presentational bias (Axelson and Rylander 1984
The measurement of IQ is a reassuringly familiar concept supported by a wealth of normative data and experience built up over many years. Unfortunately, IQ tests were originally developed within a theoretical framework of cognitive functioning that prevailed more than a half-century ago. Such tests reflected a contemporary need to categorize individuals on a quantitative scale to predict future performance, an approach now considered somewhat crude and simplistic. Although such tests maintain their predictive validity in some settings (Neisser et al. 1996
), they are relatively blunt instruments that combine a number of different abilities within each test (Lezak 1988
). This aspect limits the information that can be derived from the assessment and makes interpretation difficult when conflicting results emerge from different studies. When placed in the context of more recent theoretical developments in cognitive psychology, established IQ tests do not provide results that can be easily linked to current models of cognitive processes.
A primary objective of neurobehavioral research is the detection of subtle effects on cognitive functioning in community samples after neurotoxicant exposure. For epidemiologic purposes, tests should be quick and easy to administer. The results should be interpretable at group level and comparable between different studies. Given these criteria, none of the tests currently in use appear to be entirely fit for this purpose. Speed and ease of administration do not represent major challenges in an age of advanced information technology. However, improvements in interpretability and comparability are more complex issues likely to require a radical change of approach. In recent years a number of authors have pointed to the overemphasis on empiricism in this field and the lack of a strong theoretical underpinning for the assessment tools employed (Stephens and Barker 1998
; Stollery 1990
; Williamson 1990
). The development of tests grounded in well-established cognitive theory would allow results to be discussed in terms of the specific aspects of cognitive processing under investigation rather than simply by reference to broad and largely uninformative categories of effect such as “memory” or “attention.” Modern approaches to the study of memory processes, for example, have long distinguished between several elements that contribute to the final outcome (initial registration of information, encoding, transfer to long-term store, loss of information by decay or interference, and use of cues for retrieval) (Baddeley 1987
). Each may be differentially susceptible to neurotoxic insult, but effects on one specific process cannot be uncovered by most current tests that provide a simple global outcome score. Moreover, overall scores may mask specific effects where subjects employ compensatory strategies among different processes to achieve maximum performance. The development of tests, for both children and adults, based on techniques currently available to separate and measure these specific processes would provide much more useful information about the nature and size of any observed effect. This type of approach would ultimately pave the way for much greater comparability between the results of different investigations and for the development of comparable assessment techniques for children at different ages during longitudinal investigations. Despite much international effort during the last 25 years, agreement on a universally approved set of tests has proved elusive (World Health Oganization 1989
). At the same time, the pursuit of the goal to achieve comparability over time and between studies has tended to inhibit the development of new techniques. It seems unlikely that consensus on appropriate assessment tools will be achieved in advance of a consensus on the theoretical basis for those tools. Fortunately, much of the information required for these new developments is readily available in the existing cognitive, experimental, and developmental psychology literature.