|Home | About | Journals | Submit | Contact Us | Français|
The potential for ill-informed causal inference is a major concern in published longitudinal studies evaluating impaired neurological function in children prenatally exposed to background levels of methyl mercury (MeHg). These studies evaluate a large number of developmental tests. We propose an alternative analysis strategy that reduces the number of comparisons tested in these studies.
Using data from the 9-year follow-up of 643 children in the Seychelles Child Development Study (SCDS), we grouped 18 individual endpoints into one overall ordinal outcome variable as well as by developmental domains. Subsequently, ordinal logistic regression analyses were performed.
We did not find an association between prenatal MeHg exposure and developmental outcomes at 9 years of age.
Our proposed framework is more likely to result in a balanced interpretation of a posteriori associations. In addition, this new strategy should facilitate the use of complex epidemiological data in quantitative risk assessment.
Mercury is widespread in the environment due to multiple natural and anthropogenic sources (Counter and Buchanan 2004; McDowell et al. 2004). The organic species methyl mercury (MeHg) has been of particular concern because it occurs in low concentrations naturally and is especially toxic to the developing nervous system (WHO 1990). The primary mode of human exposure to MeHg is through fish consumption. Developmental neurotoxicity can occur following high levels of exposure to prenatal MeHg (WHO 1990). These findings raise the question of whether children whose mothers consumed fish contaminated with background levels during pregnancy are at an increased risk of impaired neurological function.
To address this important public health concern, a number of longitudinal studies have been carried out, including ones in New Zealand (Crump et al. 1998), the Seychelles islands (Myers et al. 2003) and the Faroe Islands (Grandjean et al. 1997). From benchmark analyses of these studies, it has been estimated that a level of 4 to 25 parts per million (ppm) measured in maternal hair may carry a risk to the infant (Budtz-Jorgensen et al. 2000; Budtz-Jorgensen et al. 2001; Crump et al. 1998; National Research Council 2000; van Wijngaarden et al. 2006). However, risk characterization of developmental deficits in relation to prenatal MeHg exposure and its interpretation is inherently difficult due to the large number of developmental test endpoints evaluated in these studies. For example, the Seychelles Child Development Study (SCDS) has examined subjects repeatedly and assessed 21 primary endpoints at 9 years of age (Myers et al. 2003) They have tested approximately 100 primary endpoints over the course of an 18-year follow-up. The Faroese study examined their subjects first at age 7 years and published the findings from 20 developmental tests (Grandjean et al. 1997). The New Zealand investigators based their derivation of benchmark doses on five tests scores from a larger developmental test battery of 26 tests (Crump et al. 1998).
Given this abundance of data, the question arises how to evaluate the basic hypothesis statistically. One strategy would be to analyze each endpoint individually in relation to exposure and test the statistical significance of each association. This is the approach taken in most clinical and public health research. However, under this paradigm one would expect 5 percent of the examined associations to be statistically significant due to random variability in the data (or “chance”). Thus, 5 out of 100 developmental tests evaluated in the SCDS in relation to prenatal MeHg exposure would be expected to be statistically significant without a guarantee that these represent true associations. Indeed, based on probability plots of p-values we previously concluded at 9 years of follow-up that the two statistically significant associations were probably chance observations, especially given the lack of statistical associations at earlier points of follow-up (Myers et al. 2003).
Some have argued that one should not rely on p-values when interpreting statistical associations, in part because systematic error such as bias and confounding may influence study findings to a much greater extent than random error (Rothman 1990; Savitz and Olshan 1995; Savitz and Olshan 1998; Thompson 1998). This may be especially true in observational studies of potential health risks. Nevertheless, reliance on p-values remains common and may lead to false-positive (or false-negative) associations. Therefore, there is a need to address the analysis of multiple endpoints in relation to one primary exposure (the “multiple comparisons” issue) in order to avoid ill-informed causal inference. An often-used approach to handle such multiple comparisons is to adjust the cut-off for rejecting the null hypothesis according to the number of statistical tests performed, such as the Bonferroni adjustment (Glantz 2002). For example, in one study (or a set of longitudinal studies based on the same study population) with 100 endpoints one would divide the alpha level of 0.05 (for statistical significance at the 5 percent level) by 100, thus now considering only those hypothesis tests yielding a p-value of 0.0005 or less to be statistically significant. Albeit simple, this correction is overly conservative when a large number of comparisons are made (Glantz 2002), and suffers from numerous additional limitations which have been described in detail elsewhere (Perneger 1998).
A key issue in discussing the merit of multiple testing adjustments is the distinction between a priori hypotheses and a posteriori observations (Thompson 1998; Veazie 2006). Recently, more refined ways have been developed to address multiple hypotheses in a single study, including Bayesian approaches (Berry and Hochberg 1999; Gelman and Tuerlinckx 2000). A Bayesian approach to models of multiple related hypotheses may include using hierarchical models to model multiple parameters as exchangeable. This type of approach allows information about individual parameters to be influenced by the magnitude of related parameters, e.g. a borrowing of strength of information across related parameters. Bayesian or empirical Bayes methods can be appropriate for either a priori or a posteriori investigation (Efron et al. 2001; Veazie 2006). On the other hand, alternative methods for a priori investigation are not necessarily statistical in nature. Rather, they may rely on investigators specifically stating the overarching research hypothesis and determining whether this hypothesis is a composite hypothesis (i.e., a composition of single hypotheses). For instance, one could hypothesize that prenatal exposure to MeHg (or, more accurately, maternal hair levels of MeHg) is associated with a decline in “cognitive functioning” in a cohort of children, where cognitive function can be measured in a variety of different ways. Arguably, this is different from hypothesizing that prenatal MeHg exposure is specifically associated with a lower score on the Boston Naming Test, which may or may not be representative of the broader construct of cognitive functioning intended to be measured by the investigators. One might surmise that the broader construct is of true public health interest. If this is the case, it would not appear appropriate to arrive at a conclusion following from a single hypothesis reported in isolation from other hypotheses addressed in the same study. To enhance our ability to properly interpret complex data, an a priori analysis approach should be developed that reflects the composite nature of the hypothesis of interest.
The evaluation of a large number of endpoints in the MeHg literature also complicates the derivation of safe exposures. The choice of an endpoint on which to base safe exposures is often arbitrary given the variety of study populations and the methodologies employed in epidemiological studies (van Wijngaarden et al. 2006). In the absence of conclusive evidence to suggest that subtle differences in cognitive and motor performance reported in relation to prenatal MeHg exposure impair an individual child’s overall ability to function, it is uncertain which endpoint is biologically most relevant (van Wijngaarden et al. 2006). Historically, some agencies have used one single endpoint from one single study to derive a safe MeHg exposure (National Research Council 2000), whereas other researchers have aggregated the original data from three prospective studies to evaluate the impact on children’s intelligence quotient (IQ) (Axelrad et al. 2007; Cohen et al. 2005). We recently suggested presenting an average benchmark dose and its corresponding range based on all available evidence to provide an indication of the exposure limits within which the true benchmark dose is likely to fall (van Wijngaarden et al. 2006).
One alternative to evaluating a large number of endpoints separately is to fit a single model that includes information on all outcomes simultaneously. Two different approaches have been taken in the statistical literature to do this, and both have been applied to MeHg exposure. Both approaches allow the outcomes to cluster into different functions or “domains”, such as “cognitive functioning”, “motor functioning”, etc, and both allow estimation of the function-specific effects. One approach uses structural equations models, which introduce a latent variable for each function or domain (Budtz-Jorgensen et al, 2002). The second approach models the MeHg effect directly, by including an overall MeHg effect, and domain-specific and outcome-specific deviations (Thurston et al., in press).
In this study, we propose another alternative method to reduce the number of statistical comparisons made in the evaluation of the risk of developmental deficits in relation to prenatal MeHg exposure. Whereas most previous studies have examined results of individual developmental tests in isolation, our definition of ‘abnormal’ versus ‘normal’ performance integrates information from all administered developmental tests into one ordinal endpoint. This approach drastically reduces the number of endpoints to be evaluated in statistical analyses. It is based on the premise that no one endpoint is biologically more relevant than any other, and that a deficit in any of the tests may be important in later life. Additionally, the nature of brain damage due to high doses of prenatal MeHg suggests that cellular functions associated with very basic brain development are impacted (Clarkson 2002). Therefore, one would expect a similarity of prenatal MeHg exposure effects on endpoints in different domains (Thurston et al. in press). Indeed, the multiple outcomes models used by Thurston and colleagues as applied to these data suggested a similarity of exposure effects on all outcomes, when the outcomes were put on a comparable scale. Most previous reports relied on generalized linear models to evaluate the association between prenatal MeHg exposure and child development, the results of which may be difficult to interpret given the subtle effects reported. Other measures of effect, such as the relative risk, may allow for a more intuitive interpretation of findings. We computed the relative risk for an adverse event and the corresponding 95 percent confidence interval across the range of prenatal MeHg exposure observed in the SCDS.
The SCDS Main cohort has been extensively described previously and is briefly summarized here (Davidson et al. 1998; Myers et al. 2003; Shamlaye et al. 2004). In 1989–1990, we initiated a longitudinal study of a cohort of 779 mother-child pairs in the Seychelles Islands (approximately half of the number of live births during that period). The children were 6 months ± 2 weeks old at enrollment. The primary purpose of this study was to examine the association between prenatal MeHg exposure from maternal fish consumption during pregnancy and the children’s developmental outcomes. Most Seychellois consume ocean fish with 10 or more meals each week and the fish in Seychelles contains background levels of MeHg similar to fish sold commercially in the United States. Prenatal exposure to MeHg was determined by measuring total mercury in maternal hair during pregnancy in the sample that best recapitulated prenatal exposure. We used cold vapor atomic absorption spectroscopy with quality control procedures and assumed a hair growth rate of 1.1 cm per month with a delay of 20 days between current blood concentrations and appearance of Hg in the first centimeter of scalp hair (Cernichiari et al. 1995). Postnatal exposure was measured in the 1 cm of hair closest to the scalp taken at the time of the evaluation. The study is double blind and individual MeHg exposure levels are not shared with clinical investigators or anyone in Seychelles. A variety of covariates known to be related to child development and believed to be potential confounders of the association between prenatal MeHg exposure and development were measured. These included maternal intelligence, evaluation of the home environment using the HOME, and the family’s socioeconomic status. A more detailed description of the MeHg exposure assessment and collection of covariate information is available elsewhere (Crump et al. 2000; Myers et al. 2003).
A small number of children enrolled in the study have been excluded over time for illnesses or injuries known to be associated with developmental deficits, including children with closed head trauma and meningitis. Seven hundred and seventeen children (92 percent of the initial cohort; 62 exclusions) were still eligible to participate at 9 years of age (Myers et al. 2003). Another 74 children were not tested at 9 years because they were residing abroad, refused to participate, or we were unable to locate them. Enrolled children were similar to those not participating (Myers et al. 2003). Thus, the current analysis is based on a cohort of 643 children (83 percent of the original cohort) followed for 9 years. For some analyses, we excluded up to 144 additional children due to missing outcome and covariate data (see Statistical Analysis section, and footnote to Table 3).
The children underwent extensive evaluations when they were 6, 19, 29, 66, and 107 months of age (Shamlaye et al. 2004). Our current analysis focuses on the examinations performed at 9 years (107 months) of age (Myers et al. 2003). Developmental evaluations at that age included tests for three general domains (tests for cognition and achievement; tests for motor, perceptual motor and memory; and tests for behavior) (Table 1). A team of three specially trained Seychellois child health and development professionals conducted the examinations. Reliability between the testers and the senior psychologist on the team was conducted regularly and was high.
For this analysis we initially considered the original 21 primary endpoints reported on previously (Myers et al. 2003) and seven additional secondary endpoints determined at that age: the Wechsler Intelligence Scale for Children III (WISC-III) verbal and performance IQ; WISC-III similarities and digit span subtests; the California Verbal Learning Test’s (CVLT) immediate recall and recognition hits; and the Boston Naming Test’s (BNT) total correct with and without cues (van Wijngaarden et al. 2006). To be consistent with our recent benchmark analysis (van Wijngaarden et al. 2006), two Continuous Performance Task (CPT) endpoints previously reported (Myers et al. 2003) were not included; risk-taking (direction of adverse effect is unclear) and hit reaction time (statistical estimation algorithm did not converge in previous analyses). We then excluded five endpoints that had missing data for five percent of more (i.e., n ≥ 32) of all children in the database, and also excluded three IQ measures which overlap with the constructs that are measured by the verbal and performance IQ measures.
For our current analyses, we defined cases as those with an abnormal score on at least one of the 18 remaining developmental endpoints considered. We defined abnormal scores using the following algorithm. We determined the 1st or 99th percentile (depending on the direction of the adverse effect) of the distribution in the cohort to identify children with an abnormal score for each subtest based on these cut-offs (Table 1). To evaluate the sensitivity of our findings to the cut-point chosen to identify children with an abnormal score, we also determined the 5th and 95th percentile (Table 1). Finally, children were classified into three case groups based on the number of tests on which they had an abnormal score: no abnormal test score on any of the 18 endpoints, one endpoint with an abnormal score, and two or more endpoints with an abnormal score. Thus, our case definition was ordinal in nature and was preserved in our statistical analyses. Case status was defined in this manner considering all endpoints combined, as well as grouped in three developmental domains consistent with our previous analyses of these data estimating benchmark dose levels (van Wijngaarden et al. 2006): cognition; motor, perceptual motor, and memory; and behavior (Table 1).
The odds ratio (OR) and 95 percent confidence interval (CI) for the association between prenatal MeHg level and abnormal development (as defined above, using the case definition defined by the 1st and 99th percentiles as the primary outcome) was estimated using ordinal logistic regression (proportional odds models), which is more efficient than binary logistic regression and produces risk estimates which can be interpreted across multiple dichotomizations of the outcome (Ananth and Kleinbaum 1997; Scott et al. 1997). The OR estimated with this model can be interpreted as the odds of more severe responses to less severe responses (e.g., 2+ abnormal test scores vs. 1 and 0 (combined) abnormal tests scores, or 1 and 2+ abnormal test scores (combined) vs. 0 abnormal values) among those with greater MeHg exposure as compared to the odds among those with lower levels of MeHg exposure. That is, the proportional odds model simultaneously compares children with 2+ abnormal outcomes to children with 0 or 1, and compares children with 1+ abnormal outcomes to those with no abnormal outcomes. The intercept for each of these 3 categories is allowed to differ, but the model assumes a common slope for MeHg for each of the two comparisons. All analyses were performed based on case status for all endpoints combined, as well as for cognition and motor function separately. We did not perform separate analyses for behavior because this domain included only one developmental test (Child Behavior Checklist) and, by definition, the number of children with an abnormal score on this test was small (n=7, 1.1 percent) for the primary outcome based on the 99th percentile. We repeated the analyses on all endpoints combined and cognitive and motor function separately based on the 5th and 95th percentile to define case status in order to evaluate the sensitivity of our findings to the cut-point chosen.
The relationship with prenatal MeHg exposure was examined using prenatal MeHg exposure as a continuous variable (as has been our practice in all previous analyses of the SCDS data) and also in a separate model using categories of MeHg exposure based on quartiles of the exposure distribution in the cohort of 643 children. All ORs for both models were adjusted for the effects of gender, family status (categorical), child’s age at testing (continuous), HOME (categorical), caregiver’s IQ (continuous), Hollingshead SES (categorical), and postnatal MeHg exposure (continuous). Other covariates were not included because they did not contribute meaningfully to the prediction of the outcome or missing data reduced the sample size significantly.
For the complete case analysis, we excluded children with missing data on any of the covariates included in the model (n=107) from the analysis. We compared unadjusted and adjusted OR estimates to assess the extent of confounding. However, excluding children with missing covariate data may impact the findings if the children who were excluded have attributes different from those upon which the model was based, or if the association of interest is different for the two different groups of children. To examine the potential bias introduced by the complete case approach, we compared the results from an unadjusted ordinal logistic regression before and after excluding children with missing covariate data. We assumed that ordinal logistic regression was an appropriate choice for modeling our data since the Score Test for the Proportional Odds assumption was not statistically significant (p>0.05) for any of our regression models (Stokes et al. 2000). All analyses were conducted with SAS (SAS Institute, Cary, NC).
The analysis of prenatal MeHg exposure as a continuous variable described above assumes a multiplicative relationship between exposure and the risk for developing the adverse outcome of interest (e.g., OR = 1.1 for one unit increase in exposure, and OR = 1.1*1.1 = 1.21 for two units increase in exposure). However, alternative forms of characterizing the exposure-response relationship may also be informative and may allow for a more straightforward estimation of safe levels of exposure (see Discussion). Therefore, we additionally derived a linear slope estimate by fitting a trend line for the underlying linear relative risk (RR) model (RR = 1 + β*x) to the ORs obtained from the models fit previously which treated exposure as a categorical variable (Rothman 1986; van Wijngaarden and Hertz-Picciotto 2004). The trend line is estimated using the median exposure concentrations and relative risk estimate in each of the exposure categories, and weights computed as the inverse variance of the relative risks (Rothman 1986). In this linear model, β represents the linear slope estimate using categorized prenatal MeHg exposure and x is the exposure level in units of ppm in maternal hair. Thus, β can be interpreted as the increase in relative risk of more severe responses to less severe responses per ppm of prenatal MeHg exposure. The lowest exposure group is used as the reference category and is not included in the computations since no variance can be estimated. In summary, using this linear model a straight line (trend line) is fit through the category-specific relative risks while forced through the origin (baseline exposure and RR of 1.0). Estimates of the linear slope were obtained with Microsoft Excel (Microsoft Corp, Redmond, WA).
Fifty-three (8.2 percent) children could not be classified according to overall case status due to missing data on one or more endpoints. Nine (1.4 percent) and 40 (6.2 percent) children could not be classified according to case status for cognition and motor function, respectively. Of the children with complete information on all 18 endpoints, 65 children (11.0 percent) scored abnormally on one or more endpoints based on cut-points using the 1st or 99th percentile, 20 (3.4 percent) of whom scored abnormally on two or more endpoints. Regarding cognition endpoints, 39 (6.2 percent) of the children with complete data had one or more abnormal test scores, and 11 (1.7 percent) scored abnormally on two or more endpoints. On motor function tests, 47 (7.8 percent) and 10 (1.7 percent) scored abnormally once or at least twice, respectively. Finally, 7 children (1.1 percent) scored abnormally on the one test representing behavior. The number of cases increased substantially when the 5th or 95th percentile was chosen to identify abnormal scores (Table 1).
Table 2 presents the association between developmental status and selected covariates. Boys were slightly more likely to do worse than girls on all tests, as were children with no biological parents in the home. Children with a higher score on HOME, higher socioeconomic status, and higher caregiver’s IQ were more likely to perform better on the developmental tests. Older children performed better overall and on motor function tests, but not on tests of cognition. Postnatal MeHg exposure measured at 9 years of age did not appear to impact performance. Findings were similar for case status based on 1st/99th and 5th/95th percentiles with the exception of family status which did not remain associated with cognition case status when based on 5th or 95th percentiles.
The results for the primary outcome from the ordinal logistic regression analyses are shown in Table 3. Prenatal MeHg exposure was not associated with developmental outcomes before or after adjustment for covariates in complete case analyses. Based on the 1st and 99th percentiles, results are compatible with both an approximately eight percent decrease and increase (based on the lower and upper 95 percent confidence limits of the risk estimates, respectively) in worse developmental performance per 1 ppm MeHg in maternal hair. After adding participants who were excluded from complete case analyses due to missing data on covariates, risk estimates for the continuous exposure measure were quite similar although point estimates for the categorical analyses were somewhat lower. Nevertheless, we considered them qualitatively comparable due to overlapping confidence intervals. The interpretation of the findings did not change substantially across case definitions (based on 1st/99th or 5th/95th percentiles), with the possible exception of slight inverse association in the linear relative risk model (Table 3).
In our present analysis of the SCDS at 9 years of follow-up, we chose to combine the test results a priori for 18 individual endpoints into one overall ordinal outcome variable, and two ordinal outcome variables representing functional domains. In this analysis there was no significant association between prenatal MeHg exposure and developmental functioning at 9 years of age. This is consistent with our previous reports using separate models for each endpoint (Myers et al. 2003). However, we argue that this new strategy is more likely to result in a balanced interpretation of a posteriori associations, because those reviewing and utilizing the findings are not distracted by statistically significant results for individual endpoints of unknown statistical and clinical value.
Our findings were expressed in odds ratios, which are perhaps easier to interpret by policy makers, clinicians and public health investigators than estimates obtained from linear regression or more broadly generalized linear models. Despite the lack of any consistent associations, our data still provide information that can be used to guide regulatory decision-making. Odds ratios reported for a one unit increase in exposure can be used for risk assessment, i.e., to estimate safe levels of exposure using methods similar to those reported previously for quantitative cancer risk assessment (van Wijngaarden and Hertz-Picciotto 2004). The maternal hair MeHg level in ppm required to produce a 1 percent additional risk of morbidity over the background risk (here called “hazardous dose”, or HD) can be calculated by first computing the relative risk corresponding to this additional risk using the following equation:
where R(0) is the background risk of morbidity. Subsequently, this RRHD1 is substituted into the following equation:
where β is the estimated slope from the linear relative risk model. Given our ordinal outcome data, we can compute two types of HD1 estimates. For example, the background risk R(0) of an abnormal score on 1 or more tests in the reference group (i.e., ≤3.34 ppm based on quartiles of the exposure distribution in this cohort) for “total cases” is 20/144 = 13.9 percent (see Table 3). That is, of the 144 children exposed to less than 3.34 ppm prenatal MeHg, 14 had one endpoint (N1) and 6 had two or more endpoints with an abnormal score (N2). The HD to increase this risk by 1 percent using the upper 95 percent confidence limit of the β (0.025) is 2.9 ppm MeHg in maternal hair above and beyond the level to which the individuals in the reference group are exposed (median = 2.2 ppm). Alternatively, the background risk R(0) of an abnormal score on 2 or more tests in the reference group for “total cases” is 6/144 = 4.2 percent (see Table 3). The HD to increase this risk by 1 percent using the same parameter estimate as above is 9.5 ppm MeHg in maternal hair. We used the 95 percent upper confidence limit for our slope estimate; the point estimate indicated a slightly beneficial effect. It is apparent from these calculations that the estimates of a hazardous dose are directly dependent on the background risk and the additional risk that is deemed acceptable. That is, the HD increases with decreasing background risk and increasing acceptable additional risk. The simple formulas described above should facilitate the use of complex epidemiological data in quantitative risk assessment.
The approach described here also has some limitations, most importantly the loss of statistical power by collapsing continuous outcome data into an ordinal variable. The relatively wide confidence intervals, in particular for the category specific odds ratios, demonstrate the lack of statistical precision. Nevertheless, analyzing ordinal variables in logistic regression is more efficient than analyzing binary variables generated from continuous data (Ananth and Kleinbaum 1997; Scott et al. 1997). It should be recognized, however, that the models used are not ‘biologically-based’ and have no relationship to specific biological events in child development. Similarly, we assumed that all endpoints are equally important given the scientifically uncertainty about which endpoints are biologically most relevant with regard to the MeHg toxicity. Furthermore, we had to exclude children for whom data was missing on one or more developmental tests across the board, whereas in our original analyses children would be excluded in the analysis of some developmental endpoints but not others thereby maximizing our use of the recruited study population. Finally, our approach may be most useful when a large number of tests have been performed within functional domains. When only few tests within a domain are considered and the issue of multiple comparisons is not as important in the interpretation of findings, generalized linear models are a more appropriate choice.
In conclusion, we present an alternative characterization of the impact of prenatal mercury exposure on child development that could complement existing risk assessment strategies. Using this method of analysis we did not find an adverse association of prenatal MeHg exposure with child development in a population with a mean prenatal exposure of 6 ppm. We hope that our proposed methods are useful when the interpretation of study findings is likely to be complicated by multiple hypothesis tests, and in quantitative risk assessment of non-cancer health outcomes using epidemiological data.
This research was supported by Grants 2R01-ES008442-05; R01-ES10219; R01-ES08442 and ES-01247 from the US National Institutes of Health; 1 UL1 RR024160-02 from the National Center for Research Resources; the Food and Drug Administration; US Department of Health and Human Services, and by the Ministry of Health, Republic of Seychelles.