Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3091821

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. STATISTICAL METHODS
- 3. SIMULATION RESULTS
- 4. ANALYSIS OF BENZOPHENONE DATA
- 5. DISCUSSION
- References

Authors

Related links

Stat Biopharm Res. Author manuscript; available in PMC 2011 May 10.

Published in final edited form as:

Stat Biopharm Res. 2011 February 1; 3(1): 97–105.

doi: 10.1198/sbr.2010.09044PMCID: PMC3091821

NIHMSID: NIHMS286362

Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park, NC 27709, USA

When evaluating carcinogenicity, tumor rates from the current study are informally assessed within the context of relevant historical control tumor rates. Current rates outside the range of historical rates raise concerns. We propose a statistical procedure that formally compares tumor rates in current and historical control groups. We use a normal approximation for the null distribution of the proposed test when there are at least 5 historical control groups and the average tumor rate is above 0.5%; otherwise, we apply standard bootstrap techniques. For comparison purposes, we show that formally basing decisions on the range of historical control rates would yield unusually high false positive rates. That is, a range-based decision rule would not maintain the nominal 5% significance level and could produce Type I error rates as high as 67%. In other cases, the power could go to zero. The proposed test, however, controls Type I errors while adjusting for survival and extra variability among the historical studies. We illustrate the methods with data from a study of benzophenone. Compared to a range-based decision rule, the proposed test has several important advantages, including operating at the specified level and being applicable with as few as one historical study.

To safeguard public health, government and industry researchers routinely conduct rodent cancer bioassays to evaluate the carcinogenicity of chemicals to which humans are exposed. These bioassays involve several treated groups, where animals are exposed to various doses of the test chemical, and a control group, where animals are not exposed but may receive the “vehicle” used to administer the chemical to the treated groups. The information obtained from these control groups provides a rich collection of historical data. For example, since its formation in 1977 as mandated by the U.S. Congress, the National Toxicology Program (NTP) has evaluated over 500 chemicals via 2-year rodent cancer bioassays, thus generating a vast historical control database.

Researchers consider such databases when assessing a chemical’s carcinogenicity in the current experiment. Not only do they evaluate whether tumor incidence increases in treated animals relative to controls in the current study, but they also compare tumor rates in the current dose groups with rates in the historical control database. Although the dose response in the current study is typically assessed with a formal statistical procedure (see, *e.g.*, Bailer and Portier [1]), comparisons of the current study with the historical database are usually informal. Specifically, if tumor rates in the treated groups fall within the range of rates in a relevant subset of the historical control data, the effect of the chemical may be discounted. In contrast, rates in the treated groups that fall outside the historical range may be considered evidence of a real carcinogenic effect. Moreover, if the current control rate falls outside the historical range, there may be concern about whether the current control group (and study) are consistent with, and comparable to, the previous control groups (and studies).

The proper use of historical control data has been the subject of much discussion among toxicologists and pathologists, including a town hall meeting in June 2008 conducted by the Society of Toxicologic Pathology, an invited “best practices” paper [2], and a paper on developing a “global database” so that control data from various laboratories around the world can be brought together for shared use [3]. These important developments require statisticians to evaluate the common practice of using the historical range to informally assess results from the current study and to derive a formal methodology that is more appropriate and yet simple enough to be adopted for practical use.

For purposes of motivation and illustration, consider the NTP 2-year study of benzophe-none (see http://ntp.niehs.nih.gov/go/16328). Among female rats, the rates of mononuclear cell leukemia (MCL) in control, low-, mid-, and high-dose groups were 19/50 (38%), 25/50 (50%), 30/50 (60%), and 29/50 (58%), respectively. Despite the suggestion of a positive trend in MCL rates with dose, the NTP’s trend test was not statistically significant (*p* = 0.058) at the usual 0.05 level. The NTP noted that MCL rates among controls in 6 recent studies ranged from 12% to 35%, which strengthens the evidence of a trend. As the MCL rate in the current control group (38%) was not within this historical range, however, one might question whether the current study was comparable to these 6 previous studies.

Elmore and Peddada [4] discussed drawbacks of using a historical range of tumor rates to evaluate current experimental data. Their main point was that outliers in the historical data can inflate the range, thus yielding a procedure with little power to detect group differences. Ironically, in the absence of outliers, if one were to use the range of historical control tumor rates to test the null hypothesis of equal tumor rates among one current and *k* historical control groups, the false positive rate could be as large as 2/(*k* + 1), which varies from 0 (for *k* = ∞) to 0.67 (for *k* = 2). For instance, when *k* = 6, as in the benzophenone example, the Type I error rate could be over 28%. Intrinsically, the range is not designed to control Type I or Type II errors. Thus, although the historical range is a widely used supplemental tool among toxicologists and pathologists, it yields an arbitrary decision rule.

Peddada et al [5] developed a method based on order-restricted inference to evaluate the dose-response in the current study while formally incorporating historical control tumor data. Toxicologists, however, expressed an interest in also comparing the tumor rate in the current control group with that among the historical control groups. We address this concern by developing a simple test for comparing tumor rates in the current and historical control groups. The proposed approach employs the poly-3 survival adjustment [1], a well accepted technique used in long-term carcinogenicity testing to adjust for differences in mortality. Our procedure also accounts for variability within and between studies. Extensive simulations show that our test operates at approximately the nominal level, whereas a pair of decision rules based on the historical range do not. We illustrate these methods with the MCL data from the NTP benzophenone study.

Let *n _{i}* be the number of animals in the

Our goal is to test the null hypothesis *H*_{0}: *π ^{c}* =

We propose testing *H*_{0} against *H _{a}* with the following Wald-type statistic:

$$Q=\frac{{\widehat{\pi}}^{h}-{\widehat{\pi}}^{c}}{\sqrt{\widehat{\pi (1-\pi )}({\widehat{\sigma}}^{2}/{w}^{h}+1/{w}^{c})}}.$$

Under the null hypothesis, the Bieler-Williams estimator for within-studies variation is

$$\widehat{\pi (1-\pi )}=\sum _{i=1}^{k+1}\sum _{j=1}^{{n}_{i}}{({r}_{ij}-{\overline{r}}_{i})}^{2}/\sum _{i=1}^{k+1}({n}_{i}-1),$$

where *r _{ij}* =

$${\widehat{\sigma}}^{2}=\frac{{\scriptstyle \frac{1}{k-1}}{\sum}_{i=1}^{k}{({y}_{i+}-{n}_{i}^{}2}^{/}{\widehat{\pi}}^{h}(1-{\widehat{\pi}}^{h}).}{}$$

We approximate the null distribution of *Q* by the standard normal distribution and often we can test *H*_{0} by comparing the observed value of *Q* to the percentage points of the standard normal distribution. However, the approximation does not work well in certain extreme situations. Thus, for *k* < 5 or * ^{h}* ≤ 0.005, we derive the null distribution of

Extensive simulations, reported in the next section, demonstrate that the proposed test operates at approximately the nominal level across a wide range of realistic situations.

Tumor rates from the current study are often evaluated, at least informally, in the context of the range of historical rates; for a discussion, see Keenan et al [2]. To assess this approach, we define a decision rule *R*, based on the range of unadjusted rates among the historical control groups, which rejects *H*_{0} if *π̃ _{k}*

Data were simulated from a variety of situations typically encountered in the 2-year NTP bioassays. We generated two latent variables for each animal: *T*_{1}, the time to tumor onset, and *T*_{2}, the time to natural death. A simulated animal developed a tumor before death if *T*_{1} < min(*T*_{2}, *t _{TS}*), where min(

We generated data for *k* historical control groups and one current control group. For each group, latent times to tumor onset and natural death, *T*_{1} and *T*_{2}, were generated from a pair of independent Weibull distributions with survival functions of the form: *P*(*T* > *t*) = *exp*(−*ψt ^{γ}*). The tests are not affected by tumor lethality, so there was no need to consider dependent

Our simulation study investigated 288 configurations by taking all combinations of five factors: number of historical control groups (4 levels), shape of the incidence curve (3 levels), mean historical control tumor rate (4 levels), heterogeneity of the control groups (2 levels), and difference between the current and historical rates (3 levels). As control death rates in NTP studies are well estimated and our focus is on tumor rates, we used a single baseline mortality distribution in all simulations. The mortality shape parameter and baseline scale parameter were fixed at *γ*_{2} = 5 and *ψ*_{2} = 4.48 × 10^{−8}, which produce an average survival rate of 70% at 2 years, a common value in NTP long-term studies.

In contrast, we varied several factors influencing tumor rates. We considered three values for the shape of the incidence curve (*γ*_{1} = 1.5, 3, and 6), ranging from early-onset to late-onset tumors, and four values for the mean tumor rate among the historical control groups (*π ^{h}* = 0.01, 0.05, 0.15, and 0.30), ranging from rare to common tumors. For given values of

Simulation values for the baseline incidence scale parameter (ψ_{1}) and the extra variation parameter (τ) by tumor rate (π^{h}), shape of the incidence curve (γ_{1}), and number of historical control groups (k).

For each of the 96 null configurations, where *π ^{c}* =

The decisions based on the historical range of tumor rates performed poorly. This was true whether using unadjusted rates or poly-3 survival-adjusted rates. For *k* = 2, the simulated Type I error rates varied from 27.9% to 58.7% for *R* and 36.9% to 67.0% for *R*^{*} (Table 2). Even for *k* = 10, the Type I error rates were as high as 14.5% for *R* and 18.5% for *R*^{*}. These error rates are unacceptably high. In contrast, the Type I error rates can become vanishingly small for very large *k*, yielding extremely conservative procedures (results not shown).

Type I error rates for the unadjusted (R) and survival-adjusted (R*) range-based tests by tumor rate (π), shape of the incidence curve (γ_{1}), number of historical control groups (k), and absence/presence of extra variation. All tests were **...**

If *H*_{0} is true but the observed tumor rates are distinct (i.e., no ties), the Type I error rate for a decision rule that rejects *H*_{0} if the current control tumor rate falls outside the historical range is 2/(*k* + 1), which can differ greatly from the usual 5% significance level, depending on the value of *k*. This formula is derived from the fact that under the null hypothesis of no differences among the *k* + 1 control groups, any group is equally likely to have the smallest (or largest) tumor rate. If multiple tumor rates coincide with the minimum or maximum, the use of strict inequalities in the definitions of *R* and *R*^{*} can produce Type I error rates below 2/(*k* + 1), unless ties are broken randomly (which we did not do).

The propensity for ties varies with two factors. Tumor rates are ratios of tumor counts and sample sizes, where the latter can be adjusted for survival effects or not. Differences among tumor rates can arise from differences among numerators, denominators, or both. The probability of observing tied tumor counts is lowest for a tumor rate of 0.5 and increases as the tumor rate approaches 0 or 1. Also, exact matches are more likely among unadjusted rates, where denominators are always integers, than among survival-adjusted rates, where denominators are typically not integers. Thus, as predicted, our simulation study obtained Type I error rates nearly identical to 2/(*k* +1) when the range-based decision rule was based on poly-3 survival-adjusted tumor rates (*R*^{*}) and the true tumor rate (*π*) was nearest one-half. See the bottom portion of Table 2, where the simulated Type I error rates are close to 66.7% for *k* = 2, 33.3% for *k* = 5, and 18.2% for *k* = 10. The false positive rate decreased from the predicted value of 2/(*k* + 1) when the range-based decision rule used unadjusted tumor rates (*R*) or when the granularity in incidence rates increased for rarer tumors, either of which tended to create a greater number of ties among the observed tumor rates.

The proposed test performed very well. In contrast to the range-based rules, *Q* maintained the nominal 5% level in all situations. The worst (i.e., highest) Type I error rate out of 96 null scenarios was 5.8% (Table 3). On the other hand, the most conservative results were obtained for *π* = 0.01 and *k* = 1, where the false positive rate varied from 2.2% to 2.4%. Our simulation studies show that, unless the tumor is extremely rare and only one historical control group is available, the Type I error rates for the proposed test are very reasonable.

Type I error rates for the proposed test (Q) by tumor rate (π), shape of the incidence curve (γ_{1}), number of historical control groups (k), and absence/presence of extra variation. All tests were performed at the nominal 5% level, with **...**

We also note that neither the shape of the incidence curve nor the introduction of extra variability had any noticeable impact on the Type I error rates for the proposed test. Even though *Q* uses a poly-3 survival adjustment [1], originally derived under the assumption of a Weibull tumor incidence model with a shape parameter of 3, the false positive rates for shapes *γ*_{1} = 1.5 and *γ*_{1} = 6 were essentially the same as for *γ*_{1} = 3. Similarly, the Type I error rates did not seem to be affected by increasing the heterogeneity of the tumor onset times and death times among the historical control groups (Table 3).

The power of the proposed test to detect a difference between *π ^{c}* and

Power of the proposed test (Q) to detect small and large differences^{a} between mean tumor rates for historical (π^{h}) and current (π^{c}) controls by tumor rate (π^{h}), shape of the incidence curve (γ_{1}), number of historical control **...**

We do not present powers for the range-based rules because their Type I error rates can be much too high or much too low, depending on the number of historical studies involved. As an interesting special case, however, we examined rejection rates for *k* = 39, where the predicted Type I error rate for a range-based rule in the absence of tied tumor rates, 2/(*k*+1), equals the nominal 0.05 level. In the null case, the rejection rates for *R*^{*} were right on target (4.9% to 5.5%) for tumor rates of 0.15 and 0.30, but were lower (2.4% to 3.5%) for tumor rates of 0.01 and 0.05 (as expected with the higher number of ties produced by lower tumor rates). For the higher tumor rates (0.15 and 0.30), where both *Q* and *R*^{*} operated at the nominal 5% level, *Q* had 17% to 24% greater power than *R*^{*} to detect small differences in tumor rates (e.g., 44.1% for *Q* versus 35.7% for *R*^{*}) and 2% to 5% greater power to detect large differences (even though there was not much room for improvement, as the powers were all above 90%). Thus, even when selecting the “best” value of *k* with respect to the Type I error rate of *R*^{*}, the proposed test was more powerful than the range-based decision rule.

Benzophenone is an aryl ketone, produced in large quantities in the United States, with widespread occupational and consumer exposures through its use as a fragrance enhancer, flavor additive, photoinitiator, and ultraviolet curing agent [8]. It is also used in manufacturing pharmaceuticals, insecticides, and agricultural chemicals, as well as being an additive in plastics and adhesives. Short-term animal studies suggested that the liver and kidneys were the target organs, but toxicity also was observed in the hematopoietic system.

The NTP conducted a 2-year study of male and female *B*6*C*3*F*_{1} mice and *F* 344/*N* rats exposed to benzophenone. Our example focuses on mononuclear cell leukemia (MCL) in female rats. Groups of size 50 received doses of 0, 312, 625, or 1250 ppm of benzophenone in their diet throughout the study. The numbers of female rats that developed MCL were 19, 25, 30, and 29, respectively, with poly-3 survival-adjusted tumor rates of 42.3%, 51.5%, 61.3%, and 59.6%. With respect to female rats, the NTP concluded that there was equivocal evidence of carcinogenic activity of benzophenone, based in part on marginally increased incidences of MCL and histiocytic sarcoma.

Several factors contributed to the uncertainty in the NTP decision. The NTP’s trend test gave a *p*-value of 0.058 for the current experimental data, which is not statistically significant at the usual 0.05 level, though the pairwise comparison of the control and mid-dose groups was marginally significant (*p* = 0.048). On the other hand, accounting for historical control data supported the notion of an increasing trend in MCL rates with dose. The NTP examined 6 contemporary feed studies and found lower MCL rates among the untreated female rats (see Table B4b of the NTP report [8]). The unadjusted tumor rates ranged from 12% to 35% in these 6 historical control groups, which suggests the spontaneous MCL rate might be lower than the 38% rate observed in the current control group. Similarly, the corresponding poly-3 survival-adjusted rates were 42.3% in the current study and 12.7% to 35.6% in the historical studies. A lower MCL rate among controls would produce a more significant *p*-value for an increasing dose-related trend, especially in view of the relatively high tumor rates in the treated groups (i.e., poly-3 rates of 51.5%, 61.3%, and 59.6%). This argument is valid if we believe the current and historical controls have the same mean tumor rate, but otherwise we cannot necessarily draw that conclusion.

The NTP observed that the MCL rate in the current control group fell outside the historical range and, for that and other reasons, declared an equivocal result with respect to the possible carcinogenicity of benzophenone in female rats. As a range-based decision rule can reject too often when there are only *k* = 6 historical control groups, we applied the proposed test to these same data. Our formal test, which operates at the proper level, gave a 2-sided significance value of *p* = 0.010, which supports the NTP’s informal observation that the current control group differs from the historical control groups with respect to MCL.

Although various statistical methods for incorporating historical control information have been proposed over the past few decades, none have gained widespread use by scientists in the field or by regulatory agencies. For example, many early procedures assumed a beta-binomial model [9], which allowed for extra-binomial variation among the control groups. Other related procedures involved generalized binomial [10] or logistic-normal [11] models. An important problem with these approaches, however, is that they do not adjust for survival, which can introduce bias when mortality differs across groups, as all animals are not at equal risk for developing a tumor. Alternative methods account for survival but make assumptions about tumor lethality [12]. Several Bayesian procedures adjust for survival and avoid lethality assumptions [13] but require investigators to specify prior distributions and hyperparameters. For these and other reasons, none of these methods have been adopted for routine use in practice.

Recently, the Technical Reports Review Subcommittee of the NTP Board of Scientific Counselors, which included two statisticians, decided against endorsing any of the current statistical methods and instead recommended developing a new procedure to address the important problem of incorporating historical control data in the analysis of a current study (http://ntp.niehs.nih.gov/files/TRRSMins0905.pdf). Consequently, Peddada et al [5] developed a simple trend test that incorporates historical control data, adjusts for survival, and makes no assumptions about tumor lethality or parametric distributions. Further discussions with NTP toxicologists and pathologists revealed the need for an additional test for comparing current and historical control groups, which was the motivation for this article.

In summary, when evaluating a chronic bioassay, toxicologists and pathologists routinely assess the relevance of historical studies to the current study by comparing tumor rates in the current control group to the range of control tumor rates from contemporary studies performed under similar conditions. Current and historical control groups are often informally labeled dissimilar if the current control rate falls outside the historical range; see Keenan et al [2] for a discussion. One natural concern is that this type of approach might be conservative and have low power when the historical range becomes too wide, which can occur if *k* is large or if there is an outlier among the tumor rates. There has been a recent push for creating a global historical control database [3], where a large value of *k* could lead to a range-based procedure that could be extremely conservative and have little power. A less appreciated, though possibly more disturbing, concern is that such a range-based process could be highly anti-conservative with huge Type I error rates when *k* is small. As an alternative to this type of range-based approach, we provide a simple procedure that controls Type I errors, while adjusting for survival effects, accounting for extra variability among historical control groups, and avoiding tumor lethality assumptions.

We emphasize that an important feature of the proposed test is that it works well with a small number of studies and in fact can be applied with only one historical study, whereas range-based decision rules would perform poorly for small *k* and would not even be defined for *k* = 1. For example, the NTP recently switched from using Fischer rats to using Sprague Dawley rats in its 2-year rodent cancer bioassay. Initially the new historical control database will not contain enough Sprague Dawley rat studies to construct a reasonable range of tumor rates, but the proposed method will be readily applicable. This issue is widespread, as many small labs conducting rodent cancer bioassays do not have extensive historical control databases. Some have even discussed the construction of a global database to deal with these types of situations [3]. Our proposed approach provides a simple solution to this problem.

Finally, although the bootstrap procedure for approximating the null distribution of the proposed test statistic in extreme cases (i.e., *k* < 5 or * ^{h}* ≤ 0.005) does not account for between-group variability, the Type I error rate is maintained. The validity of the bootstrap methodology of Peddada et al [7] is not surprising for

This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences (Z01-ES045007 and Z01-ES101744). The authors thank Grace Kissling, Joseph Haseman, and the reviewers for their helpful comments.

1. Bailer A, Portier C. Effects of treatment-induced mortality and tumor-induced mortality on tests for carcinogenicity in small samples. Biometrics. 1988;44:417–431. [PubMed]

2. Keenan C, Elmore S, Francke-Carroll S, Kemp R, Kerlin R, Peddada S, Pletcher J, Rinke M, Schmidt S, Taylor I, Wolf D. Best practices for use of historical control data of proliferative rodent lesions. Toxicologic Pathology. 2009;37:679–693. [PubMed]

3. Keenan C, Elmore S, Francke-Carroll S, Kerlin R, Peddada S, Pletcher J, Rinke M, Schmidt S, Taylor I, Wolf D. Potential for a global historical control database for proliferative rodent lesions. Toxicologic Pathology. 2009;37:677–678. [PubMed]

4. Elmore S, Peddada S. Points to consider on the statistical analysis of rodent cancer bioassay data when incorporating historical control data. Toxicologic Pathology. 2009;37:672–676. [PubMed]

5. Peddada S, Dinse G, Kissling G. Incorporating historical control data when comparing tumor incidence rates. Journal of the American Statistical Association. 2007;102:1212–1220. [PMC free article] [PubMed]

6. Bieler G, Williams R. Ratio estimates, the delta method, and quantal response tests for increased carcinogenicity. Biometrics. 1993;49:793–801. [PubMed]

7. Peddada SD, Prescott K, Conaway M. Tests for order restrictions in binary data. Biometrics. 2001;57:1219–1227. [PubMed]

8. National Toxicology Program. Technical Report Series No. 533, NIH Publication No. 05-4469. U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health; RTP, NC: 2006. NTP Technical Report on the Toxicology and Carcino-genesis Studies of Benzophenone (CAS No. 119-61-9) in *F*344/*N* Rats and *B*6*C*3*F*_{1} Mice (Feed Studies)

9. Tarone R. The use of historical control information in testing for a trend in proportions. Biometrics. 1982;38:214–220. [PubMed]

10. Makuch RW, Stephens MA, Escobar M. Generalised binomial models to examine the historical control assumption in active control equivalence studies. The Statistician. 1989;38:61–70.

11. Dempster AP, Selwyn MR, Weeks BJ. Combining historical and randomized controls for assessing trends in proportions. Journal of the American Statistical Association. 1983;87:221–227.

12. Ibrahim J, Ryan L. Use of historical controls in time-adjusted trend tests for carcinogenicity. Biometrics. 1996;52:1478–1485. [PubMed]

13. Dunson D, Dinse G. Bayesian incidence analysis of animal tumorigenicity data. Applied Statistics. 2001;50:125–141.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |