Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC 2010 April 13.
Published in final edited form as:
J Am Stat Assoc. 2007 January 1; 102(480): 1212–1220.
PMCID: PMC2853781

Incorporating Historical Control Data When Comparing Tumor Incidence Rates *


Animal carcinogenicity studies, such as those conducted by the U.S. National Toxicology Program (NTP), focus on detecting trends in tumor rates across dose groups. Over time, the NTP has compiled vast amounts of data on tumors in control animals. Currently, this information is used informally, without the benefit of statistical tests for carcinogenicity that directly incorporate historical data on control animals. This article proposes a survival-adjusted test for detecting dose-related trends in tumor incidence rates, which incorporates data on historical control rates and formally accounts for variation in these rates among studies. An extensive simulation, based on a wide range of realistic situations, demonstrates that the proposed test performs well in comparison to the current NTP test, which does not incorporate historical control data. In particular, our test can aid in interpreting the occurrence of a few tumors in treated animals that are rarely seen in controls. One such example, which motivates our work, concerns the analysis of histiocytic sarcoma in the NTP’s 2-year cancer bioassay of benzophenone. Whereas the occurrence of three histiocytic sarcomas in female rats was not significant according to the current NTP testing procedure (P = 0.074), it was highly significant (P = 0.004) when control data from six recent historical studies were included and our test was applied to the combined data.

Keywords: Cancer bioassay, Carcinogenicity experiment, National Toxicology Program, Order-restricted inference, Poly-3, Quantal response, Survival adjustment


In accord with a mandate by the U.S. Congress, the NTP routinely evaluates the toxicity and carcinogenicity of various chemicals to which humans are exposed. The NTP has performed more than 500 long term rodent studies, involving over 450 chemicals, which has produced a rich collection of information about tumors in control animals. Often this database is used informally when assessing whether differences in tumor rates are dose-related, especially when tumors are rare, but a formal method of incorporating historical control data would be preferable.

As an example, the NTP recently conducted a two-year rodent bioassay to evaluate the toxicity and carcinogenicity of benzophenone, a chemical to which a large number of people are exposed (for details, see As is typical of NTP bioassays, both sexes of two rodent species were studied, with 50 animals randomly assigned to a control group and 50 to each of three dose groups for all four sex/species combinations. The results of standard analyses of histiocytic sarcoma in female rats were unsatisfying. The observed numbers of rats with this cancer were 0, 0, 1, and 2 in the control, low-dose, mid-dose, and high-dose groups. The usual trend test reported by the NTP was not statistically significant (P = 0.074) at the standard 5% level, but the NTP historical control database shows that spontaneous occurrences of this neoplasm are very rare. None of the 460 female rats among the control groups in 6 recent studies had a histiocytic sarcoma, yet 2 of 50 in the current high-dose group and 1 of 50 in the current mid-dose group developed this cancer. We are interested in determining if the trend in the benzophenone study is significant after formally taking into account that, historically, this tumor is extremely rare.

Over the past few decades, several methods have been proposed for including historical information in the analysis of current data. Many of the procedures focus on lifetime tumor rates and assume a beta-binomial model, which allows for extra-binomial variation among the tumor proportions from the historical control groups (see, for example, Tarone 1982; Yanagawa and Hoel 1985; Hoel and Yanagawa 1986; Tamura and Young 1986; Prentice et al., 1992; Ryan 1993). Similar approaches have been based on generalized binomial (Makuch et al., 1989) and logistic-normal (Dempster et al., 1983; Seewald 1994) models. A major disadvantage of these methods, however, is that they do not adjust for survival. When treatment a ects longevity, the time at risk for developing tumors can differ substantially across dose groups, in which case survival-adjusted procedures are necessary to avoid biases.

Generalizations have been suggested that account for survival, but they have various shortcomings of their own. For example, Ibrahim and Ryan (1996) proposed a Dirichlet-multinomial model that assumes tumors are rapidly lethal, while other methods are oriented toward nonlethal tumors (Ibrahim et al., 1998; Parise et al., 2001). Bayesian procedures have been developed that adjust for survival and do not make extreme lethality assumptions (Dunson and Dinse 2001; French and Ibrahim 2002), but these analyses require investigators to specify prior distributions and hyperparameters.

The need for a formal method of incorporating historical control data in the analysis of the current experiment has long been recognized (Haseman et al., 1984; Haseman 1995), but no procedure has emerged as the clear favorite (Greim et al., 2003). The Technical Reports Review Subcommittee of the NTP Board of Scientific Counselors, which includes two statisticians, has not endorsed any of the current methods and recommended a new procedure be developed for this important problem ( This article presents a new method which is relatively simple, incorporates historical control data, adjusts for survival, and avoids lethality restrictions and distributional assumptions. Section 2 describes the proposed statistical procedure and Section 3 reports the results of an extensive simulation study, which shows that our method has good operating characteristics. The proposed test is illustrated in Section 4 by applying it to two NTP data sets, and several concluding remarks are provided in Section 5.


Early procedures for analyzing carcinogenicity data, with or without incorporating historical control information, dealt with lifetime tumor rates. In general, though, methods should focus on age-specific tumor rates to avoid biases due to differential survival. Of the various age-specific rates, the most appropriate is tumor incidence (McKnight and Crowley 1984), which corresponds to the hazard function for time to tumor onset. As a function of age, the tumor incidence rate automatically adjusts for differential survival, and as the rate at which new tumors occur, it is a natural measure of carcinogenesis. Thus, we focus on the detection of a positive trend (across dose groups) in the age-specific tumor incidence rates.

2.1 Notation

Let T be the time to the first of two events, either tumor onset or death, and let Y (t) be an indicator that is 1 if the tumor is present at time t and 0 otherwise. The tumor incidence rate and the tumor-free death rate correspond to the event-specific hazard functions




respectively. Let π be the expected proportion of animals that develop a tumor during the study (i.e. the lifetime tumor rate), which can be expressed in terms of λ(t) and β(t):


where tTS denotes the terminal sacrifice time (i.e. the length of the study).

Suppose there are k treatment groups in the current experiment, with associated dose levels d1 < d2 < (...) < dk, where animals in the first group are unexposed controls (d1 = 0). More generally, di is a dose score assigned to the ith group (i = 1, 2, … , k). Let πi, λi(t), and βi(t) denote the lifetime tumor rate, the tumor incidence rate, and the tumor-free death rate in the ith group. Tests for a positive trend in tumor incidence rates involve the following null and ordered alternative hypotheses:




respectively, where at least one inequality in Ha is strict for at least one t. Equation (1) shows that tests oriented toward the {πi} are not necessarily valid for comparing the {λi(t)}, as such tests can be confounded by differences among the {βi(t)}.

Suppose nci animals in the current study are randomized to the ith group and receive dose di of the treatment (i = 1, 2, … , k). We use c to index variables and parameters related to the current study, and later we use h to index similar quantities related to the historical data. Let ycij and tcij denote the tumor status and death time, respectively, for the jth animal in the ith group of the current study (j = 1, 2, … , nci; i = 1, 2, … , k). Thus, ycij is 1 if a tumor is present at necropsy and 0 otherwise. Furthermore, nc+=i=1knci is the total number of animals in the current study, and yci+=j=1nciycij and yc++=i=1kyci+ are the numbers of tumor-bearing animals in the ith group and in the entire study, respectively.

2.2 Background

Many early analyses used the linear trend test of Cochran (1954) and Armitage (1955) to compare lifetime tumor rates. Inferences based on lifetime rates can be misleading, though, if survival differs across treatment groups. For example, unless all animals live to the end of the study, the sample proportion yci+/nci tends to underestimate πci because it does not account for early deaths reducing the time at risk for developing tumors. Bias arises when the amount of reduction in the time at risk varies with dose (i.e. differential mortality).

Bailer and Portier (1988) addressed the survival issue by defining a modified sample size nci=j=1nciδcij, where δcij is a weight for animal j in group i. Clearly an animal that developed a tumor was at risk of tumor onset and thus receives a weight of 1. An animal that dies without a tumor, however, is assigned a weight linked to the fraction of the study it survived. Bailer and Portier (1988) suggested using (tcij/tTS)3 for this fractional weight, which is known as the Poly-3 weight; then they applied the Cochran-Armitage test after substituting nci for nci. The resulting survival-adjusted estimate of πci is π~ci=yci+nci. Bieler and Williams (1993) noted that nci is a random variable and proposed a corrected variance estimator, which under the assumption that πc1 = πc2 = (...) = πck is given by


where S12=ij(rcijrci)2(nc+k),rcij=ycijπ~cδcij,rci=rci+nci,π~c=yc++nc+,wci=nci2nci, and the plus sign indicates summation over the missing index.

The usual Poly-3 trend test, with the Bieler-Williams variance adjustment, is


The NTP currently uses the following (conservative) variation of WBW:


which shrinks WBW toward 0. The continuity correction factor in the above formula is


where d=i=1kwcidii=1kwci is a weighted average of the dose scores. In the absence of survival differences, CF reduces to half the largest difference between successive dose scores. Both WNTP and WBW are approximately normal when πc1 = πc2 = (...) = πck. Bailer and Portier showed that while the Poly-3 test is oriented toward survival-adjusted comparisons of the {πci}, it also provides an approximately valid comparison of the age-specific tumor incidence rates when each λci(t) follows a Weibull model with a shape parameter of 3. However, simulation studies demonstrate that the Poly-3 test is sensitive to certain departures from the underlying Weibull model (Mancuso et al., 2002; Peddada et al., 2005).

Peddada et al. (2005) suggested an alternative trend test which uses the Poly-3 weights, but does not assume the tumor rates are linear in dose, does not depend on the numerical scores assigned to the dose groups, and does not incorporate a continuity correction. Let π^ci denote the survival-adjusted isotonic regression estimate of πci (i = 1, 2, … , k) under the constraint that πc1 ≤ πc2(...) ≤ πck. These estimates can be obtained, for example, by applying the pool-adjacent-violators algorithm (PAVA) (Barlow et al., 1972) to the {π^ci} with weights {nci}. As a simple illustration of PAVA, suppose there are k = 4 groups, all weights are equal, the true (unknown) proportions are constrained by π1 ≤ π2 ≤ π3 ≤ π4, and the sample proportions are π~1=.1,π~2=.3,π~3=.2 and π~4=.4, respectively. Clearly, π~2 and π~3 violate the constraint. The isotonic solution can be obtained by retaining those estimates which satisfy the inequality constraints while averaging the estimates which are in violation. Thus, in this illustration, the “isotonized” estimates are π^1=.1,π^2=π^3=(.3+.2)2=.25, and π^4=.4. Mathematically, in the general case, the “isotonized” values π^ci (i = 1, 2, …, k) under the above inequality constraint can be computed using the following min-max formula:


The isotonic test statistic proposed by Peddada et al. (2005) can be written


which was motivated by a procedure that Williams (1977) derived to test the equality of normal means. Define k standard normal random variables: Zi ~ N(0, 1) for i = 1, … , k. Given equal {πci}, the distribution of WISO can be approximated by the distribution of


where Z^1Z^2Z^k are the isotonized values of the {Zi} using unit weights. Peddada et al. (2005) conducted extensive simulations which showed that their test generally was more powerful than the Bailer-Portier/Bieler-Williams test when comparing age-specific incidence rates, even though both tests are more naturally oriented toward the{πci} than the {λci(t)}.

2.3 Proposed Test

Suppose there are p relevant studies in the historical control database, where this number depends on several factors. Generally, we would prefer to include only those historical studies for which the route of exposure (e.g. feed, gavage, inhalation, skin painting) is the same as in the current study. Also, we typically restrict our comparison to studies which were performed relatively recently, since background tumor rates are known to “drift” over time (Haseman, 1992). Thus, in practice, we anticipate incorporating control data from approximately 5 to 10 historical studies.

Let nhm be the number of animals in the mth historical control group, and let yhmj and thmj denote the tumor status and death time, respectively, for animal j in group m (j = 1, 2, … , nhm; m = 1, 2, … , p). In addition, set yhm+=j=1nhmyhmj and yh++=m=1pyhm+. Finally, let πhm denote the proportion of animals [summationtext] in the mth historical control group that are expected to develop a tumor if surviving the entire study (i.e. the lifetime tumor rate).

The problem of detecting trends in tumor rates can be viewed as a special case of order restricted inference and can be solved by maximizing certain distances in a directed graph. Figure 1 shows a directed graph appropriate for the usual NTP situation, where there is a high-dose (HD) group, a mid-dose (MD) group, a lowdose (LD) group, a current control (CC) group, and a collection of historical control (HC) groups. The arrows indicate the restrictions that the tumor rates in the control groups do not exceed the rate in the low-dose group, which does not exceed the rate in the mid-dose group, which does not exceed the rate in the high-dose group. The control animals, both current and historical, are assumed to come from a common population, and thus there are no arrows connecting CC and HC.

Figure 1
Directed Graph Representing the Trends of Interest. The circles denote parameters in the historical control (HC) group, the current control (CC) group, the current low-dose (LD) group, the current mid-dose (MD) group, and the current high-dose (HD) group. ...

In the spirit of Kolmogorov distance measures, we follow Peddada et al. (2001) and define a test statistic for the graph in Figure 1 as the maximum distance between connected points. Specifically, our test is max(D1, D2), where D1 is the distance between the current control group and current dose groups, and D2 is the distance between the historical control groups and current dose groups. Each of these two distances, D1 and D2, is itself the maximum of two quantities. Distance D1 is the maximum of the Bieler-Williams version of the Poly-3 statistic and the order restricted inference statistic of Peddada et al. (2005) based on the current control group. Our choice for D1 was motivated by a test proposed by Peddada and Kissling (2006), who selected this pair of statistics because the former performs well against linear trends, while the latter performs well when trends are not linear. By taking the maximum of the two, we obtain a test with reasonably good power for detecting linear or nonlinear (but still non-decreasing) trends. Distance D2 has exactly the same form as D1, except that it is based on the historical control groups rather than the current control group. Thus, the proposed test statistic is the maximum of four components. Two of the components are WBW and WISO, as described in the previous subsection, and now we derive the other two analogous components, say WBW and WISO.

Our approach views the lifetime tumor rates in the current and historical control groups (πc1, πh1, … , πhp) as random realizations from a common control population with mean π. If no animals die before the end of the study, the sample proportion yhm+/nhm has expected value π (m = 1, 2, … , p) and we assume the variance can be expressed as


where σ2 is an unknown parameter that allows for extra-binomial variation.

Now suppose that some animals die before tTS and we want to account for this by using Poly-3 adjusted sample sizes. Allowing for the random nature of the sample size in variance formula (2), we propose separate estimators for σ2 and π(1 — π), where a Bieler-Williams type adjustment is made to the latter. Under the assumption that the lifetime tumor rates are equal in all control and treated groups, we estimate σ2 by σ^2=S22[π~(1π~)], where




The Bieler-Williams estimate of π(1 — π), say π(1π^), is obtained by applying the formula for S12 to the combined set of p historical control groups and k current groups. We estimate the variance of the Poly-3 adjusted tumor proportions in the current high-dose group, π~ck, and in the pooled historical control groups, π~h=yh++nh+, by




respectively, where wh+=nh+2nh+. Note that although the extra binomial variation among the historical control groups should not extend to the current dose groups, we purposely inflate the estimated variance of π~ck by σ^ in the above formula to avoid the Behrens-Fisher problem. This modification leads to a slightly conservative test (described below), but we view it as a small price to pay for the resulting simplification.

Thus, we define a new pair of test statistics, say WBW and WISO, which are analogous to WBW and WISO, except that they are oriented toward detecting a trend in the current dose groups relative to the historical control groups rather than relative to the current control group. Specifically, define WBW exactly the same as WBW , except for replacing the current control group quantities wc1 and π~c1 with the historical control group summaries wh+ and π~h, as well as replacing S1 with S1σ^. Similarly, let WISO be the following analog of WISO:


where πˇhπˇc2πˇc3πˇck are the isotonized values of {π~h,π~c2,π~c3,,π~ck} under the constraint that π ≤ πc2 ≤ πc3 (...) πck, with weights nh+,nc2,nc3,,nck.

Finally, the proposed survival-adjusted test statistic for detecting a dose-related trend in the tumor rates, using both current and historical controls, is given by


Deriving the exact null distribution of W is not trivial. Hence we use critical values based on the approximations described below. In the next section, simulation studies reveal that our approximations control the Type I error rate very well.

Initially, assume that all animals survive to the end of the study and the lifetime tumor rates in all control and treated groups are equal to π. Also, let vg denote the estimated variance of π^g for g = h, c1, c2, … , ck. These assumptions, together with the consistency of the {vg}, suggest the following approximations:


Denote the above limiting standard normal random variables by {Z0, Z1, … , Zk}, where the first term (Z0) corresponds to the pooled set of p historical control groups and the remaining terms correspond to the k groups in the current study. Next, define U0=Z0p and Ui = Zi for i = 1, … , k. In our application, the observations in all control and treated groups are independent; thus, the {Zi} are independently distributed, as are the {Ui}. Note that all 4 components of W involve π^c2,,π^ck, but in addition to these terms WBW and WISO are also functions of π^c1, whereas WBW and WISO are also functions of π^h. Thus, we approximate the null distribution of WBW and WISO by using the random variables Z1, Z2, … , Zk, and we approximate the null distribution of WBW and WISO by using the random variables, U0, U2, U3, … , Uk. As WBW can be obtained by regressing the lifetime tumor rates on dose, it should be distributed approximately the same as the regression of Z1, Z2, … , Zk on dose:


We approximate the distribution of WBW using the analogous function of U0, U2, … , Uk:


Both WISO and WISO resemble the statistic proposed by Williams (1977) to test for a monotone trend in dose. Hence, using the asymptotic approximations given above, the null distributions for these statistics can be approximated by




respectively, where Z^1Z^2Z^k are the isotonized values of {Z1, Z2, … , Zk} with unit weights and U^0U^2U^k are the isotonized values of {U0, U2, … , Uk} with a weight of p for U0 and unit weights for the others.

Therefore, in the absence of deaths prior to the end of the experiment, the null distribution of our test statistic W can be approximated by the distribution of


As V is a function of independent N(0, 1) random variables, its distribution can be simulated by repeatedly generating k + 1 independent standard normals. The level α critical value for W is the 100(1 — α)th percentile of the simulated distribution of V. In practice, we take a million random samples to approximate the distribution of W.

More generally, when some animals die early, the components of W are functions of the Poly-3 weights, which provide a simple survival adjustment. However, despite possible survival effects, we approximate the distribution of W by the same simulated distribution of V described above. We demonstrate that the proposed test performs well in the more general case by simulating a variety of realistic data sets with many deaths before tTS. The simulation results show that our test operates at (or below) the nominal level in a wide range of situations typically observed in practice.


3.1 Study Design

Following Peddada et al. (2005), data were simulated from a variety of realistic situations commonly encountered in NTP studies, which typically run for two years and end with a terminal sacrifice. We generated two latent random variables for each animal: T1, the time to tumor onset, and T2, the time to natural death. A simulated animal develops a tumor before death if and only if T1 < min(T2, tTS), where min(T2, tTS) is the observed death time. Our simulation measured time in months, and thus tTS = 24. Data were simulated for 50 animals per group, which is the standard sample size in most NTP 2-year bioassays. Poly-3 based tests assign unit weights to all animals that die with a tumor, regardless of the tumor’s lethality. Thus, we did not simulate different levels of tumor lethality because such information is not used by the quantal response tests considered here.

We generated data for p historical control groups, a current control group, and three current dose groups. The latent times to tumor onset and death, T1 and T2, for animals in all groups were generated from independent Weibull distributions with survival functions of the form: P(T > t) = exp(-ψtγ). We assumed that all dose groups had the same shape parameters for T1 and T2, say γ1 and γ2, which govern the steepness of the tumor incidence and mortality curves, respectively. Any differences among groups were assumed to arise only through the scale parameters. Let the scale parameters for T1 and T2 be denoted by ψ1ci and ψ2ci in the i-th group in the current study (i = 1, 2, 3, 4) and by ψ1hm and ψ2hm in the m-th historical control group (m = 1, … , p). We generated T1 and T2 values for the various groups according to the following models for the scale parameters:







where ψ1 and ψ2 are baseline values, ϕ1i is specific to dose group i, ϕ2 is common to all dose groups, and XmiidN(0,1) for m = 1, … ,p. Thus, all control groups (current and historical) shared a common mortality scale parameter, ψ2, but each had a distinct incidence scale parameter. The incidence scale parameters in the historical groups (ψ1h1, ψ1h2, … , ψ1hp) were perturbed from the current control value (ψ1c1 = ψ1) in proportion to τ, to induce extra variation. Our primary focus is on the ratio of the tumor incidence rate in dose group i relative to the tumor incidence rate in the current control group, which is represented by θi=ϕ1idi.

The simulation specified multiple values for some parameters, which led to 540 configurations. We generated data for p = 5 historical control groups. Results are reported for a typical pattern of dose scores: (d1, d2, d3, d4) = (0, 0.5, 1.0, 2.0), which we refer to as 2-fold spacings. We also investigated 5-fold dose spacings, but the results are very similar to those obtained for 2-fold spacings (if not even more favorable for our test), and thus they are not shown. As control death rates in NTP studies are well estimated, and our focus is on tumor incidence, very few mortality configurations were examined. The mortality shape parameter was fixed at γ2 = 5 and we derived the baseline scale parameter value that gave a typical control survival rate of 70% at 2 years: ψ2 = 4.48 × 10-8. Two mortality scale multipliers were used, ϕ2 = 1 and 1.5, which correspond to dose effects on mortality that can be labeled as “none” and “moderate” (i.e. 70% and 45% survival at 2 years in the high-dose group).

In contrast, we varied many factors influencing tumor incidence rates. We investigated three values for the shape of the tumor onset curve, γ1 = 1.5, 3, and 6, and five values for the lifetime tumor rate in the current control group, π1 = 0.001, 0.01, 0.05, 0.15, and 0.30. For fixed values of π1, γ1, γ2, and ψ2, we used equation (1) to solve for ψ1, the baseline incidence scale parameter for an “average” control group (i.e. Xm = 0). We examined three levels of heterogeneity among the historical control groups, where the heterogeneity parameter τ varied with the background tumor rate (π1). The τ values were selected so that the resulting standard deviation of π1 among the simulated historical control groups reasonably approximated low, medium, and high values within the range of sample standard deviations observed in the NTP historical control database. Finally, we chose ϕi1 values that gave certain patterns for the incidence ratio, θ′ = (θ1, θ2, θ3, θ4). The null case for comparing incidence rates is θ′ = (1, 1, 1, 1), and we considered five non-null patterns, which varied with the background tumor rate. The specific values for ψ1, τ, and θ are given in Table 1.

Table 1
Parameter Values Used in the Simulations. The entries in the last 5 columns correspond to the 5 values of the background tumor rate (π1). The entries in the first 3 rows are the values of the baseline incidence scale parameter (ψ1) for ...

Overall, we investigated 540 configurations by taking all combinations of several factors: mortality scale multiplier (2 levels), shape of incidence curve (3 levels), background tumor rate (5 levels), heterogeneity of control groups (3 levels), and incidence ratio pattern (6 levels). Type I error rates were evaluated for the 90 configurations associated with the null hypothesis of equal incidence rates, and powers were estimated for the 450 configurations associated with various alternative hypotheses. The operating characteristics of W and WNTP were compared in each situation. Though initially it may seem odd to compare a test which uses historical control data with another test which does not, we view this as a comparison of a new method with the current standard method.

For each configuration, we generated 10,000 sets of current study data (i.e. a control group and three dose groups), and for each set of current study data, we generated a set of data from p historical control groups. We estimated the Type I error rates for W and WNTP by the empirical proportions of the 10,000 trials for which H0 was rejected. The nominal level for both trend tests was 0.05.

3.2 Results

This section summarizes the operating characteristics of the proposed test (W) and the NTP test (WNTP) as estimated in our simulation study. Both tests maintained the nominal level (see Figure 2). The Type I error rate did not exceed 0.05 in any of the 90 null cases for WNTP and only slightly exceeded the nominal level in one null case for W, and that single excess was not statistically significant. The Type I error rate generally increased with the background tumor rate (π1), especially for WNTP, as illustrated by the different symbols in Figure 2. The Type I error rates also varied with the shape of the incidence curve (γ1) and the dose effect on mortality (ϕ2), but these effects were small. Control heterogeneity (τ) had no impact on WNTP, as the NTP test does not incorporate historical data. In contrast, the Type I error rates for W decreased as variability among historical control groups increased, though this effect was not strong for typical levels of heterogeneity, such as those used here.

Figure 2
Type I Error Rate Comparison for Proposed Test (W) Versus NTP Test (WNTP). Different symbols identify the various background tumor rates: 0.001 (diamonds), 0.01 (squares), 0.05 (pluses), 0.15 (circles), and 0.30 (triangles). The diagonal reference line ...

The proposed test was more powerful than the standard test in almost all situations, with W having a fairly large advantage relative to WNTP in many cases (see Figure 3). For some configurations, the two tests had approximately the same power, as shown by the symbols along the diagonal line. When the rejection rates differed, however, W tended to be more powerful than WNTP , as indicated by the large proportion of symbols above the diagonal reference line. In fact, the proposed test was 4.5 times as powerful as the standard test in some cases and was never less than 94% as powerful. Analogous to the results for Type I error rates, the observed power tended to increase with background tumor rate, as illustrated by the different symbols in Figure 3. Similarly, power varied with the shape of the incidence curve and the dose effect on mortality, as well as the control group heterogeneity for W, but these effects usually were relatively small.

Figure 3
Power Comparison for Proposed Test (W) Versus NTP Test (WNTP). Different symbols identify the various background tumor rates: 0.001 (diamonds), 0.01 (squares), 0.05 (pluses), 0.15 (circles), and 0.30 (triangles). The diagonal reference line shows where ...

The proposed test performed well compared to the standard NTP test in general, but the most obvious improvement of W over WNTP was when tumors were rare. For example, when the background tumor rate was 0.001 (i.e. a very rare tumor), WNTP was extremely conservative, with Type I error rates close to 0 and powers between 3% and 9%. In contrast, W was less conservative than WNTP, with Type I error rates as high as 2% and powers ranging from 5% to 26%. As for the other factors, the power of both tests decreased as differential mortality increased and as tumor development shifted to later in life (i.e. as γ1 increased). In addition, the power of W decreased as the historical control groups became more heterogeneous. The effects of these other factors were relatively small, however, compared to the effect of the background tumor rate, at least at the factor levels reported here, which were chosen to simulate data representative of what the NTP observes in practice.


This section illustrates the proposed methodology with real data from two NTP studies. The first analysis focuses on a rare tumor, which is the situation that motivated our research, and the second analysis presents a contrasting example involving a common tumor. In one case, our results strengthen an NTP conclusion that was reached despite rather weak statistical evidence. The other case demonstrates how the increased power of our formal procedure for incorporating data from past studies provides an appealing alternative to the NTP’s informal practice of comparing tumor rates in current dose groups with the range of rates from historical control groups.

4.1 Analysis of Benzophenone Data

Benzophenone is an aryl ketone which is produced in large quantities in the U.S., with the potential for widespread occupational and consumer exposures through its use as a fragrance enhancer, flavor additive, photoinitiator, and ultraviolet curing agent (NTP, 2006). It is also used in the manufacture of pharmaceuticals, insecticides, and agricultural chemicals, as well as being an additive in cosmetics, plastics, and adhesives. Benzophenone does not appear to be mutagenic (NTP, 2006, p. 19), although it could potentially be carcinogenic. Short-term animal studies suggested that the liver and the kidneys were the target organs, though toxicity also was observed in the hematopoietic system.

As a consequence of its widespread use, the NTP conducted a two-year study of male and female mice and rats exposed to benzophenone. This example focuses on histiocytic sarcomas in female rats. Groups of size 50 received doses of 0, 312, 625, or 1250 ppm of benzophenone in their food throughout the study. One rat in the 625 ppm group and two rats in the 1250 ppm group developed histiocytic sarcomas. Based only on the current study data, the NTP’s Poly-3 trend test led to a significance value of P = 0.074. Though not statistically significant at the usual 0.05 level, the NTP concluded that there was equivocal evidence of carcinogenic activity of benzophenone in female rats for several reasons (NTP, 2006). First, histiocytic sarcoma is a rare neoplasm in female rats and had not been observed in any controls (0/460) in six recent NTP feed studies. In addition, the same neoplasm was positively correlated with the dose of benzophenone in the current study of female mice (P = 0.032), and there was a marginally significant trend (P = 0.058) observed for another hematopoietic system lesion (mononuclear leukemia) in female rats.

Using the proposed methodology, which formally accounts for the historical control data from the six recent feed studies, we find a significant dose-related trend (P = 0.004) in histiocytic sarcoma among female rats. Thus, our method uses historical information to bolster the weak statistical evidence which would otherwise be obtained if considering only the concurrent data. Our results provide formal support for the biologic evidence that fueled the NTP’s suspicion that benzophenone is carcinogenic in female rats.

4.2 Analysis of Gallium Arsenide Data

Gallium arsenide is a by-product of aluminum and zinc extraction (NTP, 2000). Due to its photovoltaic properties, this chemical is used extensively when producing lasers, photodetectors, light-emitting diodes, microwave devices, solar cells, and semicon-ductors. The most common exposure to gallium arsenide occurs when workers inhale small particles during the manufacturing of these devices. As in the case of benzophenone, gallium arsenide does not appear to be mutagenic (NTP, 2000, p. 27), even though it could potentially be carcinogenic. While little is known about the effects of gallium arsenide on humans, short-term rodent studies suggest that the lung is the primary target organ, with toxic effects also seen in the kidney, liver, immune system, and male reproductive system, as well as in fetal development and heme biosynthesis.

As a result of its extensive use in the microelectronics industry, the NTP conducted a two-year inhalation study of male and female mice and rats exposed to gallium arsenide (NTP, 2000). In this example, we focus on malignant lymphomas in female mice. Groups of 50 female mice were exposed in inhalation chambers to doses of 0, 0.1, 0.5, or 1.0 mg/m3 of gallium arsenide. The corresponding numbers of mice with a malignant lymphoma were 3, 8, 11, and 10, respectively, and the NTP trend test produced a significance value of P = 0.035. Even if a statistically significant result is observed in the current study, the NTP often re-evaluates the significance of dose effects on tumor rates in the current study by comparing the current rates with the range of rates observed in control animals from past studies. For instance, in our gallium arsenide example, the survival-adjusted rates of malignant lymphoma in the control and three dose groups are 0.067, 0.175, 0.269, and 0.234, respectively. The NTP compared these rates with the background malignant lymphoma rates from 20 inhalation studies in their historical control database, which ranged from 0.06 to 0.32. As all of the lifetime tumor rates in the current study were within the range of historical rates, the NTP concluded that “the incidences of malignant lymphoma in exposed females were not considered to be exposure related” (NTP, 2000, p. 86), despite a statistically significant result for the current data at the usual 0.05 level of significance. It is easy to verify, however, that methods based on the range of historical tumor rates become less powerful as the number of historical studies increases.

As an alternative to the NTP strategy, we applied our test using 19 of the same 20 historical control groups (data from one study were unavailable). The significance value for our test was P = 0.021, which shows that even after incorporating historical control information, the observed dose effects remained statistically significant. Thus, this example suggests that the historical range may not be a good criterion for evaluating the statistical significance of dose effects in the current study.

We also assessed the robustness of our method to the choice of historical data by considering a variety of subsets of 5 historical studies. In each case, the resulting significance value was P = 0.02, with differences only in the third decimal place. Furthermore, the inclusion of historical control data from as few as 5 recent studies when analyzing the current data may be su cient, as the large sample approximation seems to work reasonably well even in that situation.


The present NTP approach to evaluating animal carcinogenicity experiments applies formal statistical tests only to concurrent data. Historical control data enter the evaluation only informally – mainly in a comparison of the tumor rates in the current study to the range of background tumor rates among the historical controls. Incorporation of historical data into a formal statistical procedure would represent a much needed tool in evaluating results from cancer bioassays. In fact, the Technical Reports Review Subcommittee of the NTP Board of Scientific Counselors recently recommended that a new method be developed for this important problem (see In this paper we propose such a procedure, which formally incorporates historical control data, adjusts for survival, makes no lethality assumptions, and avoids parametric models.

We want to re-emphasize that our focus is on age-specific tumor incidence, even though this endpoint generally can not be estimated directly. The proposed test and the standard NTP test are most naturally oriented toward comparisons of lifetime rates, but we evaluate their performance with respect to hypotheses about age-specific incidence rates. Both tests use the Poly-3 adjustment to account for dose-related survival diff erences, and structurally they are expressed in terms of survival-adjusted versions of lifetime tumor rates {πi} rather than age-specific tumor incidence rates {λi(t)}. Nevertheless, we conducted extensive simulations to characterize the rejection rates of these two tests with respect to hypotheses about the {λi(t)}, where we considered the null configurations to be those having equal age-specific incidence rates rather than those having equal lifetime tumor rates.

As noted in Haseman (1992), extra-binomial variation among historical controls can arise from calendar time drift, inter-laboratory variability, and different types of controls. For example, the variability of tumor rates for vehicle controls in inhalation studies often differs from that of control tumor rates in feed studies. In order to minimize extra variability, the current NTP practice is to use the most recent historical control data and also, as far as possible, to limit the data to studies involving the same route of exposure. We believe that the NTP should continue to base their decisions about carcinogenicity on such a strategy of selecting relevant historical control data, as it is not only scientifically meaningful but will also increase the power of our test considerably. Too much variability in the historical control data can result in W having less power than WNTP in certain extreme situations.

We believe that the proposed test should be used whenever historical control data are incorporated in the analysis. Specifically, the informal range-based method presently used by the NTP should be replaced by a formal statistical test such as W. In most practical situations, W is at least as powerful as WNTP and often is much more powerful. Even when W has no power advantage over WNTP, the former still represents a more realistic approach to understanding the data than the latter. We note, however, that although on a relative scale our test enjoys its greatest power advantage over the NTP test when the tumor of interest is rare, there is still room to improve the absolute power of our test in this situation.


This research was supported by the Intramural Research Program of the NIH, National Institute of Environmental Health Sciences. The authors thank Dr. Kenny Crump, Dr. Beth Gladen, Dr. Joseph Haseman, the referees, and the editors for their helpful comments and suggestions. The authors also thank Mr. Kevin McGowan for extracting the NTP data from the database which were analyzed in this paper.


*Shyamal D. Peddada (Corresponding author, vog.hin.shein@adaddep), Gregg E. Dinse (vog.hin.shein@esnid), and Grace E. Kissling (vog.hin.shein@gnilssik) are members of the Biostatistics Branch, National Institute of Environmental Health Sciences, MD A3-03, P.O. Box 12233, Research Triangle Park, NC 27709.


  • Armitage P. Tests for Linear Trends in Proportions and Frequencies. Biometrics. 1955;11:375–386.
  • Bailer A, Portier C. Effects of Treatment-induced Mortality and Tumor-induced Mortality on Tests for Carcinogenicity in Small Samples. Biometrics. 1988;44:417–431. [PubMed]
  • Barlow R, Bartholomew D, Bremner J, Brunk H. Statistical Inference Under Order Restrictions. John Wiley & Sons; London: 1972.
  • Bieler G, Williams R. Ratio Estimates, the Delta Method, and Quantal Response Tests for Increased Carcinogenicity. Biometrics. 1993;49:793–801. [PubMed]
  • Cochran W. Some Methods for Strengthening the Common χ2 Tests. Biometrics. 1954;10:417–451.
  • Dempster AP, Selwyn MR, Weeks BJ. Combining Historical and Randomized Controls for Assessing Trends in Proportions. Journal of the American Statistical Association. 1983;87:221–227.
  • Dunson D, Dinse G. Bayesian Incidence Analysis of Animal Tumorigenicity Data. Applied Statistics. 2001;50:125–141.
  • French JL, Ibrahim JG. Bayesian Methods for a Three-state Model for Rodent Carcinogenicity Studies. Biometrics. 2002;58:906–916. [PubMed]
  • Greim H, Gelbke H-P, Reuter U, Thielmann HW, Edler L. Evaluation of Historical Control Data in Carcinogenicity Studies. Human and Experimental Toxicology. 2003;22:541–549. [PubMed]
  • Haseman JK. Value of Historical Controls in the Interpretation of Rodent Tumor Data. Drug Information Journal. 1992;26:191–200.
  • Haseman JK. Data Analysis: Statistical Analysis and Use of Historical Control Data. Regulatory Toxicology and Pharmacology. 1995;21:52–59. [PubMed]
  • Haseman JK, Hu J, Boorman GA. Use of Historical Control Data in Carcinogenicity Studies in Rodents. Toxicologic Pathology. 1984;12:126–135. [PubMed]
  • Hoel DG, Yanagawa T. Incorporating Historical Controls in Testing for a Trend in Proportions. Journal of the American Statistical Association. 1986;81:1095–1099.
  • Ibrahim JG, Ryan LM. Use of Historical Controls in Time-adjusted Trend Tests for Carcinogenicity. Biometrics. 1996;52:1478–1485. [PubMed]
  • Ibrahim JG, Ryan LM, Chen M-H. Using Historical Controls to Adjust for Covariates in Trend Tests for Binary Data. Journal of the American Statistical Association. 1998;93:1282–1293.
  • Makuch RW, Stephens MA, Escobar M. Generalised Binomial Models to Examine the Historical Control Assumption in Active Control Equivalence Studies. The Statistician. 1989;38:61–70.
  • Mancuso J, Ahn H, Chen J, Mancuso J. Age-adjusted Exact Trend Tests in the Event of Rare Occurrences. Biometrics. 2002;58:403–412. [PubMed]
  • McKnight B, Crowley J. Tests for Differences in Tumor Incidence Based on Animal Carcinogenesis Experiments. Journal of the American Statistical Association. 1984;79:639–648.
  • National Toxicology Program . U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health; RTP, NC: 2000. NTP Technical Report on the Toxicology and Carcinogenesis Studies of Gallium Arsenide (CAS No. 1303-00-0) in F344/N Rats and B6C3F1 Mice (Inhalation Studies) (Technical Report Series No. 492). NIH Publication No. 00-3951. [PubMed]
  • National Toxicology Program . U.S. Department of Health and Human Services, Public Health Service, National Institutes of Health; RTP, NC: 2006. NTP Technical Report on the Toxicology and Carcinogenesis Studies of Benzophenone (CAS No. 119-61-9) in F344/N Rats and B6C3F1 Mice (Feed Studies) (Technical Report Series No. 533). NIH Publication No. 05-4469.
  • Parise H, Wand MP, Ruppert D, Ryan L. Incorporation of Historical Controls Using Semiparametric Mixed Models. Applied Statistics. 2001;50:31–42.
  • Peddada S, Dinse G, Haseman J. A Survival-adjusted Quantal Response Test for Comparing Tumor Incidence Rates. Applied Statistics. 2005;54:51–61.
  • Peddada S, Kissling G. A Survival-adjusted Quantal-Response Test for Analysis of Tumor Incidence Rates in Animal Carcinogenicity Studies. Environmental Health Perspectives. 2006;114:537–541. [PMC free article] [PubMed]
  • Peddada S, Prescott K, Conaway M. Tests for Order Restrictions in Binary Data. Biometrics. 2001;57:1219–1227. [PubMed]
  • Prentice RL, Smythe RT, Krewski D, Mason M. On the Use of Historical Control Data to Estimate Dose Response Trends in Quantal Bioassay. Biometrics. 1992;48:459–478. [PubMed]
  • Ryan L. Using Historical Controls in the Analysis of Developmental Toxicity Data. Biometrics. 1993;49:1126–1135. [PubMed]
  • Seewald W. Time Trend in Historical Controls for Tumour Incidences in Long-term Animal Studies. Applied Statistics. 1994;43:127–137.
  • Tamura RN, Young SS. The Incorporation of Historical Control Information in Tests of Proportions: Simulation Study of Tarone’s Procedure. Biometrics. 1986;42:343–349. [PubMed]
  • Tarone RE. The Use of Historical Control Information in Testing for a Trend in Proportions. Biometrics. 1982;38:214–220. [PubMed]
  • Williams D. Some Inference Procedures for Monotonically Ordered Normal Means. Biometrika. 1977;64:9–14.
  • Yanagawa T, Hoel DG. Use of Historical Controls for Animal Experiments. Environmental Health Perspectives. 1985;63:217–224. [PMC free article] [PubMed]