PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J R Stat Soc Ser A Stat Soc. Author manuscript; available in PMC 2014 February 1.
Published in final edited form as:
J R Stat Soc Ser A Stat Soc. 2013 February 1; 176(2): 603–608.
Published online 2012 June 28. doi:  10.1111/j.1467-985X.2012.01052.x
PMCID: PMC3616635
NIHMSID: NIHMS369244

The risky reliance on small surrogate endpoint studies when planning a large prevention trial

Summary

The definitive evaluation of treatment to prevent a chronic disease with low incidence in middle age, such as cancer or cardiovascular disease, requires a trial with a large sample size of perhaps 20,000 or more. To help decide whether to implement a large true endpoint trial, investigators first typically estimate the effect of treatment on a surrogate endpoint in a trial with a greatly reduced sample size of perhaps 200 subjects. If investigators reject the null hypothesis of no treatment effect in the surrogate endpoint trial they implicitly assume they would likely correctly reject the null hypothesis of no treatment effect for the true endpoint. Surrogate endpoint trials are generally designed with adequate power to detect an effect of treatment on surrogate endpoint. However, we show that a small surrogate endpoint trial is more likely than a large surrogate endpoint trial to give a misleading conclusion about the beneficial effect of treatment on true endpoint, which can lead to a faulty (and costly) decision about implementing a large true endpoint prevention trial. If a small surrogate endpoint trial rejects the null hypothesis of no treatment effect, an intermediate-sized surrogate endpoint trial could be a useful next step in the decision-making process for launching a large true endpoint prevention trial.

Keywords: Cancer prevention, Cardiovascular disease, Prentice criterion, Principal stratification, Sample size calculation, Surrogate endpoint

1. Introduction

Searching for an effective treatment to prevent a chronic disease with low incidence in a middle aged cohort, such as cancer or cardiovascular disease, is challenging because a definitive trial with a disease incidence endpoint requires a very large sample size of perhaps 20,000 or more. Before implementing such a large trial, investigators often seek evidence of treatment benefit from a small trial of perhaps 200 using a surrogate endpoint observed before the true endpoint of disease incidence. For example, a recent trial of a drug to reduce the surrogate endpoint of adenoma occurrence involved a sample size of 267 (Thompson, et al., 2010). Had this study used the true endpoint of colorectal cancer incidence or mortality, the sample size could have been as large as 70,000 (Atkin, 2010). As another example, a recent trial of a drug to reduce the surrogate endpoint of the occurrence of bronchial dysplasia involved a sample size of 100 (Lam, et al., 2004). Had this study used the true endpoint of lung cancer incidence, the sample size could have been as large as 30,000 (The Alpha-Tocopherol, Beta Carotene Cancer Prevention Study Group, 1994). These surrogate endpoint trials have adequate power to detect an effect of treatment on a surrogate endpoint, and their small sample sizes are touted as an important advantage (Psaty et al., 1999). But is this the proverbial case of a free lunch that is not really free?

To answer this question, it is helpful to consider the primary goal of using a surrogate endpoint and the data available to achieve this goal. The following discussion of the use of surrogate endpoints is not a summary of an extensive statistical literature (e.g., Walle and Weir, 2006; Lassere, 2008) but focuses on key points in study design and implications.

With treatment trials, usually the primary goal of using a surrogate endpoint is extrapolation, namely drawing conclusions about the effect of treatment on a true endpoint while shortening the duration of the trial. In this setting data are typically available from at least one historical trial with the same surrogate and true endpoints associated with the trial of interest. Typically a model constructed from these historical data is used to predict the effect of treatment on the true endpoint based on the surrogate endpoint in the trial of interest (e.g. Baker et al. 2012).

With the prevention trials discussed here, usually the primary goal of using a surrogate endpoint is to draw conclusions about the effect of treatment on a true endpoint using a much smaller sample size than with the true endpoint. (A secondary goal is shortening the duration of the study). In terms of available data, typically there are no historical trials with the same surrogate and true endpoints as associated with the trial of interest. With no data for predicting treatment effect on true endpoint, the focus is on hypothesis testing in order to justify a much larger definitive trial. In the hypothesis testing framework, Prentice (1989) defined a valid surrogate endpoint as a surrogate endpoint satisfying what we call the Extrapolation Assumption, namely rejecting the null hypothesis of no treatment effect on a surrogate endpoint in favor of a beneficial treatment effect on the surrogate endpoint implies rejecting the null hypothesis of no treatment effect on a true endpoint in favor of a beneficial treatment effect on the true endpoint.

Using three models relating surrogate and true endpoints, we show that the link between the Extrapolation Assumption and the size of the surrogate endpoint trial explains why a small surrogate endpoint trial is particularly unreliable for drawing conclusions about the effect of treatment on the true endpoint.

2. Binary surrogate endpoint: mixture model

Let S = 0, 1 and T= 0, 1 denote binary surrogate and true endpoints, respectively, where outcome 0 (1) is unfavorable (favorable). For example T = 0 (1) is incidence (no incidence) of lung cancer and S =0 (1) is occurrence (non-occurrence) of bronchial dysplasia. Let Z = 0 (control), 1 (experimental) denote the randomization group. Let pz = pr(S=1 | Z=z), fz = pr(T=1 | Z=z), bs = pr(T=1 | S=s, Z=0), and cs = pr(T=1 | S=s, Z=1) − pr(T=1 | S=s, Z=0). The probabilities of true endpoint are mixtures

equation M1
(1)

equation M2
(2)

Equations (1) and (2), which define the mixture model in Baker et al. (2012), imply f1f0 = (p1p0) (b1b0) + dMIX, where dMIX = p1 c1 + (1 − p1) c0.

The goal is to reject f1 = f0 in favor of f1 > f0. For a reasonable surrogate endpoint, b1b0 > 0. Therefore

equation M3
(3)

The Extrapolation Assumption says that p1p0 > 0 (namely rejecting the null hypothesis of no treatment effect on a surrogate endpoint) implies f1f0 >0 (namely rejecting the null hypothesis of no treatment effect on a true endpoint); this is a special case of Equation (3) in which dMIX =0. Thus if an investigator believes the Extrapolation Assumption holds, but in reality dMIX < 0, it is possible to conclude there is beneficial effect of treatment on the true endpoint when, in fact, there is a detrimental effect of treatment on the true endpoint equal to dMIX. To put the magnitude of dMIX in perspective relative to the effect of treatment, we compute the relative error (as a percent), namely REmix= 100 dMIX / (f1f0). The requirement for the Extrapolation Condition that dMIX =0 implies c0 = c1 = 0, which is called the Prentice Criterion, namely the probability of true endpoint given surrogate endpoint does not depend on randomization group. See also Buyse and Molenberghs (1998).

For a true endpoint trial, a two-sided type I error of 5%, and power of 90%, a standard formula for the sample size of each equal-sized randomization group (Halperin, et al. 1968) is

equation M4
(4)

where favg = (f0 + f1)/2. The sample size for the surrogate endpoint trial with a two-sided type I error of 5% and power of 90% is Size(p0, p1) computed under the Prentice Criterion.

Consider a realistic example with f0 = 0.003 and f1 = 0.004 (Table 1). Realistic values of parameters under the Prentice Criterion can be obtained using proportional probabilities of endpoints, namely pz =R fz, which implies b0 =0 and b1=1/R. For R= 100, a sample size of 73300 for a true endpoint trial is reduced to 480, but the relative error for a very small deviation from the Prentice Criterion of c1 = − 0.002 is −80%, a possibility of great concern. If 1− pz is close to 1, Size(R p0, R p1) / Size(p0, p1) ≈ 1/R. For a deviation from the Prentice Criterion in only c1 (with c0 = 0), the relative error is REMIX(p1) =100 p1 c1 / (f1f0), which implies REMIX(R p1) / REMIXx(p1) = R. Thus, regardless of the values of f0 and f1, if the sample size decreases by a factor of approximately R, the relative error increases by a factor of R, in agreement with Table 1.

Table 1
Relative errors and sample sizes.

3. Binary surrogate endpoint: principal stratification model

The principal stratification model for surrogate endpoints (Frangakis and Rubin, 2002) was originally formulated to estimate causal effects. Here we discuss its implications for hypothesis testing. Let S* denote the four principal strata: A (always), when S = 1 regardless of randomization assignment, C (consistent) when S = z, I (inconsistent) when S = 1z, N (never) when S = 0 regardless of randomization assignment. Let ps* = pr(S*=s*), bs* = pr(T=1 | S*=s*, Z=0), hs* = pr(T=1 | S*=s*, Z=1) − Pr(T=1 | S*=s*, Z=0). By definition, p1 = pA + pC because S=1 for Z=1 only in principal strata A and C. Similarly p0 = pA + pI because S=1 for Z=0 only in principal strata A and I. Consequently p1p0 = pCpI. Combining this result with

equation M5
(5)

equation M6
(6)

gives f1f0 = (p1p0) hC + dPS, where dPS = hApA + hNpN + (hIhC) pI. The relative error is REPS =100 dPS / (f1f0). Here the Extrapolation Assumption, p1p0 > 0 implies f1f0 >0, requires dPS=0. An appealing scenario for dPS =0 is what we call the PS Criterion: hA = hN = 0 and pI = 0, namely the probability of true endpoint for principal strata A and N depends only on the level of the surrogate endpoint and not randomization group and there is no person for whom the level of surrogate endpoint is “inconsistent” with the randomization group. The PS Criterion is analogous to the identifiability requirement in some principal stratification models (e.g. Baker et al, 2011).

The sample size formula for a true endpoint trial is Size(f0, f1). The sample size for the surrogate endpoint trial with a two-sided type I error of 5% and power of 90% is Size(p0, p1 computed under the PS Criterion.

Again let f0 = 0.003 and f1 = 0.004. Realistic values of parameters under the PS Criterion can be obtained using pz =R fz, which implies pA = f0 R, pC= (f1f0) R, and pN= 1− f1 R. For R= 100, a sample size of 73300 for a true endpoint trial is reduced to 480 (Table 1), and the relative error for a very small deviation from the PS Criterion of hA = −0.002 is −60%, a possibility of great concern. If 1− pz is close to 1, Size(R p0, R p1) / Size(p0, p1) ≈ 1/R. For the deviation from the PS Criterion in only hA (with hN = pI = 0) the relative error is REPS(p0) =100 hA pA /(f1f0) = hA p0 /(f1f0), so REPS(R p0) / REPS(p0) = R. Thus, as with the mixture model, if the sample size decreases by a factor of approximately R, the relative error increases by a factor of R, in agreement with Table 1.

4. Continuous surrogate endpoint: model for the mean

To investigate small surrogate endpoint trials with a continuous surrogate endpoint, let sz denote the mean value of a continuous surrogate endpoint for randomization group z, and let tz = logit(fz). Also let (σT)2 and (σS)2 denote the variance for a sample size of 1 of (t1t0) and (s1s0), respectively. Based on the delta method, (σT)2 = 1 /{f0 (1−f0)} + 1/{f1 (1− f1)}. A simple linear model with σT and σS as scale factors,

equation M7
(7)

equation M8
(8)

implies (t1t0 )/ σT = b (s1s0) /σS + dMEAN, where dMEAN= c (s1S)+ a1a0. The relative error is REMEAN = 100 dMEAN / {(t1t0) / σT}. Under this model s1s0 > 0 implies t1t0 > dMEAN. Here the Extrapolation Assumption, s1s0 > 0 implies t1t0 >0, requires dMEAN =0. An appealing scenario for dMEAN =0 is what we call the Prentice Criterion for the Mean: a0 = a1 and c = 0, which implies the same effect of the mean surrogate endpoint on the true endpoint for each randomization group.

For the true endpoint trial with a two-sided type I error of 5%, and power of 90%, the sample size of each equal-sized randomization group is Size*(t0, t1) = (1.96 + 1.28)2 (?T)2 / (t1t0)2. The sample size for the surrogate endpoint trial with a two-sided type I error of 5% and power of 90% is Size*(s0, s1) computed under the Prentice Criterion for the Mean, which we can write as Size*(b) = Size*(t0, t1) / b2.

Again consider f0 = 0.003 and f1 = 0.004. The sample size for the true endpoint trial is 73,800. To obtain realistic parameter values, we consider sample sizes computed under the other models. For the sample size of 480 (Table 1), the relative error arising from a small deviation from the Prentice Criterion for the Mean of c = 0.002 is −46.1%, yet another possibility of great concern. Note that Size*(b/ k) / Size*(b) = k2. For the deviation from Prentice Criterion for the Mean involving only c (with a0 = a1), REMEAN(b) = 100 t1 {c / (b + c)} / (t1t0), so REMEAN(b/ k) / REMEAN(b) = (b + c)/(b/ k + c) ≈ k for small c. Thus, if sample size decreases by a factor of k2, the relative error increases by a factor of approximately k, in agreement with Table 1.

5. Discussion

The search for treatments to prevent cancer or cardiovascular disease involves preliminary evaluations using small surrogate endpoint trials, and as a practical matter this trend will likely continue in the genomic era in order to handle the likely explosion in potential hypotheses to be tested. Although a small surrogate endpoint trial is typically conducted, it has a greater potential than a large surrogate endpoint trial for a misleading conclusion that treatment has a beneficial effect on true endpoint. This misleading conclusion could lead to a faulty decision about implementing a definitive trial with a true endpoint. The implications regarding expenditure of resources are enormous.

The focus here is on an incorrect conclusion to implement a large prevention trial after rejecting the null hypothesis in the surrogate endpoint study. It is also possible to draw an incorrect conclusion of not implementing a large prevention trial after not rejecting the null hypothesis in the surrogate endpoint study. However this latter incorrect conclusion is of limited interest because the beneficial treatment effect would not likely be large and researchers are looking for a large beneficial treatment effect to make a large prevention trial worthwhile.

If a small surrogate endpoint trial indicates a promising treatment, investigators should next investigate an intermediate-sized surrogate endpoint trial (with a more frequently occurring surrogate endpoint) before jumping to a very large, resource-intensive, prevention trial with a true endpoint of a rare disease. Of course the intermediate-sized surrogate endpoint trial is no guarantee of drawing a correct conclusion. Therefore other sources of evidence, such as observational studies and animal testing, need to be considered before implementing a large true endpoint prevention trial. Also surrogate endpoint trials typically do not provide information about multiple endpoints and long-term side effects, which is another reason for caution.

Acknowledgment

This research was supported by the National Institutes of Health.

References

  • Atkin WS, Edwards R, Kralj-Hans I, Wooldrage K, Hart AR, Northover JM, Parkin DM, Wardle J, Duffy SW, Cuzick J, UK Flexible Sigmoidoscopy Trial Investigators Once-only flexible sigmoidoscopy screening in prevention of colorectal cancer: a multicentre randomised controlled trial. Lancet. 2010;8:1624–1633. [PubMed]
  • Baker SG, Sargent DJ, Buyse M, Burzykowski T. Predicting treatment effect from surrogate endpoints and historical trials: an extrapolation involving probabilities of a binary outcome or survival to a specific time. Biometrics. 2012;68:248–257. [PMC free article] [PubMed]
  • Baker SG, Lindeman KS, Kramer BS. Clarifying the role of principal stratification in the paired availability design. International Journal of Biostatistics. 2011;7:1. [PMC free article] [PubMed]
  • Baker SG, Lindeman KS. The paired availability design: a proposal for evaluating epidural analgesia during labor. Statistics in Medicine. 1994;13:2269–2278. [PubMed]
  • Buyse M, Molenberghs G. The validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
  • Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
  • Halperin M, Rogot E, Gurian J, Ederer F. Sample sizes for medical trials with special reference to long-term therapy. Journal of Chronic Diseases. 21:13–24. [PubMed]
  • Lam S, leRiche J. C., McWilliams A, Macaulay C, Dyachkova Y, Szabo E, Mayo J, Schellenberg R, Coldman A, Hawk E, Gazdar A. A randomized phase IIb trial of pulmicort turbuhaler (budesonide) in people with dysplasia of the bronchial epithelium. Clinical Cancer Research. 2004;10:6502–11. [PubMed]
  • Lassere MN. The Biomarker-Surrogacy Evaluation Schema: a review of the biomarker-surrogate literature and a proposal for a criterion-based, quantitative, multidimensional hierarchical levels of evidence schema for evaluating the status of biomarkers as surrogate endpoints. Statistical Methods in Medical Research. 2008;17:303–340. [PubMed]
  • Psaty BM, Weiss NS, Furberg CD, Koepsell TD, Siscovick DS, Rosendaal FR, Smith NL, Heckbert SR, Kaplan RC, Lin D, Fleming TR, Wagner EH. Surrogate end points, health outcomes, and the drug-approval process for the treatment of risk factors for cardiovascular disease. Journal of the American Medical Association. 1999;282:786–790. [PubMed]
  • Prentice RL. Surrogate endpoints in clinical trials: Definitions and operational criteria. Statistics in Medicine. 1989;8:431–430. [PubMed]
  • The Alpha-Tocopherol, Beta Carotene Cancer Prevention Study Group The Effect of vitamin E and beta Carotene on the incidence of lung cancer and other cancers in male smokers. New England Journal of Medicine. 1994;330:1029–35. [PubMed]
  • Thompson PA, Wertheim BC, Zell JA, Chen WP, McLaren CE, LaFleur BJ, Meyskens FL, Gerner EW. Levels of rectal mucosal polyamines and prostaglandin E2 predict, ability of DFMO and Sulindac to prevent colorectal adenoma. Gastroenterology. 2010;139:797–805. [PMC free article] [PubMed]
  • Weir CJ, Walle RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Statistics in Medicine. 2006;25:183–203. 431–430. [PubMed]