Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3616635

Formats

Article sections

- Summary
- 1. Introduction
- 2. Binary surrogate endpoint: mixture model
- 3. Binary surrogate endpoint: principal stratification model
- 4. Continuous surrogate endpoint: model for the mean
- 5. Discussion
- References

Authors

Related links

J R Stat Soc Ser A Stat Soc. Author manuscript; available in PMC 2014 February 1.

Published in final edited form as:

J R Stat Soc Ser A Stat Soc. 2013 February 1; 176(2): 603–608.

Published online 2012 June 28. doi: 10.1111/j.1467-985X.2012.01052.xPMCID: PMC3616635

NIHMSID: NIHMS369244

National Cancer Institute, Bethesda, USA

Address for correspondence: Stuart. G. Baker, Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 3118, 6130 Executive Blvd MSC 7354, Bethesda, MD 20892-7354, USA. Email: vog.hin@i61bs

See other articles in PMC that cite the published article.

The definitive evaluation of treatment to prevent a chronic disease with low incidence in middle age, such as cancer or cardiovascular disease, requires a trial with a large sample size of perhaps 20,000 or more. To help decide whether to implement a large true endpoint trial, investigators first typically estimate the effect of treatment on a surrogate endpoint in a trial with a greatly reduced sample size of perhaps 200 subjects. If investigators reject the null hypothesis of no treatment effect in the surrogate endpoint trial they implicitly assume they would likely correctly reject the null hypothesis of no treatment effect for the true endpoint. Surrogate endpoint trials are generally designed with adequate power to detect an effect of treatment on surrogate endpoint. However, we show that a small surrogate endpoint trial is more likely than a large surrogate endpoint trial to give a misleading conclusion about the beneficial effect of treatment on true endpoint, which can lead to a faulty (and costly) decision about implementing a large true endpoint prevention trial. If a small surrogate endpoint trial rejects the null hypothesis of no treatment effect, an intermediate-sized surrogate endpoint trial could be a useful next step in the decision-making process for launching a large true endpoint prevention trial.

Searching for an effective treatment to prevent a chronic disease with low incidence in a middle aged cohort, such as cancer or cardiovascular disease, is challenging because a definitive trial with a disease incidence endpoint requires a very large sample size of perhaps 20,000 or more. Before implementing such a large trial, investigators often seek evidence of treatment benefit from a small trial of perhaps 200 using a surrogate endpoint observed before the true endpoint of disease incidence. For example, a recent trial of a drug to reduce the surrogate endpoint of adenoma occurrence involved a sample size of 267 (Thompson, *et al.*, 2010). Had this study used the true endpoint of colorectal cancer incidence or mortality, the sample size could have been as large as 70,000 (Atkin, 2010). As another example, a recent trial of a drug to reduce the surrogate endpoint of the occurrence of bronchial dysplasia involved a sample size of 100 (Lam, *et al.*, 2004). Had this study used the true endpoint of lung cancer incidence, the sample size could have been as large as 30,000 (The Alpha-Tocopherol, Beta Carotene Cancer Prevention Study Group, 1994). These surrogate endpoint trials have adequate power to detect an effect of treatment on a surrogate endpoint, and their small sample sizes are touted as an important advantage (Psaty *et al.*, 1999). But is this the proverbial case of a free lunch that is not really free?

To answer this question, it is helpful to consider the primary goal of using a surrogate endpoint and the data available to achieve this goal. The following discussion of the use of surrogate endpoints is not a summary of an extensive statistical literature (e.g., Walle and Weir, 2006; Lassere, 2008) but focuses on key points in study design and implications.

With treatment trials, usually the primary goal of using a surrogate endpoint is extrapolation, namely drawing conclusions about the effect of treatment on a true endpoint while shortening the duration of the trial. In this setting data are typically available from at least one historical trial with the same surrogate and true endpoints associated with the trial of interest. Typically a model constructed from these historical data is used to predict the effect of treatment on the true endpoint based on the surrogate endpoint in the trial of interest (e.g. Baker *et al.* 2012).

With the prevention trials discussed here, usually the primary goal of using a surrogate endpoint is to draw conclusions about the effect of treatment on a true endpoint using a much smaller sample size than with the true endpoint. (A secondary goal is shortening the duration of the study). In terms of available data, typically there are no historical trials with the same surrogate and true endpoints as associated with the trial of interest. With no data for predicting treatment effect on true endpoint, the focus is on hypothesis testing in order to justify a much larger definitive trial. In the hypothesis testing framework, Prentice (1989) defined a valid surrogate endpoint as a surrogate endpoint satisfying what we call the *Extrapolation Assumption*, namely rejecting the null hypothesis of no treatment effect on a surrogate endpoint in favor of a beneficial treatment effect on the surrogate endpoint implies rejecting the null hypothesis of no treatment effect on a true endpoint in favor of a beneficial treatment effect on the true endpoint.

Using three models relating surrogate and true endpoints, we show that the link between the Extrapolation Assumption and the size of the surrogate endpoint trial explains why a small surrogate endpoint trial is particularly unreliable for drawing conclusions about the effect of treatment on the true endpoint.

Let *S* = 0, 1 and *T*= 0, 1 denote binary surrogate and true endpoints, respectively, where outcome 0 (1) is unfavorable (favorable). For example *T* = 0 (1) is incidence (no incidence) of lung cancer and *S* =0 (1) is occurrence (non-occurrence) of bronchial dysplasia. Let *Z* = 0 (control), 1 (experimental) denote the randomization group. Let *p _{z}* = pr(

$${f}_{0}={p}_{0}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{1}+(1-{p}_{0})\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{0}$$

(1)

$${f}_{1}={p}_{1}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{1}+{c}_{1})+(1-{p}_{1})\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{0}+{c}_{0}).$$

(2)

Equations (1) and (2), which define the mixture model in Baker *et al.* (2012), imply *f*_{1} − *f*_{0} = (*p*_{1} − *p*_{0}) (*b*_{1} − *b*_{0}) + *d _{MIX}*, where

The goal is to reject *f*_{1} = *f*_{0} in favor of *f*_{1} > *f*_{0}. For a reasonable surrogate endpoint, *b*_{1} − *b*_{0} > 0. Therefore

$${p}_{1}-{p}_{0}>0\phantom{\rule{thinmathspace}{0ex}}\text{implies}\phantom{\rule{thinmathspace}{0ex}}{f}_{1}-{f}_{0}>{d}_{\mathit{MIX}}.$$

(3)

The Extrapolation Assumption says that *p*_{1} − *p*_{0} > 0 (namely rejecting the null hypothesis of no treatment effect on a surrogate endpoint) implies *f*_{1} − *f*_{0} >0 (namely rejecting the null hypothesis of no treatment effect on a true endpoint); this is a special case of Equation (3) in which *d _{MIX}* =0. Thus if an investigator believes the Extrapolation Assumption holds, but in reality

For a true endpoint trial, a two-sided type I error of 5%, and power of 90%, a standard formula for the sample size of each equal-sized randomization group (Halperin, et al. 1968) is

$$\text{Size}({f}_{0},{f}_{1})={[1.96{\{{f}_{1}(1-{f}_{1})+{f}_{0}(1-{f}_{0})\}}^{1\u22152}+1.28{\left\{2\phantom{\rule{thinmathspace}{0ex}}{f}_{avg}(1-{f}_{avg})\right\}}^{1\u22152}]}^{2}\u2215{({f}_{1}-{f}_{2})}^{2}.$$

(4)

where *f _{avg}* = (

Consider a realistic example with *f*_{0} = 0.003 and *f*_{1} = 0.004 (Table 1). Realistic values of parameters under the Prentice Criterion can be obtained using proportional probabilities of endpoints, namely *p _{z}* =

The principal stratification model for surrogate endpoints (Frangakis and Rubin, 2002) was originally formulated to estimate causal effects. Here we discuss its implications for hypothesis testing. Let *S** denote the four principal strata: *A* (always), when *S* = *1* regardless of randomization assignment, *C* (consistent) when *S* = *z*, *I* (inconsistent) when *S* = *1*− *z*, *N* (never) when *S* = 0 regardless of randomization assignment. Let *p*_{s*} = pr(*S**=s*), *b*_{s*} = pr(*T*=1 | *S**=*s**, *Z*=0), *h*_{s*} = pr(*T*=1 | *S**=*s**, *Z*=1) − Pr(*T*=1 | *S**=*s**, *Z*=0). By definition, *p*_{1} = *p _{A}* +

$${f}_{0}={p}_{A}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{A}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}+{p}_{c}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{c}+{p}_{I}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{I}+{p}_{N}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{b}_{N},$$

(5)

$${f}_{1}={p}_{A}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{A}+{h}_{A})+{p}_{I}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{I}+{h}_{I})+{p}_{c}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{C}+{h}_{C})+{p}_{N}\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({b}_{N}+{h}_{N}),$$

(6)

gives *f*_{1} − *f*_{0} = (*p*_{1}− *p*_{0}) *h _{C}* +

The sample size formula for a true endpoint trial is *Size*(*f*_{0}, *f*_{1}). The sample size for the surrogate endpoint trial with a two-sided type I error of 5% and power of 90% is *Size*(*p*_{0}, *p*_{1} computed under the PS Criterion.

Again let *f*_{0} = 0.003 and *f*_{1} = 0.004. Realistic values of parameters under the PS Criterion can be obtained using *p*_{z} =*R f*_{z}, which implies *p*_{A} = *f*_{0 }*R, p*_{C}= (*f*_{1} − *f*_{0}) *R,* and *p*_{N}= 1− *f*_{1 }*R.* For *R*= 100, a sample size of 73300 for a true endpoint trial is reduced to 480 (Table 1), and the relative error for a very small deviation from the PS Criterion of *h*_{A} = −0.002 is −60%, a possibility of great concern. If 1− *p*_{z} is close to 1, *Size*(*R p*_{0}, *R p*_{1}) / *Size*(*p*_{0}, *p*_{1}) ≈ 1/*R.* For the deviation from the PS Criterion in only *h*_{A} (with *h*_{N} = *p*_{I} = 0) the relative error is *RE*_{PS}(*p*_{0}) =100 *h*_{A }*p*_{A} /(*f*_{1} − *f*_{0}) = *h*_{A }*p*_{0} /(*f*_{1} − *f*_{0}), so *RE*_{PS}(*R p*_{0}) / *RE*_{PS}(*p*_{0}) = *R.* Thus, as with the mixture model, if the sample size decreases by a factor of approximately *R,* the relative error increases by a factor of *R,* in agreement with Table 1.

To investigate small surrogate endpoint trials with a continuous surrogate endpoint, let *s*_{z} denote the mean value of a continuous surrogate endpoint for randomization group *z,* and let *t*_{z} = logit(*f*_{z}). Also let (σ_{T})^{2} and (σ_{S})^{2} denote the variance for a sample size of 1 of (*t*_{1} − *t*_{0}) and (*s*_{1} − *s*_{0}), respectively. Based on the delta method, (σ_{T})^{2} = 1 /{*f*_{0} (1−*f*_{0})} + 1/{*f*_{1} (1− *f*_{1})}. A simple linear model with σ_{T} and σ_{S} as scale factors,

$$({t}_{0}\u2215\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\sigma}_{T})={a}_{0}+b\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({s}_{0}\u2215\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\sigma}_{s})$$

(7)

$$({t}_{1}\u2215\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\sigma}_{T})={a}_{1}+(b+c)\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}({s}_{1}\u2215\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}{\sigma}_{s})$$

(8)

implies (*t*_{1} − *t*_{0} )/ σ_{T} = *b* (*s*_{1} − *s*_{0}) /σ_{S} + *d*_{MEAN}, where *d*_{MEAN}= *c* (*s*_{1} /σ_{S})+ *a*_{1} − *a*_{0}. The relative error is *RE*_{MEAN} = 100 *d*_{MEAN} / {(*t*_{1} − *t*_{0}) / σ_{T}}. Under this model *s*_{1} − *s*_{0} > 0 implies *t*_{1} − *t*_{0} > *d*_{MEAN}. Here the Extrapolation Assumption, *s*_{1} − *s*_{0} > 0 implies *t*_{1} − *t*_{0} >0, requires *d*_{MEAN} =0. An appealing scenario for *d*_{MEAN} =0 is what we call the *Prentice Criterion for the Mean: a*_{0} = *a*_{1} and *c* = 0, which implies the same effect of the mean surrogate endpoint on the true endpoint for each randomization group.

For the true endpoint trial with a two-sided type I error of 5%, and power of 90%, the sample size of each equal-sized randomization group is *Size**(*t*_{0}, *t*_{1}) = (1.96 + 1.28)^{2} (**?**_{T})^{2} / (*t*_{1} − *t*_{0})^{2}. The sample size for the surrogate endpoint trial with a two-sided type I error of 5% and power of 90% is *Size**(*s*_{0}, *s*_{1}) computed under the Prentice Criterion for the Mean, which we can write as *Size**(*b*) = *Size**(*t*_{0}, *t*_{1}) / *b*^{2}.

Again consider *f*_{0} = 0.003 and *f*_{1} = 0.004. The sample size for the true endpoint trial is 73,800. To obtain realistic parameter values, we consider sample sizes computed under the other models. For the sample size of 480 (Table 1), the relative error arising from a small deviation from the Prentice Criterion for the Mean of *c* = 0.002 is −46.1%, yet another possibility of great concern. Note that *Size**(*b/ k*) / *Size**(*b*) = *k*^{2}. For the deviation from Prentice Criterion for the Mean involving only *c* (with *a*_{0} = *a*_{1}), *RE*_{MEAN}(*b*) = 100 *t*_{1} {*c* / (*b* + *c*)} / (*t*_{1} − *t*_{0}), so *RE*_{MEAN}(*b/ k*) / *RE*_{MEAN}(*b*) = (*b* + *c*)/(*b/ k* + *c*) ≈ *k* for small *c.* Thus, if sample size decreases by a factor of *k*^{2}, the relative error increases by a factor of approximately *k,* in agreement with Table 1.

The search for treatments to prevent cancer or cardiovascular disease involves preliminary evaluations using small surrogate endpoint trials, and as a practical matter this trend will likely continue in the genomic era in order to handle the likely explosion in potential hypotheses to be tested. Although a small surrogate endpoint trial is typically conducted, it has a greater potential than a large surrogate endpoint trial for a misleading conclusion that treatment has a beneficial effect on true endpoint. This misleading conclusion could lead to a faulty decision about implementing a definitive trial with a true endpoint. The implications regarding expenditure of resources are enormous.

The focus here is on an incorrect conclusion to implement a large prevention trial after rejecting the null hypothesis in the surrogate endpoint study. It is also possible to draw an incorrect conclusion of not implementing a large prevention trial after not rejecting the null hypothesis in the surrogate endpoint study. However this latter incorrect conclusion is of limited interest because the beneficial treatment effect would not likely be large and researchers are looking for a large beneficial treatment effect to make a large prevention trial worthwhile.

If a small surrogate endpoint trial indicates a promising treatment, investigators should next investigate an intermediate-sized surrogate endpoint trial (with a more frequently occurring surrogate endpoint) before jumping to a very large, resource-intensive, prevention trial with a true endpoint of a rare disease. Of course the intermediate-sized surrogate endpoint trial is no guarantee of drawing a correct conclusion. Therefore other sources of evidence, such as observational studies and animal testing, need to be considered before implementing a large true endpoint prevention trial. Also surrogate endpoint trials typically do not provide information about multiple endpoints and long-term side effects, which is another reason for caution.

This research was supported by the National Institutes of Health.

- Atkin WS, Edwards R, Kralj-Hans I, Wooldrage K, Hart AR, Northover JM, Parkin DM, Wardle J, Duffy SW, Cuzick J, UK Flexible Sigmoidoscopy Trial Investigators Once-only flexible sigmoidoscopy screening in prevention of colorectal cancer: a multicentre randomised controlled trial. Lancet. 2010;8:1624–1633. [PubMed]
- Baker SG, Sargent DJ, Buyse M, Burzykowski T. Predicting treatment effect from surrogate endpoints and historical trials: an extrapolation involving probabilities of a binary outcome or survival to a specific time. Biometrics. 2012;68:248–257. [PMC free article] [PubMed]
- Baker SG, Lindeman KS, Kramer BS. Clarifying the role of principal stratification in the paired availability design. International Journal of Biostatistics. 2011;7:1. [PMC free article] [PubMed]
- Baker SG, Lindeman KS. The paired availability design: a proposal for evaluating epidural analgesia during labor. Statistics in Medicine. 1994;13:2269–2278. [PubMed]
- Buyse M, Molenberghs G. The validation of surrogate endpoints in randomized experiments. Biometrics. 1998;54:1014–1029. [PubMed]
- Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58:21–29. [PubMed]
- Halperin M, Rogot E, Gurian J, Ederer F. Sample sizes for medical trials with special reference to long-term therapy. Journal of Chronic Diseases. 21:13–24. [PubMed]
- Lam S, leRiche J. C., McWilliams A, Macaulay C, Dyachkova Y, Szabo E, Mayo J, Schellenberg R, Coldman A, Hawk E, Gazdar A. A randomized phase IIb trial of pulmicort turbuhaler (budesonide) in people with dysplasia of the bronchial epithelium. Clinical Cancer Research. 2004;10:6502–11. [PubMed]
- Lassere MN. The Biomarker-Surrogacy Evaluation Schema: a review of the biomarker-surrogate literature and a proposal for a criterion-based, quantitative, multidimensional hierarchical levels of evidence schema for evaluating the status of biomarkers as surrogate endpoints. Statistical Methods in Medical Research. 2008;17:303–340. [PubMed]
- Psaty BM, Weiss NS, Furberg CD, Koepsell TD, Siscovick DS, Rosendaal FR, Smith NL, Heckbert SR, Kaplan RC, Lin D, Fleming TR, Wagner EH. Surrogate end points, health outcomes, and the drug-approval process for the treatment of risk factors for cardiovascular disease. Journal of the American Medical Association. 1999;282:786–790. [PubMed]
- Prentice RL. Surrogate endpoints in clinical trials: Definitions and operational criteria. Statistics in Medicine. 1989;8:431–430. [PubMed]
- The Alpha-Tocopherol, Beta Carotene Cancer Prevention Study Group The Effect of vitamin E and beta Carotene on the incidence of lung cancer and other cancers in male smokers. New England Journal of Medicine. 1994;330:1029–35. [PubMed]
- Thompson PA, Wertheim BC, Zell JA, Chen WP, McLaren CE, LaFleur BJ, Meyskens FL, Gerner EW. Levels of rectal mucosal polyamines and prostaglandin E2 predict, ability of DFMO and Sulindac to prevent colorectal adenoma. Gastroenterology. 2010;139:797–805. [PMC free article] [PubMed]
- Weir CJ, Walle RJ. Statistical evaluation of biomarkers as surrogate endpoints: a literature review. Statistics in Medicine. 2006;25:183–203. 431–430. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |