|Home | About | Journals | Submit | Contact Us | Français|
Surrogate endpoints offer the hope of smaller or shorter cancer trials. It is, however, important to realize they come at the cost of an unverifiable extrapolation that could lead to misleading conclusions. With cancer prevention, the focus is on hypothesis testing in small surrogate endpoint trials before deciding whether to proceed to a large prevention trial. However, it is not generally appreciated that a small surrogate endpoint trial is highly sensitive to a deviation from the key Prentice criterion needed for the hypothesis-testing extrapolation. With cancer treatment, the focus is on estimation using historical trials with both surrogate and true endpoints to predict treatment effect based on the surrogate endpoint in a new trial. Successively leaving out one historical trial and computing the predicted treatment effect in the left-out trial yields a standard error multiplier that summarizes the increased uncertainty in estimation extrapolation. If this increased uncertainty is acceptable, three additional extrapolation issues (biological mechanism, treatment following observation of the surrogate endpoint, and side effects following observation of the surrogate endpoint) need to be considered. In summary, when using surrogate endpoint analyses, an appreciation of the problems of extrapolation is crucial.
A surrogate endpoint, such as a biomarker, is an endpoint observed sooner than a true endpoint, such as cancer or mortality, and is used to make conclusions about the effect of an intervention on true endpoint. Consequently, surrogate endpoints offer the hope of making results available sooner and at reduced costs. However, there is no free lunch. Surrogate endpoint analyses involve an extrapolation to an unobserved effect of intervention on the true endpoint, and no statistical approach can fully address the unknown nature of this extrapolation. A bewildering variety of statistical methods have been proposed for analyzing trials with surrogate endpoints (1–12), and it is easy to lose an appreciation of the extrapolation amid the mathematics. The focus here is on two simple methods for surrogate endpoint analysis that elucidate the potential cost of extrapolation.
Importantly, the type of surrogate endpoint analysis depends on the application, with considerable differences between cancer prevention trials (13) and cancer treatment trials (14). Surrogate endpoints also play a role in double sampling trials, in which surrogate endpoints are observed in all trial participants and true endpoints are observed in a random sample of participants (15–20). In double sampling, the use of a random sampling to observe some true endpoints provides a firm basis for drawing conclusions without problems of extrapolation. In the situations discussed in this article, the investigators are more ambitious in drawing conclusions because they have no data on the true endpoint in the trial of interest and the validity of an extrapolation is a crucial consideration.
A typical definitive cancer prevention trial to study the effect of an intervention on the true endpoint of cancer incidence among healthy persons may require a sample size in the tens of thousands. Before committing time and resources to such a large trial, investigators seek preliminary evidence that the intervention to prevent cancer will likely be beneficial. A frequent source of evidence is a small trial with a surrogate endpoint (21). The choice of surrogate endpoint for cancer incidence depends on the current understanding of cancer biology (22). Examples of candidate surrogate endpoints include measurements or indicators based on tumor-associated gene expression or function; circulating blood biomarkers, such as hormone levels or cellular morphology; markers of cell proliferation; and tissue changes, such as onset of adenomas (23). The biological link may or may not be solid. However, the focus here is on statistical issues. Because it is unusual to find any previous prevention trial that measures the surrogate endpoints of interest, there are no data for constructing a model to estimate the effect of an intervention on cancer incidence based on the surrogate endpoint. In this setting, surrogate endpoint analyses are based on what we call hypothesis-testing extrapolation—namely, rejecting the null hypothesis that intervention has no effect on the surrogate endpoint implies rejecting the null hypothesis that intervention has no effect on true endpoint.
It is not uncommon for a surrogate endpoint trial for cancer prevention to be less than 1% the size of the corresponding prevention trial with a true endpoint. For example, a trial that involved a surrogate endpoint of bronchial dysplasia had a sample size of 267 (24), whereas a trial that involved a true endpoint of lung cancer incidence among healthy persons had a sample size of 70 000 (25). Because a small surrogate endpoint trial usually has adequate power to detect a specified reduction in a surrogate endpoint, investigators rarely question its relevance to a large prevention trial with a true endpoint. This may partly arise from the seductive assumption that the underlying biology is fully understood. However, as discussed below, there is no free lunch when it comes to hypothesis-testing extrapolation.
The problem with drawing conclusions from a small surrogate endpoint trial for cancer prevention is that hypothesis-testing extrapolation can be misleading because of sensitivity to violations of a key assumption (13). A key assumption for hypothesis-testing extrapolation is the Prentice criterion—that the probability of true endpoint given the surrogate endpoint is the same in both randomization groups. This criterion is named after Ross Prentice, who discussed requirements for hypothesis-testing extrapolation (26). In the case of a binary surrogate endpoint, the two requirements for valid hypothesis-testing extrapolation are 1) the Prentice criterion and 2) an association between the probabilities of true and surrogate endpoints within each randomization group (27). Sometimes an additional “requirement” is listed—namely, the probabilities of surrogate and true endpoints depend on the intervention, but this is not really a requirement but instead part of the statement of hypothesis-testing extrapolation
Insight into these requirements is provided by Figures 1 and and2,2, which display BK plots (28–30) in a modified form that is related to a plot for continuous surrogate and true endpoints that shows that a “perfect correlate does not a surrogate make” (31). The horizontal axis is the probability of the surrogate endpoint (at one level). The vertical axis is the probability of the true endpoint (at the corresponding level). Points are labeled as C for control group and E for experimental group. The diagonal lines connect the probabilities of true endpoint for the two levels of the surrogate endpoints. The first requirement, the Prentice criterion, translates into identical diagonal lines for each randomization group, as in Figure 1. The second requirement says the diagonal lines are, in fact, not flat. The additional mistaken “requirement” says points C and E are distinct.
To understand the graphical implications of these requirements for hypothesis-testing extrapolation, it is necessary to explain the relationship between corresponding points on the vertical and horizontal axes. Point C on the vertical axis (the probability of true endpoint in the control group) is graphically computed by connecting a vertical line from point C on the horizontal axis (the probability of surrogate endpoint in the control group) to the diagonal line for the control group and drawing a horizontal line to the left. A similar algorithm applies for computing point E on the vertical axis from point E on the horizontal axis. If the requirements for hypothesis-testing extrapolation hold (a single diagonal line), then the true result (the difference between vertical points E and C) is proportional to the surrogate result (the difference between horizontal points E and C). Consequently, hypothesis-testing extrapolation holds because any surrogate result greater than zero implies a true result greater than zero.
A deviation from the Prentice criterion can be specified as a difference in the probabilities of true endpoint given surrogate endpoint between the two randomization groups. In Figure 2, this deviation translates into different slopes for the diagonal lines for the two groups. The point “assumed E” is the probability of true endpoint in the experimental group if investigators incorrectly assumed the Prentice criterion held and believed the diagonal line for the experimental group coincided with the diagonal line for the control group. The left side of Figure 2 shows a slope of 1 for the control group, which corresponds to a large surrogate endpoint trial with the same size as the true endpoint trial. A small deviation of −0.07 in the Prentice criterion gives a slope of 0.993 for the experimental group. Because point assumed E is close to point E, hypothesis-testing extrapolation approximately holds. The right side of Figure 2 shows a slope of 0.100 for control group (note that the scale of the horizontal axis differs from that of the left side), which corresponds to a small surrogate endpoint trial about one-tenth the size of the true endpoint trial. The same small deviation of −0.07 in the Prentice criterion gives a slope of 0.093 for the experimental group. Because the point assumed E is far from point E (in fact on the opposite side of C), hypothesis-testing extrapolation can be seriously misleading.
The impact of a small deviation from the Prentice criterion is quantified by the relative error, which is the error in the estimated effect of intervention on the true endpoint (the distance between assumed E and E on the vertical axis) as a percentage of the true result investigators hope to detect. Based on previous calculations that involved a sample size of 73 300 for a trial with a true endpoint, a very small deviation in the Prentice criterion of −0.002 is consistent with a relative error of −0.8%, −8%, and −80% for a trial with a surrogate endpoint of size of 73 300, 7100, and 496, respectively (14). In other words, with a small trial of size 496, the relative error is an important concern. With an intermediate-sized trial of 7100, the relative error is only a moderate concern. With a large trial of 73 300, the relative error is a minor concern, but this defeats the purpose of using the surrogate endpoint. Although the focus of this discussion has been on the Prentice criterion with a binary surrogate endpoint, qualitatively similar results are obtained with a Prentice criterion for the mean of a continuous surrogate endpoint and for a related criterion that involves a principal stratification model with binary surrogate endpoint (32).
The low tolerance for a small deviation from the Prentice criterion is the fundamental problem with small sample sizes for surrogate endpoint trials. As an analogy, suppose you are tossing two coins, a small coin weighing 1g and a large coin weighing 100g. It is a rainy day, and mud weighing 0.2g sticks to one side of each coin when it lands on the ground. In subsequent tosses, this extra mud has a larger effect on the probability of heads with the small coin than with the large coin.
Hypothesis testing from a small surrogate endpoint trial is a major component in making decisions about implementing a definitive trial with an endpoint of cancer incidence. The above results lead to the following recommendation. Before jumping directly from a statistically significant result in a small surrogate endpoint trial to implementation of a large prevention trial with a true endpoint of cancer incidence, a moderately sized surrogate endpoint trial should be implemented to reduce the possibility of misleading hypothesis-testing extrapolation. Implementing the moderately sized surrogate endpoint trial would require finding a surrogate endpoint that occurs less often than the surrogate endpoint in the initial small surrogate endpoint trial. However, even if the moderately sized surrogate endpoint trial yields a promising result, other types of evidence, such as any results from mechanistic studies, experimental studies in animals, or observational studies in humans, should also factor into the decision of whether to launch a large prevention trial. Also a larger deviation from the Prentice criterion than specified in this example could render the results from even a moderately sized surrogate endpoint trial misleading.
Surrogate endpoints play a different role in cancer treatment trials than in cancer prevention trials. With treatment trials, the main purpose of using the surrogate endpoint is usually to shorten the duration of the trial, sometimes with an eye toward drug approval. Unlike with prevention trials, data are typically available from one or more historical trials with the same surrogate and true endpoints as in the new trial. The goal is to predict the effect of treatment on true endpoint in a new trial based on the surrogate endpoints in the new trial and a prediction model that relates surrogate and true endpoints derived from historical trials, a procedure we call estimation extrapolation. The focus here is on one relatively simple method that highlights the extrapolation.
A major challenge in formulating a prediction model is labeling control and experimental groups. A problem is that an experimental treatment in one historical trial may be a control treatment in another historical trial (33). Also, there may be more than two randomization groups in a trial (14). The following two-part strategy addresses this challenge. First, the randomization group with smallest (largest) estimated probability of the favorable surrogate endpoint is labeled as the control (experimental) group (14). This procedure assumes that the treatment effect for the surrogate endpoint is in the same direction as the treatment effect for the true endpoint. Second, the prediction model specifies the Prentice criterion—namely, the true result is proportional to the surrogate result—so that if mislabeling changes the sign of both results, the model is unchanged (33). Importantly, the rationale for using the Prentice criterion differs between estimation extrapolation and hypothesis-testing extrapolation.
The standard error multiplier quantifies the uncertainty associated with the prediction model based on a successive leave-one-out analysis that mimics estimation extrapolation using past data (14). Computation of the standard error multiplier involves the following steps. Successively, one historical trial is removed from the analysis, the prediction is model is fit to data from the remaining historical trials, and the prediction model is applied to the surrogate endpoints in the left-out trial to compute a model result. The extrapolation error for the left-out trial is the difference between the model result and the true result. The predicted result for the left-out trial equals the model result plus the mean of the extrapolation errors. The variance of the predicted result for the left-out trial equals the variance of the model result plus the variance of the extrapolation error. The standard error multiplier is the average, over historical trials, of the standard errors of the predicted result divided by the standard error of the true result.
Computation of the standard error multiplier requires surrogate and true endpoints that are binary, which importantly includes an indicator of survival to a prespecified time. A surrogate or true result that is the difference in estimated survival to a clinically meaningful prespecified time is an attractive alternative to the commonly used hazard ratio, and it provides a more easily understood outcome for both clinician and patient. The main objection to the former is the need to specify a time for the analysis. However, without a strong assumption of proportional hazards, the hazard ratio depends on the duration of follow-up (34), so it also depends on a prespecified time. Because an absolute difference is more clinically relevant than a ratio (35–38), the difference in estimated survival at specified time can be a more appealing measure than the hazard ratio.
The standard error multiplier has been computed for the following datasets: 1) 10 historical, randomized trials for early colon cancer where the surrogate endpoint was survival to 3 years without cancer recurrence and the true endpoint was overall survival to 5 years (14,39); 2) 10 randomized trials for advanced colorectal cancer where the surrogate endpoint was survival to 6 months without cancer progression and the true endpoint was overall survival to 12 months (14,40,41); and 3) 27 randomized trials for advanced colorectal cancer, some of which are the same as in the second example, where the surrogate endpoint was tumor status assessed after 3–6 months, and the true endpoint was overall survival to 12 months (3,42). For these three examples, the standard error multipliers were 1.36, 1.33, and 1.25, respectively (14). Using the original labels for control and treatment groups (for a sensitivity analysis), the standard error multipliers were similar—namely, 1.30, 1.33, and 1.23, respectively. Detailed calculations for the second example are presented in the Supplementary Material (available online).
Before planning a new trial with only a surrogate endpoint and drawing conclusions based on estimation extrapolation, it is necessary to address the following four questions.
If the answer to Question 1 is “no,” there is no point in implementing a new surrogate endpoint trial because confidence intervals will likely be too wide to be informative. If the answer to Question 1 is “yes,” then the following three additional questions, that reframe known issues in surrogate endpoint analysis (43), need to be addressed:
If the answers to Questions 2, 3, and 4 are all “yes,” then a surrogate endpoint trial could reasonably be used to rule in a promising treatment. Of course, this is a tall order, particularly Questions 2 and 4, which could involve many unknowns. If the answers to Questions 2 and 3 are “yes,” a surrogate endpoint trial could be reasonably used to rule out an unpromising treatment without the need to consider Question 4. The detrimental consequences of incorrectly answering these questions is greatly lessened if estimation extrapolation is applied to a preliminary randomized trial with a surrogate endpoint, which is used to decide whether or not to implement a trial with a definitive true endpoint.
For clinicians and clinical trialists contemplating the use of a surrogate endpoint trial, the key point is not to lose sight of the fact that the analysis is fundamentally an extrapolation. In the cancer prevention setting, the reliance on hypothesis-testing extrapolation is particularly risky with a small surrogate endpoint trial. In the cancer treatment setting, a useful first step is to quantify the uncertainty of estimation extrapolation based on a leave-one-out analysis of previous historical trials. Subsequently, other extrapolations issues need to be addressed.
This work was supported by the Division of Cancer Prevention in the National Cancer Institute and the National Institutes of Health.
The funders did not have a role in the study design, data collection, analysis, and interpretation; the writing of the article; or the decision to submit the article for publication.