We will outline, in general, critical issues, constraints and goals associated with the validation of a new imaging endpoint, to provide guidance and a conceptual framework for the validation of individual imaging endpoints. It is not our purpose to precisely prescribe how to validate any specific new phase II trial imaging endpoint as this will depend on the specific characteristics and purpose of the particular endpoint, whether its use is restricted to a certain patient subgroup, the current state of development of the endpoint, and the technology to measure it.
The primary purpose of an imaging endpoint in the phase II setting is to serve as an early but accurate indicator of a promising treatment effect. As such, the key criteria for judging the utility of a new endpoint will be its ability to predict accurately the phase III endpoint for treatment effect, which is usually assessed by a difference between two arms on PFS or overall survival (OS). More precisely, the measure of treatment effect on the phase II endpoint must correlate sufficiently well with the measure of treatment effect on the phase III primary endpoint that the former can be considered reasonably predictive of the latter.
An initial question to be addressed is whether the new endpoint is destined to be ‘+++’ or only ‘++’, according to the TMUGS criteria (16
) – in other words, will the endpoint be useable, by itself, as the primary criterion for moving to a phase III study, or will it be useable as one of several such criteria. In this paper, we focus on validating early endpoints that are anticipated to be ‘+++’. A second question relates to the current utility of RECIST in the disease setting under exploration. In a disease setting where RECIST (or existing alternatives) predict Phase III outcomes poorly, improved prediction of outcome over the current standard would be of clear utility, even if the imaging modality does not meet criteria for full endpoint validation.
It is not sufficient that the endpoint being considered for a phase II trial be a prognostic indicator of clinical outcome, although it will usually be the case that early endpoints are prognostic of clinical outcome even in the absence of a treatment effect. Within the context of a clinical trial, the early endpoint must capture at least a component of treatment benefit, a concept that specifies that a change due to treatment in the early endpoint predicts a change in the ultimate clinical endpoint. Theoretical principles to define treatment benefit were outlined by Prentice (34
), although capturing the full treatment benefit (as measured by the phase III endpoint) has been recognized as too strict to be useful in practice (35
). A more practical, and demonstrable criterion, requires that the early endpoint captures a substantial proportion of the treatment benefit, for example, more than 50% (20
). This approach has been used to establish the utility of endpoints such as tumor response and progression-free survival (PFS) by demonstrating that they are sufficiently predictive of OS, even if they do not satisfy the Prentice criterion (18
Establishing the utility of the endpoint can be separated into an early development and a later validation stage (). Even in the early development stage, optimally work should be performed in the context of randomized studies, which most reliably allow the measurement of treatment benefit (35
). Practically, much early development work will by necessity occur in the context of prospective cohort studies, which should at minimum have patients with uniform treatment. In the early development stage of a new imaging endpoint, utility determination will likely be restricted to demonstrating that in single studies the endpoint captures much of the treatment benefit at the individual patient level
. Such a demonstration suggests, but does not prove, that the endpoint may also capture much of the treatment benefit at the trial level. Freedman et al (35
) describe one approach to estimating the proportion of treatment effect explored by modeling the treatment effect on the ultimate endpoint (Appendix II
Early and Late Phases of Endpoint Validation
Success at this early validation phase, by demonstrating a high correlation at the patient level between the early endpoint and the ultimate clinical endpoint within a trial, randomized or not, is not sufficient to validate an endpoint. Such a correlation may be a result of prognostic factors that influence both endpoints, rather than a result of similar treatment effect on the two endpoints. Despite this caveat, a reasonably high patient level correlation (for example >50%) would suggest the possible utility of the early endpoint and the value of subsequently assessing, by means of a larger analysis, the predictive ability of the early endpoint for the ultimate phase III endpoint for treatment effect at the trial level.
In the later stages of validation, as argued by Korn et al (36
), the true test of the validity of an endpoint is whether it captures treatment benefit at the trial level
. In other words, there must be a strong association between the measure of treatment effect as assessed by the early endpoint with the measure of treatment effect as assessed by the endpoint to be used in a phase III trial, which is most likely the estimated treatment hazard ratio associated with PFS or OS. In virtually all cases, such an assessment must be performed in the context of a meta-analysis of phase III trials, where both endpoints are measured. Such a meta-analysis may be performed using trials already conducted, if imaging data is available. However, the methodologic aspects of meta-analysis itself must be defined prospectively in order to be statistically convincing. Such analyses have been performed for the relationship between tumor response and OS in advanced colorectal cancer (18
) and in metastatic breast cancer (21
). In each case, the proportion of variation in the treatment effect on OS explained by the log OR of tumor response is less than 50%. In metastatic breast cancer, tumor response was seen to capture a much greater proportion of the treatment benefit reflected by PFS (92%). Such meta-analyses are substantial undertakings; the breast study included 11 trials, while the colorectal studies included 18–28 trials.
Even with a substantial number of trials included in a planned meta-analysis, obtaining adequate power to demonstrate that a substantial proportion of the treatment benefit, at the trial level, is captured by the early imaging endpoint is challenging (see Appendix III
). In the end, it will be necessary to compromise and accept that one cannot always prospectively assure the desired power to achieve the desired lower confidence bound. We stress that whatever form the meta-analysis is to take, it must be pre-specified formally in a protocol. An ad hoc approach will increase the probability for bias in the estimation of correlation between the two measures of treatment benefit (that associated with the early endpoint versus that associated with the primary phase III endpoint).
The recommendations above are based, in large part, on guidelines to validate a phase III surrogate endpoint. Although the basic principles behind validation of a phase II endpoint remain similar, in specific contexts the standards may appropriately be adapted for a phase II endpoint. For example, a meta-analysis of fewer trials may be all that is possible, and/or an imaging endpoint may be considered acceptable for use in phase II trials with a lower correlation between the treatment effect of interest and that estimated by the imaging endpoint (for example, capture of 50% of the treatment effect may be adequate). We further note that there may be scenarios to allow refinements to RECIST based on technical or other advances in which the above standards of validation are not required. For example, an existing concern regarding RECIST is the reproducibility of tumor measurements across readers. If a more reproducible anatomic method were available (e.g., a computer-assisted diagnostic or CAD, algorithm) that consistently provided the same result as an expert reader across sites, this would be an improvement upon standard RECIST and would likely be acceptable without a meta-analytic validation.