|Home | About | Journals | Submit | Contact Us | Français|
The success or failure of a clinical trial, of any phase, depends critically on the choice of an appropriate primary endpoint. In the setting of phase II and III cancer clinical trials, imaging endpoints have historically, and continue presently to play a major role in determining therapeutic efficacy. The primary goal of this paper is to discuss the validation of imaging-based markers as endpoints for Phase II clinical trials of cancer therapy. Specifically, we outline the issues that must be considered, and the criteria that would need to be satisfied, for an imaging endpoint to supplement or potentially replace RECIST- defined tumor status as a phase II clinical trial endpoint. The key criteria proposed to judge the utility of a new endpoint primarily relate to its ability to accurately and reproducibly predict the eventual phase III endpoint for treatment effect, which is usually assessed by a difference between two arms on progression free or overall survival, at both the patient and more importantly, at the trial level. As will be demonstrated, the level of evidence required to formally and fully validate a new imaging marker as an appropriate endpoint for phase II trials is substantial. In many cases, this level of evidence will only become available by conducting a series of coordinated prospectively designed multicenter clinical trials culminating in a formal meta-analysis. We also include a discussion of situations where flexibility may be required, relative to the ideal rigorous evaluation, to accommodate inevitable real-world feasibility constraints.
The success or failure of a clinical trial, of any phase, depends critically on the choice of an appropriate primary endpoint. The endpoint must be sensitive to the effect of the treatment under study, be able to be unambiguously and reliably measured, and optimally, highly clinical relevant. In the setting of phase II and III cancer clinical trials, imaging endpoints have historically, and continue to presently play a major role in determining therapeutic efficacy. The utility of imaging-based endpoints in the context of cancer is based on several factors. The non-invasive or minimally invasive nature of imaging allows the integration of tumor biology information at each site of disease, and for most imaging methodologies, over multiple sites of metastases throughout the body. Non-invasive imaging assays also lend themselves to serial evaluation as well as during surveillance and evaluation for recurrent disease. Increasingly, with advances in molecular and functional imaging, we will gain the ability to assess not only the morphology of the primary tumor and its metastases, but also the metabolic, hypoxic, proliferative and receptor status of the lesions, suggesting that the role of imaging may further increase (1–6).
The primary goal of this paper is to discuss the validation of imaging-based markers as endpoints for Phase II clinical trials of cancer therapy. We will touch upon endpoints for Phase III studies but only insofar as this is necessary for us to pursue our primary goal. Phase II clinical trials in cancer are designed to provide evidence of biological drug activity. Phase II trials have traditionally used imaging-related endpoints, such as tumor shrinkage or delayed tumor growth, as anti-tumor activity signals. The utility of tumor response as a Phase II endpoint is supported by biology (tumors rarely shrink by themselves) and history; drugs that induce tumor responses in early clinical trials are more likely to subsequently lead to positive phase III trials and drug registration (7,8). However, tumor response by itself does not constitute a necessary or sufficient demonstration of clinically meaningful drug efficacy, as 1) tumor response may not result in an improvement in survival or quality of life, and 2) patients may benefit from therapy without obtaining a tumor response (9,10). Therefore, increasingly tumor growth (progression) despite drug administration is viewed as evidence of drug inactivity, and progression-free survival (PFS), either overall or at a fixed time point, is increasingly being used as a phase II clinical trial endpoint.
Since 2000, a standard for evaluating imaging-related endpoints in solid tumor cancer clinical trials has been that defined by the RECIST project (11). These criteria specify the manner by which data from standardized imaging modalities, such as CT and MRI, are used to define clinical trial endpoints. In this volume, the RECIST criteria are updated to address multiple issues that have arisen since the initial RECIST publication in 2000 (12). Given the importance of imaging-related endpoints in cancer clinical trials, and the rapid pace at which new imaging modalities are becoming available, in this paper we focus on methodological issues that must be considered for a new imaging endpoint to be appropriately validated as a primary endpoint for Phase II clinical trials. Specifically, we outline the issues that must be considered, and the criteria that would need to be satisfied, for an imaging endpoint to supplement or potentially replace RECIST- defined tumor status as a phase II clinical trial endpoint. For example, SUV decrease from a FDG-PET scan, if appropriately validated, might be accepted as an alternative to a RECIST-based partial response as assessed by a CT scan, or provide an additional mechanism to upgrade a patient from a partial responder to a complete responder (13,14). Alternatively, volumetric imaging, if validated as a more accurate predictor of subsequent therapeutic benefit (demonstrated in phase III trials), could potentially replace unidimensional imaging as currently specified by RECIST.
To evaluate the quality of a study and to compare results across studies, established standards for reporting relevant elements of study design and analysis are vital. Guidelines for evaluating and reporting results from studies of tissue-based biomarkers have recently been developed. The Reporting Tumor Marker Prognostic Study (REMARK) guidelines, for example, describe a list of basic elements that should be documented in any report of a tissue-based tumor marker study (15). These guidelines include reporting of study design, pre-specified hypotheses, patient characteristics, and the statistical analysis plans. Similarly, the Tumor Marker Utility Grading System (TMUGS) established a standardized technique to allow evaluating the utility of a known marker based on existing evidence (16). The principles outlined in these tissue-based biomarker guidelines in general are equally appropriate in the context of imaging-based biomarkers. Here we focus on study design issues to validate an imaging biomarker; adherence to these standards, and reporting the studies per REMARK guidelines, will allow the generation of TMUGS level ‘+++’ or ‘++’ evidence for imaging modalities.
Several authors have proposed that alternative uses of criteria based on measurements obtained via morphologic imaging may be preferred to tumor response as a predictor of improvement in clinically relevant endpoints in phase III clinical trials. For example, it has been proposed that progression-free survival, assessed by the current RECIST status, provides greater predictive accuracy than tumor response for phase III endpoints (17,18), or that a continuous measure of tumor size change may be preferred to the categorical definitions of RECIST (19). In this paper we focus on technological advances designed to supplement or potentially replace tumor assessments based on RECIST, as opposed to an endpoint that uses RECIST-based anatomical imaging data in an alternative manner. In addition, we acknowledge that anatomically-based tumor assessments in phase II clinical trials, using either response rate or PFS based on the current RECIST criteria, as well as for previous criteria such as WHO, is well documented to provide imperfect prediction of subsequent therapeutic benefit in phase III clinical trials (20–22). However, at the present time, the RECIST criteria remain a clearly defined and recognized standard, and for a new approach to be advocated, it must provide clear advantage to the recognized standard. We therefore will consider RECIST as the default competitor for new imaging approaches.
Tumor response as an endpoint in therapeutic trials was first codified by the World Health Organization (WHO) based upon initial publications that focus on the reproducibility of the metric for assessing tumor response and progression. The specific response categorizations, as well as the cutpoint values for response categorization (50% for the World Health Organization’s bi-dimensional metric, corresponding to a partial response of 30% by unidimensional measurements) have remained basically unchanged in the evolution of response assessment. The modalities used to assess tumor size, however, have evolved substantially. In addition, novel non-response inducing agents (cytostatic) are being developed, in addition to new cytotoxic agents (4,23–25). Taken together, these factors have led to a recognition that endpoints based on RECIST have limitations in certain primary tumor types, and with certain therapeutic agents. Indeed there are clear limitations to the universal use of RECIST in all tumor types for all agents. However, many of the shortcomings that have been noted in the literature represent either lack of proper clinical interpretation of radiologic images, or intrinsic limitations of any scoring system where categorical response criteria are binned into categories while the data actually represent continuous change. These limitations will be relevant to any imaging modality or other biomarker technique. Here we present selected examples to illustrate these concepts.
In gastrointestinal stromal tumors, divergent strategies to RECIST have been proposed that, in single institutional trials, improve correlation with survival outcomes (23). These criteria have been based on modifying the RECIST cut point for progression, as well as utilizing change in tumor density to assess disease status. The criteria involving changes in tumor density on post-contrast CT scans are particularly appealing as they introduce a functional element into anatomic criteria. Properly applied, these approaches may be invaluable, principally by providing additional clinically relevant data from existing scanning technology. However, at this time the reproducibility of such criteria among multiple institutions requires further validation in independent data sets. Considering the variability in image acquisition techniques among centers, and the resultant wide variability in density or perfusion measurements post-contrast, proposed criteria such as these must be carefully vetted not only for correlation with outcome but reproducibility of the measurement metric. Further, correlation of within-patient changes in a biomarker and patient outcome is inadequate to conclude that a therapy that alters the biomarker will also alter the ultimate patient outcome – a correlate does not a surrogate make (26).
Other criteria, such as those proposed in mesothelioma, have carefully been created in an attempt to address the reproducibility issue. For example, there is considerable heterogeneity of response seen within tumors such as mesothelioma, evident on multiple CT slices, with innumerable potential linear diameters. Approaches to provide reproducible measures are clearly a necessary part of the strategy to documenting response in this patient population (27) (for specific details see Appendix I). However, these responses must not only be reproducible, but correlate with true clinical outcome through validation in large multi-institutional data sets.
Three imaging modalities or metrics currently posed to play a role in disease assessment are PET (including but not limited to the FDG tracer), dynamic contrast enhanced MRI (DCE-MRI), and three dimensional tumor measurement. In this section we discuss key performance characteristics as they relate to the potential widespread use and acceptance of these three imaging modalities.
When an imaging assay is used serially to assess changes in tumor characteristics, a change analysis is being performed. In the setting of developing a technique for wide usage in clinical trials, as with any other assay, the performance characteristics of the imaging assays in the multi-center setting must be established. At this time the methods used to obtain FDG-PET scans and assess FDG metabolism and uptake are clearly varied (28,29); this is also true of studies evaluating DCE MRI (30,31). The accuracy, variance, and reproducibility of the imaging technology must be determined to assure a quantitative or semi-quantitative index which is biologically meaningful.
To provide guidance, and standardize the acquisition analysis and interpretation of FDG-PET in clinical trials, the Cancer Imaging Program (CIP) of the NCI convened a workshop in 2005, resulting in consensus guidelines that are currently being used in NCI trials as well as several studies developed and performed by the pharmaceutical industry (29). The guidelines include recommendations on patient preparation, image acquisition, image reconstruction, quantitative and semi-quantitative analysis of FDG-PET images, quality assurance issues, reproducibility, and other parameters of importance to be used in FDG-PET studies before and after a therapeutic intervention. The NCI Cancer Imaging Program has also engaged the magnetic resonance imaging (MRI) community in a similar process to develop consensus guidelines for the performance of dynamic contrast enhanced MRI (DCE-MRI) as well as magnetic resonance spectroscopy (MRS) (http://imaging.cancer.gov/).
In the specific setting of FDG PET, multiple studies have evaluated the role of FDG PET in assessing response to treatment in non small cell lung cancer (NSCLC), esophageal cancer, head and neck cancer, breast cancer, and many other tumors (1,6,14,32,33). To date, these studies have been primarily performed in single institutions with small numbers of patients. Similarly, although promising, the data for 3D measures of treatment response are scant, involve relatively small numbers of patients, and are not readily comparable. Therefore, at the present time we feel there are inadequate data to support the inclusion of functional imaging (PET and DCE-MRI) or expanded morphologic imaging (3D measurement) RECIST version 1.1 criteria as presented in this volume (12). In a later section we provide specific details of an ongoing trial seeking to provide components of the necessary information to determine whether FDG PET and/or 3D tumor measurements can ultimately serve as valid trial endpoints.
Prior to initiating definitive studies to validate imaging based endpoints, several criteria must be met (Table 1). The technology must be at a relatively stable stage, and have the potential for broad availability across centers which will perform the therapeutic intervention in the clinical trial. All aspects of image acquisition including frequency of scanning, modality, timing of image acquisition relative to injection of contrast agents or radiolabeled tracers, and pulse sequence parameters or other imaging parameters, must be specified. A standardized protocol for interpreting images, qualitative or quantitative, must be established, taking into account the specific modality parameters and reproducibility of the measurement metric.
Standardization of technique will also help limit variability across readers, although such variability is unlikely to disappear even in modalities that produce quantitative test results. For example, SUV measurements in PET studies are subject to variability related to the determination of a Region of Interest (ROI) by the test interpreter. The assessment of variability across readers is particular to imaging and remains an important consideration in imaging marker evaluation studies. Accordingly, studies to evaluate imaging reproducibility and to document a normal range of values for replicate acquisitions and interpretations should be conducted. Such studies should include an evaluation of the rating system to establish categories of response or progression, optimally based upon biologically relevant cut values. Further, the appropriate patient population should be well-defined, along with an understanding of any limitations of the technique in certain diseases or disease sub-types.
In addition to these technical issues, in most if not all cases, it is assumed that a sound biological rationale exists for the use of an imaging technique as an endpoint. For example, the use of radiographic tumor response in phase II studies of cytotoxic agents assumes that tumor shrinkage is an outcome reflecting drug activity. Biologic confounders must also be accounted for, for example, if assessing tumor metabolism via FDG PET, one must consider treatment specific issues such as nonspecific uptake in inflammation post radiotherapy, which can impact the optimal time to asses post treatment response. This is usually minimized by waiting several weeks post radiation therapy, to obtain the post treatment FDG PET scan, allowing the inflammation time to subside. While critical, for the validation of future phase II endpoints, we stress that biological plausibility alone is inadequate to allow any endpoint to be validated as a without a demonstration of correlation with a true patient benefit (Phase III) outcome. We return to this point in the discussion.
We will outline, in general, critical issues, constraints and goals associated with the validation of a new imaging endpoint, to provide guidance and a conceptual framework for the validation of individual imaging endpoints. It is not our purpose to precisely prescribe how to validate any specific new phase II trial imaging endpoint as this will depend on the specific characteristics and purpose of the particular endpoint, whether its use is restricted to a certain patient subgroup, the current state of development of the endpoint, and the technology to measure it.
The primary purpose of an imaging endpoint in the phase II setting is to serve as an early but accurate indicator of a promising treatment effect. As such, the key criteria for judging the utility of a new endpoint will be its ability to predict accurately the phase III endpoint for treatment effect, which is usually assessed by a difference between two arms on PFS or overall survival (OS). More precisely, the measure of treatment effect on the phase II endpoint must correlate sufficiently well with the measure of treatment effect on the phase III primary endpoint that the former can be considered reasonably predictive of the latter.
An initial question to be addressed is whether the new endpoint is destined to be ‘+++’ or only ‘++’, according to the TMUGS criteria (16) – in other words, will the endpoint be useable, by itself, as the primary criterion for moving to a phase III study, or will it be useable as one of several such criteria. In this paper, we focus on validating early endpoints that are anticipated to be ‘+++’. A second question relates to the current utility of RECIST in the disease setting under exploration. In a disease setting where RECIST (or existing alternatives) predict Phase III outcomes poorly, improved prediction of outcome over the current standard would be of clear utility, even if the imaging modality does not meet criteria for full endpoint validation.
It is not sufficient that the endpoint being considered for a phase II trial be a prognostic indicator of clinical outcome, although it will usually be the case that early endpoints are prognostic of clinical outcome even in the absence of a treatment effect. Within the context of a clinical trial, the early endpoint must capture at least a component of treatment benefit, a concept that specifies that a change due to treatment in the early endpoint predicts a change in the ultimate clinical endpoint. Theoretical principles to define treatment benefit were outlined by Prentice (34), although capturing the full treatment benefit (as measured by the phase III endpoint) has been recognized as too strict to be useful in practice (35,36). A more practical, and demonstrable criterion, requires that the early endpoint captures a substantial proportion of the treatment benefit, for example, more than 50% (20,35–36). This approach has been used to establish the utility of endpoints such as tumor response and progression-free survival (PFS) by demonstrating that they are sufficiently predictive of OS, even if they do not satisfy the Prentice criterion (18,20,21,37–42).
Establishing the utility of the endpoint can be separated into an early development and a later validation stage (Table 2). Even in the early development stage, optimally work should be performed in the context of randomized studies, which most reliably allow the measurement of treatment benefit (35). Practically, much early development work will by necessity occur in the context of prospective cohort studies, which should at minimum have patients with uniform treatment. In the early development stage of a new imaging endpoint, utility determination will likely be restricted to demonstrating that in single studies the endpoint captures much of the treatment benefit at the individual patient level. Such a demonstration suggests, but does not prove, that the endpoint may also capture much of the treatment benefit at the trial level. Freedman et al (35) describe one approach to estimating the proportion of treatment effect explored by modeling the treatment effect on the ultimate endpoint (Appendix II).
Success at this early validation phase, by demonstrating a high correlation at the patient level between the early endpoint and the ultimate clinical endpoint within a trial, randomized or not, is not sufficient to validate an endpoint. Such a correlation may be a result of prognostic factors that influence both endpoints, rather than a result of similar treatment effect on the two endpoints. Despite this caveat, a reasonably high patient level correlation (for example >50%) would suggest the possible utility of the early endpoint and the value of subsequently assessing, by means of a larger analysis, the predictive ability of the early endpoint for the ultimate phase III endpoint for treatment effect at the trial level.
In the later stages of validation, as argued by Korn et al (36), the true test of the validity of an endpoint is whether it captures treatment benefit at the trial level. In other words, there must be a strong association between the measure of treatment effect as assessed by the early endpoint with the measure of treatment effect as assessed by the endpoint to be used in a phase III trial, which is most likely the estimated treatment hazard ratio associated with PFS or OS. In virtually all cases, such an assessment must be performed in the context of a meta-analysis of phase III trials, where both endpoints are measured. Such a meta-analysis may be performed using trials already conducted, if imaging data is available. However, the methodologic aspects of meta-analysis itself must be defined prospectively in order to be statistically convincing. Such analyses have been performed for the relationship between tumor response and OS in advanced colorectal cancer (18,20) and in metastatic breast cancer (21). In each case, the proportion of variation in the treatment effect on OS explained by the log OR of tumor response is less than 50%. In metastatic breast cancer, tumor response was seen to capture a much greater proportion of the treatment benefit reflected by PFS (92%). Such meta-analyses are substantial undertakings; the breast study included 11 trials, while the colorectal studies included 18–28 trials.
Even with a substantial number of trials included in a planned meta-analysis, obtaining adequate power to demonstrate that a substantial proportion of the treatment benefit, at the trial level, is captured by the early imaging endpoint is challenging (see Appendix III). In the end, it will be necessary to compromise and accept that one cannot always prospectively assure the desired power to achieve the desired lower confidence bound. We stress that whatever form the meta-analysis is to take, it must be pre-specified formally in a protocol. An ad hoc approach will increase the probability for bias in the estimation of correlation between the two measures of treatment benefit (that associated with the early endpoint versus that associated with the primary phase III endpoint).
The recommendations above are based, in large part, on guidelines to validate a phase III surrogate endpoint. Although the basic principles behind validation of a phase II endpoint remain similar, in specific contexts the standards may appropriately be adapted for a phase II endpoint. For example, a meta-analysis of fewer trials may be all that is possible, and/or an imaging endpoint may be considered acceptable for use in phase II trials with a lower correlation between the treatment effect of interest and that estimated by the imaging endpoint (for example, capture of 50% of the treatment effect may be adequate). We further note that there may be scenarios to allow refinements to RECIST based on technical or other advances in which the above standards of validation are not required. For example, an existing concern regarding RECIST is the reproducibility of tumor measurements across readers. If a more reproducible anatomic method were available (e.g., a computer-assisted diagnostic or CAD, algorithm) that consistently provided the same result as an expert reader across sites, this would be an improvement upon standard RECIST and would likely be acceptable without a meta-analytic validation.
There are currently ongoing national trials within the United States designed to provide data at the early validation phase (Table 2) for FDG PET as a biomarker for response in lymphoma and non small cell lung cancer. These multicenter trials seek to validate the results of single and other multicenter trials that provided promising evidence of the utility of these biomarkers to make patient-level biomarker outcome prediction, thus reflecting the early phase of biomarker validation. These trials have been designed for the purpose of biomarker validation by optimizing the image acquisition parameters within the real world limitations of a multicenter trial, and by providing both local and expert assessments of the imaging results. Validating imaging methods as potential biomarkers for tumor response to treatment requires the demonstration of a high degree of test-retest reproducibility for the imaging method. Therefore, test-retest reproducibility will also be an important element of these trials.
As a specific example, ACRIN protocol 6678 (FDG-PET/CT as a Predictive Marker of Tumor Response and Patient Outcome: Prospective Validation in Non-small Cell Lung Cancer) will explore three types of evaluation of imaging biomarkers that relate to their potential role as clinical trial endpoints. Specifically the study includes (a) comparison of time-to-event distributions for biomarker “responders” and “non-responders”, (b) assessment of the predictive accuracy of the biomarkers, and (c) assessment of the test-retest reliability of the imaging measurement. The primary aim of the study is to assess whether a metabolic response, defined as a ≥ 25% decrease in peak tumor SUV post-cycle 1 of chemotherapy, provides early prediction of treatment outcome as determined by one-year patient survival. A secondary aim of the study is to compare the predictive value of FDG-PET/CT for one-year overall survival after one and two cycles of chemotherapy. Further secondary endpoints assess the test-retest reproducibility of standardized uptake values (SUVs) measured by PET/CT systems.
In addition to the evaluation of FDG-PET based markers, ACRIN protocol 6678 also includes an exploration of tumor volumetry. This study will permit an early assessment of whether volumetric analysis is feasible and reproducible in the multicenter trial setting, and whether volumetric change analysis early in the course of therapy has the potential to predict a phase III endpoint (long term survival), as an independent or complementary variable to FDG PET. Additional detail on ACRIN 6678 is available in Appendix IV, and the full protocol is available at http://www.acrin.org/TabID/162/Default.aspx.
It is clear from the preceding discussion that the level of evidence required to formally and fully validate a new imaging marker as an appropriate endpoint for phase II trials is substantial. In many cases, this level of evidence will only become available by conducting a series of coordinated prospectively designed clinical trials, such as the example described above. As the financial and time burdens involved in prospective clinical trials are considerable, it is necessary to consider whether selected elements of the validation of a new technology may be performed retrospectively, that is, on data from patients who have already been enrolled, treated, and assessed on a previous clinical trial (or even who were not on a clinical trial). Clearly, each component of a validation analysis must prospectively specify the hypothesis, the analytic techniques, the patient population, and the precise imaging algorithms to be used. If these elements are clearly specified prospectively in a protocol, it is possible to derive evidence from a situation in which patients may have already been enrolled in a randomized clinical trial, as long as the imaging results are available from the vast majority of patients without selection bias. Such a retrospective evaluation may be most appropriate at the early validation phase of an endpoint’s development, where the focus is on the individual patient level treatment benefit prediction. As standardization is attained in measurements based on new imaging modalities, if the data are stored in a queriable database, such analyses may become possible. An implication of this recommendation is that ongoing and future phase III trials should incorporate appropriate collection of imaging endpoints whenever feasible.
The considerable enthusiasm surrounding the use of new imaging modalities must be tempered by a number of examples that suggest that endpoints reflecting a biological effect of an agent may not result in improvements in a clinically meaningful endpoint in a phase III trial. For example, a clear and measurable change in vascular permeability and blood flow as assessed by dynamic contrast enhanced magnetic resonance imaging (DCE-MRI) failed to predict for improved survival when the VEGFR targeted agent vatalanib (PTK787/ZK22584) was tested in phase III trials in colorectal carcinoma patients (43,44). Another example may be the use of FDG-PET to determine response in patients with GIST. FDG-PET response has been shown to be an early and sensitive evaluation of the effectiveness of imatinib in this disease, (23), and as such FDG-PET is useful to evaluate imatinib’s activity in individual patients, and to screen for activity in phase II trials for drugs with similar mechanism of action to imatinib. However, it is not clear whether similar FDG-PET effects would occur with other agents that differ in mechanism of action from imatinib. Furthermore, occurrence of FDG-PET changes similar to those seen with imatinib in GIST patients may not reflect changes seen in other settings that will correlate with clinical benefit.
This raises the critical and difficult issue in the validation of new imaging techniques of the degree to which the validation of an early imaging endpoint may be universal versus being disease site and agent specific. Clearly generalizability is never guaranteed, however independent validation for each imaging modality for each disease site/agent class is clearly intractable. As a general guideline, if one performs a rigorous evaluation, according to the principles outlined here, in one particular setting, then the level of evidence required for that imaging modality for disease sites that have historically performed consistently, and agents with similar mechanisms of action, may be lessened. Consistency of the novel imaging results with results obtained using other biomarkers (RECIST, PFS, etc.) which may be observed later in the trial strengthens the evidence for a new biomarker. In the end, the study design, study endpoints, and level of validation required must be informed by careful examination of both the biology of the imaging marker and the mechanism of action of the therapeutic intervention. A critical difference exists between upstream markers that may be pathway or target specific versus downstream markers (metabolism, apoptosis, proliferation) that are intended to measure biologic activity in the tumor; these guidelines focus on the downstream markers. Ultimately, researchers must balance cost and time efficiencies against potential bias. To achieve this, an iterative strategy may be adopted through which until substantial evidence of lack of prediction exists, researchers may proceed as though previous results of predictive ability established in similar settings continue to apply in a new setting. However, as new knowledge becomes available, studies may need to become more clearly disease or agent-class focused. This strategy clearly only applies in the phase II setting; phase III endpoints must have been appropriately validated to allow practice changing decisions.
Ultimately, whether the biological measurements allowed by advanced imaging are meaningful predictors of drug efficacy and patient benefit depends on multiple factors, including the importance of the biologic effect being assessed on tumor growth and survival, and the magnitude, duration and frequency of occurrence of the biologic effect in a given patient population. The relevance of these factors must be understood for each modality and in each clinical situation. We remain optimistic that through the careful design of prospective trials, coupled with protocol-specified analyses of existing, standards-based datasets, promising imaging modalities may be properly validated for inclusion into future RECIST versions.
The modified RECIST criteria employed in Mesothelioma response assessment (27) consist of measuring tumor perpendicular to the chest wall or mediastinum in two positions at three separate levels on transverse cuts of CT scan. The sum of these 6 measurements defines a pleural unidimensional measurement. The transverse cuts are recommended to be at least 1 cm apart and related to anatomical landmarks in the thorax to enhance reproducibility on follow up scans. At the follow up scans, the pleural thickness is measured at the same position and at the same level.
The Freedman approach (35) involves estimating the treatment effect on the true endpoint, defined as τ, and then assessing the proportion of treatment effect explained by the early endpoint by 1−(a/), where the ratio is that of estimated treatment effect, adjusted for the early endpoint, divided by the unadjusted estimated treatment effect. Thus, for an early endpoint that captures no treatment benefit (a = ), the proportion of treatment effect explained is 0%. At the opposite extreme, for an early endpoint that captures all the treatment benefit(a = 0), satisfying the Prentice criterion, the proportion of treatment effect explained is 100%. However, as noted by Freedman, this approach has statistical power limitations that will generally preclude conclusively demonstrating that a substantial proportion of the treatment benefit at the individual patient level is explained by the early endpoint. In addition, it has been recognized that the proportion explained is not indeed a true proportion, as it may exceed 100%, and that while it may be estimated within a single trial, that data from multiple trials are required to provide a robust estimate of the predictive endpoint (37).
In the setting of conducting a meta-analysis of randomized clinical trials, even if 81% of the variation in the primary phase III endpoint is explained by the early endpoint, it will require 28 trials to achieve 90% power to demonstrate that the proportion of variation is at least greater than 50%, with 95% confidence. (Relaxing the requirement to 90% confidence does little – the requirement is reduced to 23 trials.) If only 64% of the variation in the primary phase III endpoint is explained by the early endpoint (r = .8), a more realistic assumption, 28 trials yields 90% power to achieve a lower 95% confidence limit of at least 23% on the proportion of variation explained – which is not very satisfactory. One possible approach to this dilemma is to separate individual trials into homogeneous strata defined by appropriate prognostic variables, and correlate the two measures of treatment effect over the much greater number of separate strata. This may be particularly useful if the treatment effects vary over the strata within trials. As long as care is taken that the strata are not so sparse that the estimates of treatment effect become statistically unstable, the increase in precision of the correlation estimate should overcome the decrease in precision of the two sets of treatment benefit estimates.
This study has four objectives:
The two specific hypotheses underlying this trial are (i) a metabolic response, defined as a ≥ 25% decrease in peak tumor SUV post-cycle 1 of chemotherapy, provides early prediction of treatment outcome (tumor response and patient survival) and (ii) tumor glucose utilization can be measured by FDG-PET/CT with high reproducibility.
The primary endpoint of this study is the prediction of one-year overall survival by monitoring changes in tumor metabolic activity during the first chemotherapy cycle, where metabolic response is classified as ≥ 25% decrease in SUV of the primary tumor relative to baseline (pre-chemotherapy).
In addition to the specific endpoints described above, the trial provides data for hypothesis-forming analyses. Specifically, the following questions will be addressed:
The trial will examine the association between changes in tumor FDG uptake during chemotherapy and patient survival. Furthermore, it will determine the test-retest reproducibility of quantitative measurements of tumor FDG uptake. The trial will also evaluate the time course of changes in tumor glucose metabolism during chemotherapy and measure changes in tumor FDG uptake after one and two cycles of chemotherapy, because the optimal time point to predict patient outcome by FDG-PET is currently unknown. Since it is not practical for participants to undergo a total of four (4) PET/CT scans (two prior to therapy and two during therapy), study participants will be randomized into two groups. Group A will undergo two PET scans prior to therapy and one PET scan after the first chemotherapy cycle. Group B will undergo one PET scan prior to therapy and two PET scans during therapy (after the first and second chemotherapy cycle). For both groups A and B, follow-up CT imaging after every other chemotherapy cycle will be used to determine best clinical response according to RECIST criteria. The participant’s treating oncologist will be contacted every three months for one year or until death, whichever occurs first, to obtain observational data to determine the primary endpoint of one-year overall survival.
Total of 228 participants will be enrolled into the study at a minimum of 8 institutions. Of the 228 eligible participants, 57 participants will be assigned to group A and the remaining 171 participants will be assigned to group B.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.