Contrary to our hypothesis, we found that the continuous metrics we assessed provide no predictive advantage over the categorical response metrics. However, we do recommend further study of the trichotomized response at early time points (e.g. 12-, 16-, and 24-weeks) with particular attention to 12-week status. This metric has at least two advantages. First, it addresses the concern over ignoring stable disease by including SD as a separate category; second it can be assessed earlier since it does not require confirmation and does not require data from the entire study period.
It is interesting to note that N0026 and N9841 had relatively low response rates (25-26%), yet the trichotomized response still performed as well as in N9741 which had a higher response rate (45%). Likely the trichotomized response appropriately recognizes the survival benefit associated with stable disease by placing such patients into their own category rather than combined in the same category as progression. A natural extension to the trichotomized response would be a 5-level metric (CR vs. PR vs. Stable vs. Increasing vs. Prog). However, this also has some inherent limitations, specifically, 1) the need to specify a cut-point to distinguish between Increasing and Progression, where the choice for this is not obvious, and 2) the complete response (CR) rate is often small in oncology studies, for example, in our data, the CR rates for N0026, N9741, and N9841 were 0%, 4.2%, and 3.3% respectively.
The inability for the continuous metrics we assessed to improve survival prediction may be due to several factors. First, when considered over an entire study population, tumor growth may be sufficiently ‘regular’ that measurements at a fixed time point post-baseline adequately characterize tumor activity. Second, the imaging frequency could be too infrequent to capture the tumor size changes. Alternatively, unidimensional tumor size may not be the most accurate measure of disease aggressiveness; functional imaging, volumetric assessment, or other advanced imaging methods may offer improvements. Finally, it may be too much to expect any early tumor measurement related endpoints to predict overall survival in settings where second and later line therapies are used (
18). An important assumption for the validity of endpoints based on our continuous metrics is that patient tumors are measured at regular intervals which do not differ by arm. This is to eliminate the possible bias that could arise in the following situation: two patients have similar tumor growth trajectories, but one has a tumor measurement at j weeks and the other has a tumor measurement as j+i weeks (for i>0) by which time the tumor is a different size than at week j. As a result, these patients may have different tumor response profiles based on our continuous metrics
An additional limitation to the current data is the inability to effectively assess the impact of the missing measurement data due to clinical progression, new lesions and missing assessments. Moreover, the number of lesions measured at each assessment was variable, and the current analysis used only the lesion measurements that were available across all assessments for the patient. Since not all lesion measurements at each assessment were used, the measurement data from each cycle used to compute the metrics could be biased. Future work should consider further exploration of trichotomized response as well as alternative continuous metrics since simple scalar summaries such as those we considered may not likely capture “the” key features of the tumor growth curve. For example, it is possible to have one patient for whom the tumor decreases over time and another patient for whom tumor increases over time, but for these two patients to have identical sums of measurements. Further, tumor growth curves often exhibit non-linearity, e.g. initial tumor shrinkage followed by progression. In order to capture key features of the tumor growth curve and thereby to improve prediction, a metric will likely need to be composite, for example, a linear combination of multiple scalar summaries such as those considered in this paper. Longitudinal modeling, e.g. mixed models, is another option others have previously considered (e.g. 12).
In conclusion, our data suggest that categorical response metrics predict survival as well as or better than the continuous tumor-measurement-based metrics considered in this work. Furthermore, trichotomized response at early timepoints, possibly as early as 12-weeks, are worthy of further study as an alternative endpoint in Phase II trials.