|Home | About | Journals | Submit | Contact Us | Français|
RATIONALE: Early detection of tumor response to therapy is a key goal. Finding measurement algorithms capable of early detection of tumor response could individualize therapy treatment as well as reduce the cost of bringing new drugs to market. On an individual basis, the urgency arises from the desire to prevent continued treatment of the patient with a high-cost and/or high-risk regimen with no demonstrated individual benefit and rapidly switch the patient to an alternative efficacious therapy for that patient. In the context of bringing new drugs to market, such algorithms could demonstrate efficacy in much smaller populations, which would allow phase 3 trials to achieve statistically significant decisions with fewer subjects in shorter trials. MATERIALS AND METHODS: This consensus-based article describes multiple, image modality-independent means to assess the relative performance of algorithms for measuring tumor change in response to therapy. In this setting, we describe specifically the example of measurement of tumor volume change from anatomic imaging as well as provide an overview of other promising generic analytic methods that can be used to assess change in heterogeneous tumors. To support assessment of the relative performance of algorithms for measuring small tumor change, data sources of truth are required. RESULTS: Very short interval clinical imaging examinations and phantom scans provide known truth for comparative evaluation of algorithms. CONCLUSIONS: For a given category of measurement methods, the algorithm that has the smallest measurement noise and least bias on average will perform best in early detection of true tumor change.
Strategies for disease response assessment must be useful in a wide range of cancers, encompassing a large variety of image-based measurements and many different treatment options. Chemotherapy and neoadjuvant chemotherapy treatment protocols vary across the world and may include group protocol studies for novel agents or combinations, the application of best therapy in multicenter clinical trials, and many instances of therapy given off-study to individual patients. Many therapy plans now include surgery or radiation as additional therapy options. New biologic response modifiers (so-called targeted therapies) for diseases such as lung cancer have received increased interest recently. These generally less-toxic agents are targeted to affect the tumor blood supply or other critical pathways in cancer cell growth, differentiation, or metastatic processes. The end point of such therapies may not be cancer regression but stasis, that is, tumor growth cessation. Therefore, measures of tumor size may be an inappropriate early measure to evaluate useful change. For example, subtle changes in image-based measures in the cancer such as density, tumor margin alterations, or other pixel-based features may signal a useful response at an early stage of therapy; tumor blood flow may be an important measure for tumor vasculature-based changes; and metabolic changes may be measured by PET—all these changes preceding any change in tumor volume.
Critical to the image-based evaluation of either tumor growth or shrinkage (or some of the more subtle features mentioned already) in response to therapy is a much-improved understanding of the three-dimensional anatomic/pathologic structure of cancers. Current assessments based on two-dimensional pathology slides indicate that malignant cells occupy only a fraction of a tumor nodule's volume, whereas the remainder consists of inflammatory cells, edema, fibrosis, or necrosis. Understanding the three-dimensional structure of cancer pathologically is critical to the evaluation of three-dimensional imaging modalities. Future response assessment protocols could then target specifically the cancer component of a tumor.
The current standard method to measure tumor response using imaging is referred to as Response Evaluation Criteria in Solid Tumors (RECIST), which is based on unidimensional, linear measurements of tumor diameter [1–5]. In promoting the summed linear measurement of a limited number of target tumors, RECIST offers a simple approach that requires minimal effort. The RECIST guidelines, however, presume that tumors are spherical and change in a uniform symmetric manner. In actuality, tumors do not necessarily grow symmetrically; different portions may grow at different rates . Significant variability in the RECIST measures exists among different observers [7–10], and published work generally focuses on the surrogate of “best overall response” with only a few methods addressing other end points such as “time to progression” and “disease-free survival.” As a therapy response measurement procedure, RECIST maps linear data into an established set of four discrete categories: complete response, partial response, stable disease, and progressive disease. These categorical bins, however, are quite coarse with most trial analyses critically pivoting on partial response (defined by a 30% linear sum reduction) and progressive disease (defined by a 20% increase in tumor dimension). Furthermore, if the cancer volume is mostly inflammation, then linear size change alone may give a false impression of therapy response (the inflammation was reduced, but the cancerous component was not); in fact, a tumor may slightly increase in size after initiation of therapy because of inflammatory reactions—although a beneficial response is occurring. As a consequence of observer measurement variability and the expectation that newer therapies will not cause initial size reduction, change in tumor volume is likely inadequate to assess early response to any therapy. Therefore, to improve the accurate assessment of response and to reduce observer variability, other lesion characteristics that may be tracked across temporally sequential scans are required. New imaging techniques and associated new image-processing algorithms allow for early assessment of response to therapy and are being introduced into human clinical trials as outcome measures.
In 2001, a National Institutes of Health working group's consensus was published defining a biomarker as a “characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic responses to therapeutic intervention” . Further, “a biomarker that is intended to substitute for a clinical endpoint” was defined by the same working group as a “surrogate endpoint”—in cancer clinical trials, the accepted criterion standard for clinical end point is overall survival. In a further paper , various subgroups of biomarkers are described including prognostic, predictive, and surrogate end point biomarkers. It is possible that imaging can provide biomarkers for all three of these functions, but only if the image-based measure is very well characterized, as indicated in these series of papers. As imaging matures as a measurable modality, we would propose an expanded definition of a biomarker as follows: A biomarker is a validated disease characteristic which can be reliably measured in a cost-effective, repeatable and generalizable manner, and which acts as a meaningful surrogate for disease presence, absence, activity, or outcome in individuals or groups with the disease process. Examples include questionnaires, biochemical measures in various biologic fluids, or image based metrics. Many disease processes have an established phenotype, but a phenotype is not necessarily a good biomarker, and these two terms are therefore not interchangeable. This expanded definition includes the notion that, to be useful in healthcare, validated biomarkers should have the additional properties of being cost effective and generalizable, that is, capable of being implemented at multiple sites with uniform results.
In this paper, we summarize a recent initiative to develop a consensus approach to the benchmarking of software algorithms for the assessment of tumor response to therapy and to provide a publicly available database of images and associated meta-data. The Reference Image Database to Evaluate Response to therapy in cancer (RIDER) project is generating a database of temporally sequential computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) scans of subjects with cancer collected longitudinally during the course of nonsurgical cancer therapy . The database will also include phantom images of synthetic tumors and short-interval patient scans for the evaluation of the variance and bias of change analysis software algorithms. This project evolved from the Lung Image Database Consortium, which is finishing the creation of a publicly available database of annotated thoracic CT scans as a reference standard for the development, training, and evaluation of computer-aided diagnostic methods for lung cancer detection and diagnosis [14,15].
The RIDER project was initiated in 2005 as a collaboration among the National Cancer Institute's (NCI) Cancer Imaging Program, the NCI's Center for Bioinformatics, the National Institute of Biomedical Imaging and Bioengineering, and the Cancer Research and Prevention Foundation, with information technology support from the Radiological Society of North America. The RIDER project was designed and continues to evolve through a consensus process among members of the RIDER steering committee composed of academic researchers, program staff at NCI, and members of the Cancer Biomedical Informatics Grid, National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration (FDA), and the National Institute of Standards and Technology. The broad purpose of the RIDER project is to 1) develop a public resource of serial (i.e., temporally sequential) images acquired during the course of various drug and radiation therapy trials across multiple centers so that change analysis software algorithms may be optimized and benchmarked before use in future trials and 2) enable the development of appropriate evaluation strategies for these new algorithms. The data that will be available to academic researchers and to the device and pharmaceutical industries will include images from CT, MRI, and PET/CT along with relevant metadata. Images of physical phantoms and patient images acquired under situations in which tumor size or biology is known to be unchanged (in which the “true” change is known to be zero) also will be provided and will play a key role in the assessment of software algorithm performance. The RIDER project will highlight the importance of creating standardized methods for benchmarking software algorithms to reduce sources of uncertainty in vital clinical assessments such as whether a specific tumor is responding to therapy.
The longer-term goal of RIDER is to help identify image-based biomarkers to measure cancer therapy response. Such biomarkers could potentially be adopted in clinical trials submitted to the FDA for regulatory approval. Further, such image-based biomarkers could be used to more easily validate other biomarker algorithms in development, such as those from genomics, proteomics, or metabolomics projects. In addition, the Centers for Medicare and Medicaid Services seek evidence to support informed reimbursement decisions for image-based biomarkers that may eventually be used clinically. Consequently, the RIDER project is expected to accelerate 1) FDA approval of both software-based response assessment algorithms and therapeutic agents evaluated through clinical trials that use such algorithms and 2) reimbursement of Centers for Medicare and Medicaid Services for subsequent therapeutic decisions made using such software algorithms. (Text for General Problem Description and History Review is an edited version of Armato et al. .)
Early detection of tumor response to therapy is a key goal. Finding a measurement algorithm capable of early detection of tumor response could individualize therapy treatment as well as reduce the cost of bringing new drugs to market. On an individual basis, the urgency arises from the desire to prevent continued treatment of the patient with a high-cost and/or high-risk regimen with no demonstrated individual benefit and rapidly switch the patient to another therapy that may increase treatment efficacy for that patient. In the context of bringing new drugs to market, such algorithms could demonstrate efficacy in much smaller subject populations, which would allow phase 3 trials to achieve statistically significant decisions in shorter durations with fewer subjects.
The emphasis placed on the word “early” implies that most interest exists near the measurement regime of zero change, that is, the detection of truly small changes from whatever algorithm and parameter set is used and measured. Given that a tumor has a change trajectory over time, for a first-order approximation valid for a short interval, we need only the first two terms of a Taylor series  to model the change trajectory, that is, the nodule's current state and its initial time rate of change. Clearly, the patients' oncologists in cases of an individual's health care, or clinical trial designers in the case of drug efficacy studies, are motivated to choose the smallest imaging interval that accurately (ratio of true calls over all calls) assesses the presence or absence of real change. From detection theory, we understand that our ability to detect small changes rests on the signal-to-noise ratio (SNR), alternatively described as effect size to variance. As this ratio increases, we migrate from the condition of being able to detect changes in large populations by averaging to the condition of using fewer subjects until we are able to detect such changes in an individual with clinically useful statistical accuracy. Whereas in most cases we can increase the effect size and thus improve the SNR by increasing the interval between imaging examinations, we would much prefer to use as short an interval as possible. In the limit as the interval between imaging examinations approaches zero, we see that we are indeed operating near the regime of zero tumor change, and it is the noise (variance) in this regime that limits our ability to see small real changes (effect size) in short-interval examinations.
Given the large task required to implement these measurements across a broad spectrum of algorithms and measurement parameters in search of optimal combinations, this consensus group of authors addressing issues facing construction of the RIDER database has focused on ways of estimating a measurement algorithm's noise, that is, variance, under the condition of no change across several modalities and measurement techniques. Arguably, the most realistic and useful data sets representing zero change come from subjects with known tumors who are imaged, removed from scanner, and then are rescanned within a very short time frame. We refer to these interval examinations as “coffee break” examinations. These data sets then contain all of the realities of imaging within a short time window with whatever modality was used, that is, imaged tissue contrast-to-noise, patient motion artifacts, repositioning errors, and so on, that will be encountered in the real world. In addition, because the time interval between these scans is on the order of hours or less, we can safely assume that we know the truth, that is, there are no macroscopic changes to the tumor in the interval between these two examinations. Therefore, measurement of nonzero change by any algorithm using these coffee break data sets is an error. Note that data sets with expert annotations are not used as truth due to their demonstrated variability in segmentation and thus lack of certainty in associated change assumptions [9,17]. An alternative to collecting these coffee break examinations that contain all sources of short-term noise for an estimate of the null hypothesis against which treatment effects must be compared is the collection of a large database of treatment trials along with clinical end points that can be modeled to determine the sources of potentially multiparametric covariance; we suggest that the collection of coffee break data may be far more efficacious at much lower collection cost.
Many parameters could be exploited to measure tumor change. There are physical parameters that already have either established or suggested relationships to cancer including density, diffusivity, and elastic moduli. In addition, there are shape and composition parameters including volume, spicularity (typically quantified as the ratio of surface area to volume), heterogeneity, and vascularity (typically quantified as number of vessels intersected per unit area in a histology section). The following section describe methodologies only for assessing the accuracy of measuring tumor volume change as rendered in anatomic imaging. In the simplest case, the same techniques can be used for assessing accuracy of measuring other parameters, but should these parameters have interactions, the measurement methods will require the use of multiparametric estimators such as generalized linear models (GLMs) potentially including mixed effects models that are not addressed herein.
By way of introduction to the problem of measuring volume change of tumors, we describe three of possibly many methods of implementing such measurements. The purpose here is to show the generality of the possible solutions as well as to view the following discussion from a common viewpoint, that is, primarily that of the quantification of tumor volume change. Consider the following two of many possible methods for estimating tumor volume change:
In the following discussion of volume change, we are fundamentally addressing directly the problem of quantifying volume change. When we discuss random error variance or bias, we are not referring to just the segmentation problem that may or may not precede more sophisticated estimates of volume change but rather to the entire change analysis methodology.
In every problem, we face two basic components of error:
For tumor change measurements, we will begin with the assumption that for similar physical imaging characteristics and subject setup, the variance and bias estimates are likely dependent on the size of the tumor as well as its complexity which includes factors such as heterogeneity, shape, and location; specifically, the derived parameters that describe each of the errors may in general be a function of these enumerated independent parameters.
In most experiments, we observe both effects simultaneously as they are not easily separated and only through the collection of sufficient data and the use of statistical analysis techniques such as GLMs with selected mixed effects are we able to separate estimates of error components. Such models are especially important when the measured quantities are truly changing with time. The modeling is complicated by having to choose the specific mixed effects and degrees of freedom (DOF). Owing to the model's large DOF, the amount of test data needed and collected under known conditions also increases. When a single measured quantity is stationary, as in the section on Estimating Variance and Bias for the Case of No Volume Change, we can also approach the problem as a simpler, ordered discovery of the two separate components.
In the following discussion, we will describe an ordered quantification of both random error and bias around the operating point of no change. This is a crucial operating point because in many practical clinical applications, we wish to discover real change in as short a time interval as possible to affirm or refute the assumption that the applied therapy is effective. The urgency arises from the desire to prevent continued treatment of the patient with a high-cost and high-risk regimen with no demonstrated benefit as well as from the need to rapidly switch an individual patient to another therapy that may increase individual efficacy. Thus, measurement noise observed in the case of no change for a specific patient is a sample of the null hypothesis that must be quantified before we can determine with some stated probability that any measured change represents true change.
Under the simplifying assumption that the random error is additive, we can estimate its variance by using input data sets where we know the underlying truth is no change. There are two main types of experiments to be considered here: “coffee break” studies, that is, very short interval examinations, and longitudinal studies, both used for gathering input data sets from which we can estimate variance.
Although we are limiting our consideration to the measurement of change near the operating point of no change, we need to make the measurement of variance for tumors of differing sizes. There is significant evidence from manual and semiautomatic segmentation that SD and therefore variance is a function of tumor size; see . Thus, we need a source of truth data, for example, coffee break examinations, which contain a spectrum of scanned tumor sizes to characterize the performance of the change measurement analysis for different size tumors.
Because sample variance is a noisy measurement of the underlying distribution's variance, we will need many measurements of tumors with no size change. There are two possible approaches to increase the number of observations of variance to approximate the variance of the underlying distribution.
The estimate of the random error's variance may be sensitive to the estimator used, particularly in case of an error in classifying a tumor as having no size change. For example, for measurements Xi with mean , the obvious estimator is the sample variance determined by:
which is exactly the same estimator as
a U statistics-based estimator  as suggested in (c) above. However, estimators and are valid only under the assumption that there is no change and will be biased if the tumor varies with time. Other model-based methods are more appropriate should the tumor's volume vary with time. For long-term clinical surveillance studies of slowly varying nodules, the estimator should be less heavily biased and can be justified based on simple assumptions on the nodule growth and the homogeneity of the variance:
Robust estimation is especially useful for small data sets when outliers may make a big difference. Huber  and Hoaglin et al.  give extensive discussion of the pros of robust estimation in practice.
Once the random error's variance has been approximated, we can explicitly compute the number of observations we must obtain to test for the presence of a bias effect at some stated level of significance. The number of observations (experiments performed to measure the bias) depends on the variance of the previously determined random error distribution, the size of the bias effect we wish to measure and the probability that we will measure such an effect, that is, reject the null hypothesis, at a stated level of confidence. The required number of observations, that is, measurements of volume change, increases as
The measurement of bias is important because if present it will lead to a propensity for false-positives/negatives, depending on whether the change measurement bias is positive/negative, respectively. As the name implies, bias is a systematic error whose cause can be discovered and removed or at least modeled and ameliorated.
The determination of bias and variance in the presence of true volume change is needed if we want a completely generalized statistical characterization of a measurement method. Note that if we want to quantify volume change, not simply determine whether there was or was no change, the truth data required for this task are more difficult to obtain. Because estimates of bias and variance in volume change may be dependent on tumor volume as well as tumor volume change (along with other characteristics such as shape, type, acquisition/reconstruction protocols, and possible motion), we will want to regress both bias and variance as a function of both tumor volume and tumor volume change through GLMs. Truth data for this task can only be known from manufactured phantoms; a method for obtaining volume change truth for real tumors is difficult and has yet to be defined for RIDER. Here, the “coffee break” null change paradigm for real patient scan data is of little use because the “truth” of tumor size is not known (only the null change in tumor size is known); instead, we need estimates of true change from other accurate sources.
The key issue is that we currently have no measurement method that will provide the true change in size of an actual tumor. For real tumors that do change in size between interval scans, we are restricted to using image measurements made by expert radiologists, and this measurement method is itself subject to bias and random error [9,17]. The only way we can obtain scans with known truth for size change is to scan manufactured phantoms with known tumor characteristics and different sizes or to embed simulated, mathematically defined tumors in actual patient scan data; the critical concern here is how well such phantoms represent real tumors and their growth. To summarize:
In the preceding section, methods for assessing the relative performance of algorithms specifically for measuring tumor volume change for the purpose of early assessment of tumor response to therapy were discussed. Whereas the use of volume was explicitly examined, we could use exactly the same techniques to examine any other single parameter, for example, average mass, elasticity, etc., and the same techniques for assessment of performance would apply, that is, the measurement of variance and bias. There is, however, an explicit difference between volume and most other single parameters: volume is necessarily a singular, summary parameter whereas other parameters have tumor-dependent, heterogeneous spatial distributions of values within that volume which can be characterized in several ways including a one-dimensional histogram of its values and the histogram's summary statistics, that is, mean, variance, skew, kurtosis, and other higher moments.
The function of the sparsely filled Table 1 is to demonstrate the relative relationships of some different outcomes analysis methods and computed parametric models previously contributed to NCI's public archive https://imaging.nci.nih.gov/ncia/, now called the National Biomedical Imaging Archive, through the efforts of previous RIDER groups as well as a few related methods previously published.
As seen in the rows of the Outcomes Analyses, Table 1, most processing is first subjected to segmentation, that is, defining the volume of interest (VOI) for further processing as the volume of clinical interest. Registration commonly follows segmentation in that registration of the whole, complex set of organ systems is very computationally intensive and challenging given that some organs deform and slide along slip walls, for example, lung compression and slippage along the pleural surface of the rib cage. Thus, registration of the lung alone is far simpler than attempting to register the lung and chest wall simultaneously owing to the discontinuity of velocity vectors at the pleural surface. Hence, registration of a segmented lesion with itself across interval examinations is typically preferred.
After segmentation and registration, differing outcomes analyses are applied to the following potential change descriptors for detecting/measuring response to therapy:
Instead of using manually drawn multiple VOIs as the only means to approximate following spatial changes in the same tumor across interval examinations, registration of the interval data sets can be implemented to reduce the increased variance associated with manual misplacement of VOIs drawn on interval examinations. More importantly depending on the accuracy of its implementation, registration is capable of supporting voxel-by-voxel change analysis. After registration, a single VOI may be used on registered interval examinations to limit voxel-by-voxel analysis to, or generate summary statistics from, registered differences considered important by the investigator. Such registered differences may come from the same enhancing region as used to define a VOI on a registered T1-postGad series that has been mapped, that is, warped to, the series (one or more) of analytical interest. Summary statistics, typically mean and variance, can be compiled
Defining truth in realistic, complex data sets is difficult except in the case of multiple, short-interval examinations, that is, coffee beak examinations. Previously, expert physician annotations have been the accepted standard, but recent studies suggest that even among recognized experts, the variance of annotations is significantly large such that the expense for obtaining sufficient data to observe small standard error of the mean expert trends is prohibitive. Thus, sources of truly known imaging “truth” in realistic, complex settings are invaluable. Despite the costs of scanning including increased radiation burden for CT, PET, and SPECT, such known truth can be obtained from multiple short-interval examinations on consented patients where the known truth is that no macroscopic change in the tumor can have occurred in the sufficiently short interval between scans. The use of interval examinations having uncorrelated or even partially correlated noise contributes to the knowledge of the covariance of the null hypothesis distribution and thus allows probabilistic limits to be set on the chance that the observed outcome represents real change versus noise.
Without such data, the only apparent other option for investigators is to gather large population databases with low-noise outcome measures of truth such as length of disease-free survival or complete pathologic response. These databases typically need to be accumulated from a minimum of 50 subjects so that part of the database may be used for training the change detection algorithm and the remainder of the database may be used for testing through application of the tuned algorithm; bootstrapping may be used [56,57]. For univariate data, the test typically consists of finding an optimal cut point using receiver operating characteristic (ROC) analysis on the training data set component of the database, which can then be applied to a separate test set for unbiased assessment of performance [58,59]. Much larger databases (e.g., 200 or more subjects) are typically required to achieve sufficiently small confidence limits for the area under the curve (or Az) of the ROC to see statistically significant changes between competing algorithms. The number of subjects required is large because experimental truth gathered outside carefully designed clinical trials is itself noisy because clinical treatments across multiple subjects are typically not uniform owing to differing surgical and other unplanned life-saving interventions that can significantly alter individual patient outcomes but obfuscate the effects of the initial therapy and associated image-based change analysis used to define early therapeutic response. For multivariate analyses, ROC techniques can be supplanted by use of GLM regression.
The concluding assumption is that it is likely more cost-effective to collect multisubject, multiple short-interval coffee break examinations on which we can measure the variance for the null condition across competing algorithms for detecting early change, than it is to essentially perform a small (~200 subjects) phase 2 study to obtain the necessary database to test the algorithms for efficacy. The noise in these larger studies may also be increased owing to a multitude of other clinical and multi-institutional factors not rigorously controlled such as variation in scanners across institutions, slight variations in scanning protocols, medications, and so on.
In estimating variance and bias in the measurement of tumor volume change around the operating point of no volume change observed with coffee break data sets, the number of measured parameters is likely dependent on a number of factors including tumor size, structural complexity (e.g., inhomogeneity, shape, surroundings), the extent of change (in terms of size and morphology), and possible changes in scanner parameters between scans; that is, the problem space is of high dimensionality with respect to lesion size and complexity and scanner settings. Given realistic limitations on the number of available, finite input data sets, it will be necessary to assess complexity and control the number of variables to obtain statistically meaningful results. Investigators should initially consider pilot experiments that focus on a small number of selected points in this problem space (e.g., well-defined lesions of clinically meaningful size and size change with well-defined margins and very similar scanner parameters). Such experiments should provide insight on how to conduct experiments to characterize error for real lesions in the larger problem space.
In quantifying the accuracy of measuring true lesion change we will also need to know bias at “operating points” other than the no-volume change point described immediately above. As a first-order approximation, we can scan simple spherical phantoms of known volumes, and by using different combinations of phantom tumors for “early” and “late” interval pairs, we can evaluate multiple combinations of volume and volume change operating points. Both low and high contrast-to-noise acquisitions should be examined. Because the variance of the random error component for these measurements should be relatively small owing to the structural simplicity compared with the coffee break examinations, the ability to see small bias should be relatively easier to observe. Irregularly shaped tumor phantoms, for example, those with spiculations and random orientations in the field of view could be used to increase the complexity of the phantom measurement setting to more closely approximate outcomes in real data sets.
All of the points discussed in the preceding paragraphs are valid for this topic as well. Additionally for heterogeneous tumors, change analysis based on one-dimensional histogram summary statistics accumulated over the volume of interest of the tumor may be misleading. Consider trying to measure response changes in a heterogeneous tumor where the changes over different regions of the tumor both increase and decrease with respect to the parameter's mean such as might be observed where therapeutic intervention is successful in some compartments while tumor growth temporarily continues in other, more isolated compartments. Under such conditions, changes in the summary statistic (mean or other moments such as variance, skew, and kurtosis) would be attenuated, and the detection of such changes, if present, would be less likely. Under these conditions, tracking of changes in individual tumor compartments supported by voxel-by-voxel change analysis has the possibility of demonstrating such confounding effects.
The initial focus of the search for algorithms that provide early image-based markers of tumor change in response to therapy will likely use
In addition to discovering which algorithms are low-noise estimators of tumor change and thus optimally suited to detect early change, in practice
The motivation for these issues and modality-specific issues are discussed in more detail in the editorial and three companion articles of this issue [60–63]; modality-independent issues are enumerated in the Appendix.
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government.
There are many factors common to most medical imaging modalities that affect our ability to measure tumor change with little bias or variance error. In this appendix, we enumerate some of those factors in the context of measuring tumor volume change and discuss possible methods of mitigation. We list them here as separate factors with the understanding that there are likely significant interactions between different sources of random error, that is, their factors will have nonzero covariance. Generalization of the principles discussed here to the measurement of parameter changes other than volume is relatively straightforward. Modality-specific examples are presented in the companion articles [61,62,63].
Each measurement method for quantifying tumor volume change, whether based broadly on subtracted segmentations or registration, will likely have its own characteristic variance and bias. The following describes many of the variables that affect each of the two different volume change estimation methods:
Level of breath hold can be partially or fully achieved several ways that vary from asking the patient to hold their breath, for example, at full inspiration (this is an example of partial control), to measurement of tidal phase through a flow meter, which actuates a valve to enforce breath holding . Other possibilities include cardiac and respiratory gating of the image acquisition system or list mode acquisition where provided by the vendor followed by gated reconstruction and registration of the differently gated cycles.
The basic principle in change analysis is that whenever possible, keep all potential sources of bias and variance unchanged between interval examinations, that is, use the same segmentation method for both interval examinations, and if the segmentation method is semiautomatic or fully automatic, continue to use the same tuning parameters for both examinations. Use the same scanner with the same technical protocol and consistent patient factors (contrast dose, rate of delivery, flush, injection site, breath hold, table position, etc.). The scanner should have a rigorous quality assurance program in place to ensure consistent performance, and technical protocols should be user-locked.
1This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract N01-CO-12400.