|Home | About | Journals | Submit | Contact Us | Français|
We use changes in tumor measurements to assess response and progression, both in routine care and as the primary objective of clinical trials. However, the variability of computed tomography (CT) –based tumor measurement has not been comprehensively evaluated. In this study, we assess the variability of lung tumor measurement using repeat CT scans performed within 15 minutes of each other and discuss the implications of this variability in a clinical context.
Patients with non–small-cell lung cancer and a target lung lesion ≥ 1 cm consented to undergo two CT scans within a period of minutes. Three experienced radiologists measured the diameter of the target lesion on the two scans in a side-by-side fashion, and differences were compared.
Fifty-seven percent of changes exceeded 1 mm in magnitude, and 33% of changes exceeded 2 mm. Median increase and decrease in tumor measurements were +4.3% and −4.2%, respectively, and ranged from 23% shrinkage to 31% growth. Measurement changes were within ± 10% for 84% of measurements, whereas 3% met criteria for progression according to Response Evaluation Criteria in Solid Tumors (RECIST; ≥ 20% increase). Smaller lesions had greater variability of percent measurement change (P = .005).
Apparent changes in tumor diameter exceeding 1 to 2 mm are common on immediate reimaging. Increases and decreases less than 10% can be a result of the inherent variability of reimaging. Caution should be exercised in interpreting the significance of small changes in lesion size in the care of individual patients and in the interpretation of clinical trial results.
Tumor imaging plays a fundamental role in oncology care and clinical trials, where computed tomography (CT) –based tumor measurement is a primary mechanism for determining response to therapy and time to treatment failure. Improvements in imaging technology mean that detailed tumor measurements are increasingly available for use in clinical decisions. Tumor measurement is commonly performed using a computer interface, which allows radiologists to provide precise measurements. In clinical research, there is increasing interest in specific measurement changes as it has become clear that broad response categories such as Response Evaluation Criteria in Solid Tumors (RECIST) sometimes fail to capture the activity of novel therapies.1 Phase II trials in particular increasingly present waterfall plots showing individual measurement changes for each patient and, thus, treat response as a continuous rather than categorical variable.2 Given this increased interest in quantitative tumor measurement, it becomes important to understand what measurement changes are meaningful rather than a result of variability of imaging and measurement.
Multiple factors can contribute to the variability of CT-based tumor measurement. A component of this variability is operator dependant, such as the process of selecting which CT slice to measure and the placement of a linear measurement using a computer interface. Several studies have tried to quantify this variability by having radiologists perform repeat measurements on a single set of CT scans.3–6 Separately, the process of performing a CT scan can lead to changes in the appearance of a tumor or surrounding stroma. Lastly, it has been found that the technique used for processing CT imaging data can contribute to measurement variability.7–9 However, prior studies assessing measurement variability have studied only one step of the imaging process; although these studies may be useful for investigations into reducing measurement variability, they do not have direct applicability to the clinical environment.
In this study, we attempt to quantify the variability of the complete CT measurement process. By imaging patients with lung tumors twice within several minutes and then measuring both images in a similar fashion, we gain an opportunity to quantify the variability inherent in both CT image capture and in the subsequent measurement technique. Prior rescan studies have had limited applicability to clinical oncology because these have looked at small lung nodules of unclear malignant potential.9–11 Additionally, we previously published a rescan analysis of the clinical study reported here that compared manual measurement to semi-automated measurement; however, radiologist measurements were made independently without allowing comparison of baseline and follow-up images.12 In the present study, by replicating the clinically standard side-by-side measurement process on two separately obtained scans, we provide the most optimal framework for the interpretation of clinical and research measurement results. We set out to obtain data that would allow clinicians to answer two basic types of imaging questions, demonstrated in the following examples:
Patients with non–small-cell lung cancer (NSCLC) receiving systemic therapy consented to this Institutional Review Board–approved rescan study (ClinicalTrials.gov identifier NCT00579852). Patients were accrued in clinic by their treating oncologist. Patients were eligible for participation if a noncontrast CT scan of the chest was clinically indicated and their most recent CT scan report described a lung lesion ≥ 1 cm in diameter; there was no radiologist review before enrollment. Sample size calculation in the study protocol was based on the ability to detect a concordance correlation coefficient between the baseline and follow-up measurements of at least 0.75.
Patients received their initial noncontrast chest CT scan per clinical routine with either a 16-detector or 64-detector scanner during a breath hold. On completion, the patient was instructed to leave the scanner briefly before returning for a second scan obtained in the identical fashion on the same scanner. For each scan, the craniocaudal extent of the scan was separately determined with a scout image of the patient. Parameters for the 16- and 64-detector scanners were as follows: tube voltage, 120 kV (peak) and 120 kV (peak); tube current, 299 to 441 mA and 298 to 351 mA; detector configuration, 16 detectors × 1.25 mm section gap and 64 detectors × 0.63 mm section gap; and pitch, 1.375:1 and 0.984:1. The thoracic images were obtained without intravenous contrast material. Images were reconstructed with 1.25-mm nonoverlapping slice intervals and a sharper convolution kernel and stored in Digital Imaging and Communications in Medicine (DICOM) format. Patient identifiers were removed from the DICOM headers of all CT images analyzed for this study. Through collaboration with the National Cancer Institute, deidentified images have been placed in the public domain as a resource for further investigations and can be accessed at the National Biomedical Imaging Archive.
CT images were viewed at a computer workstation by three experienced radiologists (M.S.G., P.G., and R.A.L.). Radiologists were first asked to perform a measurement of maximum diameter for a selected target lesion on the first scan of each of the 33 patients. Measurements were performed manually using a computer interface. After making this baseline measurement, each radiologist viewed the second scan side-by-side with the first and was asked to measure this follow-up scan in a similar fashion. Radiologists were blinded from knowing how much time had passed between the two scans. The six measurements performed on each patient (by three radiologists on each of two different scans) were averaged to calculate the approximate size of each lesion.
The change in size between the two measurements of each tumor by each radiologist was calculated. Because it is unlikely that any real change in tumor burden occurred in the minutes between the two scans, the measured change functions as a gauge of the random variations that can be expected simply by reimaging. Because the two scans were performed at approximately the same time, one of them was randomly selected as the baseline, and the other was designated as the follow-up scan. The change in millimeters and the relative change in measurement (as a percentage) were calculated. The former is the measure of change directly observable, whereas the latter is the metric used commonly in clinical trials.13 This random assignment step was repeated 1,000 times, resulting in 1,000 distributions of the change (in millimeters) and of the relative change. We report the mean and standard deviation of measurement change by averaging these statistics over the 1,000 distributions.
To assess what range of measurements could be expected as a result of variability, the 95% limits of agreement for change in millimeters were calculated as the mean change ± 2 standard deviations. The 95% limits of agreement can inform clinical practice because changes that fall within these limits can be considered as potentially arising as a result of measurement variability, rather than true change in tumor size. The relationship between this measurement error and lesion size was examined by fitting two separate generalized linear models with a normal probability distribution, which accounted for the intracluster correlation resulting from the fact that each scan is measured by three different radiologists. In each model, the positive measurement change (in millimeters and percent change, respectively) was modeled as a linear function of the lesion size.
Between January and September of 2007, 33 patients with NSCLC consented to participation in the study. Two patients were excluded from analysis because, after undergoing measurement per protocol, the mean target lesion size was determined to be less than 1 cm. Another patient did not follow study protocol and was excluded because more than 1 day elapsed between CT scans. The characteristics of the 30 remaining patients and their 30 target lesions are listed in Table 1. The mean lesion size was 3.7 cm (range, 1.0 to 8.0 cm). The median time interval between the two scans was 8 minutes (range, 5 to 14 minutes). Twenty-seven patients (90%) were imaged with a 16-detector scanner, and three patients (10%) were imaged with a 64-detector scanner. Repeat measurements were made of each of the 30 lesions by three radiologists, totaling 90 paired measurements.
The distribution of the 90 measurement changes is shown in Figure 1A. The standard deviation of measurement change was 2.4 mm, indicating that 95% of measurement changes fell between −4.8 mm and +4.8 mm. Fifty-seven percent of measurement changes had a magnitude greater than 1 mm, and 33% of changes had a magnitude greater than 2 mm, with half of these appearing as positive changes and half appearing as negative changes (Table 2).
A waterfall plot of the 90 relative changes in tumor measurement is shown in Figure 1B, ranging from 23% shrinkage to 31% growth. The median increase was +4.3%, and the median decrease was −4.2%. Three percent of changes met the RECIST threshold for progressive disease (20% increase), whereas none met the RECIST threshold for partial response (30% decrease). Eighty-four percent of the tumor measurement changes were between −10% and +10%.
Table 3 lists the standard deviation of measurement change calculated from tumors of different sizes. Larger tumors tended to have larger magnitude measurement changes in millimeters (P = .06; Fig 2). In contrast, relative change (percent increase or percent decrease) was found to be significantly larger for smaller tumors (P = .005; Fig 2). For tumors smaller than 3 cm, 6% of the changes met RECIST criteria for disease progression, as opposed to 1% of the changes for tumors larger than 3 cm. The range of potential measurement changes as a result of variability are listed in Table 3 for three tumors of different sizes, using 95% limits of agreement calculated from the standard deviation. Although standard deviation of absolute measurement increases somewhat with increased tumor size, variability of percent change decreases.
This study represents the first time that the variability of CT-based tumor measurement has been fully quantified with the use of repeat CT imaging and conventional side-by-side measurement. The variability we found is directly applicable to measurements of lung tumors and is likely to be observed in tumor measurements in other organs. We believe our data have important relevance to the clinical care of patients with parenchymal metastases and to the interpretation of clinical trial results that rely heavily on radiographic response end points.
Our data demonstrate that CT measurement of lung lesions has variability of a clinically meaningful magnitude, with standard deviation of 2.4 mm of change between two CT scans of a tumor performed within a short period of time. This measurement variation is multifactorial in nature, partly because of differences in the appearance of the tumor or surrounding tissue (Appendix Fig A1, online only) and partly because of operator-dependent differences in the placement of a measurement. In total, the multistep process of repeat imaging and measurement of a tumor will lead to differences of ≥ 4 mm approximately 10% of the time (Table 2). Or put in other terms, for a lesion that in fact measures 4 cm, the variability of CT imaging can lead to measurements ranging from 3.5 to 4.5 cm (Table 3).
These data can assist clinical oncologists as they determine when a patient has developed disease progression. Although disease progression in clinical trials is defined by RECIST as an increase in summed tumor diameter of ≥ 20%,13 the criteria state that, “it is not intended that these RECIST guidelines play a role in [clinical] decision making, except if determined appropriate by the treating oncologist.” In clinical practice, any clear evidence of tumor growth could be judged a treatment failure, supporting the discontinuation of a therapy and possible change to another. Some oncologists may interpret a diameter increase of 1 or 2 mm as evidence of clear tumor growth; however, we found that reimaging led to measurement differences exceeding 2 mm 33% of the time, with half of these (17%) showing a more than 2 mm increase in tumor diameter. Therefore, we believe the variability inherent in CT imaging requires that clinicians consider other factors, such as changes in size of other lesions or patient toxicity, when using CT measurements to identify tumor progression.
Our findings indicate that the inherent variability of conventional unidimensional CT measurement can at times lead to the appearance of RECIST progression (≥ 20% diameter increase), considering that 3% of measurement changes calculated from the repeat CTs met this criterion. This was more common with lesions measuring between 1 and 3 cm; in 6% of these lesions, the measurement change from reimaging resulted in an appearance of a ≥ 20% increase in diameter. One strategy for avoiding cases of variability being misclassified as progression was adopted by RECIST 1.1, which now dictates that a ≥ 20% increase only qualifies as disease progression if there is “an absolute increase of at least 5 mm” in summed diameter measurements,13 and our data support this concept of a minimal change requirement.
This work also may have important implications for the interpretation of tumor response, particularly when measurement change is considered as a continuous variable, a technique increasingly considered as a way of better expressing a therapy's antitumor activity.1–2,14 One analysis often used is the waterfall plot, which displays the magnitude of each patient's best response as a percent measurement change (Fig 1B). To gauge its prevalence in the literature, we searched PubMed for phase I and II trials treating the major CT-measured carcinomas (lung, colorectal, pancreas, and renal) that were published in the Journal of Clinical Oncology in 2009. The search found 41 articles; nine articles focusing on radiotherapy or toxicity of therapy were excluded. Of the remaining 32 clinical trials, 15 (47%) included data addressing response as a continuous variable; 14 showed waterfall plots, and nine quantified the number of patients with a reduction in measurements. Although these statistics are used frequently, they can only be meaningfully interpreted (and meaningfully reported) if the expected variability of CT imaging and measurement is known. In the present study, we found a median decrease of 4.2% with the reimaging of a tumor after a short interval of time, indicating that half of measurement decreases as a result of variability alone will exceed 4.2%. As shown in the waterfall plot of our data (Fig 1B), 84% of percent measurement differences fell between −10% and +10%. Tumor size changes of small magnitudes are commonly reported in clinical trial results, yet the implication of these changes remains unclear considering that these differences may be solely a result of imaging variability.
As an example, in Figure 3, we present a waterfall plot of data from a recently completed phase II trial of targeted therapy in NSCLC (published separately).15 No standard statistical method exists for comparing waterfall plots, but many authors describe the proportion of patients with tumor shrinkage, including tumors with any evidence of diameter decrease. However, our data would suggest that many measurement decreases, particularly those less than 10%, may be indistinguishable from variability-related changes. Rather than calculating a tumor shrinkage rate, we would recommend that tumors with diameter changes between 0% and 10% decrease be considered relatively unchanged; for this reason, we have added a gray zone to Figure 3 to minimize the significance of changes with a magnitude less than 10%. The diameter decreases between 10% and 30% are less likely to be a result of variability and could potentially represent true antitumor effect, although the clinical implications of such a minor response would need to be investigated further. It is worth remembering that the historical roots of the RECIST response criteria date back to a variability study performed in 197616,17; perhaps the improved accuracy of modern measurement could in part be a basis for reconsidering what qualifies as a tumor response. Interestingly, several groups have found that a 10% decrease in tumor diameter may be correlated with better outcomes in some cancers.18–21
Because this study measured only a single lesion for each patient, we are not able to quantify how summed measurement of multiple lesions might affect variability, although we can estimate this effect using the standard deviation of measurement variability from earlier. Although variance increases proportionally with a number (n) of independent summed measurements with the same standard deviation (σ), standard deviation (the square root of variance) increases at a square root proportion (√ n × σ). This means the relative magnitude of the standard deviation can decrease with summed measurement, particularly if one assumes no correlation between the measurement error for two tumors on an individual CT scan. To illustrate this, we can consider a patient with multiple 15-mm lung tumors. Measurement of a single 15-mm tumor has a standard deviation of approximately 2.0 mm (Table 3), meaning that 95% of tumor measurements can be expected to lie within 15 ± 4.0 mm; as a percentage of tumor size, this is equal to a 95% limits of agreement of −27% and +27%. Yet when four 15-mm tumors are measured, the standard deviation increases only by a factor of two, to equal 4.0 mm; therefore, 95% of tumor measurements will lie within 60 ± 8.0 mm, equal to a 95% limits of agreement of −13% and +13% (Appendix Table A1, online only). This demonstrates how one can increase the relative accuracy of summed measurements by measuring a greater number of similarly sized lesions.
In conclusion, this rescan study of lung lesions in patients with advanced NSCLC found a clinically important magnitude of measurement variability inherent in repeat CT imaging. This variability is greatest in the measurement of small tumors and has important implications for accurate determination of disease progression. Apparent changes in tumor diameter exceeding 1 to 2 mm are common on reimaging and alone may not be indicative of progression. Relative changes less than 10% may be indistinguishable from changes caused by variability alone and are unproven as a marker of efficacy in clinical trials.
|No. of 15-mm Tumors||Summed Tumor Size (mm)||Standard Deviation (mm)*||95% Limits of Agreement (%)|
|1||15||2.0||−27 to +27|
|2||30||2.8||−19 to +19|
|3||45||3.5||−15 to +15|
|4||60||4.0||−13 to +13|
|6||90||4.9||−11 to +11|
See accompanying editorial on page 3109
Supported in part by Grant No. R01-CA125143 (L.H.S.) from the National Cancer Institute, Bethesda, MD.
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
Although all authors completed the disclosure declaration, the following author(s) indicated a financial or other interest that is relevant to the subject matter under consideration in this article. Certain relationships marked with a “U” are those for which no compensation was received; those relationships marked with a “C” were compensated. For a detailed description of the disclosure categories, or for more information about ASCO's conflict of interest policy, please refer to the Author Disclosure Declaration and the Disclosures of Potential Conflicts of Interest section in Information for Contributors.
Employment or Leadership Position: None Consultant or Advisory Role: Lawrence H. Schwartz, Novartis (C), GlaxoSmithKline (C) Stock Ownership: None Honoraria: None Research Funding: Lawrence H. Schwartz, AstraZeneca Expert Testimony: None Other Remuneration: None
Conception and design: Binsheng Zhao, Mark G. Kris, Lawrence H. Schwartz, Gregory J. Riely
Financial support: Mark G. Kris, Lawrence H. Schwartz
Administrative support: Binsheng Zhao, Mark G. Kris,Lawrence H. Schwartz
Provision of study materials or patients: Michelle S. Ginsberg, Robert A. Lefkowitz, Pingzhen Guo, Mark G. Kris, Gregory J. Riely
Collection and assembly of data: Geoffrey R. Oxnard, Binsheng Zhao, Michelle S. Ginsberg, Robert A. Lefkowitz, Pingzhen Guo,Lawrence H. Schwartz, Gregory J. Riely
Data analysis and interpretation: Geoffrey R. Oxnard, Binsheng Zhao, Camelia S. Sima, Leonard P. James, Lawrence H. Schwartz,Gregory J. Riely
Manuscript writing: All authors
Final approval of manuscript: All authors