

| Home | About | Journals | Submit | Contact Us | Français |

Volume of subcutaneous xenograft tumors is an important metric of disease progression and response to therapy in preclinical drug development. Non-invasive imaging technologies suitable for measuring xenograft volume are increasingly available, yet manual calipers, which are susceptible to inaccuracy and bias, are routinely employed. The goal of this study was to quantify and compare the accuracy, precision, and inter-rater variability of xenograft tumor volume assessment by caliper measurements and ultrasound imaging.
Subcutaneous xenograft tumors derived from human colorectal cancer cell lines (DLD1, SW620) were generated in athymic nude mice. Experienced independent reviewers segmented three-dimensional ultrasound data sets and collected manual caliper measurements resulting in tumor volumes. Imaging- and caliper-derived volumes were compared to tumor mass, the gold standard, determined following resection. Bias, precision and inter-rater differences were estimated for each mouse among reviewers. Bootstrapping was used to estimate mean and confidence intervals of variance components, intra-class correlation coefficients (ICC’s) and confidence intervals for each source of variation.
Average deviation from true volume and inter-rater differences were significantly lower for ultrasound volumes compared to caliper volumes. Reviewer ICC’s for ultrasound and caliper measurements were similarly low (1%), yet caliper volume variance was 1.3-fold higher than ultrasound.
Ultrasound imaging more accurately, precisely, and reproducibly reflects xenograft tumor volume than caliper measurements. These data suggest that preclinical studies utilizing xenograft burden as a surrogate endpoint measured by ultrasound imaging require up to 30% fewer animals to reach statistical significance compared to analogous studies utilizing caliper measurements.
Longitudinal measurement of subcutaneous xenograft tumor volume is a central component of numerous preclinical studies utilizing mouse models of human cancer. Tumor volume is used as a metric to assess growth and disease progression, as well as to quantify response to therapeutic regimens. Meaningful assays of tumor volume are highly accurate, precise, and possess the requisite sensitivity to detect subtle differences between experimental arms, while utilizing as few experimental animals as possible. Measurements of xenograft tumors collected with manual calipers are rapid, non-invasive and inexpensive (1, 2). However, caliper measurements of subcutaneous xenografts are affected by contributions to the measure from epidermis and adipose tissue, as well as fur if present, each of which introduces error and variability into volume determinations. Furthermore, caliper measurements are commonly collected along the longest two dimensions of the tumor x/y plane only, with the z-axis dimension assumed to be the same as the shortest dimension (1).This practice contributes to the expediency of the method but hinders accuracy because volume estimation with this approach assumes ellipsoidal shaped xenografts, which is frequently incorrect. These weaknesses highlight that improved methods to accurately and reproducibly determine xenograft tumor volumes on a routine basis are needed in preclinical cancer research.
A number of alternative methods to measure the volume of xenograft tumors have been reported, each varying widely with respect to the time required to make the measurement, cost, and accuracy. Physical methods, such as cast modeling, have been shown superior to caliper measurements for determining xenograft tumor volume in mouse models (3). Cast modeling, however, is extremely time and labor intensive. Furthermore, cast modeling requires xenografts to be placed in a limited number of anatomical locations, such as the ventral chest wall, which provide suitable resistance to pressure as to enable the cast to properly form (3). Several imaging methods suitable for assessing xenograft tumor volume exist and are quite attractive due to their non-invasive nature and potential for highly-resolved measurement. Computed tomography (4) has been reported more reliable for assessing tumor volume than calipers in rat models of mammary cancer (5). However, the availability and expense of small animal CT scanners, the requirement of ionizing radiation, and the inherently poor soft tissue contrast of CT without the use of exogenous contrast material limit the routine use of CT for xenograft volume measurements in small animals. Magnetic Resonance Imaging (MRI) methods provide reliable tumor volume measurements without exogenous contrast materials (6, 7), yet without specialized animal holders accommodating numerous animals simultaneously (8), MRI is generally too expensive and time consuming for routine xenograft tumor measurements when studies include more than a few cohorts. Bioluminescence imaging (BLI) has been used as a measure of relative tumor burden (9) and can be rapid. However, in most cases BLI is acquired in two-dimensional planar format and is thus unable to provide an absolute tumor volume measurement. BLI also has the added requirement that xenografts must be generated from tumor cells engineered to express luciferase and such models require injection of luciferin substrate, limiting the breadth of preclinical models that can be studied.
Ultrasound imaging in the preclinical setting has emerged as an inexpensive, non-invasive method for measuring xenograft tumor volume. Ultrasound imaging boasts excellent soft tissue contrast without the use of exogenous contrast agents or ionizing radiation and offers considerably higher throughput than CT or MRI (up to 30 animals/hr). Previous studies have reported that ultrasound imaging provides reliable tumor volumes in longitudinal studies (10–12) and in in vitro tissue samples(13, 14).. However, studies to date have not directly compared the accuracy and reliability of xenograft volumes determined by ultrasound imaging to those obtained using external caliper measurements. In this study, we show that xenograft tumor volume determination is significantly more accurate and reproducible using ultrasound imaging than external caliper measurements. Accordingly, we demonstrate that preclinical studies employing ultrasound imaging for volume determination of xenograft tumors require significantly fewer animals to reach statistical significance than analogous studies relying upon standard caliper measurements.
Studies involving mice were conducted in accordance with federal and institutional guidelines. SW620 and DLD1 human colorectal cells were cultured in DMEM supplemented with 10% fetal bovine serum at 37°C in an atmosphere of 5% CO2. For in vivo studies, xenograft tumors were generated as described (15). Briefly, 4 × 106 cells were injected subcutaneously on the right flank of 5–6 week old female athymic nude mice (Harlan Sprague-Dawley). Using this method, palpable tumors were typically observed within 2 weeks following injection of cells and were allowed to progress until at least 400 mm3 for these studies.
The two longest perpendicular axes in the x/y plane of each xenograft tumor were measured to the nearest 0.1 mm by three independent observers (reviewers 5–7) familiar with collecting caliper measurements of xenograft tumors in mice. The depth was assumed to be equivalent to the shortest of the perpendicular axes, defined as y. Measurements were made using a digital vernier caliper while mice were conscious and were calculated according to Equation 1 as is standard practice (1, 2):
Immediately following caliper measurements, three-dimensional ultrasound imaging data sets were collected for each xenograft using a Vevo 770 ultrasound microimaging system (VisualSonics Inc.) designed for small animal imaging. For imaging acquisition, mice were initially anesthetized using 2% isofluorane in oxygen followed by placement on a heated stage during the course of imaging. Anesthesia was maintained during imaging using 2% isoflurane in oxygen. Xenografts were coated in warmed (37°C) Aquasonic 100 ultrasound gel (Parker Laboratories) and centered in the imaging plane. Three-dimensional B-mode data was acquired by automated translation of the 30 MHz ultrasound transducer along the entire length of the xenograft. The resulting data sets had a 17mm × 17mm field of view with an in-plane pixel resolution of 33.2 × 33.2 µm and an interslice spacing of 101.6 µm, resulting in 33.2 × 33.2 × 101.6 µm voxels.
For analysis of ultrasound data, images were imported into Amira 5.2 (Visage Imaging) for volumetric analysis. Tumor tissue exhibited photopenia compared to non-tumor tissue allowing the tumor tissues to be manually segmented by four trained observers (reviewers 1–4) to obtain a volume for each xenograft. Tumor volume was determined by summation of the in-plane segmented regions and multiplying this quantity by the inter-slice spacing as described(12).
Animals were sacrificed immediately following ultrasound imaging and xenograft tumors excised and stripped of non-tumor tissue if present. Tumor mass has been shown to directly correlate with volume measured by water displacement (r=1.0000)(2). Mass was determined to the nearest 0.1 mg using a calibrated analytical balance. Xenograft tumor volume was calculated from tissue mass assuming a density of 1 mg/mm3. This value was used as the true tumor volume (TTV) for comparison purposes.
Volumes derived from the mass of excised xenograft tumors were established as the “gold standard” value for volume. Overall bias was estimated separately for each measurement type using an intercept only mixed models analysis of variance assuming normal errors containing random effect terms for reviewers and mice. Inter-rater variability was assessed using the average of the absolute value of inter-rater differences over mice by measurement type. Among the 4 ultrasound reviewers, these averages were comprised of observations per mouse compared to observations per mouse for the 3 caliper reviewers. Per mouse averages were compared using the Wilcoxon signed rank test for paired observations. Five thousand bootstrap replicates of the data were generated under the model described above to estimate nonparametric confidence intervals for sources of variability, inter-class correlation coefficients (ICC), and the total variance. This number of replicates ensured consistent precision to 3 decimal places for the confidence intervals of point estimates. The ICC is defined as the ratio of variance components; specifically as the ratio of variability among mice to the sum of variability due to mice, reviewers, and error. An analogous ICC was also estimated for reviewers placing reviewer variability in the numerator. Reviewer 3 and 6 were a common reviewer, though observations from this individual were considered independent between measurement types in all analyses as the observer was blinded to which animal was being measured.
The mean (standard deviation) TTV measured on 14 mice was 1117 mm3 (587 mm3). TTV ranged from 460 mm3 to 2323 mm3 with a median of 951 mm3. Typical reconstructed three-dimensional tumor volumes derived from ultrasound imaging data are shown in Figure 1, where the segmented tumor volume is shown in purple for display purposes. As displayed, tumors are ranked by TTV. Non-spheroid tumors, which comprised approximately half of the tumors studied, are denoted with an asterisk. Table 1 summarizes the cell line and the measurements of tumor volume by ultrasound and calipers as well as the TTV.

Figure 2 depicts the distributions of bias (experimental measure minus TTV) for each independent rater. Overall average bias (± s.e.) among ultrasound measurements was −53 mm3 (± 43 mm3) compared to 96 mm3 (± 88 mm3) for caliper measurements. Deviations per reviewer compared to the overall bias were relatively small. The coefficient of variation for bias and its standard deviation for ultrasound and caliper measurements were 0.81 and 0.92, respectively. Reviewer variability was relatively small compared to the variability among mice for both ultrasound and caliper measurements as assessed by the reviewer intraclass correlations of 1% for both modalities.

Figure 3 depicts Bland-Altman plots for reviewer 1 (ultrasound reviewer, 3A) and 5 (caliper reviewer, 3B). The Bland-Altman plots from these reviewers were representative of the other reviewer plots, which are all shown in Supplemental Data. Importantly, these plots illustrate more precise inter-rater agreement between ultrasound imaging measurements and TTV than for caliper measurements performed on the same mice. These data also illustrate that accurate ultrasound measurement may be limited to tumors <1500 mm3, as ultrasound estimates appeared consistently smaller than the TTV in tumors larger than 1500 mm3. Underestimation of volume in very large tumors was not observed for caliper measurements, but the variability in caliper estimates was found to increase with tumor volume.

Median (range) inter-rater variability, as assessed by the average of the absolute value of inter-rater differences, was significantly lower among ultrasound measurements 73 mm3 (25 mm3 to 138 mm3) compared to 147 mm3 (66 mm3 to 408 mm3) for caliper measurements (p=0.001). The reviewer differences between measurement type were not highly correlated (spearman r = 0.17), suggesting that tumor characteristics that result in large inter-rater deviations for caliper measurements are not the same as those that result in the larger inter-rater differences for ultrasound (Figure 4A). Shown another way (Figure 4B), inter-rater differences plotted against the rank of TTV for each mouse illustrates a trend toward increasing disagreement as true volume increases for caliper measurements (p=0.063) and ultrasound (p=0.004). For caliper measurements, the increase was an average 0.09 mm3 per 1 mm3 increase in true tumor volume compared to 0.04 mm3 among ultrasound reviewers. The discrepancy in p-values is likely associated with the higher variability among reviewers making caliper measurements (residual standard error for caliper reviewers = 93 mm3 versus 23 mm3 for ultrasound reviewers).

Bootstrapped estimates of the sources of variability confirmed results from previous analyses (Table 2). Interestingly, high mouse ICC and low reviewer ICC for both types of measurements indicate that multiple reviewers are not necessary to establish precise estimates of tumor volume whether measured via ultrasound or calipers. Four percent of the total variance was in the error term for caliper measurements, which in this model may suggest an interaction between reviewers and mice. We interpret the four-fold greater ICC error among caliper measurements compared to ultrasound imaging measurements to stem from the observation that inter-reviewer variability apparently increases with tumor size among caliper measurements but not for ultrasound measurements.
Studies to elucidate the biological effects and therapeutic potential of candidate drugs in oncology may begin in an in vitro setting, but promising strategies are ultimately advanced to in vivo mouse models. Though elegant transgenic mice have been developed which enable study of subtle biological and clinical traits of human cancer, a majority of in vivo assays designed to assay the efficacy of novel agents utilize simple measurement of tumor growth and/or regression. A natural progression from in vitro screening, these studies routinely employ subcutaneous human cell line xenografts grown in athymic nude mice and rely upon accurate and precise measurement of xenograft volume.
In this study, we compared the accuracy and precision of ultrasound imaging volumes of subcutaneous xenograft tumors and volumes determined using caliper measurements to the true volume of the tumor. Accuracy can be defined as proximity between the measure of an object and its true value. We quantified the accuracy of volumetric measurements as the average difference between the TTV and measured volume using either ultrasound image volumes or caliper measurements of the tumor. Precision is the average squared distance of measurements from their mean (variance or standard deviation). It is important to note that a group of measurements can be accurate, precise, both, or neither. Bias is defined as a systematic difference from the true value. In multi-arm experiments, the impact of bias is negligible because bias, usually assumed to be constant in all treatment groups, is negated by subtraction. An exception occurs when bias is not constant across the range of values measured in the study. For example, if one treatment (e.g., control) elicits no response with concomitant high values of a measurement and the active treatment group yields small response values on average when systematic bias is present, the estimated treatment difference will contain much of the larger bias of the control treatment. In short, valid measurement systems are both accurate and precise and contain minimal bias.
Ultrasound measurements (−53 mm3 ± 43 mm3) of tumor volume were less biased compared to tumor measurements taken with calipers (96 mm3 ± 88 mm3). Locally weighted scatterplot smoothing of data points in Bland-Altman plots show ultrasound measurements may underestimate tumor volume for true tumor volumes greater than 1500 mm3. While no systematic bias was detected among caliper measurements, the variability of caliper measurements increased with true tumor volume. This suggests that data transformations (e.g., natural log) may be necessary to stabilize the variability of caliper measurements before statistical methods assuming Gaussian (normal) errors can be used appropriately.
The bootstrap estimated standard deviation of caliper and ultrasound measurements (i.e., square root of the total variance excluding reviewer sources) were 583 mm3 and 508 mm3, respectively. As rule of thumb, sample size increases in direct proportion to the ratio of variances of alternative scenarios. The ratio of caliper versus ultrasound variances was 1.32. In comparable randomized studies, the sample size necessary to have the same power to detect the same difference in tumor volume among treatment groups will be 1.32 times higher using caliper measurements compared to using ultrasound to estimate tumor volume. To illustrate this further, we conducted sample size calculations for a hypothetical two-arm mouse study where the primary endpoint is post-treatment minus pre-treatment change in tumor volume (Table 3). In the hypothetical study, treatment begins when tumors are palpable at 100 mm3. We assume an average increase in tumor volume to 1100 mm3 among control mice and the treatment effect results in a decreased average tumor volume to 1000 mm3, 850 mm3, or 600 mm3. Change in tumor volume is compared using a two-sided, two sample t-test that is considered statistically significant when p < 0.05. Without loss of generality, we assume pre- and post-treatment tumor measurements are uncorrelated. For the purposes of this exercise, we used the mixed model estimates of standard deviations among mice of 563 mm3 for caliper-measured mice and 490 mm3 for ultrasound-measured mice as determined in the present study. For significant results to be achieved, the number of animals required necessarily increases as the true treatment effect decreases, desired power increases, and/or the standard deviation among animals increases. This latter effect is constant, regardless of a treatment effect or desired power, and directly proportional to the ratio of the variances between measurement types. Note that deviations from a ratio of 1.32 are due to rounding of sample size to integer values, the effect of which increases with smaller sample sizes.
In addition to focusing on accuracy and repeatability, both of which affect the conduct of basic and clinical research, the third component of this study evaluated measurement reproducibility. Reproducibility is concerned with the precision of repeated measurements within (intra) and among (inter) reviewers. We did not assess intra-reviewer precision in this study. Inter-reviewer variability was measured as the average absolute difference between reviewer measurements by type. Thus each difference is always positive, as is the average. With 4 ultrasound reviewers and 3 caliper reviewers, there were 6 and 3 differences per mouse, respectively, comprising the average for each mouse. Average absolute differences among caliper measurers had a median (range) of 147 mm3 (66 mm3 to 408 mm3). Ultrasound reviewers had a median value of 73 mm3 (25 mm3 to 138 mm3). Lines that match mice across measurement type in Figure 3 support a low Spearman correlation of 0.17 between measurement types. Importantly, from this result we conclude that tumor characteristics that elicit low reproducibility in caliper measurements do not greatly affect ultrasound measurements. The average absolute differences tended to increase in both groups as tumor size increased with the rate in the caliper group approximately double that of the ultrasound group. Ordered by true tumor size, mouse number 6, 10, 13, and 14 had the highest disagreement values in the caliper group. Although our results suggest that ultrasound may underestimate tumor volume when tumors are above 1500 mm3, these plots show that the ultrasound reviewers were consistent in their assessments regardless of tumor size. We hypothesize that attenuation of ultrasound signal at depths approaching 15 mm may cause a degree of uncertainty in segmenting the basal surface of very large xenograft tumors. If true, this effect could be minimized at lower frequencies where ultrasound penetration is less affected by these determinants, but would decrease the spatial resolution of the ultrasound images. In either case, variance component analysis revealed very high ICC of 0.95 and 0.98 for caliper and ultrasound measurements suggesting that reviewer contributions to variability are very small relative to the total variance. Consequently, the expense of using multiple reviewers in mouse studies using either measurement type seems unwarranted.
A key determinant in the discrepancy between TTV and caliper-derived volumes, as opposed to ultrasound-derived volumes, is the assumption of spheroid shaped tumors in the case of caliper measurements. Figure 5 shows representative examples for tumors of similar size that were determined to be spheroid (5A, tumor 9) and non-spheroid (5B, tumor 10) as determined by three-dimensional visualization of the ultrasound data set. For the spheroid tumor, both ultrasound and caliper measurements accurately and precisely represent the TTV. For the non-spheroid tumor, the ultrasound measurements are both accurate and precise, while the caliper measurements are neither.

In this study, we have shown that ultrasound measurements of subcutaneous xenograft tumor volume exhibit significantly more accuracy, precision, and reproducibility than measurements made with standard calipers. Though the benefits outweigh the costs in the long-term, ultrasound necessarily requires access to imaging instrumentation, and like caliper measures, removal of fur if present. We conclude that use of ultrasound to measure tumor volume in mouse studies will result in more sensitive and reproducible measures of volume change at endpoints and throughout longitudinal studies. This is expected to translate into more efficient studies requiring fewer animals to obtain statistically meaningful results and requiring less infrastructure to support them.
This work was supported in part by funding from the National Cancer Institute (NCI): P50 95103 (Vanderbilt GI SPORE Program), U01 084239 (Mouse Models of Human Cancers Consortium), U24 CA126588 (South-Eastern Center for Small-Animal Imaging), 1RO1 CA46413, 1R01 CA140628, 1RC1 CA145138, K25 CA127349, and 1P50CA128323 (Vanderbilt ICMIC Program). ETM was supported by pre-doctoral training grant in imaging science T32 EB003817.
PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |