The statistical limitations of SUV, and in particular SUV
max, have been appreciated for some time (
9). However, recent interest in the role of PET for monitoring tumor response to treatment has generated renewed interest in the topic (
32), partly because the reproducibility of the method imposes a minimum change in SUV that is required to indicate a statistically significant change in the tumor. Overall SUV reproducibility includes components due to biologic and protocol issues, but much recent work (
23,
33,
34) has focused on the instrument and analysis components of reproducibility. These studies used phantom experiments that approximated the noise environment encountered in clinical imaging and assessed bias and reproducibility in single-center and multicenter settings. The data presented in the current paper augment these studies by measuring aspects of reproducibility and bias in real patient images, thus accurately reflecting the statistical quality that is encountered in the clinical environment. Because the numerous factors that influence image statistical quality, and their variability between patients, are hard to accurately capture with current phantom designs, the use of real patient data in the present study is significant.
In this study, we found the within-patient SD for tumor SUV
max to be 5.6% ± 0.9% under conditions typical of whole-body oncology protocols. Comparing this value with previously published data for the overall reproducibility of SUV
max is slightly complicated by the use of different metrics, but despite this complication, the literature is quite consistent. The mean absolute percentage difference between successive SUV
max measurements has been reported by 3 studies to be 11.3% ± 8.0% (
29), 13% ± 12% (
10), and 16.1% ± 10.5% (
35). The higher value in the last study may be due to the fact that the measurements were made on 2 different scanner systems: one PET/CT and the other PET only. Because the mean absolute percentage difference approximates the within-patient SD (), these data are in good agreement with 2 other publications, which quoted 11%–12% (
22) and 11.8% (within-patient SD, 16.7%/√2 = 11.8%) (
30). Direct comparison of these data with those of Nahmias and Wahl (
19) is not possible because their results are presented in absolute SUV units, as opposed to a relative change. However, their 95% confidence intervals of ± 2.23 SUV units (within-patient SD, 2.23/2.77 = 0.80) and a mean SUV
max of approximately 8 SUV units indicate reproducibility results that are consistent with the previously mentioned publications.
The within-patient SD of 5.6% ± 0.9% for tumor SUV
max measured in the present study is lower than the literature values because the reports mentioned above include variability due to multiple sources, not simply image noise. These factors include differences in patient preparation, plasma glucose levels, and tracer uptake periods, as well as potentially real changes in tumor metabolism between studies performed on separate days. In addition, technical errors related to such things as scanner calibration and clock synchronization may also contribute. It is worth noting that for SUV
max, the component of variability that can be attributed to image noise accounts for approximately half the overall variability. Image statistical quality is therefore not a negligible consideration, at least when uptake measurements are derived from single-pixel SUV
max. Although the previously reported values of 11%–13% for overall within-patient SD may seem relatively low, they imply 95% limits of agreement for the difference between repeated measurements of around ± 30% (2.77 × 11% = 30%). In other words, repeated SUV
max measurements that differ by up to 30% should be expected simply from measurement error. The excellent interobserver reproducibility that has been reported (
18) for SUV
max should not be confused with the within-patient SD, which better reflects the variability that is encountered in response-monitoring studies involving sequential imaging.
SUV
peak provides a mechanism for improving reproducibility for SUV measurements of the most metabolically active tumor region. The component of the overall within-patient SD due to image noise was reduced from 5.6% ± 0.9% with SUV
max to 2.5% ± 0.4% with SUV
peak (128 × 128 image matrix). SUV
peak is by no means a new proposal, and its use predates by many years the adoption of the term
SUVpeak. In this work, we have implemented SUV
peak using a fixed-size 12-mm-diameter spheric ROI (
17), positioned so as to maximize the enclosed average. Compared with SUV
max, larger bias due to the partial-volume effect is expected for small tumors, and this is clearly a limitation of the SUV
peak approach. However, greater volume averaging with SUV
peak was seen to improve reproducibility and offers a slightly more robust alternative to SUV
max. Achieving this advantage in clinical practice requires consistent placement of the peak ROI, something that is not trivial if performed manually. Fortunately, the inclusion of automated SUV
peak algorithms in the software of many commercial vendors promises to make this index more widely available and potentially as convenient to use as SUV
max. Another potential advantage of SUV
peak over SUV
max suggested by the data in and may be that the reproducibility of SUV
peak is less affected by changes in pixel size. If confirmed, this property could have advantages for multicenter studies, in which images from different sites are likely to have pixels of different sizes.
In addition to limiting the reproducibility of SUV measurements, image noise also has the potential to introduce bias. provides clinical data confirming the potential for significant positive bias when the maximum pixel value is used to characterize PET uptake measurements. This trend is a consequence of the way SUV
max is defined. In a region of uniform tracer accumulation, statistical noise gives rise to a range of nonuniform pixel values. When one is considering the mean within an extended ROI, these pixels tend to average out, resulting in an unbiased estimate of the underlying signal (not withstanding other sources of error). SUV
max, however, consistently takes the highest pixel value and therefore tends to overestimate the underlying average. shows a mean positive bias for SUV
max of 30% ± 26% for 1-min acquisitions. SUV
peak, in contrast, was biased by only 11% ± 16% for the same 1-min images (). Noise-dependent bias of SUV
max has been previously reported in relation to computer simulations (
9), experimental phantoms (
20), and respiration-gated patient studies (
36). Murray et al. (
20) noted this bias effect in phantom studies with a time-of-flight PET system but did not observe it in their patient data. A possible explanation might be that, although their phantom images were statistically independent, their patient images may not have been and a potentially misleading correlation between SUVs may have resulted.
We acknowledge several limitations in our present work. The list-mode data were acquired 147 ± 37 min after 18F-FDG administration, and thus significant additional radioactive decay of the tracer (additional decay factor, 0.58) would be expected, compared with the more conventional oncology start time of 60 min. Our protocol attempted to compensate for this additional decay via the higher 18F-FDG activities that were administered. To approximate a typical patient administration of 370 MBq, an activity of 638 MBq would be required (370 MBq/0.58). In the present study, an average of 624 ± 83 MBq was administered, suggesting that the effect of delayed scanning may have been adequately compensated. Another limitation of our protocol was the use of nongated CT for attenuation correction of the respiration-gated PET series. Ideally, the CT data would have been gated in a similar way to the PET, allowing more accurate attenuation correction. This approach was not adopted because of the increased patient radiation dose that would have resulted. At least for the abdominal lesions, we believe that this may not have been a major limitation, because although respiratory motion can be significant in the abdomen, attenuation differences between abdominal organs are small and errors due to slightly misaligned CT are expected to be minimal. A further limitation is the fact that we do not present data for the various different tumor segmentation algorithms that have been proposed. Although we recognize this limitation, it was our intention to focus on SUVmax and SUVpeak, because they are widely used metrics that may be particularly vulnerable to image noise. Finally, the data presented in this report are strictly applicable only to the scanner model and protocol that were used. Although similar trends are expected on other scanner systems, the magnitude of the effects may differ if different acquisition and reconstruction protocols are used.
Although the issue of statistical noise and its effect on SUV
max has been previously explored (
9,
10), the subject bears reexamination in light of moves toward lower-activity protocols (
20) and shorter data acquisitions (
37). Although both developments are welcome in principle, the potential for increased image noise should not be overlooked when SUV
max is to be used. Given the current interest in tumor quantification and the fact that SUV
max has become the quantitative metric of choice for many centers, additional data on the influence of image noise in real patient studies is timely. This report serves as a reminder of these statistical limitations and we hope will contribute to improved accuracy and reproducibility of quantitative PET studies.