|Home | About | Journals | Submit | Contact Us | Français|
This paper presents a general framework for assessing imaging systems and image-analysis methods on the basis of therapeutic rather than diagnostic efficacy. By analogy to receiver operating characteristic (ROC) curves, it utilizes the Therapy Operating Characteristic or TOC curve, which is a plot of the probability of tumor control vs. the probability of normal-tissue complications as the overall level of a radiotherapy treatment beam is varied. The proposed figure of merit is the area under the TOC, denoted AUTOC. If the treatment planning algorithm is held constant, AUTOC is a metric for the imaging and image-analysis components, and in particular for segmentation algorithms that are used to delineate tumors and normal tissues. On the other hand, for a given set of segmented images, AUTOC can also be used as a metric for the treatment plan itself. A general mathematical theory of TOC and AUTOC is presented and then specialized to segmentation problems. Practical approaches to implementation of the theory in both simulation and clinical studies are presented. The method is illustrated with a a brief study of segmentation methods for prostate cancer.
It is now generally accepted, at least in the SPIE Medical Imaging community, that image quality must be assessed on the basis of the ability of a user to perform medically of scientifically relevant tasks with the image data.1 For diagnostic tasks, many evaluation methods related to receiver operating characteristic (ROC) curves have been developed and used to assess imaging systems and reconstruction algorithms. Much less has been done with the estimation tasks that arise in the context of radiotherapy.
The overall goal of radiotherapy is to maximize the probability of destroying tumors while minimizing damage to surrounding normal tissues. Many sophisticated approaches to this goal, including three-dimensional conformal radiotherapy2 (3DCRT), intensity-modulated radiotherapy3–5 (IMRT) and image-guided radiotherapy6 (IGRT), have been developed for this purpose, but they all depend on information extracted from images of the patient. Moreover, even after the treatment plan is finalized, there is still a tradeoff between the probability of tumor control and the probability of normal-tissue complications, both of which increase when the overall intensity or duration of the treatment beam is increased.
Much work has gone into estimating these probabilities,7–12 and standardized modules for their calculation are freely available.13, 14 The form of the output from such modules is depicted in Fig. 1. The graphs on the left are for a hypothetical therapy plan where the operating point (overall dose) can be chosen for high probability of tumor control, denoted Pr(TC), and minimal probability of normal-tissue complications, Pr(NTC). In the graphs on the right, however, Pr(NTC) is higher for any given choice of Pr(TC) than in the graphs on the left, so these curves correspond to a less effective treatment plan.
The similarity of the curves in Fig. 1 to plots of true-positive fraction (probability of detection) and false-positive fraction (probability of false alarm) vs. decision threshold will be evident to anyone conversant with signal detection.1 It is then an easy step to replot Fig. 1 as Pr(TC) vs. Pr(NTC) with the overall dose (intensity or duration of the treatment beam) as a parameter. This plot, initially introduced by Moore and Mendelsohn15 and analyzed in ROC terms by Metz et al.,16 is referred to as the Therapy Operating Characteristic or TOC curve. The TOC curves corresponding to Fig. 1 are shown in Fig. 2.
This paper uses the area under the TOC curve (AUTOC) as a figure of merit for radiation therapy planning and for the imaging components that affect the plan. An AUTOC of 1.0 is ideal, and AUTOC = 0.5 corresponds to a treatment scenario in which the probability of tumor control can be increased only at the expense of an identical increase in the probability of normal-tissue complications.
In Sec. 2 we lay out the full theory of TOC with as much generality as we can muster. The result is that Pr(TC) and Pr(NTC) are each described by three nested integrals, where each integral is over a huge or even infinite number of variables and the integrands are unknown. In Sec. 3 we make some simplifying assumptions applicable to the goal of using AUTOC to evaluate segmentation algorithms, and in Secs. 4 and 5, respectively, we show how AUTOC can be estimated from phantom images or real clinical images. An illustrative application to prostate radiotherapy is presented in Sec. 6.
The imaging system is modeled as a continuous-to-discrete (CD) mapping1 where the object is a function of continuous variables and the image is a set of discrete numbers such as voxel values. We assume that the tumor and all normal organs of interest have well-defined boundaries which can be represented as mathematical surfaces in a 3D space. For patient j, these surfaces are specified by a (potentially infinite) set of parameters denoted as Bj.
The surface descriptions contained in Bj are, of course, unknown, and it is the goal of a segmentation algorithm to estimate them in some way. The resulting estimate of the surfaces for patient j is denoted j. Unlike the continuous descriptions in Bj, the estimates in j specify a set of voxels or surface tesselations that define the surfaces in a discrete sense.
The data used for estimating the boundaries consist of one or more images. Typically, the normal-organ boundaries and the tumor boundaries will be estimated from a CT or MR image, but there is an increasing interest in including biological information estimated from PET images into the treatment planning process. For generality, we denote the available image data set for patient j as Gj, but we assume that this set must always include a CT study because x-ray attenuation data are needed for treatment planning.
From the estimated boundaries and the original CT data, a dosimetrist or physicist will derive a treatment plan which specifies the beam profiles, energies and exposures. Implementation of this plan will result in a dose distribution for patient j denoted Dj(r), where r is a three-dimensional position vector in the patient’s body. We assume that the dose distribution scales by the same factor at each position r as the exposure time or beam current is varied. If we assume further, for fractionated treatments, that the exposure is varied by the same proportion in each session, we can write
where D0j(r) is the dose distribution at some reference exposure and α is the ratio of an actual exposure to the reference exposure. The horizontal axis in Fig. 1 can now be interpreted as α − 1. For notational economy, we denote α D0j(r) as Dj, with the dependence on α implicit.
In passing, we note that scaling the exposure time is not precisely the same as varying beam current because of the dose-rate-dependent balance of tumor kill, repair and regrowth processes, but these effects are neglected in this initial paper.
The stochastic quantities that control the probabilities of tumor control and normal-tissue complications are noise in the image data; randomness in the estimated boundaries derived from a given image set; uncertainty in the actual delivered dose distribution, and of course the stochastic nature of the tissue response itself. We can describe all of these effects abstractly in terms of conditional probability density functions (PDFs).
The multivariate PDF for the image noise for patient j, denoted pr(Gj|j), describes the effect of the noise in the raw projection data as propagated through the reconstruction algorithm; for much more on this PDF and copious references on the subject, see Chapter 15 of Barrett and Myers.1
The PDF for the estimated boundaries given the image data, denoted pr(j|Gj), would be a huge-dimensional delta function for an automatic segmentation algorithm because repeated trials with the same images would always deliver the same boundaries, but human analysts would not be so repeatable. We therefore retain pr(j|Gj) as an unspecified PDF for now.
The PDF for the actual dose for patient j given the estimated boundaries and the image data is denoted pr(Dj|j, Gj, j). The estimated boundaries and the variations in x-ray attenuation coefficient in the CT data determine the treatment plan and hence the intended dose distribution, but this PDF allows the actual dose to deviate in an unknown, random way from the intended one and to depend also on other characteristics of the patient such as fine details not captured in the image data.
Finally, the probability (not PDF) of tumor control, which depends on the actual dose distribution, the actual boundaries and possibly other biological factors such as oxygenation and perfusion, can be denoted Pr(TC|Dj, Bj, j). The notation here is a trifle redundant because conditioning on the patient is implicitly specifying the true organ boundaries, but we retain the dependence on Bj to emphasize its importance.
With these definitions, we can apply elementary rules of conditional probability to write the probability of tumor control for patient j, in full generality, as
Similarly, the general expression for the probability of normal-tissue complications is
Note that the probabilities given by (2) and (3) depend on both the actual and the estimated boundaries; the estimated boundaries determine the treatment plan, but given a plan and hence a dose distribution, it is the actual boundaries that determine Pr(TC|j) and Pr(NTC|j).
The general equations (2) and (3) can be used to describe and evaluate almost any aspect of the therapy-planning process, including evaluation of planning algorithms and the dose computations used therein; evaluation of imaging systems and reconstruction algorithms; and study of the accuracy and reproducibility of humans performing manual segmentation. This paper, however, focuses on evaluation of segmentation algorithms, so we can introduce some additional assumptions about the other links in the therapeutic chain.
Most importantly, we can neglect the potential inaccuracies in dose calculations and assume that the actual dose distribution, given the plan, is precisely what the therapy-planning program says it is. There are two main justifications for this assumption. First, this critical aspect of radiotherapy has been widely studied, and the dose-computation algorithms have been continually refined and validated by Monte Carlo studies.17 One can surely have confidence in them if an accurate image model is used. Second, the dose delivered by a beam of high-energy photons is rather non-local; a photon that undergoes a photoelectric or Compton interaction at a specific point in the tissue delivers its dose over a range of a centimeter or more, and there are many such interactions, so noise or limited spatial resolution in the image may have little significant effect on the dose distribution.
where D(j, Gj) is the dose distribution (interpreted as a function of continuous spatial variables) returned by the planning algorithm. The delta function can then be used to perform the integral over Dj, and (2) becomes
Another reasonable approximation is to neglect the biological subtleties of tissue damage by ionizing radiation. As noted above, Pr(TC|Dj, Bj, j) can depend on patient characteristics not captured by the image data; that was the reason for the final conditioning on j. In fact, we have little way of knowing what these characteristics are in clinical practice, and we certainly could not incorporate them realistically in a simulation study. Thus we opt to delete that last variable and write Pr(TC|Dj, Bj, j) = Pr(TC|Dj, Bj).
Now we note that the product of the last two conditional probabilities in (5) can be rewritten as a joint probability:
Thus (5) becomes
where the notation indicates that the quantity in angle brackets is to be averaged over the joint distribution of the subscripted quantities, j and Gj.
The expression in (7) suggests a Monte Carlo algorithm that can be implemented with simulated data. Suppose we begin with an anatomical phantom to represent a particular patient j and we generate K random image samples, differing only by the realization of the Poisson noise in the raw projection data. The kth such image sample is denoted and the corresponding boundary estimate is denoted . The pair is a sample from the joint distribution pr(j, Gj|j); only one index k is needed to specify the sample because is generated directly (though not necessarily deterministically) from .
Because we started with an anatomical phantom, the true boundaries Bj are known, and we can use a standard therapy planning algorithm and standard TCP and NTCP modules to evaluate the quantity in angle brackets in (7). The Monte Carlo estimate of Pr(TC|j) is then
where the probability on the righthand side is interpreted as the output of the TCP module.
The notation doesn’t show it, but the Monte Carlo estimates of (8) and (9) are still functions of α, so the computation gives one point on the TOC curve. Repeating for several values of α as in an ROC study generates the TOC curve for phantom j; the study can be repeated for several phantoms to get an average TOC curve and AUTOC.
If we wish to use real clinical images rather than simulations to generate TOC curves, we need a gold standard for the true boundaries. One approach is to have two or more experienced radiation oncologists painstakingly segment the images and agree on the resulting boundaries for the tumor and the critical organs. By analogy to the use of expert panels in standard ROC studies, the resulting Bj can then be adopted as the gold standard for evaluation of the performance of less experienced (or less painstaking) oncologists or of automatic contouring algorithms. It is also possible to begin with normal images, segment them manually in order to delineate the critical normal organs, and then add realistic simulated tumors with known boundaries.
Another problem in applying the theory introduced in this paper to real data is that we do not have the option of collecting multiple images on the same patient, so we cannot implement the factor pr(Gj|j) in (2), (3) or (5). We can proceed with the general probabilistic formulation of Sec. 3, therefore, only if we assume that, at the dose levels normally used in clinical CT, the image noise is not the limiting factor in segmentation accuracy. We can check this assumption by adding more noise to the raw CT data before reconstruction and determining at what noise level AUTOC is significantly affected, or we can revert to simulation.
If image noise is indeed negligible (for segmentation purposes) at clinical dose levels, then by analogy to (4) we can write
where j is the mean or noisefree image of patient j; in practice, we simply estimate j with the actual observed image for that patient.
How we apply this equation depends on whether the segmentation is manual or automatic. If we have N people performing manual segmentation, and the nth of these produces estimated boundaries jn, then an estimated probability of tumor control for patient j, with manual segmentation and the assumptions already incorporated into (11), is given by
A similar expression holds for the probability of normal tissue complications, so a patient-specific TOC curve is readily generated as described in Sec. 4. Further averaging over patients will yield an average TOC and AUTOC.
Evaluation of automatic segmentation algorithms is even easier. These algorithms are never uncertain but usually wrong – relative to the assumed gold standard. In estimation-theoretic terms, they have zero variance but nonzero bias. If algorithm m produces boundaries jm for patient j, Pr(TC) for that algorithm and patient is given by
Again, this expression depends implicitly on α, and a similar expression holds for , so a TOC curve specific to algorithm m and patient j is readily constructed. Averaging over patients then provides a task-based metric for the particular algorithm.
Rectal bleeding is a significant normal-tissue complication in radiotherapy of prostate cancer.18 For this reason, we chose a clinical case of prostate cancer to illustrate the use of TOC curves with clinical images for a single patient. The prostate, seminal vesicles, bladder and rectum had previously been manually segmented, and we considered these boundaries to be the gold standard.
The treatment plan actually used clinically with this segmentation provided the dose distribution for the tumor and normal organs. In the notation of (13), this procedure provided D(j1, j) with the best manual segmentation being denoted as algorithm m = 1. Because this algorithm was taken as the gold standard, we set Bj = j1 in (13) and used the TCP and NTCP modules of Gay and Niemierko13 to compute the probability of tumor control and the probability of rectum complications from the cumulative dose-volume histograms. The resulting dose-response curves are shown on the left in Fig. 3.
Then we artificially expanded the boundaries of the prostate by 5 mm in all directions and made a new treatment plan with wider beams as if these expanded boundaries were correct. Equivalently, we increased the PTV (planned treatment volume) by 5 mm in all directions.
This new segmentation was denoted m = 2 and a new dose distribution, D(j2, j), was derived. With the true boundaries still taken as Bj = j1, we again used the modules of Gay and Niemierko to compute Pr[TC|D(j2, j), j1, j], and similarly for NTCP. The resulting dose-response curves for this case are shown on the right in Fig. 3.
The two sets of dose-response curves in Fig. 3 are remarkably similar, and one might think that the intentional segmentation error made no difference at all. When the data from Fig. 3 are replotted as TOC curves in Fig. 4, however, the difference becomes evident. The gold-standard manual estimation yields AUTOC = 0.894, and the erroneous segmentation yields AUTOC = 0.857. There is no need to assign error bars to these figures of merit or to inquire whether the difference is statistically significant because there is nothing statistical about them. The TCP and NTCP modules compute the probabilities from the input dose-volume histograms and the input biological parameters for the organs in question, and there is no variance in the results. The output probabilities are no doubt biased, in the sense that they may not agree well with clinical outcomes, but the bias is the same for both segmentation algorithms. In this example, the manual segmentation was the gold standard, and the intentionally erroneous segmentation gave higher probability of rectal complications at all tumor-control probabilities, as expected. We made no effort to use biologically accurate input parameters to the modules, and we speculate that the rank ordering of segmentation algorithms will be invariant to the choice of these parameters in most situations.
We have analyzed therapy operating characteristic curves and proposed the use of the area under the curve as a figure of merit for comparison of segmentation algorithms. The sensitivity of this method of comparison was demonstrated for a single clinical case of prostate cancer.
TOC curves have many other applications as well. They can be used to compare different imaging instruments, reconstruction algorithms or image-registration methods that may go into the treatment-planning process. They can also be used to compare and optimize the planning algorithms themselves, especially in simulation studies where there is no uncertainty about the true organ boundaries. Finally, they can be used to evaluate methods to overcome practical clinical problems such as patient positioning or motion during treatment.
In all of these cases, it seems likely that the comparison of instruments, algorithms and treatment methodologies will be insensitive to the biological accuracy of the TCP and NTCP calculations. If, however, that biological accuracy can be clearly established, TOC curves for individual patients can, in principle, be used to determine the operating point (overall dose) that is best for that patient. In this case, multiple TOC curves for different normal organs and degrees of complication can be used to choose the right balance for the patient.
We have benefited significantly from discussions with Julian Rosenman and Stephen Pizer at the University of North Carolina, and we thank Charles Metz for calling our attention to references 15 and 16. This research was supported by the National Institutes of Health under grants R37 EB000803 and P41 EB002035.