|Home | About | Journals | Submit | Contact Us | Français|
Variation in target volume delineation represents a significant hurdle in clinical trials involving conformal radiotherapy. We sought to determine the impact of a consensus guideline-based visual atlas on contouring of target volumes.
A representative case and target volume delineation instructions derived from a proposed rectal cancer clinical trial involving conformal radiotherapy were contoured (Scan1) by 14 physician observers and a reference expert. Gross tumor volume (GTV), and 2 clinical target volumes (CTVA, comprising internal iliac, pre-sacral, and peri-rectal nodes, and CTVB, external iliac nodes) were contoured. Observers were randomly assigned to receipt (Group_A) /non-receipt (Group_B) of a consensus guideline and atlas for anorectal cancers, then instructed to re-contour the same case/images (Scan2). Observer variation was analyzed volumetrically using conformation number (CN, where CN=1 equals a total agreement).
In 14 evaluable contour sets (1 expert, 7 Group_A, 6 Group_B), there was greater agreement for GTV (mean CN 0.75) than CTVs (mean CN 0.46–0.65). Atlas exposure for Group_A led to a significant increased inter-observer agreement for CTVA (mean initial CN 0.68, post-atlas 0.76; p=0.03), as well as increased agreement with the expert reference (initial mean CN 0.58, 0.69 post-atlas; p=0.02). For GTV and CTVB, neither inter-observer nor expert agreement was altered after atlas exposure.
Consensus guideline atlas implementation resulted in a detectable difference in inter-observer agreement and greater approximation of expert volumes for CTVA, but not GTV or CTVB, in the specified case. Visual atlas inclusion should be considered as a feature in future clinical trials incorporating conformal radiotherapy.
Inter-observer differences in target volume delineation are a demonstrated source of potential treatment variability in the context of clinical trials that incorporate conformal radiotherapy approaches 1, 2. Recent publications suggest that target delineation consensus documentation is highly desirable for clinical trials3 and that specific instructional or educational interventions may afford a measurable effect in terms of physician contouring4, 5.
As part of efforts to improve radiotherapy implementation for Southwest Oncology Group trials, and consistent with its focus on quality improvement in cooperative studies, the SWOG Radiation Oncology Committee authorized this study as a pilot project to achieve the following primary specific aims:
This prospective IRB-exempt study was conducted under the auspices of the University of Texas Health Science Center San Antonio Institutional Review Board. This study was designed as a double-blind, randomized hypothesis generating pilot study (Figure 1). Statistical power for agreement analysis was estimated for a non-Bonferroni-corrected paired-measures Wilcoxon test (assuming a minimum asymptotic relative efficiency of ≥0.863 compared with a paired t-test), with a specified 1-β of 0.7, and α of ≤0.05, resulting in a minimum requisite sample size of 6 observers (radiation oncologists) per group, calculated using G*Power 3 statistical software6. Goal enrollment was 10–12 observers per cohort.
Participant radiation oncologists (observers) were recruited from SWOG-participating institutions. Those who indicated interest were sent study documentation which included a standardized case report, description of target volumes to be contoured, and a compact disc (CD) containing 3 mm axial CT images derived from the DICOM file of the standardized case study’s simulation CT, to be contoured twice using the Big Brother target delineation software program. “Big Brother” is a custom target volume delineation evaluation software platform developed at the Netherlands Cancer Institute7, 8. Big Brother consists of a user interface with target delineation features common to most commercial treatment planning systems9 and collects a wide array of volumetric and target delineation data unobtrusively during the contouring session. The included case study depicted the history and clinical findings from an anonymized patient with T3N0M0 adenocarcinoma of the rectum with instructions modeled on a SWOG protocol in development at the time which included detailed instructions regarding 3-DCRT and IMRT treatment plan design (SWOG S0713: A Phase II Study of Oxaliplatin, Capecitabine, Cetuximab and Radiation in Pre-operative Therapy of Rectal Cancer, ClinicalTrials.gov Identifier NCT00686166), with terminology modified to fit nomenclature established in the then-unpublished RTOG consensus guideline for target delineation in anorectal cancers10. Observers were asked to contour the structures as per Table 1. Axial CT images were extracted using a single DICOM dataset; identical copies of reconstructed (axial, sagittal, and coronal views) were then designated Scan1 and Scan2.
Half of distributed CDs contained an automated HTML link, which, after submission of the first contouring session (Scan1) and the subsequent electronic survey, directed users to a pre-publication version of the RTOG consensus guidelines for target volumes in anorectal cancer10 as well as instructions to re-contour the exact same axial CT images a second time (Scan2) with the same case presentation, instructions and target definitions, using the RTOG consensus guideline visual atlas as a guide (Group_A). All other CDs contained HTML pop-up directions to re-contour the same volumes on the identical CT-simulation-derived dataset (Scan2), using the same aforementioned case data/instructions as previous (Group_B). Hence, Group_B did not receive consensus atlas guidance for re-contouring the case. CDs with and without the HTML link to the consensus atlas were randomly shuffled before labeling and delivery to participants; both study personnel and physician observers were unaware of which CD had been distributed to each participant until electronic survey completion.
After completion of the GTV and CTV delineation on Scan1, observers submitted the case by email and were directed to an electronic survey (Table 2). Subsequently, participants were provided with instructions to re-contour the case either with or without the assistance of an anatomically specific consensus atlas. The recently published RTOG consensus atlas10 was utilized in pre-publication form http://www.rtog.org/pdf_document/AnorectalContouringGuidelines.pdf ).
In addition, one of the members involved in the development of the RTOG consensus guideline was asked to delineate Scan1 and Scan2. This observer [LK] was designated as a “reference expert”, with her contours serving as a de facto gold standard against which to compare observer-derived contours. During the study period, only the reference expert user had a prior knowledge of this atlas, and thus study participants represented a tabula rasa with regard to consensus guidelines.
All delineations were first visually analyzed (Figure 2) and protocol deviations from the delivered instructions were identified by review of all axial contours. Total volume encompassed in cubic centimeters for all structures were calculated and tabulated. Statistical comparison of volume differentials between Scan1 and Scan2 were performed for each structure for Group_A and B respectively.
Baseline inter-observer variation for the SWOG protocol was derived from the delineations on Scan1 from all observers, excepting reference expert. Baseline intra-observer variation was derived from a comparison of the volume of Scan1 and Scan2 in Group_B. The effect of the atlas on inter-observer variation was quantified by comparing the inter-observer variation for Scan1 and Scan2 in Group_A. For comparison within cohorts, a composite median delineation was calculated for each group. The median delineation represents a 50% coverage isosurface of the observers, such that each voxel inside is designated by at least 50% of the observers, and was calculated for GTV, CTVA and CTVB. The CTVC structure was not designated in the instructions as a necessary volume be contoured for this clinical case, and was therefore not analyzed. For volumetric agreement analysis for Group_A, first, the common volume (CV) was calculated between either the median or expert contour (V1) and the observer contour (V2). Subsequently, as a modification of the concept introduced by van’t Riet et al.11, 12, a conformation number (CN) was derived as CN = CV2/(V1*V2). Differences in CN values for target structures (e.g. GTV, CTVA, CTVB) for Scan1 and Scan2 for Group_A, using both reference expert and group median delineation isosurface as a comparator, were calculated and formally assessed for statistical significance by paired-measures Wilcoxon test.
For Group_B, intra-observer CN values were calculated using the aforementioned van’t Riet formula12; the common volume (CV) was calculated between either the Scan1 (V1) and Scan2 (V2).
After completion of planned volumetric analysis, surface distance analysis was performed to identify regional delineation variation within the CTVA volume, using virtual volume unfolding as previously presented elsewhere13, 14. Briefly, for the surface distance variation calculation, the reference structure (median or expert) was first re-sampled to 100 equidistant points per delineated slice. Secondly, for each point on the reference structure, the distance, perpendicular to the surface, to the observer derived contour was calculated. For the observer variation analysis, the standard deviation (SD) of all observers was calculated for each point on the reference structure. For comparison with the expert, the group median was calculated by taking the median of the distances for each point on the expert derived reference structure to the perpendicular surface of every observer-derived contour. Regional differentials in surface variation were then explored graphically and numerically.
Eight SWOG institutions had at least one user submitting contours, as well as a single non-SWOG affiliated participant. Of the twenty-six observers directly asked to participate, 15 submitted contour set pairs, of which 14 were technically evaluable (1 expert, 7 Group_A, 6 Group_B). The non-evaluable contour set consisted of non-connected, non-overlapping contours, which precluded ready analysis with the cohort at large. Survey results were pooled and are tabulated in Table 2.
All 14 remaining observers delineated the GTV and CTVA on Scan1 and Scan2. While the CTVB was mandatory in the specific delineation instructions, it was only delineated by 10 of the 13 observers. The CTVC, which should not have been delineated, was contoured by two observers both on Scan1 and Scan2, by one observer on Scan1 only, and by one other observer on Scan2 only. For one observer of Group_A and four observers of Group_B, major deviations from the delineation protocol (e.g. GTV was not encompassed by the CTVA) were visible on axial slice review. For one other observer of Group_A, the CTVB covered the internal iliac vessels instead of the externals on both ScanA and ScanB. For the 5 observers where the CTVA did not fully cover the GTV, the CTVA was manually edited so that the observer-contoured GTV was encompassed for the volume analysis; preliminary statistical evaluation evidenced minimal alteration of volumetric statistics by this modification.
Between Scan1 and Scan2, only the increase in volume of the CTVB in Group_A approached statistical significance (p=0.06) (Table 3). In Group_B, the number of CTVB slices covered by all observers dropped from 14 to 3 axial CT slices, while the average delineated number of axial CT slices contoured only dropped from 20 to 16 slices. The median GTV delineated on Scan1 for all observers had a volume of 74 cc. The average CN for baseline inter-observer variation of the GTV was 0.75 (range 0.60–0.81). The median CTVA had a volume of 709 cc and a CN of 0.65 (range 0.47–0.75), which indicates comparatively greater inter-observer disagreement for CTVA compared to the GTV. For CTVB, with a median volume of 70 cc, even less inter-observer agreement could be found, with an average CN of 0.46 (range 0.24–0.70).
Atlas exposure led to a statistically significant increase in volumetric agreement on CTVA between observers (Figure 3a) and with the expert (Figure 3b), as measured by CN. The average inter-observer CN (i.e. agreement with the median surface) increased from 0.68 (range 0.41–0.78) on Scan1 to 0.76 (range 0.57–0.87) on Scan2 (p=0.031, paired Wilcoxon signed rank test; Figure 3a). The average CN, compared with the expert, increased from 0.58 pre-atlas (range 0.42–0.70) to 0.69 post-atlas (0.58–0.78, p=0.016; Figure 3b). For the CTVB, however, neither inter-observer variation (mean CN 0.39 (range 0.26–0.67) vs. 0.45 (range 0.13–0.68, p=0.4), nor agreement with the expert (mean CN 0.31 (range 0.16–0.49) vs. 0.30 (range 0.11–0.44)) was altered to a statistically significant degree after atlas exposure (p=0.8).
Since the atlas only affected observer variation for CTVA, exploratory post-hoc surface distance variation analysis was limited to CTVA only (Figure 4 and and5).5). To translate surface maps into numbers, first the reference structure (median/expert CTVA) was divided into anterior, lateral and posterior regions, and subdivided into upper- and lower sub-regions at the level of the tip of the coccyx. For each of the 6 regions, the SD value covering 5%–95% of regional surface distance difference was taken to characterize the minimum and maximum regional variation (Table 4), though no formal statistical comparison of regional sub-volumes was performed. Visual inspection (Figure 4 and Figure 5) showed introduction of the atlas resulted in modification of surface distance between observers and the expert primarily in limited regions of the CTVA, rather than the CTVA volume globally. Modification of target volumes was most notably localized to the upper-anterior region adjacent to the bladder, lower-posterior and lateral CTVA (not shown); however, statistical significance was not formally assessed. For all defined regions, except upper-posterior and upper-lateral, the upper 95% CI of the inter-observer surface standard deviation was reduced by 0.2–0.8 cm after the introduction of the atlas. As Table 4 demonstrates, >1 cm of surface variation was observed for multiple regions before atlas implementation for all users, and though reduced after atlas administration, >1cm was still needed to cover 95% of the surface variation in CTV sub-regions.
Regarding intra-observer variation, the absolute volume of all respective structures contoured was essentially equivalent (Table 3). Comparison between delineations on Scan1 and Scan2 in Group_B yielded an average CN of 0.80 (range 0.75–0.82), 0.68 (range 0.47–0.89) and 0.54 (range 0.16–0.72) for the GTV, CTVA and CTVB respectively. Regional intra-observer variability is illustrated graphically in Figure 5.
Despite the well-known consequences of geometric inaccuracy in target volume delineation15–17, inter-observer variability in target definition has been demonstrated in a host of studies, in various anatomical sites18. Simply put, “inter-observer variability in the definition of GTV and CTV is a major – for some tumor locations probably the largest – factor contributing to the global uncertainty in radiation treatment planning”18. Consequently, there are continuing efforts to implement solutions to possible sources of variability/error in the target volume delineation process. These solutions have included optimization of imaging inputs19–22, instructional protocol modification 5, 23, 24, integration of specific training programs25, 26, development of software tools27–31, and implementation of standardized guidelines32–37 for distinct anatomic sub-sites. For clinical trials, the situation is potentially more vexing, as insurance of adequate treatment uniformity between comparison cohorts necessitates comparatively increased attention to both protocol construction and enrollee plan review, costing significant time in terms of resources for the primary investigator(s).
In terms of feasibility, the study was readily completed (total study duration of 5 months). A total of 12/26 (46%) invited SWOG institutions confirmed intent to participate; however, only 8/26 (31%) had resultant submissions. Nonetheless, our findings suggest that a reasonably powered target delineation trial might be implemented with a modicum of cooperative group resource allocation in timely manner, and that such a study is both technically and logistically feasible. Analysis of resultant data alludes to the difficulty of executing clinical trials in the conformal radiotherapy era. The high proportion of major protocol deviations was consistent with previous reports. The substantial variation from expert reference and median contour surfaces observed for all users pre-intervention (Figure 4 and and5,5, Table 4) suggests that efforts to further minimize inter-observer variability are imperative. As Table 4 demonstrates, substantial inter-observer surface deviation was observed for multiple CTVA sub-regions before atlas implementation. After atlas administration, a reduction of 0.3, 0.6 and 0.8 cm was achieved for the upper-anterior, lower-lateral and lower-posterior CTVA sub-region upper limit of standard deviation from the median isosurface. Although >1 cm would still be needed to cover 95% of all contouring variability, the achieved reduction(s) would result in a decrement in required PTV expansion margins. However, further reduction of variation is desired, because the PTV margins required to encompass the residual variation in target delineation would limit the practical advantages of IMRT over conventional RT.
Several limitations to this pilot study are evident. The sample size is of this study is limited, and only a single case was contoured. Utilization of a reference expert’s contours as a de facto “gold standard” points to the fact that the “ground truth” in contouring clinical target volumes remains ambiguous (Table 3, noting variation within the reference expert user’s sequential contours). Some variance in the study might be attributable to instructions which were distinct from standard clinical practice (e.g. the external iliacs are not typically contoured for T3 rectal cases). Our invitation was limited to SWOG institutes, creating potential sampling bias, and the fact that only interested observers participated creates an avenue for selection bias. Nonetheless, our data suggest that inclusion of a visual atlas in addition to written instructions can improve conformance to a reference expert’s contours (Figure 3a), as well as reducing inter-observer variability, to a statistically detectable degree (Figure 3b). However, our data also suggest substantial residual variability in rectal target volume delineation, even after atlas utilization (Table 3, ,44).
This study is consistent with previous investigations of educational interventions and consensus guideline application in contouring studies. Recently, Bekelman et al.25 demonstrated improvement in contour quality after a directed teaching intervention, echoing previous work by Tai et al. showing increased protocol compliance after a site-specific educational experience26. With regard to consensus guideline application as an avenue towards target variability reduction, Dimopulous et al.32 detail a study in which 19 cervical cancer cases were contoured using GEC-ESTRO guidelines by two observers, with a resultant between-user conformity index (CI)11, 21 in the range 0.6–07 for target volumes volumes, roughly consistent with CNs in the current series. Likewise, Wong et al. recently demonstrated, using a test-retest sequence, that improved consistency in seroma contouring could be observed after exposure to consensus guidelines38. In the clinical trial setting, it is likely that “trial specific” atlases should be used, based on patterns of failure data (as per Roels et al.39) or, possibly, after a pilot contouring trial similar to the present study. For instance, RTOG anorectal consensus guidelines stipulate coverage “extending CTVA ~1 cm into the posterior bladder, to account for day-to-day variation in bladder position10.” This incorporation of motion into CTV generation, rather than PTV expansion, represents a conceptual break with ICRU 6240 and other guidelines39, wherein the posterior bladder wall would not be contoured. No users in Group_A included significant portions of posterior bladder pre-atlas, whereas a majority did so after atlas exposure (in compliance with the presented atlas10 and consistent with the reference expert).
Future studies will be required to ascertain if observed effects of atlas administration are transferable to other anatomic sites with potentially more complicated anatomic relationships5, 24. The SWOG Radiation Therapy Committee intends to suggest building target delineation studies into clinical trial protocol development/quality assurance processes. Aspects of this dataset may also be integrated into design of educational materials for a proposed Dutch cooperative group rectal study workshop. We plan to use portions of this dataset to construct composite models accounting for rectal motion and set-up variability14, 41, as well as development of novel software strategies for evaluation42 and minimization of target delineation variance.
The addition of a visual atlas and consensus treatment guidelines to a written protocol increased clinical target volume delineation conformance with expert derived contours and increased contour agreement between participants. The addition of a visual atlas and consensus treatment guidelines to a written protocol increased clinical target volume delineation conformance with expert derived contours and increased contour agreement between participants for CTVA, but not GTV or CTVB, for the included rectal cancer case. Detected inter-observer (both with and without atlas) and intra-observer variation in contouring target structures was substantial. Visual atlas-based supplementary target volume specification materials should be considered for clinical trials involving conformal radiotherapy approaches.
C.D.F. is supported by a training grant from the National Institutes of Health/National Institute of Biomedical Imaging and Bioengineering, “Multidisciplinary Training Program in Human Imaging” (5T32EB000817-04), a Technology Transfer Grant from the European Society for Therapeutic Radiology Oncology, and the Product Support Development Grant from the Society for Imaging Informatics in Medicine. The funder(s) played no role in study design, in the collection, analysis and interpretation of data, in the writing of the manuscript, nor in the decision to submit the manuscript for publication. Portions of this data have been selected for a Poster Recognition Award at the 2009 American Society for Radiation Oncology Annual Meeting, November 1–5, 2009, Chicago, IL.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Conflicts of Interest Notification: The author(s) assert no conflict(s) of interest to declare germane to this effort.