|Home | About | Journals | Submit | Contact Us | Français|
The sensitivity of CT virtual colonoscopy (CT colonography) for detecting polyps varies widely in recently reported large clinical trials. Our objective was to determine whether a computer program is as sensitive as optical colonoscopy for the detection of adenomatous colonic polyps on CT virtual colonoscopy.
The data set was a cohort of 1,186 screening patients at three medical centers. All patients underwent same day virtual and optical colonoscopy. Our enhanced gold standard combined segmental unblinded optical colonoscopy and retrospective identification of precise polyp locations. The data were randomized into separate training (n=394) and test (n=792) sets for analysis by a computer-aided polyp detection (CAD) program.
For the test set, CAD’s per polyp and per patient sensitivities were both 89.3% (25/28, 95% CI [71.8%, 97.7%]) for detecting retrospectively identifiable adenomatous polyps at least 1 cm in size. The false-positive rate was 2.1 [2.0, 2.2] false polyps per patient. Both carcinomas were detected by CAD at a false positive rate of 0.7 per patient; only one of two was detected by optical colonoscopy prior to segmental unblinding. At both 8 mm and 10 mm adenoma size thresholds, the per patient sensitivities of CAD were not significantly different than those of optical colonoscopy prior to segmental unblinding.
The per patient sensitivity of CT virtual colonoscopy computer-aided polyp detection in an asymptomatic screening population is comparable to that of optical colonoscopy for adenomas 8 mm or larger and is generalizable to new CT virtual colonoscopy data.
Colorectal cancer is the second leading cause of cancer death in Americans 1. It is known that with proper screening, colorectal cancer can be prevented. Unfortunately, many patients do not undergo screening due to the perceived inconvenience and discomfort of existing screening tests. Virtual colonoscopy (also known as CT colonography), a CT scan based imaging method, has been under study for the past 10 years and shows promise as a method of colorectal cancer screening that may be acceptable to many patients 2, 3.
Recent large clinical trials have suggested that virtual colonoscopy may have high sensitivity and specificity for polyp detection 4, 5. Other studies have raised questions about its reproducibility and accuracy in actual clinical practice 6–9. If virtual colonoscopy is to be widely disseminated for colorectal cancer screening, methods that improve consistency and accuracy would be highly desirable.
Computer-aided polyp detection has been proposed by a number of investigators to improve consistency and sensitivity of virtual colonoscopy interpretation and reduce interpretation burden10. Preliminary studies of prototype computer-aided detection (CAD) systems on small patient datasets have reported per polyp sensitivities from 64% to 100% and false-positive rates from 1 to 11 false-positives per patient for detecting polyps 1 cm or larger 11–17. However, there is currently insufficient evidence whether CAD is accurate in a screening population and whether the reported results generalize to independent data.
The purpose of this study is to provide this evidence by assessing CAD performance on a large consecutive prospectively-enrolled asymptomatic screening patient population. To ascertain the generalizability of CAD’s performance, we randomized the patients’ data into separate training and test sets, and evaluated the performance of CAD on each dataset.
The patient population consisted of 1,253 asymptomatic adults between 40 and 79 years at three medical centers (“Institutions 1 –3”) of whom 1,233 underwent complete same day virtual and optical colonoscopy 4. Twenty of the 1,253 patients were excluded because of incomplete optical colonoscopy, inadequate preparation or failure of the CT colonographic system. The study was approved by the institutional review boards (IRBs) at all three centers. Written informed consent was obtained from all patients. This study was part of the original IRB-approved project and consent form that led to publication of Ref. 4 and the patient population is the same.
Patients underwent a 24-hour colonic preparation that consisted of oral administration of 90 ml sodium phosphate, 10 mg bisacodyl, 500 ml of barium (2.1% by weight) and 120 ml of diatrizoate meglumine and diatrizoate sodium given in divided doses 18.
A small flexible rectal catheter was inserted and pneumocolon achieved by patient-controlled insufflation of room air. Each patient was scanned in the supine and prone positions during a single breathhold using a four-channel or eight-channel CT scanner (General Electric LightSpeed or LightSpeed Ultra). CT scanning parameters included 1.25 – 2.5 mm section collimation, 15 mm per second table speed, 1 mm reconstruction interval, 100 mAs and 120 kVp.
Optical colonoscopy (OC) was performed by one of 17 experienced colonoscopists. Our technique for segmental unblinding of virtual colonoscopy results at OC has been previously described 4 and reduces OC false-negatives as much as 12% for large adenomas (≥ 10 mm) 19. The colonoscopists used a calibrated guidewire to measure polyp size, and recorded whether the polyp was located on a haustral fold and a subjective assessment of polyp shape (sessile, pedunculated or flat).
CT images from the virtual colonoscopy studies from each of the 3 institutions were loaded onto a computer server. The CT images from 47 patients could not be located or restored and were excluded from further analysis; this left 1,186 patients with complete data.
To assess the performance of the CAD software, we developed an enhanced ground truth (calibration data) based upon manual determination of the three-dimensional borders of polyps. Each polyp 6 millimeters or larger found at optical colonoscopy was located on the prone and supine virtual colonoscopy examinations using three-dimensional endoluminal reconstructions with “fly-through” capability and multiplanar reformatted images (Viatronix V3D colon, research version 18.104.22.168, Viatronix).
For each polyp and for each position (supine and prone), a marker was placed manually in the center of each polyp using computer software. Then the borders of the polyp on each slice that contained the polyp were manually traced. The markers (approx. 500) and borders (approx. 3650) were stored in data files. The markings and tracings were done by a trained research assistant (D.B.) supervised by a radiologist (R.M.S.).
To assess the potential clinical significance of CAD false positives, we created a database of radiologist false positives to enable comparison of the two sets for any commonality. This database allowed us to determine whether radiologists and CAD made the same false positives. A trained research assistant (V.K.), supervised by a radiologist (R.M.S.) identified the false positive polyps reported on the same cases by the radiologists in Ref. 4. Each false positive that was identifiable in retrospect was marked and manually traced as above.
The CAD system has been described in detail elsewhere 12, 17. It consisted of automated identification of the colonic lumen and wall 20, electronic subtraction of opacified colonic fluid 21, calculation of colonic surface features, segmentation of candidate polyps to locate their entire three-dimensional boundaries 22 and classification to distinguish true and false positive polyp detections 23, 24.
The output of the CAD system was a series of locations of polyp candidates in the CT images. The location data could be converted to a graphical overlay on three-dimensional virtual colonoscopy images.
The CAD software compared its detections with the ground truth tracings in a blinded fashion. If any part of a detection matched any part of a manual tracing of a polyp, the detection was considered a true-positive; otherwise the detection was considered a false-positive. Similarly, if any part of a detection matched any part of a manual tracing of a radiologist false-positive, the detection was considered a matching false-positive.
As for other types of radiology CAD such as detecting lung nodules on CT scans or breast cancer on mammography, the CAD system for detecting polyps must be trained on proven cases. The training “teaches” the computer program how to discriminate between true polyps and non-polyps. After training, the entire CAD system, including the classifier, should be applied to new “test” cases to provide a fairer assessment of future performance.
To implement this, the data set was divided into separate training and test sets. We chose to train on 1/3 and test on the remaining 2/3 of the data. This partitioning of the data enables better statistical power during testing and quicker processing during technical development when the training set is used. The division into training and test data sets was done using a random number generator that assigned patients from all three centers to either the training or test sets (Microsoft Access). Characteristics of the patients in the training and test sets are shown in Table 1.
Testing cases were sequestered and not used during development or training 25. When an acceptable training was accomplished, testing was run to produce the results shown herein. We did do training and testing with and without merging of overlapping detections but based on superior performance with merging during training, we present only results for merged detections. Details of the training and classifier design have been previously published 23, 24, 26.
The training was done using detections from the training set cases from all three institutions. Training was done for adenomas at 10, 8, and 6 mm size thresholds. Adenomas smaller than these size thresholds and all non-adenomatous polyps were placed in the false-positive set during training. The outputs of the training were three different classifiers, one for each size threshold, that were individually applied to the CT colonography test data.
The CAD software executed on both the Linux and Windows operating systems. The majority of the cases (> 99%) were run on a Linux super-cluster (a network of inexpensive computers linked together) to more efficiently analyze the large number of CTC exams 27. As many as 64 exams could be analyzed simultaneously on the super-cluster. CAD successfully analyzed all but four training (two supine and two prone) and three test exams (two supine and one prone). The processing time per patient was 20.2±8.0 minutes (n=1179), approximately half of which time was spent reading the images across the network.
We used free-response receiver operating characteristic (FROC) analysis, the standard method for evaluating CAD performance 28. FROC analysis produces curves that graphically show the sensitivity of CAD for detecting polyps versus false-positive rate (number of false positives per patient) for different settings of a tunable parameter in the classifier. As is typical in CAD, one can tune the CAD system to yield higher sensitivity at the expense of a greater number of false-positives. FROC curves are presented for different adenoma size categories and for training and testing. Because we are focusing on the more clinically significant adenomatous polyps, true-positive detections on proven non-adenomatous polyps were ignored and not included in the false-positive rates for the FROC analysis. Because the number of non-adenomatous polyps (Table 2) was small relative to the number of patients, the effect of this procedure on false positive rates is negligible.
While FROC curves show the spectrum of CAD sensitivities across a range of false-positive rates, for clinical use a CAD system is typically set at a specific operating point on the FROC curve with fixed sensitivity and false-positive rate. For each of the three size thresholds, we selected an operating point on the FROC curve. We report the sensitivities and false-positive rates at these operating points in the Tables. The operating points were chosen in relatively flat parts of the FROC curves where there were diminishing gains in sensitivity as the false-positive rates were increased. The operating points were chosen somewhat arbitrarily but represent reasonable trade-offs between sensitivity and false-positive rates.
A random subset of 64 false-positives was selected from those found after application of the classifier trained on adenomas 10 mm or larger to determine their cause. Images of these false-positives were loaded into a software application developed by author J.Y. that creates a mosaic of images that can be reviewed rapidly to determine the cause of the false-positives.
To better characterize CAD performance, we computed CAD’s sensitivity three ways: for all polyps, for those surrounded by luminal air and for those submerged in opacified fluid. A polyp was considered to be submerged if by visual assessment 50% or more of its surface was covered by fluid. Polyps were not considered submerged if they were merely coated with a thin layer of opacified fluid. We also stratified detection performance by polyp shape (sessile, pedunculated or flat), location in the colon and whether the polyps were on folds.
Sensitivity was computed two ways: using all polyps found at segmentally unblinded optical colonoscopy and using only those polyps visible upon retrospective review of the CT colonography images. The former is useful for comparing the overall sensitivity of CAD to that of optical colonoscopy prior to segmental unblinding and literature reports of radiologist interpretation. The latter is useful for distinguishing CAD’s performance from shortcomings of the CTC technique itself. For example, some polyps, particularly those 6 or 7 mm in size, could not be found on the supine and/or prone views. Consequently, it is not possible to train on them or to confirm whether CAD detected them.
We report exact 95% confidence intervals for sensitivities and false positive rates (SAS Software Version 9.1), use the Fisher exact test to compare proportions, and consider statistical significance to be p<0.05. Bootstrapping was used to compute standard deviations over a range of operating points for the FROC analysis. The bootstrapping was done by determining FROC curves for each of 100 random samples of 792 test patients with replacement (duplicates allowed) and then estimating the standard deviation at fixed values of the sensitivity and false positive rate on the FROC curves.
The patients were distributed into the training and test sets as shown in Table 1, with similar age and gender distributions, accounting for the 2:1 split. The polyp distributions are shown in Table 2.
The FROC curves are shown in Figure 1 for the three different classifiers trained to detect adenomatous polyps ≥ 10, 8, and 6 mm. These curves indicate that at a constant false-positive rate, sensitivity was higher for larger polyps. Sensitivity was also higher on the training set compared to the test set, although the differences were small (< 5%) for the 8 mm and 10 mm size thresholds. The three operating points are indicated by their associated error bars.
The per polyp and per patient sensitivities at the operating point at each size threshold are shown in Table 3. At a false-positive rate of 2.1 per patient for polyps 10 mm or larger the per polyp and per patient sensitivities were both 89.3%. Both carcinomas were found at a false-positive rate of 0.7 per patient. The sensitivities were lower for the two smaller size thresholds. Example virtual colonoscopy images of 1.4, 0.8 and 0.6 centimeter polyps detected by CAD are shown in Figure 2 – Figure 4.
The sensitivities of first-look optical colonoscopy (prior to segmental unblinding) and virtual colonoscopy CAD, using a baseline of all adenomas found by segmentally-unblinded optical colonoscopy, are compared in Table 4. The per patient sensitivities of CAD were not significantly different than that of first-look optical colonoscopy at the 8-mm and 10-mm size thresholds; the per polyp sensitivities were not significantly different at the 10-mm size threshold. Optical colonoscopy initially missed one of the two carcinomas prior to segmental unblinding; CAD detected both cancers.
Standard deviations of sensitivity ranged from 4% to 6% and of false positive rate from 0.1 to 0.3 per patient at the operating points ( Figure 1). The bootstrap analysis revealed that the standard deviations in sensitivity increased at lower false positive rates to a maximum of 10%. The standard deviations in false positive rate increased at higher false positive rates, to a maximum of 0.8 per patient.
Sensitivity was higher for adenomatous polyps in the air-filled part of the colonic lumen compared to the fluid-filled part (Table 5). The sensitivity differences were statistically significant for five of six pair-wise comparisons. In general, polyps were more frequently located in the air-filled part of the colonic lumen.
Sensitivity of polyp detection as a function of shape, location and relationship to a haustral fold is shown in Table 6. Larger polyps were most frequently pedunculated, smaller polyps were most frequently sessile. For the 6 mm and larger polyps, CAD sensitivity was lower for sessile polyps compared to pedunculated polyps and for polyps on a fold compared to polyps not on folds. There were no significant sensitivity differences for left-sided compared to right-sided polyps. None of five flat polyps were detected by CAD.
Of CAD false-negatives, 67% (2/3), 90% (9/10) and 89% (41/46) were for adenomatous polyps on or touching a fold and 67% (2/3), 80% (8/10) and 24% (11/46) were on or near (within a few voxels of) the air-fluid boundary at the 10, 8 and 6 mm size thresholds, respectively.
Analysis of 64 random CAD false-positives 1 cm or larger showed that the majority were caused by the ileocecal valve (52/64, 81%) at a FP rate of 2.1 per patient. The remainder were due to haustral or rectal folds, residual stool or fluid, or other causes.
The radiologists identified 165 false-positive polyps of all sizes in the test set of which 126 could be found on at least one view (supine or prone). Of 1,692 CAD false-positive detections in the test set (FP rate 2.1 per patient), only 15 CAD false-positives (0.9%) matched radiologist false-positives .
CT virtual colonoscopy has progressed rapidly since its inception in 1994 29. Several large clinical trials have been published 4, 6, 8, 30. Some of these trials have reported excellent sensitivity but others have demonstrated relatively poor sensitivity. The causes of poor sensitivities have been variously attributed to out-of-date CT scanner technology, absence of bowel opacification, inadequate interpretation software, improper interpretation approach (2-D rather than 3-D) or lack of training of the interpreters 7, 31–34. While there is consensus that virtual colonoscopy is appropriate for such indications as incomplete colonoscopy, there is ongoing debate about its role in the asymptomatic average risk (screening) patient.
The process of interpreting virtual colonoscopy exams is one area that has received considerable scrutiny in recent years. For example, there is debate over whether images should be read using a primary 2-D versus primary 3-D approach, whether different interpretation software yields different results and whether training or occupation affect interpretation skill 6, 7, 9, 35–39. It is clear that different observers interpret virtual colonoscopy images with different levels of skill. For example, Fletcher et al. found that 17 of 30 false negative polyps one centimeter or larger were missed because of perceptual error 40. By detecting disease on radiologic images with high sensitivity and low false-positive rate CAD can potentially improve overall physician interpretative performance, diminish the frequency of perceptual errors, and allow more poorly performing interpreters to attain performance levels comparable to experts 41, 42.
A number CAD systems for polyp detection have been described 12, 14, 43–52. In a typical implementation, computer-aided polyp detection analyzes the surface of the colon to identify polyp-like shapes that protrude into the colonic lumen. Factors such as colonic wall thickness, surface curvature and contrast enhancement have been proposed as useful features that can be quantitated and can distinguish polyps from normal colonic mucosa 11–14, 17, 44, 47, 53. While these works are encouraging, in general they have used small highly selected patient populations, unclear patient selection criteria, or more readily detectable conspicuous polyps to develop and assess the CAD system. In addition, with few exceptions 54, 55, data have come from a single institution with testing done on the same data used for training.
While CAD development for polyp detection has proceeded along many fronts, one common and critical element is validation of performance on a database of proven cases. There are many important issues about developing the database and validating performance if the CAD system is to be generalizable to new patient data. It is accepted by many experts that the key elements of the database are that it be an unbiased collection of proven cases of sufficient number to adequately reflect the diversity of polyp sizes, shapes and locations in the patient population. It is also critical to determine the generalizability of the CAD system by assessing its performance on a fresh set of data (a test set) different from that upon which it was developed (the training set). Our database and validation methods were chosen to fulfill these important criteria. In this paper, we used data from 1,253 consecutive screening cases from three medical institutions, less about 5% which were excluded, and divided it into separate training and testing samples. The CT colonography data were validated with an enhanced gold standard, segmentally-unblinded optical colonoscopy. To our knowledge, this is the largest virtual colonoscopy database of its kind.
When we analyzed all polyps visible in retrospect on CT colonography, both the per polyp and per patient sensitivities were 89.3% at a false-positive rate of 2.1 per patient for polyps 10 mm or larger. At the 8 mm size threshold, the per polyp and per patient sensitivities were 80.8% and 87.2%, respectively, at a false-positive rate of 6.7 false polyps per patient. These results indicate that CAD reliably finds retrospectively-visible adenomatous polyps 8 mm or larger on CT colonography images.
When compared to sensitivities of first-look optical colonoscopy and to radiologist interpretation in the largest CT colonography trials, CAD’s per adenoma sensitivity (86.2%) was equivalent or better at the 10 mm size threshold. For example, CAD’s sensitivity was not significantly different compared to radiologists’ as reported by Pickhardt et al. (47/51, 92.2% [95% CI: 81.1 – 97.8]), but was significantly greater than that reported by Cotton et al. (28/54, 52.0% [38.7 – 65.3]), Rockey et al. (35/55, 64% [49 – 77]) and Johnson et al. (double read, 26/41, 63.4% [46.9 – 77.9])4, 6–8. Note that Cotton et al. did not break down per polyp sensitivity by polyp histology so that all colorectal lesions (including hyperplastic polyps) are included. Rockey et al. report combined sensitivities for detecting adenomas and cancers.
Similarly, when compared to sensitivities of first-look optical colonoscopy (85.7% and 89.6%) and to radiologist interpretation in the largest CT colonography trials, CAD’s per patient sensitivities (89.3% and 85.4%) were equivalent or better at the 10 mm and 8 mm size thresholds, respectively, and are therefore likely to be in the clinically acceptable range. For example, at the 10 mm size threshold CAD’s sensitivity was not significantly different compared to radiologists’ as reported by Pickhardt et al. (45/48, 93.8% [82.8 – 98.7]) but was significantly greater than that reported by Cotton et al. (23/42, 55.0% [39.9 – 70.0]), Rockey et al. (37/63, 58.7% [45 – 71]) and Johnson et al. (double read, 30/47, 63.8% [48.5 – 77.3]) 4, 6–8. Note that Cotton, Rockey and Johnson did not break down per patient sensitivity by polyp histology so that all colorectal lesions (including hyperplastic polyps) are included. At the 8 mm size threshold our per patient sensitivities were not significantly different compared to that reported by Pickhardt et al. (77/82, 93.9% [86.3 – 98.0]). These comparisons do not take into account any changes in specificity that might occur as a consequence of CAD false positives.
We found that CAD developed on training data was generalizable to a separate test set. For example, the sensitivity and false positive rate of CAD were essentially identical for the training and test sets at the 10 mm size threshold. For smaller size thresholds, there was a decrease in sensitivities between the training and test sets that ranged from about 5% to 10% on average at the 8 mm and 6 mm size thresholds, respectively ( Figure 1). Standard deviations at the operating points were low for sensitivity (4% to 6%) and negligible for false positive rate (0.1 to 0.3). These standard deviations, which provide an estimate of the expected change in sensitivities and false positive rates on new datasets, are likely to be in the clinically acceptable range.
For guiding practical use by clinicians and future technical improvements by researchers, it is important to ascertain particular situations in which CAD is less effective. The sensitivity of our CAD system was lower for polyps under fluid, for small sessile and flat polyps and for small polyps on folds. Many false-negatives were at the air-fluid boundary, a location difficult for CAD to analyze. Factors such as the CT attenuation and amount of opacified colonic fluid may also affect CAD performance. The bowel prep used in this study produced a relatively large volume of residual colonic fluid 56. Subsequent modifications of the bowel preparation have since reduced the amount of retained colonic fluid, which would likely improve CAD performance.
The significance of the false-positive rate is harder to assess. Physician acceptance of 2.1 or 6.7 false positive rates, at the 10 mm and 8 mm thresholds respectively, depends on a number of issues: the efficiency (speed) with which physicians can review CAD “hits” and how difficult it is to decide if a CAD hit is true or false. The former is determined by the quality of the user interface for the interpretation software and was not specifically investigated by us. The latter was studied by us at a false-positive rate of 2.1. We found that most false-positives were readily identified to be normal structures like the ileocecal valve or colonic folds. In addition, few (0.9%) of the CAD false-positives coincided with radiologist false-positives. This suggests that most CAD false-positives would be rejected by the radiologist as being unlikely to represent true polyps. There is preliminary evidence that CAD false-positives do not significantly impair radiologists’ specificity even when almost 30 false-positives are shown per patient 52.
Because of the large number of CT colonography datasets in this study, we used a Linux super-cluster to perform the CAD analyses more efficiently. In clinical practice, the CAD system described herein would be run on a readily available desktop personal computer running either the Linux or Microsoft Windows operating systems. We estimate the typical processing time to be under 10 minutes per patient using such a system.
This study has several limitations. First, we could have incorrectly matched polyps found at optical and virtual colonoscopy. This error could either increase or decrease the measured sensitivity of CAD. Second, there were a number of polyps found at optical colonoscopy that we could not find retrospectively at virtual colonoscopy. Although it is possible that CAD “false-positives” were actually true-positive detections of such polyps, we suspect this occurred infrequently. To avoid bias, we did not attempt to reclassify such polyps.
We do not report performance on hyperplastic polyps. For polyps in the test set 6 mm or larger, 31.9% (65/204) were hyperplastic polyps. While hyperplastic polyps may appear indistinguishable from adenomas on CT colonography, they have no malignant potential and consequently it is less important to detect them.
CT colonography CAD is an active area of research pursued by a number of investigators both in the academic and commercial sectors. Future improvements in CAD algorithms will likely lead to even better performance. CAD systems for CT colonography are likely to become commercially available within the next few years, pending approval by the appropriate regulatory agencies.
The economics of CT colonography CAD is an important and open issue. Unlike the situation for mammography CAD, colonography CAD is not yet reimbursable. CAD could decrease expensive radiologist interpretation time and missed cancer diagnoses, leading to cost savings. However, the work-up of radiologist false positives induced by CAD could increase costs. Each of these issues will need to be assessed.
In conclusion, we found that the sensitivity and false-positive rate of computer-aided polyp detection in an asymptomatic screening population were in the range likely to be clinically acceptable and were generalizable to fresh CT virtual colonoscopy data.
We thank William R. Schindler, DO, Naval Medical Center San Diego, San Diego, CA, for providing CT colonography and supporting data; Andrew Dwyer, MD, for critical review of the manuscript; Shawn Albert and Tina R. Scott for database support; Nicholas Petrick, PhD, for helpful discussions; Maruf Haider, MD, and Meghan Miller for additional image analysis; and Sharon Robertson for manuscript preparation. Viatronix supplied the V3D Colon software free of charge. This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Md. (http://biowulf.nih.gov). This research was supported by the Intramural Research Program of the National Institutes of Health, Warren G. Magnuson Clinical Center.