|Home | About | Journals | Submit | Contact Us | Français|
To examine the effect of time spent viewing images and level of confidence on a screening mammography test set on interpretive performance.
Radiologists from six mammography registries participated in the study and were randomized to interpret one of four test sets and complete 12 survey questions. Each test set had 109 cases of digitized four-view screening film-screen mammograms with prior comparison screening views. Viewing time for each case was defined as the cumulative time spent viewing all mammographic images before recording which visible feature, if any, was the “most significant finding”. Log-linear regression fit via GEE was used to test the effect of viewing time and level of confidence in the interpretation on test set sensitivity and false-positive rate.
119 radiologists completed a test set and contributed data on 11,484 interpretations. Radiologists spent more time viewing cases that had significant findings or for which they had less confidence in interpretation. Each additional minute of viewing time increased the probability of a true positive interpretation among cancer cases by 1.12 (95% CI: 1.06, 1.19, p<0.001), regardless of confidence in the assessment. Among radiologists who were ‘very confident’ in their assessment, each additional minute of viewing time increased the adjusted risk of a false positive interpretation among non-cancer cases by 1.42 (95% CI 1.21, 1.68), and this viewing-time effect diminished with decreasing confidence.
Longer interpretation times and higher levels of confidence in the interpretation are both associated with higher sensitivity and false positive rates in mammography screening.
Little is known about how time spent examining different types of mammographic images affects interpretive accuracy outside of comparisons between digital and screen film mammography or mammography with and without use of computer-aided detection (1-8). In one study, Saunders, et al (1, 2) found that incorrect detection decisions for both cancer and non-cancer cases, and incorrect decisions about needed work-up were both associated with longer interpretation time for screening mammography. In another study, Nodine, et al (3) found that a high level of confidence was associated with both shorter fixation dwell times (time spent looking at a specific area on a film and time spent initially scanning the image, and that one second of fixation dwell time (coupled with a high level of confidence) versus time spent initially scanning the image were associated with detection of true positive lesions for experienced radiologists. They also found that prolonging the search beyond the global recognition time yielded few new lesions and increased the risk of error. Kundel, et al (4) found that expert radiologists fixated on a cancer in 1.13 seconds and that proficient radiologists appear to use a fast holistic scanning mode rather than a search-to-find-mode. Understanding how time spent interpreting mammography affects performance could assist radiologists in avoiding viewing behaviors unlikely to improve accuracy.
Weaknesses of all the above studies include that very few experienced radiologists were included (between 1 and 6), so adjustment for radiologists characteristics known to affect performance were not addressed, and none of the studies examined the particular features being examined in either normal or abnormal images. Understanding the relationships among time spent, complexity of the images and interpretive accuracy, while adjusting for possible confounders, could aid in identifying which types of mammographic findings might benefit from a second opinion. In addition, conducting an in-depth assessment of initial radiologists’ time spent and confidence in their assessment could potentially improve interpretive performance and be more effective than double reading all screening mammograms. We conducted a study with radiologists across the U.S. to examine these issues and to specifically test the hypothesis that more versus less time spent interpreting mammograms is associated with more difficult cases and lower accuracy compared to cases interpreted more quickly.
This study was conducted with six mammography registries (Carolina Mammography Registry, New Hampshire Mammography Network, New Mexico Mammography Project, Vermont Breast Cancer Surveillance System, and Group Health Cooperative in western Washington) associated with the National Cancer Institute funded Breast Cancer Surveillance Consortium (BCSC; http://breastscreening.cancer.gov). Data collected as part of this study were pooled at the BCSC Statistical Coordinating Center (SCC) in Seattle, WA for analysis. Each registry and the SCC received IRB approval for either active or passive consenting processes or a waiver of consent to enroll participants, link data, and perform analytic studies. All procedures are Health Insurance Portability and Accountability Act compliant and all registries and the SCC have received a Federal Certificate of Confidentiality (9) and other protection for the identities of women, physicians, and facilities that are subjects of this research. In addition, each registry and the SCC received IRB approval for all test set study activities.
Radiologists who interpreted mammograms at a facility contributing to any of the registries between January 2005 and December 2006 were invited to participate. We also invited 103 non-BCSC radiologists from Oregon; Puget Sound, WA; North Carolina, San Francisco, and New Mexico. A total of 469 radiologists were invited to participate, and 148 (31.6%) consented. Among these, 119 (80.4%) completed all study activities.
We selected test set cases based on cancer prevalence and expert rated difficulty identifying breast cancer to create four screening mammography test sets with 109 cases in each set. This approach was used because the goal of the larger study was to assess how cancer prevalence and type of finding (subtle, intermediate and obvious) as interpreted on a test set would correlate with actual clinical practice. The results of the larger study will be reported elsewhere.
All cases were randomly selected from screening examinations performed on women aged 40 to 69 between 2000 and 2003 from the six participating BCSC mammography registries. Women who had a mastectomy and those with a prior history of breast cancer were excluded. Participating registries contributed between 42 and 84 screening mammography examinations, all of which had a mammogram within the prior 11-30 months. Of these approximately 26-48% were selected from each site for use in a test set. Examinations with stray marks or other quality issues on the films were excluded. Each screening examination selected consisted of craniocaudal (CC) and mediolateral oblique (MLO) views of each breast (4 views per woman for each of the screening and comparison examinations). For cancer cases, we selected images from exams for which cancer was diagnosed within 12 months following the mammogram. Non-cancer cases came from women who were cancer after at least two years following the mammogram. Final cases in the test sets came from 36 women known to have been diagnosed with cancer within one year of imaging, and from 94 women who remained cancer-free for two years following the imaging. Cases from these women were used more than once to configure the test sets appropriately.
The case sampling design was stratified based on clinical interpretations as true positive, true negative, false positive and false negative and were reviewed by an expert panel of radiologists (n=3), which was blinded to the original mammography interpretation and cancer status. The experts also categorized significant findings as mass, calcification, asymmetric density or architectural distortion; and as obvious, intermediate or subtle. Obvious findings were defined as those the expert panel agreed that 100% of community radiologists should identify. Intermediate finding were defined as those the expert panel agreed 25-99% of community radiologists would identify. Subtle findings were defined as those the expert panel indicated <25% of community radiologists would identify. The experts reached consensus on any interpretation for which an initial disagreement occurred.
We randomly selected 60 TP examinations and 16 FN examinations, so the experts could identify 14 obvious cancers, 15 intermediate cancers, and 9 subtle cancers for inclusion in a test sets. To include FP examinations, we selected examinations with BI-RADS assessment categories 0, 4 or 5 that were not associated with breast cancer within 24 months of mammography. The remaining examinations were TN examinations of both breasts. The composition of Test Sets 1 and 2 included: 47% obvious, 40% intermediate, 13% subtle, and the composition of Test Sets 3 and 4 included 20% obvious, 50% intermediate, 30% subtle. Test sets 1 and 2 each contained 15 cancer and 94 non-cancer cases. Test sets 3 and 4 both contained 30 cancer and 79 non-cancer cases. After cases were selected, the films were digitized by experts at the American College of Radiology using a Vidar Diagnostic Pro (10) and loaded into specially designed viewing software that allowed us to collect timing and location information associated with interpretations.
We randomized consenting radiologists to one of the four test sets. The test sets were self-administered using custom designed software distributed on a DVD. Participants were block randomized within strata defined by registry/site and whether a radiologist had reviewed at least 30 cancers in the BCSC database to ensure an equal number of radiologists with accurate measures of clinical sensitivity of mammography were reading each test set. This criterion was not used for non-BCSC participants.
Each site sent consenting radiologists the DVD along with an instruction sheet informing them of their assigned test. Radiologists used either a home or work computer or laptop provided by the study with a large size screen and high-resolution graphics (≥1280×1024, ≥3GHz, 1GB of RAM, and a video card with 128MB of memory capable of displaying full 32-bit color at the listed resolutions and a DVD reader) to show 2 images at the same time. The monitor specifications were provided to radiologists if they chose to use their home or work computers. The software developed for the study allowed radiologists to: 1) choose whether the images were displayed with right breast facing right or left and left breast facing left or right; 2) rapidly toggle (≤1 second) between the display of paired images so that visual memory is retained from one displayed pair to the next; 3) magnify a portion of the displayed image; and 4) point and click on any important abnormality to record the coordinates of findings to enable capture of whether a radiologist has identified and located the lesion of highest suspicion for cancer.
Participating radiologists were instructed to interpret test sets as they would in clinical practice. Radiologists were informed that the overall cancer rate on test sets was higher than that found in a screened population (11), but they were not informed of the specific prevalence of positive examinations or cancers in the test sets. We used this approach so that all radiologists would interpret test sets with similar knowledge of the underlying prevalence of disease instead of assuming the prevalence to be similar to their own clinical experience.
Prior to evaluating test set digitized mammography films, the software prompted each participant to answer 12 demographic and clinical practice survey questions, including receipt of fellowship training, specialization, number of years spent interpreting mammograms, and the number of mammograms interpreted per week. In addition, we made use of the test set assignment as a radiologist-level characteristic, as it captured the case mixture and difficulty of films.
Our analyses focus on mammographic test set performance. As participants viewed individual cases in the test set, they were prompted to identify the most significant visible breast abnormality, and to decide whether or not the patient should be recalled for additional work-up. The decision to recall constituted a positive test result for our analyses. Recall decisions on mammography exams were modeled conditionally based on the patients’ true cancer status, and other relevant covariates described in the data analysis section, to estimate effect of the time spent viewing films on sensitivity and false-positive probability.
The test-set software randomly presented the images in a similar manner to digital mammogram interpretation using a single monitor. Each case was presented in a sequence including MLO and CC views of both breasts simultaneously, followed by MLO and CC views of each breast paired with the analogous image from the previous exam to assess whether changes from the prior mammogram were apparent. Figure 1 illustrates an example case with image presentation shown from the Test Set software. Images could be magnified as needed. The software recorded the length of time, measured in seconds, the user spent viewing each individual film, which was defined as the cumulative time spent viewing all mammographic images and identifying which, if any, visible feature was the most significant finding by a mouse click.
Radiologists were encouraged by a pop-up message within the program to examine all available images, including both current and comparison views, before indicating their decisions about any of the individual images to ensure viewing consistency during the study. Because the assessment software did not have a pause feature, any time spent away from the computer during the completion of the test set was added to the cumulative time associated with the most recently started exam. To minimize the impact on our analyses, we assessed viewing time in a controlled setting with seven radiologists viewing 320 cases where interrupted viewing time was not possible and found that >98% of interpreters completed viewing all study images for each case within five minutes, which our expert radiologists concurred with. Examinations for which viewing time exceeded 5 minutes (n=1,443) were excluded from our analyses since they likely represented a mix of uninterrupted and interrupted viewing durations.
Breast density was categorized as almost entirely fat, <25%; scattered fibroglandular densities, 25–50%; heterogeneously dense, 51–75%; or extremely dense, >75% (12). Users reported confidence in their assessment on each exam as either not at all confident, not very confident, neutral, confident, or very confident. For our analyses, we combined the responses for ‘Not at all confident’ and ‘Not very confident’ to form a ‘Not Confident’ category.
We recorded the expert-assessed lesion type for each case as one of either mass; calcification, asymmetry, or architectural distortion. This variable was classified as missing when the expert consensus indicated there was no significant finding.
All but one radiologist, who reviewed only 104 cases, reviewed all 109, resulting in 12,966 interpretations. Of these, 1,443 (11%) observations were excluded because their duration exceeded five minutes, and an additional 39 (0.3%) were excluded due to errors in time recording caused by computer problems. In all, 11,484 (89%) interpretations were suitable for analyses.
We calculated frequency distributions for responses to the 12 demographic questions by test set assignment. To address the primary scientific question of the effect of viewing time on radiologist’s test set performance, we modeled the relative risk of a positive assessment (recall) using log-linear regression. To account for the correlation within both radiologists and exams, we implemented an extension of GEE developed specifically for analysis of non-nested multilevel data (13, 14).
We regressed a binary indicator of recall on viewing time and examination-level, patient-level, and radiologist-level covariates using a log-link function to estimate relative risks. Our models made a Poisson variance assumption, which yields valid variance estimates for relative risk in analyses of common binary outcomes (15), and applied the robust Huber-White sandwich variance estimator. We separately modeled sensitivity and false-positive rates. For both cancer and non-cancer exams, we modeled the probability of recall as a function of viewing time, with and without adjustment for: radiologist’s confidence on the exam; the radiologist’s assessment of breast density; the expert-identified lesion type; and the radiologist’s fellowship category, specialization, years interpreting mammograms, number of mammograms read per week, and random test set assignment. We expressed viewing time in the model using a single linear term, and expressed each categorical covariate with an appropriate group of indicator variables.
We hypothesized that the effect of viewing time on the probability of recall may differ according to the radiologist’s confidence interpreting the exam and assessed whether confidence modified the effect of viewing time on the risk of recall by including interaction terms between viewing time and each indicator associated with confidence. In analyses where interactions were statistically significant, we calculated confidence-level specific estimates of the relative risk of recall associated with a one-minute increase in viewing time. Where the interaction was not significant, we omitted it from the regression models, and estimated a time effect across levels of confidence. All analyses were conducted using the R statistical software, version 2.10.0 (16, 17).
Seventy-six of 119 participating radiologists (64%) reported interpreting mammograms for more than 10 years, and 86 (72%) reported reading at least 50 mammograms per week (Table 1). Fifteen radiologists (13%) reported completion or plans to complete a fellowship in breast or women’s imaging.
Figure 2 presents the mean and inter-quartile range (IQR) for viewing time for the 2,291 exams of images belonging to 36 women known to have been diagnosed with cancer within one year of imaging, and for 9,193 exams of images from 94 women who remained cancer-free for two years following the imaging. Results are shown by cancer status and the confidence level each radiologist selected following the exam, with horizontal reference lines indicating the 25th, 50th, and 75th percentiles of the viewing time distribution among cancer and non-cancer exams. Median times associated with exams resulting in a positive assessment, shown with a solid vertical bar, were higher than negative exams, shown with a dotted bar, for cancers and non-cancers and across all levels of confidence. Among both cancer and non-cancer exams, median viewing times were shorter for exams on which radiologists endorsed greater confidence, and longer for exams on which readers were less confident.
Table 2 illustrates the median viewing times and IQRs for all examinations in groups defined by expert-identified finding type and participant-rated BI-RADS (12) breast density. Median viewing times for exams containing any expert findings were longer than for those with none, with the exception of those containing a mass. Fatty and extremely dense breasts have similar properties in terms of time spent viewing. Both have shorter viewing times than scattered and heterogeneously dense for cases with no findings and most cases with findings. Among exams with expert findings, those containing masses corresponded to the shortest median viewing times, and calcifications to the longest, except in cases where breasts were almost entirely fatty for which asymmetries had the longest median viewing time.
Among cases with cancer, each additional minute of viewing time increased the adjusted probability of a true positive assessment by a factor of 1.12 (95% CI: 1.06, 1.19), p<0.001 (Table 3). This effect did not significantly differ by radiologist confidence in assessment before (p=0.88) or after (p=0.73) adjustment for expert-identified lesion type, reader-rated breast density, fellowship category, specialization, years interpreting mammograms, mammograms read per week, and random test set assignment. Confidence in the assessment was significantly associated with a true positive assessment both before (p=0.003) and after (p<0.001) covariate adjustment. Radiologists who reported being very confident in their assessment were 1.32 (95% CI: 1.16, 1.50) times more likely to have correctly recalled the patient than those reporting neutral confidence.
The relative risk of false positive assessments, estimated from examinations of women who remained free of breast cancer for one year after imaging, illustrated that the relationship between viewing time and false positive probability differed significantly by radiologist confidence (Table 3), both before (p=0.016) and after (p=0.039) adjustment for the factors mentioned above. The unadjusted association between confidence and the risk of a false positive was statistically significant (p<0.001). False positive exams were about half (RR=0.55, 95% CI: 0.36, 0.86) as likely among very confident examiners as among those reporting neutral confidence. Those who reported being not very confident or not at all confident were 1.41 (95%CI: 1.17, 1.69) times more likely to recall the patient.
The effect of viewing time on the probability of a false positive assessment was significant across all confidence levels, both before and after covariate adjustment. For those who were ‘Very confident’ in their assessment, each additional minute of viewing time increased the adjusted risk of a false positive by a factor of 1.42 (95% CI: 1.21, 1.68). Relative risk estimates diminished monotonically according to falling confidence. For those reporting ‘Confident’ assessments, the risk increased by a factor of 1.40 (95% CI: 1.29, 1.52), for ‘Neutral’ assessments by a factor of 1.38 (95% CI: 1.29, 1.48), and for those who were either ‘Not very confident’ or ‘Not at all confident’ in their assessment, by a factor of 1.20 (95% CI: 1.10, 1.31).
This study is the largest conducted to-date with 119 radiologists and 11,484 interpretations analyzed to examine the relationships among viewing time, type of finding, confidence and accuracy when interpreting mammography. We found that radiologists spent more time viewing mammographic findings that they ultimately recalled rather than those that those they did not recall and that higher confidence was usually associated with shorter viewing times. Among cancer cases, increasing viewing time increased the probability of a true positive interpretation. We also found that among non-cancer exams for which radiologists felt ‘very confident’ in their assessment, each additional minute of viewing time increased the adjusted risk of a false positive interpretation by 42% and that this effect diminished according to decreasing confidence.
These findings illustrate the complex relationship between view time and confidence. While increased viewing time resulted in a small increase in sensitivity, it decreased specificity to a much larger extent. Thus, radiologists may not benefit from spending more time on an interpretation they are not confident in, but may benefit from asking a colleague for a second opinion, which may assist less experienced radiologists in gaining knowledge about and confidence in their interpretations. Addressing confidence in an educational setting may be challenging as most continuing medical education is designed to address knowledge deficits that may or may not exist, and changes in knowledge often do not translate to improvements in skill (18). Interestingly, our findings do not appear to be related to fellowship training, years of experience, or specialization in breast imaging. One might hypothesize that such educational experiences should shorten the time needed to interpret a mammogram, but little exists in the literature on this important topic. Double reading is used extensively outside the U.S. (19, 20), which appears to reduce recall rates without affecting cancer detection. Double reading is not done routinely in the U.S., principally because radiologists are not reimbursed for it; though, double reading on cases that take a long time to interpret might improve specificity.
Our study differs from those of Kundel, Nodine et al and others (2, 3, 21-23), who have studied eye position and fixation dwell times using a computer eye-head tracking system. While we also used an interactive computer system that included time assessments, we did not specifically measure eye movements. Like these investigators, we found that longer time spent on the interpretation yielded lower specificity though the 2002 study only included nine radiologists and six of them were trainees (2). Another difference is that these investigators did not include specific measures of confidence as we did in our study.
In another study, Castella et al (24) studied the influence of signal variability on human and model observers for detection tasks using simulated masses superimposed on both real patient mammographic backgrounds and synthesized mammographic backgrounds with clustered lumpy backgrounds. They found that human observers’ performance did not vary when benign masses were superimposed on real images or on the synthesized background. Uncertainty and variability in signal shape did not significantly affect human performance though variability in signal size did. Our findings differ in that we found level of confidence, a concept reflective of uncertaintly, in the interpretation influenced interpretive accuracy.
Our study shows that interpretation using a test set methodology involves variation in time spent, level of confidence and accuracy. More time spent and lower confidence appears to result in much higher recall rates with many more false positive exams and only a small increase in sensitivity. Interventions that recognize this issue could reduce false positive cases without altering sensitivity. Options such as selective requests for a second opinion should be tested. These might involve academic detailing (university-based educational outreach involving face-to-face education by trained health care professional) (25-27), which has shown improved performance in physicians’ use of pharmacologic agents.
The strengths of our study include the large number of participating radiologists from around the U.S and the large number of images our analysis was based upon. Another strength is that we designed the test set software so that it was similar to interpreting digital mammography, which is now being used in greater than 70% of mammography facilities across the U.S., so it is similar to clinical practice (28). Limitations include that physicians could not be kept blind to the fact that their interpretations of the test set were being studied, so their interpretations could have been influenced by the Hawthorne effect (knowledge that they were participating in a study could have affected their interpretive behavior) (29). In addition, though we used the American College of Radiology Guidelines for computer monitor display capabilities for diagnostic radiology (30), image quality in this field is advancing rapidly and monitors did improve over the study period. Similarly, radiologists use of two monitors for interpretive viewing increased over this time period, which may have resulted in variability of study findings. We also used a control setting with seven radiologists to identify a cut point for maximum viewing time and found that >98% viewed the cases within this time period. Only two percent took longer than five minutes in our controlled setting. Though this time might be longer when radiologists interpreted in their home or work settings, we do not think it will have affected our findings to any significant degree.
In conclusion, longer interpretation times and higher levels of confidence appear to be independently associated with a small increase in sensitivity to detect cancers in screening mammography; however, longer interpretation times also appear to be associated with a much greater risk of false positives, and this association increases in magnitude with higher levels of confidence.
This work was supported by the American Cancer Society, made possible by a generous donation from the Longaberger Company’s Horizon of Hope® Campaign (SIRSG-07-271, SIRSG-07-272, SIRSG-07-273, SIRSG-07-274-01, SIRSG-07-275, SIRGS-06-281, SIRSG-09-270-01, SIRSG-09-271-01, SIRSG-06-290-04), the Breast Cancer Stamp Fund, and the National Cancer Institute Breast Cancer Surveillance Consortium (U01CA63740, U01CA86076, U01CA86082, U01CA70013, U01CA69976, U01CA63731, U01CA70040). The collection of cancer data used in this study was supported in part by several state public health departments and cancer registries throughout the U.S. For a full description of these sources, please see: http://www.breastscreening.cancer.gov/work/acknowledgement.html <http://www.breastscreening.cancer.gov/work/acknowledgement.html> . The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health. We thank the participating women, mammography facilities and radiologists for the data they have provided for this study. A list of the BCSC investigators and procedures for requesting BCSC data for research purposes are provided at: http://breastscreening.cancer.gov/ <http://breastscreening.cancer.gov/> .”