|Home | About | Journals | Submit | Contact Us | Français|
Previous studies have shown that the agreement among radiologists interpreting a test set of mammograms is relatively low. However, data available from real-world settings are sparse. We studied mammographic examination interpretations by radiologists practicing in a community setting and evaluated whether the variability in false-positive rates could be explained by patient, radiologist, and/or testing characteristics.
We used medical records on randomly selected women aged 40–69 years who had had at least one screening mammographic examination in a community setting between January 1, 1985, and June 30, 1993. Twenty-four radiologists interpreted 8734 screening mammograms from 2169 women. Hierarchical logistic regression models were used to examine the impact of patient, radiologist, and testing characteristics. All statistical tests were two-sided.
Radiologists varied widely in mammographic examination interpretations, with a mass noted in 0%–7.9%, calcification in 0%–21.3%, and fibrocystic changes in 1.6%–27.8% of mammograms read. False-positive rates ranged from 2.6% to 15.9%. Younger and more recently trained radiologists had higher false-positive rates. Adjustment for patient, radiologist, and testing characteristics narrowed the range of false-positive rates to 3.5%–7.9%. If a woman went to two randomly selected radiologists, her odds, after adjustment, of having a false-positive reading would be 1.5 times greater for the radiologist at higher risk of a false-positive reading, compared with the radiologist at lowest risk (95% highest posterior density interval [similar to a confidence interval] = 1.17 to 2.08).
Community radiologists varied widely in their false-positive rates in screening mammograms; this variability range was reduced by half, but not eliminated, after statistical adjustment for patient, radiologist, and testing characteristics. These characteristics need to be considered when evaluating false-positive rates in community mammographic examination screening.
Despite many recent improvements in mammography (1), the ultimate interpretation still depends on individual physicians. The level of agreement among radiologists interpreting the same test set of mammograms is relatively low (2–6), which may delay the detection of breast cancer (7). However, recent data have shown that mammography test sets may not adequately represent actual clinical practice in a community setting (8). Few studies of variability have been done in the community setting. One study (9) found variability among radiologists' recommendations for biopsy, with radiologists in academic settings having a higher positive predictive value in their recommendations to undergo biopsy compared with community radiologists. This community-based study, however, did not control for possible differences in the patient populations or for differences among radiologists other than their affiliation with an academic institution.
In our previous work (10), we estimated that a woman's cumulative risk of experiencing at least one false-positive interpretation after 10 mammograms was approximately 50%. Several variables predicted the likelihood of a woman having a false-positive result (11). Risk ratios for a false-positive screening result increased with younger age of the woman, family history of breast cancer, use of hormone replacement therapy (HRT), time between screenings, no comparison with previous mammograms, and the radiologists' tendency to call mammograms abnormal. The single largest predictor, noted in our earlier work (11), was the radiologist's individual tendency to call mammograms abnormal.
The present study was designed to explore in more detail the extent of variability among radiologists in a community setting. Our goals were to 1) describe variability among radiologists in their specific observations, interpretations, and false-positive rates in screening mammograms; 2) evaluate the impact on variability of additional individual characteristics of the patients and the radiologists (i.e., sex, age, experience) and of additional testing characteristics (i.e., year of the mammogram, health maintenance organization [HMO] versus community facility); and 3) determine if the variability noted among radiologists would be reduced after adjusting for differences in patients, radiologists, and testing characteristics.
This retrospective cohort study was conducted on women enrolled in Harvard Pilgrim Health Care, a large HMO in New England. The study design has been previously reported and is described here in brief (10,11). The HMO has encouraged women aged 40 years and older to undergo routine breast cancer screening at both HMO and local community radiology centers. Radiologists interpreting mammograms were board-certified and worked in professional associations that contracted with the HMO.
Female members of the HMO between 40 and 69 years of age on July 1, 1983, were potentially eligible for the study (n = 14 382). Women were excluded for the following reasons: a lapse in enrollment in the HMO between July 1, 1983, and June 30, 1995 (n = 8816); health coverage from a source other than Harvard Pilgrim Health Care or from a noncomputerized HMO center during the study period (n = 1093); and a history of breast cancer, prophylactic mastectomy or breast implants before July 1, 1983 (n = 146), or a prophylactic mastectomy or breast implants during the study period (n = 8). From the cohort of 4319 eligible women, a random sample was chosen consisting of 1200 women 40–49 years of age, 600 women 50–59 years of age, and 600 women 60–69 years of age, for a total eligible sample of 2400 women.
We excluded the data on 302 mammograms done prior to 1985, because time since the previous mammographic examination could not be calculated for most mammograms obtained in the first 18 months of the study. We note that the false-positive rate for this subset was lower than that for the remainder of the mammograms (2.0% versus 6.4%, respectively). The final study period for abstraction of screening visit data was therefore 8.5 years (January 1, 1985, to June 30, 1993) with a 2-year follow-up period for assessment of breast cancer outcomes (July 1, 1993, to June 30, 1995).
This study was approved by the Human Studies Committee of Harvard Pilgrim Health Care and the University of Washington School of Medicine.
Harvard Pilgrim Health Care uses computerized records for ambulatory care services (12,13). Data on demographic characteristics, breast cancer risk factors, screening mammograms, and breast cancer outcome were extracted from these records onto standardized forms. The diagnostic interpretations for mammography were classified as normal, abnormal–probably benign, abnormal–indeterminate, or abnormal–suspicious for cancer. The radiologists' recommendations for additional testing, including physical examination by the primary care provider or surgeon, diagnostic mammography within the subsequent 12 months, ultrasound examination, and biopsy were recorded.
Information on the radiologists was obtained from the Massachusetts State Medical Registry and from HMO administrative files. Data included sex, year of birth, and year of graduation from medical school. Mammography testing characteristics included the year of the mammographic examination (1985 through 1987, 1988 through 1990, 1991 through 1993), prior mammographic examination available for comparison (yes versus no/unknown), facility type where more than 50% of radiologists' clinical time occurred (HMO versus community), and time since previous mammogram. Time since previous mammographic examination was defined as ≤18 months, >18 months, or unknown/no previous mammograms. Menopausal status, if unknown, was estimated based on the median age for the cohort with known status. If no family history of breast cancer was noted, it was assumed that there was none.
Screening mammograms were defined as those performed on asymptomatic women without previously noted abnormalities. Mammograms performed because of abnormalities noted by clinicians or patients or noted on previous mammograms were classified as diagnostic exams. Measures of accuracy were defined in a manner consistent with current recommendations regarding mammography audits (14–16) and with those used by others (17–20). A mammographic examination was classified as positive if the results were indeterminate or suspicious for cancer, or if there was a recommendation for nonroutine follow-up, including physical examination, diagnostic mammographic examination within the next 12 months, ultrasound, or biopsy. A positive test was classified as true-positive if breast cancer (invasive or ductal carcinoma in situ) was diagnosed in the patient on the basis of pathologic findings within 1 year of the test and as false-positive otherwise.
A total of 93 radiologists interpreted screening mammograms for the women included in this study. The number of mammograms interpreted by each radiologist ranged from 1 to 2036. Estimates of accuracy by individual radiologists may be unreliable for radiologists reading a small number of study films. Therefore, only the 24 radiologists who each read more than 50 screening mammograms were included in this analysis. These 24 radiologists interpreted screening mammograms on 2169 of the 2400 eligible women in the initial cohort; 45 of the 2169 women were subsequently diagnosed with breast cancer. Because estimates of sensitivity and true-positive rates may be unreliable, given the relatively low number of breast cancer cases in this study, only the false-positive rates were determined for each radiologist. Among the 24 radiologists, the percentage of mammograms with specific observations, diagnostic interpretations, and recommendations were noted, and results were presented for the median and range.
We estimated the effects of patient, radiologist, and testing characteristics on false-positive rates by using hierarchical logistic regression. The outcome of interest was the probability of a false-positive reading (versus a true-negative reading). Hierarchical logistic regression is similar to standard logistic regression except that we included woman-specific and radiologist-specific effects to account for the correlation between multiple readings within the same woman and by the same radiologist. We fit separate hierarchical logistic regression models for each variable of interest and for two multivariable models. The first multivariable model adjusted for patient and testing characteristics only; the second full model additionally adjusted for radiologist characteristics.
Through the inclusion of radiologist-specific effects, the models estimated adjusted false-positive rates for each radiologist by taking a weighted average of the radiologist's adjusted rate (given the covariates) and the overall mean rate. The weight given to the radiologist's rate depends on the number of mammograms read by that radiologist—more film readings resulted in more weight being given to the radiologist's rate and less weight being given to the overall average. In this way, the model indirectly adjusted for the number of films read by each radiologist. The false-positive rates are adjusted for the covariates included in the model. For example, if a radiologist tended to see many women with risk factors associated with higher false-positive rates, then that radiologist's adjusted false-positive rate would be lower than his or her observed false-positive rate.
Hierarchical models gave direct estimates of subject-specific (conditional) means and odds ratios (ORs), which measured the expected value for an individual woman; however, in this study, we were interested in population (marginal) averages, which estimated the average across a population of women. [For a discussion on the differences between subject-specific and population averages, see (21–23)]. Therefore, we estimated the population average false-positive rates and ORs from the conditional estimates by using Monte Carlo integration. The radiologist-specific effects may also be used to examine the heterogeneity between false-positive rates among radiologists. To quantify this heterogeneity, we calculated the average OR between any two radiologists, comparing the one having a higher false-positive rate with the one having a lower false-positive rate (24).
The hierarchical logistic regression models were fit using Bayesian Inference Using Gibbs Sampling (BUGS) (25). The regression coefficients were taken to be Gaussian with zero mean and precision of 1 × 10−6. The population variances were taken to be gamma (0.01, 0.01). Following procedures that are commonly used with Gibbs sampling, we ran the single-variable logistic regressions for 25 000 iterations, discarded the first 5000 iterations, and kept every 20th iteration of the remaining 20 000, for a total of 1000 samples from the posterior distribution. The full models were run for 100 000 iterations with 10 000 burn-in iterations, thinned by 90 iterations. Convergence of the Gibbs samplers was assessed by examining the trace plots. For the statistics of interest, we report the posterior mode of the population averages and the 95% highest posterior density intervals (95% HPD), which are similar to classical 95% confidence intervals (CIs).
We included the following patient variables in the analyses: patients' age at the time of the mammogram, menopausal status, HRT use (current, previous, or never), family history of breast cancer (yes versus no/unknown), history of breast aspirate or biopsy (none since the start of study versus one or more), body mass index (BMI) at the time of the mammographic examination (BMI ≤25 kg/m2 versus >25 kg/m2) and race (white, black, other, or unknown). Radiologists' characteristics included age of the radiologist, the number of years since graduation from medical school, and sex (male versus female). Testing characteristics included the year of the mammographic examination in three categories for parsimony in the full model (1985 through 1987, 1988 through 1990, and 1991 through 1993), whether the radiologist indicated that a prior mammographic examination was available for comparison (no/unknown versus yes), time since previous mammographic examination (≤18 months, >18 months, or unknown/no previous mammograms), and facility type (HMO versus community).
Race was not included in the full models because there were 747 women with unknown race, and using a missing category in multiple regression can bias results (26,27); however, there were no differences in the false-positive rate by race in the unadjusted model. The 149 women with missing BMI were excluded from the full models.
Over the 8.5-year study period, the 24 radiologists interpreted 8734 screening mammograms obtained on 2169 women. The median number of mammograms per woman was 4 (range = 1–9). Most of the women (78.9%) were white; 10.0% were black, 2.5% were of other races, and 8.6% were of unknown race. A family history of breast cancer was recorded for 19.7% of the women. Current HRT use was reported at some time during the study period for 12.1% of the women, and previous use was reported for 7.5%. Forty-eight percent of the women were overweight or obese (BMI >25 kg/m2).
Breast cancer was diagnosed in 45 women during the study period: local disease was present in 38 women and regional disease was present in seven women. The mean age at diagnosis of the women with breast cancer was 60 years (range = 45–76). Ductal carcinoma in situ was diagnosed in seven women. In 35 women, breast cancer was diagnosed as a result of an abnormality first noted on a screening mammogram.
The 24 radiologists worked at nine different radiology facilities, consisting of two community and seven HMO sites. The median number of mammograms interpreted per radiologist was 111 (range = 59–1990). Four radiologists each interpreted more than 1000 mammograms; one radiologist interpreted 620 mammograms and the others interpreted between 59 and 292 mammograms. The median age of the 24 radiologists at the time of interpreting their first screening mammographic examination in the cohort was 48 years (range = 31–70). Four radiologists were women. The mean number of years between graduating from medical school and interpreting their first mammographic examination for the members of the study cohort was 23 years (range = 5–46).
The observations, diagnostic interpretations, and specific recommendations for management made by the radiologists are shown in Table 1. A mass was reported by the 24 radiologists in a median of 2.3% of films interpreted, with the range being from 0% of cases interpreted by one radiologist to 7.9% of cases interpreted by another radiologist. There were wide ranges among radiologists in their notation of the presence of calcifications, fibrocystic changes, and other abnormalities. For example, one of the 24 radiologists did not observe calcifications in any film, whereas another radiologist noted the presence of calcifications in 21.3% of the films read. A wide range was also noted in the diagnostic interpretation categories of normal (range = 55.1%–83.6%) and abnormal benign (range = 6.0%–39.3%), although there was much less variability in the abnormal category suggestive of cancer (range = 0.5%–2.7%). The largest variability in recommendations was in suggesting that additional mammographic views be ordered (1.1% for one radiologist to 11.0% for another radiologist).
The observed false-positive rates of the 24 radiologists ranged from 2.6% (95% CI = 0.3% to 9.0%) to 15.9% (95% CI = 8.7% to 25.6%) and are shown graphically in Fig. 1. While the 95% CIs for the two extreme radiologists do overlap, the 95% CIs for false-positive rates from other radiologists who read more films do not overlap. For example, for the radiologist with a false-positive rate of 2.7%, the 95% CI was 1.2% to 5.3% and for the radiologist with a false-positive rate of 15.9% the 95% CI was 8.7% to 25.6%.
Table 2 shows the association of patient, radiologist, and testing characteristics with false-positive interpretations for the unadjusted and full hierarchical logistic regression models. Those women who were younger, premenopausal, using HRT at the time of the mammogram, had a positive family history of breast cancer, or had a history of previous biopsy were more likely to have a false-positive screening test result. Films interpreted by younger radiologists and by radiologists who graduated from medical school within the past 15 years were also more likely to have a false-positive result.
A secular trend was noted, with women who had mammograms in the 1990s being more likely to have a false-positive result than women who had mammograms in the 1980s. Screening mammograms for which radiologists noted a prior film available for comparison had a false-positive rate of 5.4% compared with a 9.0% false-positive rate for screens without prior films. Women who had mammograms within 18 months of previous mammograms were less likely to have a false-positive result compared with those waiting longer between screens or not having any prior screens.
Fig. 2 shows the observed and adjusted false-positive rates for the 24 radiologists in the study. The mean unadjusted ORs for all possible pairwise comparisons among radiologists (comparing the radiologist at higher risk of a false-positive reading with the radiologist at lower risk) is 2.05 (Fig. 2, line A). In other words, if a woman went to two randomly picked radiologists, her odds of having a false-positive reading would be 2.05 times greater on average for the radiologist at higher risk compared with the radiologist at lower risk. Some of this variability is due to the small number of films read by certain radiologists (i.e., <100 mammograms). However, after accounting for the correlation within woman and radiologist using hierarchical logistic regression, which indirectly adjusts for the varying number of mammograms read by each radiologist and each woman's overall tendency for having a false-positive mammogram, the false-positive rates ranged from 3.5% to 11.9%, with a mean OR between radiologists of 1.68 (95% HPD = 1.33 to 2.42; Fig. 2, line B). Adjusting for patient and testing characteristics in addition to the correlation within woman and radiologist did not further reduce the variability in false-positive rates between radiologists (Fig. 2, line C; range = 3.3%–10.2%; mean OR = 1.65, 95% HPD = 1.33 to 2.44). However, after additionally adjusting for radiologists' characteristics, the range of false-positive rates was reduced to 3.5%–7.9%, and the mean OR between radiologists was 1.48 (95% HPD = 1.17 to 2.08; Fig. 2, line D).
Community radiologists varied substantially in their interpretation of screening mammograms; the variability in false-positive mammography rates was reduced by half, but not eliminated, after adjustment for differences in the patient population, the testing situation, and radiologists' characteristics. Before adjustments, the 24 radiologists varied in their false-positive interpretation rates from 2.6% to 15.9%; after full adjustment for patient, testing, and radiologist characteristics that may influence false-positive readings, variability was reduced to a range of 3.5%–7.9%. While patient, testing, and radiologists' characteristics were all important predictors of false-positive rates, radiologist characteristics were more important in accounting for variability among radiologists in this study than we had anticipated. The unexpected importance of radiologist characteristics was probably due to the similarity of patient populations and testing characteristics across radiologists in this study. However, these characteristics may not be similar in other studies; therefore, it will typically be important to adjust for all of these variables when studying radiologists' variability.
The most important radiologist characteristic appeared to be age and time since graduation from medical school, with younger radiologists and those more recently in training having higher rates of false-positive mammograms. The fact that younger radiologists and those more recently trained had two to four times the false-positive mammographic examination rates of older radiologists (Table 2) is especially noteworthy, because it is reasonable to hypothesize that those most recently trained would be more accurate than older mammographers, i.e., those trained a long time ago. It is possible that the younger radiologists missed fewer cancers than did older mammographers who were more distant from their training, because their training emphasized sensitivity over specificity.
Variability has been noted in many areas of clinical medicine (28,29). Microscopic review of breast tissue slides has an element of subjectivity in interpretation similar to that of interpretation of mammograms. For example, in the diagnosis of ductal carcinoma in situ, agreement among five pathologists with a standard interpretation on a test set of 24 breast tissue slides ranged from 71% to 92%, with individual false-positive rates ranging from 0% to 20% (30). Obviously, the CIs around the individual rates would be wide, given the small sample size, but the similarities with our findings in mammography are striking.
Several studies (2–7) have indicated that significant variability exists in the interpretation of mammograms. This variability indicates the possibility of wide ranges in false-positive mammogram interpretations by individual radiologists, which can be both alarming and expensive for the patient (10). By better understanding sources of variability in mammography interpretation, we can identify potential areas of improvement. The ultimate goal is to enhance mammography performance by reducing the rate of false-positive interpretations while maintaining high levels of sensitivity and accuracy.
It has long been known that certain clinical and demographic characteristics of women make accurate reading of mammograms more difficult (31–33). More recently, several studies (34,35) have shown that time between mammograms and the availability of previous studies for comparison also affect accuracy. However, less attention has been directed to secular trends in false-positive mammographic examination rates. We found that rates almost doubled in this community setting between 1985 and 1993. This increase in false-positive rates may be related to fear of medical malpractice litigation, given the prominence in North America of malpractice litigation for delayed detection of breast cancer.
Strengths of this study include the fact that it was done within a community setting and with radiologists who had a broad range of years of experience and who had worked in different types of clinical settings. Data were available on the radiologists, the patients, and the testing characteristics, all of which were controlled for in the analysis. Most of the prior studies of radiologists' variability in mammography have been done in a testing situation, which might not be representative of real-life clinical practice (8).
The limitations of our study include the fact that the radiologists in this study did not read the same films, and so direct comparisons are not possible (although we did adjust for patient characteristics in the models). Only 45 women were diagnosed with breast cancer; thus, we did not analyze sensitivity. In addition, some of the radiologists read fewer than 100 mammograms in the 8.5-year study period, which makes comparisons difficult because the CIs were wide. It should be noted, however, that these radiologists read additional films outside this study cohort; thus, the numbers do not represent the total number of mammograms they read during the study period. In addition, the American College of Radiology breast imaging reporting and data system (BI-RADSTM) classification system was not in use at the time of the study (36). Although use of BI-RADSTM may ultimately lead to less variability among radiologists, this has not yet been shown to be the case (5). The false-positive rates for our participating radiologists were lower than the national average; thus, our results possibly underestimate the variability among radiologists elsewhere. Finally, the data in this study are for 1985 through 1993, and reading patterns among radiologists may have changed since then.
Given the retrospective nature of this study, data on some variables were not available, which may have resulted in misclassification errors. For example, several factors related to radiologists that might be important and should be included in future research include fiscal incentives, medical malpractice concerns, and comfort with ambiguity in clinical decision making. Adjustments for these and other variables may further decrease the variability in false-positive rates.
In summary, community radiologists varied widely in their false-positive rates for screening mammograms. This variability was affected not only by the kind of patients seen but also by radiologists' age and experience. Younger radiologists and those more recently in training had higher rates of false-positive mammogram interpretations. This study was different from research designs that used test sets of films, because we looked at radiologists' decisions as they naturally occur in actual clinical practice. That the variability among radiologists in false-positive mammographic examination readings was reduced by half underscores the importance of adjusting for patient and radiologist characteristics when attempting to understand variability in clinical medicine.
Supported by grants from the American Cancer Society (to J. Elmore); by Public Health Service grant HS-10591 (to J. Elmore) from the Agency for Healthcare Research and Quality and the National Cancer Institute, National Institutes of Health, Department of Health and Human Services; by a Robert Wood Johnson Generalist Faculty Scholar Award (to J. Elmore); and by the Harvard Pilgrim Health Care Foundation (S. Fletcher and M. Barton).
Joann G. Elmore, Department of Medicine, University of Washington School of Medicine, Seattle.
Diana L. Miglioretti, Center for Health Studies, Group Health Cooperative of Puget Sound, and Department of Biostatistics, University of Washington School of Public Health, Seattle.
Lisa M. Reisch, Department of Medicine, University of Washington School of Medicine, Seattle.
Mary B. Barton, Department of Ambulatory Care and Prevention, Harvard Pilgrim Health Care, and Harvard Medical School, Boston.
William Kreuter, Department of Medicine, University of Washington School of Medicine, Seattle.
Cindy L. Christiansen, Boston University School of Public Health, Health Services Department, and Center for Health Quality, Outcomes and Economic Research at Veterans Affairs Health Services Research and Development, Boston, MA.
Suzanne W. Fletcher, Department of Ambulatory Care and Prevention, Harvard Pilgrim Health Care, and Harvard Medical School, Boston.