|Home | About | Journals | Submit | Contact Us | Français|
All silicone breast implant recipients are recommended by the US Food and Drug Administration to undergo serial screening to detect implant rupture with magnetic resonance imaging (MRI). We performed a systematic review of the literature to assess the quality of diagnostic accuracy studies utilizing MRI or ultrasound to detect silicone breast implant rupture and conducted a meta-analysis to examine the effect of study design biases on the estimation of MRI diagnostic accuracy measures.
Studies investigating the diagnostic accuracy of MRI and ultrasound in evaluating ruptured silicone breast implants were identified using MEDLINE, EMBASE, ISI Web of Science, and Cochrane library databases. Two reviewers independently screened potential studies for inclusion and extracted data. Study design biases were assessed using the QUADAS tool and the STARDS checklist. Meta-analyses estimated the influence of biases on diagnostic odds ratios.
Among 1175 identified articles, 21 met the inclusion criteria. Most studies using MRI (n= 10 of 16) and ultrasound (n=10 of 13) examined symptomatic subjects. Meta-analyses revealed that MRI studies evaluating symptomatic subjects had 14-fold higher diagnostic accuracy estimates compared to studies using an asymptomatic sample (RDOR 13.8; 95% CI 1.83–104.6) and 2-fold higher diagnostic accuracy estimates compared to studies using a screening sample (RDOR 1.89; 95% CI 0.05–75.7).
Many of the published studies utilizing MRI or ultrasound to detect silicone breast implant rupture are flawed with methodological biases. These methodological shortcomings may result in overestimated MRI diagnostic accuracy measures and should be interpreted with caution when applying the data to a screening population.
Substantial adverse public and media attention was directed toward the use of silicone gel breast implants in the early 1990s when concerns linking implants with connective tissue disorders,1 cancer,2 and neurologic sequelae3 resulted in a 15 year ban by the US Food and Drug Administration (F.D.A.) on April 16, 1992.4 During this ban, many rigorous clinical and epidemiological studies were conducted but failed to show compelling associations between implant rupture and autoimmune diseases.5 The lack of evidence persuaded the F.D.A. to re-approve the use of silicone breast implants in 2006 with the recommendation to screen all silicone breast implant recipients with magnetic resonance imaging (MRI) 3 years after implantation and biannually thereafter.6 These recommendations affect a large number of women undergoing breast augmentation. An estimated 1 million women underwent augmentation with silicone gel implants between 1963 and 1988 prior to the ban,7 and from 2008–2009, 265,074 women used silicone implants, which comprise nearly 50% of all women undergoing breast augmentation in the US.8
With a growing number of women being implanted with silicone gel implants,8 serial MRI screening throughout the implant’s lifetime raises concerns regarding MRI as an optimal screening modality. Important criteria to be considered when choosing an optimal screening test include characteristics of the condition of interest (i.e., prevalence of the detectable condition) and test characteristics (i.e., sensitivity and specificity). The F.D.A.’s concern for screening is to detect silent ruptures, a rupture in a clinically asymptomatic patient.9 The range of rupture characteristics extend from large visible tears or focal ruptures through pin-sized holes to gel bleeds, which are microscopic silicone leaks through an otherwise intact implant envelope.10 Microscopic leaks may be caused by degenerating silicone elastomers and may evolve into larger leaks with migrating free silicone. How sensitive MRIs are in detecting gel bleeds, compared to intracapsular and extracapsular ruptures manifesting with clinical symptoms, remains unknown. In addition, the prevalence of gel bleeds, given the subclinical presentation, is difficult to assess. The most recent study using MRI to screen for ruptures reports a prevalence of 8% among asymptomatic women for implants of median age 11 years.11 This study was restricted to implants 10–13 years of age and specific manufacturing styles of Inamed silicone breast implants (Inamed Corp., Santa Barbara, CA) and did not verify subjects with explantation. Nonetheless, the low prevalence of a silent rupture questions the utility of MRI in a screening population. Moreover, the manufacturing of silicone gel implants has been improving with more durable shells and more cohesive gel materials,12 which will potentially further decrease the prevalence of rupture.
Screening test characteristics are fundamental in choosing the optimal test. Several authors have shown inaccurate diagnostic accuracy measures in studies flawed with study design biases, such as spectrum bias or partial verification bias.13, 14 Spectrum bias occurs when a study sample is comprised of a clinically restricted spectrum of patients. For example, symptomatic subjects are more likely to have a ruptured implant, resulting in higher sensitivity and specificity estimates. Furthermore, studies evaluating the accuracy of a screening test are particularly subject to partial verification bias. This bias occurs when not all subjects who are screened undergo the reference test, in particular, those with a negative screening test result. This bias can markedly reduce the apparent specificity and increase the sensitivity of the test.13
We examined the study quality of diagnostic accuracy studies using MRI to detect silicone breast implant rupture, given the current controversy about the MRI screening recommendation by the F.D.A.15 All silicone breast implant recipients are recommended by the F.D.A. to undergo serial MRI screening to detect implant rupture despite a lack of evidence showing serious consequences from a ruptured implant. Because some physicians believe that ultrasound is a more acceptable screening modality,16 we were also interested in the quality of diagnostic accuracy studies using US to detect implant rupture. We performed a systematic review of this literature and identified the most common biases and examined reporting quality using validated checklists. Next, we used this identified literature to perform a meta-analysis,17, 18 a statistical method commonly used to combine results from multiple studies. Figure 1 is a schematic illustrating the steps for a meta-analysis. The meta-analysis was conducted to also quantify the effect of biases on the reported MRI diagnostic accuracy measures.
Four databases (MEDLINE, EMBASE, ISI Web of Science, and Cochrane) were searched using patient (breast and silicone), intervention, outcome, and diagnostic accuracy entry terms up to April 2010 (Table 1). Two searches from each database were performed using first, a combination of patient, intervention and outcome terms and second, a combination of patient, intervention, outcome and diagnostic accuracy terms.19 There were no language restrictions but searches were limited to human studies. The standards of quality for reporting meta-analyses of observational studies were reviewed during the planning, conducting, and reporting of this meta-analysis.20
Study inclusion criteria are shown in Table 2. First, titles and abstracts were screened independently by two reviewers. Selected articles from this screen underwent subsequent independent full-text reviews. The references of all articles selected for full text review were manually reviewed. Foreign language articles were translated and reviewed for inclusion.
Following the Cochrane Collaboration recommendations,21 methodological quality was assessed independently by two reviewers using the Quality of Diagnostic Accuracy Studies (QUADAS) tool.22, 23 Completeness of reporting was assessed by the Standards for Reporting of Diagnostic Accuracy Studies (STARD) checklist.24, 25 The QUADAS tool is a 14-point assessment instrument developed for systematic reviews of diagnostic accuracy studies. Studies were checked as yes, no, or unclear. We defined a representative spectrum of patients to be both asymptomatic and symptomatic, to reflect a screening population, given the current context of the MRI as a screening tool. Authors were contacted when information was inadequate in the report. The STARD statement is a 25-item checklist aimed to improve the completeness of reporting diagnostic accuracy studies. Studies were checked as yes, no, incomplete or unclear. Inter-rater agreement was calculated by the kappa-statistic. Discrepancies were resolved by consensus.
Data extraction also included type of study (prospective or retrospective) and sample and implant characteristics. Numbers of true-positives, false-negatives, false-positives, and true-negatives were extracted, and the sensitivity (number of true positives divided by the number of true positives and false negatives) and specificity (number of true negatives divided by the number of true negatives and false positives) were calculated by the authors to confirm the reported values. Data extraction was performed independently by 2 reviewers. Discrepancies were resolved by consensus.
Pooled test sensitivity and specificity values were obtained using multi-level mixed-effects logistic regression models. A forest plot was generated to graphically represent heterogeneity across individual studies. A forest plot is a pictorial representation of each study’s sensitivity and specificity bounded by the 95% confidence intervals. It is useful for visualizing how similar or dissimilar the reported measures are amongst studies and to identify studies with outlying values of the measures.18
We next assessed heterogeneity in the reported sensitivity and specificity across the studies using the Q and I2 statistics26 to determine whether the measures from the different studies are similar enough to be combined into a pooled summary measure. A small p-value (< 0.05) from the Q statistic suggests statistically significant heterogeneity among studies. The I2 statistic quantifies the amount of heterogeneity among studies, and by convention, low, moderate, and high values of heterogeneity are indicated by I2 values of 25%, 50%, and 75%, respectively.26 If substantial statistical heterogeneity among studies emerges, sources of heterogeneity should be identified.26 The variations in the reported sensitivities and specificities may be due to differing sample characteristics or differences in the way each study was conducted. To identify potential sources of heterogeneity, we performed subgroup analyses to evaluate if any of the various sample and study characteristics affected the sensitivities and specificities.
The effect of study characteristics was further examined using diagnostic odds ratio (DOR) as previously described by Lijmer et al.14 The DOR is useful because it is a single summary statistic for diagnostic accuracy incorporating both sensitivity and specificity (Figure 2A). It is the odds of a positive test in a diseased person relative to the odds of a positive test in a non-diseased person, in which a large DOR indicates a high sensitivity and specificity.27 Briefly, the effect of study characteristics on diagnostic accuracy can be assessed using a regression model with the logarithm of the DOR as the dependent variable. From the regression model, we can obtain an estimate for relative DOR (RDOR), which is interpreted as a ratio of the DORs with versus without the study characteristic (Figure 2B). Thus, a RDOR of 1 indicates that the study characteristic does not influence the overall DOR, whereas a RDOR greater than 1 indicates that studies with the characteristic yield larger estimates of DOR than studies without the characteristic.27
Publication bias was examined by construction of a funnel plot, and statistical significance of asymmetry was assessed by the Egger’s test.28 Stata 11.1 (StataCorp, College Station, TX) was used for statistical analyses.
The initial search dated up to April 2010 using 4 databases (MEDLINE, EMBASE, ISI Web of Science, and Cochrane) identified 1175 articles (Figure 3). A total of 311 articles were duplicates. Of the remaining 864articles, 768 were excluded upon review of the title or abstract based on inclusion and exclusion criteria, leaving 96 articles. Forty-two additional articles were included after a manual bibliography search from the 96 included articles, totaling 138 articles for full-text review. After full-text review, 117 articles were excluded leaving 21 articles that evaluated the diagnostic accuracy of the ultrasound and/or MRI for silicone breast implant rupture. Reasons for further exclusion are described in Figure 3. Eight studies examined both ultrasound and MRI,29–36 5 evaluated US only,37–41 and 8 evaluated MRI only.42–49
All 21 studies were diagnostic cohort studies.29–36, 42–49 Table 3 summarizes characteristics of the included studies. Among the MRI studies, 2 studies were duplicated: Scaranelo et al.35 examined breast and body coil MRIs separately, and Gorczyca et al.44 compared Fast Spin-Echo and 3-Point Dixon MRIs separately. In total, 1,098 silicone breast implants in 615 women were examined with MRI, and 1,007 silicone breast implants in 577 women were examined with ultrasound.
We assessed 21 studies using the QUADAS instrument and STARD checklist. Inter-rater agreement for the total QUADAS and STARD assessments was good for the 21 studies (76.8%, kappa statistic=0.58). More than 50% of the 16 MRI studies used a sample that was not representative of a screening sample (10 studies examined only symptomatic patients29–31, 33, 34, 36, 44, 45, 47, 48, 2 studies examined only asymptomatic patients35, 42), 9 did not explain reasons for individuals withdrawing from the study29–32, 36, 42, 44–46, and 11 did not report uninterpretable results.29, 30, 32–34, 43–46, 48, 49 The reference test diagnostic criteria were not specified in 7 studies (43.8%)29, 34, 35, 42–44, 48, and 7 studies (43.8%)31, 34, 42–44, 48, 49 had partial verification bias (Figure 4A-B). More than 50% of the 13 ultrasound studies did not use a screening sample (10 studies examined only symptomatic patients29–31, 33, 34, 36–39, 41, 1 study examined only asymptomatic patients35), and 11 did not explain reasons for individuals withdrawing from the study29–33, 36–41(Figure 5A-B).
Using the STARD checklist, we identified important specifications for diagnostic accuracy studies that were inconsistently addressed across studies. Only 4 of 16 MRI29, 35, 36, 43 and 5 of 13 ultrasound29, 35, 36, 40, 41 studies reported in their title or abstract that the reported sensitivity and specificity were applicable to the specified studied sample. In reporting the test results, only 31.3% of MRI (5 of 16)31, 35, 46, 47, 49 and 15.4% (2 of 13)31, 35 of ultrasound studies reported a time interval from the index test to explantation. This time interval ranged from 1 week35 to 297 days47 with a median of 3 months among the 5 MRI studies. This information is important especially for screening tests, given the possibility that rupture may have occurred during the interim period before explantation; this context is also known as disease progression bias. In addition, a few MRI (3 of 16)33, 42, 47 and ultrasound (2 of 13)33, 36 studies discussed the possibility of rupture at the time of explantation, an important detail given a surgical reference test.
Gel bleeds were inconsistently addressed across studies. Most studies (10 of 16 MRI; 7 of 13 ultrasound) did not address gel bleeds or did not include them in calculating sensitivity and specificity (Table 3). Five MRI30, 32, 34, 35, 46 and 5 ultrasound studies30, 32, 34, 35, 38 considered gel bleeds as not ruptured, and 1 MRI47 and 1 ultrasound41 study considered gel bleeds as ruptured.
Because observer variability between radiologists can be particularly problematic with imaging tests, it is important to report estimates of test reproducibility. Although 87.5% of MRI (14 of 16) and 76.9% of ultrasound (10 of 13) studies reported the number of radiologists reading the films, only 7 MRI31, 33, 44, 45, 47–49 and 531, 33, 37, 38, 40 ultrasound studies had 2 or more radiologists. Furthermore, only 4 MRI33, 44, 48, 49 and 2 ultrasound34, 37 studies discussed inter-observer agreement. Less than half of the studies discussed indeterminate or inconclusive findings (5 MRI31, 32, 36, 44, 47 and 4 ultrasound studies31, 32, 36, 37, 40).
Forest plots for the 18 MRI studies are illustrated in Figure 6. Significant heterogeneity was present across studies for sensitivity and specificity (sensitivity, Q-statistic p<0.01; I2 = 64.7; specificity, Q-statistic p<0.01; I2 = 84.9). The pooled sensitivity and specificity for MRI were 87.0% (95% CI 81–91%) and 89.9% (95% CI 82–94%), respectively. Though not shown as a forest plot, the sensitivity for the 13 ultrasound studies ranged from 30.0%35 to 77.0%32, with significant heterogeneity across studies (Q-statistic 28.0, p=0.01; I2 = 57.2; 95% CI 31–84). The reported specificity was also highly variable, ranging from 55.0%40 to 92.0%29, 39 with significant heterogeneity (Q-statistic 57.0, p<0.01; I2 = 78.9; 95% CI 68–90). The pooled sensitivity and specificity were 60.8% (95% CI 53–68%) and 76.3% (95% CI 68–83%), respectively.
Potential sources of heterogeneity and tests of heterogeneity are summarized in Table 4. We examined subgroups of study design biases, methodological characteristics, and test execution characteristics. Not included in the table are subgroups in which only 1 study was categorized to a group because statistical tests could not be done. For example, among the ultrasound studies, only 1 study used an asymptomatic sample,35 1 study was retrospectively conducted,32 and 1 study used a consecutive recruitment method.37 The sensitivity and specificity in MRI studies that used a symptomatic sample were higher (sensitivity 88%; specificity 94%) compared to studies using an asymptomatic sample (sensitivity 76%; specificity 68%), although the differences were not statistically significant. Of note, ultrasound studies without partial verification bias had significantly higher specificity (81%) than studies with partial verification bias (67%, p<0.001).
Regression analyses to quantify the influence of biases on diagnostic accuracy are illustrated in Figure 7. MRI studies using symptomatic samples had a DOR that was nearly 14-fold greater compared to the DOR of studies with asymptomatic samples (RDOR 13.8). MRI studies that used symptomatic samples had a DOR that was 1.89 times greater than studies that used a screening sample (i.e., symptomatic and asymptomatic samples). Studies that did not evaluate the condition of the explanted silicone implants in all subjects who underwent MRI evaluations (i.e. studies with partial verification bias) had a DOR 2.49 times greater than studies that surgically evaluated all implants in study subjects who underwent MRI evaluations.
Two funnel plots were constructed to assess publication bias among MRI and ultrasound studies. Significant publication bias was detected in MRI studies, indicated by an asymmetric distribution of studies (Figure 8A, p=0.01). No significant publication bias was detected among ultrasound studies (Figure 8B, p=0.87)
Our review reveals several new insights about the current literature using MRIs or ultrasounds to detect silicone gel implant ruptures. Although the pooled summary measures across the studies indicate relatively high accuracy of MRI in detecting breast implant rupture with a pooled sensitivity of 87% and a specificity of 89.9%, the majority of the current literature examined only symptomatic patients. This leads to a higher prevalence of silicone breast implant rupture and higher diagnostic accuracy estimates. We found the DOR, a measure of overall diagnostic test performance, of MRI to be 14-fold greater in symptomatic samples than in asymptomatic samples and 2-fold greater in symptomatic samples than in screening samples. This was shown in the subgroup analyses with higher sensitivity and specificity of the MRI in studies examining symptomatic samples than in studies using asymptomatic and screening samples (Table 4). These findings have widespread health policy implications given the F.D.A. recommendations to repeatedly screen silicone breast implant recipients with serial MRI exams.
Instituting a screening program requires careful consideration of several issues. First, the disease should have serious consequences.50 Currently, the morbidity associated with silicone breast implant rupture remains unclear and is still under study. Second, the disease must have a pre-clinical yet detectable stage.50 For silicone breast implant ruptures, this stage may be considered gel bleeds. Our results show a lack of consistency in addressing gel bleeds (Table 3). Third, a high prevalence of the pre-clinical stage among the screening population is optimal for a successful screening program. To date, there is a lack of evidence about the prevalence of subclinical gel bleeds. In addition, many studies report the mean age of implant at time of rupture to be greater than 10 years,29–31, 34–36, 47 suggesting that perhaps this group may consist of the high-risk sample that should garner directed attention. In light of this and the possibility of very low prevalence, adherence to the F.D.A. recommendation to screen with MRI at least 4 times within the first 10 years of silicone breast implantation will be costly and may potentially result in over-detection and over-treatment of a questionable non-life-threatening condition.
In addition, important screening test characteristics to consider include the sensitivity and specificity of the screening modality. Our results reveal many methodological flaws in the current literature, which may result in higher MRI sensitivity and specificity estimates (Figure 7). We showed that most of the included MRI studies reported diagnostic accuracy measures on symptomatic samples, which had a DOR that was nearly 14-fold and 2-fold greater than the DOR of detecting silicone breast implant ruptures in asymptomatic samples and screening samples, respectively. Thus, although MRI’s diagnostic performance in detecting silicone breast implant ruptures in a symptomatic sample may be quite good, we find that the MRI’s accuracy is magnitudes lower in detecting rupture in asymptomatic and screening samples.
These compelling results are noteworthy given the frequency of the F.D.A. recommendations for serial MRI exams as a screening test to detect silicone breast implant ruptures. As a screening program, these recommendations have been received with wide-spread controversy. There is a lack of high level evidence establishing serious health consequences from a ruptured silicone breast implant, and yet adherence to these recommendations result in substantial cost and use of resources. In particular, the benefits of screening within the first 10 years are unclear, and the effectiveness of such a screening program warrants further investigation. Moreover, screening programs should take into account patient preferences.51 Patient acceptability influences adherence to recommendations and have been important topics for other screening programs such as colorectal cancer52, 53 and prostate cancer screening.54
The main strength of this review is our rigorous compliance with the recommended methods for carrying out and reporting a systematic review, specifically for diagnostic accuracy studies. Systematic reviews of diagnostic accuracy studies differ from intervention studies in 3 ways: the inclusion of diagnostic accuracy search terms,19 assessment of study quality and completeness of reporting,20 and meta-analysis methods. Additionally, we attempted to identify sources of heterogeneity across these studies.55
There are several limitations to our study. First, the small number of studies may explain the lack of statistical significance of our results. The wide confidence intervals of the RDORs indicate low statistical power. Eight studies were excluded because they lacked sufficient data to construct 2×2 tables, which are essential in obtaining data to conduct meta-analyses. Efforts to contact authors were largely unsuccessful. This limitation emphasizes the importance of reporting diagnostic accuracy studies using the STARD checklist.24 Another limitation is the possibility of publication bias among MRI studies, despite an extensive search through 4 databases without any language restrictions. A possible explanation for the asymmetric funnel plot may be the small number of available included studies.56
In summary, many of the MRI and ultrasound diagnostic accuracy studies examining silicone breast implant ruptures are methodologically flawed, particularly because of the use of only symptomatic samples. The reported MRI sensitivity and specificity estimates may be high if applied to asymptomatic or screening samples. Given the current policy recommendations to screen asymptomatic women, further research is needed to investigate and identify long-term disease consequences of rupture, the effectiveness of MRI or other more optimal screening tests in an appropriate sample, the cost of screening strategies, and patient preferences for screening.
The authors thank David B. Lumenta, MD for his help with foreign language article translations. This project was supported in part by a Ruth L. Kirschstein National Research Service Awards for Individual Postdoctoral Fellows (1F32AR058105-01A1) (to Dr. Jae W. Song) and by a Midcareer Investigator Award in Patient-Oriented Research (K24 AR053120) from the National Institute of Arthritis and Musculoskeletal and Skin Diseases (to Dr. Kevin C. Chung).
Disclosures: None of the authors has a financial interest in any of the products, devices, or drugs mentioned in this manuscript.