The search for genetic risk factors for cancer and countless other diseases has been a major focus of investigation in recent years. Many genome-wide association studies have been conducted, including several in breast cancer [Easton et al., 2007
; Gold et al., 2008
; Hunter et al., 2007
; Ruiz-Narvaez et al., 2010
; Thomas et al., 2009
]. These studies, which involve the use of relatively common SNPs, have been based on the premise that the major influences on the risk of common diseases such as cancer are likely to involve relatively common variants, on the grounds that only common variants can deliver high attributable risks. Although there has been some success in identifying cancer genes through this process, in general the relative risks conferred by the identified SNPs have been small. This is leading to a reconsideration of the premise that genetic risk of cancer is caused primarily by common variants [Gorlov et al., 2008
; Shork et al., 2009
]. Indeed the BRCA1
genes are good examples of why this hypothesis may be false, in that selected rare variants in these genes confer very high risk, while the common variants appear to convey little or no risk. Even more perplexing is the fact that among the thousands of observed rare variants only some appear to confer risk, while others, including the vast majority of single nucleotide changes, appear to be harmless. This would seem to be a paradigm that is likely to apply to other cancer genes that are identified in the future. Faced with this possibility, the challenge to find strategies that distinguish risk bearing from harmless variants in a setting where the information on each individual variant is necessarily sparse becomes increasingly important. Consequently resolution of these ambiguities is an important public health issue.
Our approach has been to use sequence variant data from a case-control study to accomplish the evaluation of individual rare variants in a highly efficient manner. In our study the cases are women with double breast malignancies while the controls have unilateral breast cancer. This ensures that the subject pool is as gene rich as possible, thereby optimizing the number of variants observed in the study. We then used hierarchical statistical modeling to leverage the information in the data as efficiently as possible. This strategy assesses the observed case-control ratios of the individual variants, while simultaneously evaluating the broader case-control ratios of groups of variants, grouped on the basis of pre-defined categories defined by bioinformatic or other classifications of the variants. The study produced an interesting finding. It showed that among the rare missense VUS, which collectively demonstrated very little increased risk [Borg et al., 2010
], there is a small subset of variants characterized by adverse predictions on all three bioinformatic classifications that has a strong collective association with breast cancer risk. The model then instructs us to conclude that membership of a variant in this group indicates that the variant is likely to possess high risk, even though there is no direct evidence, on the basis of the frequent occurrence of the variant itself in cases, that the variant has high risk.
How do these results compare with previous modeled strategies to assign risk to individual rare variants? In a comprehensive analysis that involved combining data from multiple sources Easton et al. 
examined all of the BRCA1
variants that have been observed to date. The authors' method is not directly comparable with ours, and the data sets are non-overlapping. Nonetheless the results show some overlap and some disagreements on the classification of the variants. Of special interest are the 11 variants that we identify as being high risk variants on the basis of possessing adverse predictions on all three bioinformatic classifications (). Of these 11 variants, 6 were considered to be VUS at the outset of our study (the other 5 are deleterious by BIC criteria). Three of these (BRCA1
G1529R and D2665G) were classified as neutral by Easton et al., while two are in agreement (W2626C and E2663V). Interestingly, the sixth variant, R3052W, was confirmed as functional in a recent study using a mouse embryonic stem cell assay [Kuznetsov et al., 2008
]. Tavtigian et al. (2008)
have also addressed this issue using data from the Myriad Genetics Laboratory database, and a variety of approaches for mapping risk to the GV and GD indices. However, none of these methods involve direct comparison of frequencies of observed cases and controls, either for individual variants or for groups defined by GVGD criteria. Their results indicate a monotonic increase in risk as a function of their summary score, ranging from C0 to C65, but our results seem to suggest, albeit with small sample sizes, that the elevated risk may be limited to variants in the highest (C65) category (see ).
Of the 181 rare missense variants that we examined in the model, 6 emerged as having nominally statistically significant association with risk (the top 6 variants listed in ). Conventionally we expect 1 statistically significant result for every 20 independent tests performed, and so at first glance the identification of 6 significant results from the 181 variants seems unremarkable. However, the use of conventional methods to adjust for multiple comparisons is problematic in this setting. For one thing, we knew at the outset that the case-control ratio overall for subjects with rare variants was similar to that of the comparison group of subjects with no rare variants [Borg et al., 2010
], and so the vast majority of the variants had to have neutral risk. Indeed the purpose of the exercise was to see if any signals emerged from this overall pattern of null association. Second, each individual variant occurs in very few subjects. Indeed the majority of these rare variants occurred only once. So there is a fundamental problem of low statistical power per variant, and the fact that any of these variants emerged with “significant” statistical evidence is itself remarkable. Third, the fact that estimates and confidence intervals for the individual variants are essentially derived by grouping them on the basis of common bioinformatic categories means that these estimates and tests are highly correlated, and thus are not amenable to the assumptions of conventional multiple testing algorithms such as the Bonferroni correction [Bonferroni, 1935
] and false discovery rate techniques [Benjamini and Hochberg, 1995
]. Nonetheless the issue of multiple comparisons is highly relevant in this context, and further research is needed to interpret the results of such analyses of large numbers of individual rare variants. Finally, the statistical properties of the hierarchical model itself are sub-optimal in the context of studying rare variants. This is an inevitable consequence of using asymptotic statistical methods when the data are sparse. However, a detailed study of this issue has provided evidence that the relative risk estimates have low bias, and the confidence intervals have coverage that exceeds 90% for a nominal 95% interval for most data configurations [Capanu and Begg, 2010
As a cautionary note it would seem inappropriate to over-interpret the distinct results for individual variants. Clearly the power of this method derives from the grouping of variants into classes with high collective risk. Any variant that belongs to such a group is inevitably allocated a high risk by the model. Consequently the results for some individual variants will be highly dependent on how we choose to group the data. We elected to use the align-GVGD and SIFT scores with pre-specified cut-offs for this purpose. If we were to use different cut-offs, or if we focused instead on grouping variants on the basis of, say, known functional domains in BRCA1 and BRCA2 then the allocation of some variants to the high risk category could change. We did explore the empirical association of 5 functional domains with risk of CBC: BRCA1 Ring; BRCA1 BRCT; BRCA1 NLS, BRCA2 DNA Binding; BRCA2 Transactiviation. Of these, only the BRCA1 Ring domain exhibited a significant association with risk. Of the 3 observed variants in this domain, 2 (C44S and C61G) were classified as “deleterious” at the outset (see ), and the third (K45T) was observed in one control subject. This variant is neutral for all 3 bioinformatic predictors.
In summary, we conducted a hierarchical modeling analysis of rare VUS in BRCA1 and BRCA2 using data from the WECARE Study. The results show that the vast majority of rare missense variants are neutral, but the study supports the growing evidence that there exists a small group of variants that are deleterious. This group includes a few variants classified as deleterious by conventional (BIC) criteria, but it also includes a few additional VUS. These additional variants are worthy of further investigation.