|Home | About | Journals | Submit | Contact Us | Français|
Association studies have been widely used to search for common low penetrance susceptibility alleles to breast cancer in general. However, breast cancer is a heterogeneous disease and it has been suggested that it may be possible to identify additional susceptibility alleles by restricting analyses to particular subtypes. We used data on 710 SNPs in 120 candidate genes from a large candidate-gene association study of up to 4470 cases and 4560 controls to compare the results of analyses of “overall” breast cancer with sub-group analyses based on the major clinico-pathological characteristics of breast cancer (stage, grade, morphology and hormone receptor status). No single nucleotide polymorphism (SNP) was highly significant in overall-effects analysis. Subgroup analysis resulted in substantial reordering of ranks of SNPs, as assessed by the magnitude of the test statistics and some associations that were not significant for an overall effect were detected in sub-groups at a nominal 5% level adjusted for multiple testing. The most significant association, of CCND1 SNP rs3212879 with estrogen receptor negative tumour types (p = 0.001), did not reach genome-wide significance levels. These results demonstrate that it may be possible to detect associations using subgroup analysis that are missed in overall-effects analysis. If the associations we found can be replicated in independent studies they may provide important insights into disease mechanisms in breast cancer.
Breast cancer tends to cluster in families, the disease being approximately twice as common in first-degree relatives of cases, than in the general population (1). Some of this clustering occurs as part of specific familial breast cancer syndromes where disease results from single alleles conferring a high risk. However such alleles are rare in the population and the majority of multiple case breast cancer families do not segregate mutations in these genes (2). The model that best describes aggregation of breast cancer in these families is a polygenic model in which susceptibility to breast cancer is conferred by a large number of genetic variants, each of which has a modest effect (3, 4, 5, 6).
Despite large research efforts over the past ten years the number of common susceptibility alleles identified has been small. Loci implicated thus far include SNP alleles located in the CASP8, FGFR2, TNRC9, MAP3K1 and LSP1 genes, and in regions in the DNA devoid of known genes at chromosome positions 8q24 and 2q35 (7, 8). One of the reasons for this lack of success may be disease heterogeneity. Invasive breast cancers can be divided into several pathologic subtypes with different histological appearances of the malignant cells, and different clinical presentations and outcomes. Novel subtypes have also emerged as a result of gene expression-profiling (9). It is plausible that the aetiology of the sub-types is different. The association of the ‘basal’ phenotype in breast cancers with rare, deleterious mutations in BRCA1 (10, 11, 12, 13) demonstrates the principle that different genetic determinants can underlie different subtypes of the disease. In addition, the association with SNPs at the FGFR2, 2q35 and MAP3K1 loci have been shown to be largely restricted to estrogen receptor positive disease (14, 15).
Few genetic association studies have systematically evaluated association between putative common susceptibility alleles and specific sub-types of disease. However, under some models of disease susceptibility, sub-group analysis may identify associations missed by analysis of overall effects. For example, if the genetic effects in the subgroups are sufficiently heterogeneous, the power of sub-group analysis may exceed that of simple overall-effects analysis. This is supported by the findings that, under some models, power to detect epistasis between genes is greater than power to detect overall effects (16). In that report the sub-group was defined by genotype at a second locus. The largest study to-date of common variants in multiple candidate genes for breast cancer susceptibility has been published recently (17). Over 700 single nucleotide polymorphisms (SNPs) in 120 genes in approximately 4,400 cases and 4,400 controls were tested. None of the SNPs reached genome-wide significance levels after adjusting for population stratification (17). However when the admixture maximum likelihood (AML) experiment-wise test for association was applied, there was evidence for an excess of positive associations over the proportion expected by chance, suggesting that some SNPs in these candidate genes are associated with breast cancer risk (17).
The purpose of this study was to reanalyse this data set in order to identify highly significant associations with specific subgroups of breast cancer in the absence of overall effects. Sub-groups were based on the major clinico-pathological features of the tumours, namely morphology type, stage, grade, estrogen and progesterone receptor status.
Study participants (up to 4,470 cases and 4,560 controls) were selected, as described by Pharoah et al. (17). In brief, cases were drawn from Studies of Epidemiology and Risk factors in Cancer Heredity (SEARCH), an ongoing population-based study ascertained through the Eastern Cancer Registration and Information Centre. All patients diagnosed with invasive breast cancer below age 55 years since 1991 and still alive in 1996 (prevalent cases, median age 48 years), together with all those diagnosed below age 70 years between 1996 and the present (incident cases, median age 54 years) are eligible to take part. Of 12,767 eligible patients, 2,284 were not contacted because their general practitioner did not respond or thought that it would be inappropriate to contact the patient. Of the 10,583 patients who were contacted, 67% have returned a questionnaire, and 64% provided a blood sample for DNA analysis. Eligible patients who did not take part in the study were similar to participants except, as might be expected, the proportion of clinical stage III/IV cases was somewhat higher in nonparticipants (10% versus 5%). Female controls were randomly selected from the Norfolk component of the European Prospective Investigation of Cancer (EPIC) a prospective study of diet and cancer being carried out in nine European countries (18). The EPIC-Norfolk cohort comprises 25,000 individuals resident in Norfolk, East Anglia—the same region from which the cases have been recruited. Controls are not matched to cases, but are broadly similar in age (42-81 years). The ethnic background of both cases and controls as reported on the questionnaires is similar, with >98% being white. Staging and phenotyping of breast cancer cases were obtained through the cancer registry from routine pathology and clinical records. The study is approved by the Eastern Region Multi-centre Research Ethics Committee, and all patients gave written informed consent.
The samples were split into two sets in order to save DNA and reduce genotyping costs: the first set (n = 2,270 cases and 2,280 controls) was genotyped for all SNPs, and the second set (n = 2,200 cases and 2,280 controls) were then tested for those SNPs that showed marginally significant associations in overall effects analysis for set 1 (p - heterogeneity or p − trend < 0.1). This staged approach substantially reduces genotyping costs without significantly affecting statistical power of the overall effects analysis. Results of sub-group analyses for set 1 data were however not used to select SNPs for analysis in set 2.
Data on 710 SNPs in 120 candidate genes were available for analysis 17 (Supplementary Table 1). Genes that encode proteins in cellular pathways that are likely to be involved in breast carcinogenesis were chosen as candidates. The major pathways studied were steroid hormone metabolism and signalling, double strand break DNA repair, oxidative damage repair, epigenetic modifiers, and cell-cycle control. Genes in the 17q21 region commonly amplified in a variety of animal models of cancer, and some carcinogen metabolism genes were also tested. For some pathways, only a small subset of genes was selected for study. Genes evaluated by pathway and number of SNPs assayed for each are described in Pharaoh et al (17). Common variation in most genes was captured using a minimal set of tagging SNPs (17, 19). Genotyping methods are as described (17). Concordance for duplicate samples was 98% for all assays. Failed genotypes were not repeated (the rate for failed genotypes did not exceed 8.3% for any of the SNPs under study). Hardy Weinberg Equilibrium was tested as part of genotyping quality assurance and SNPs with serious deviations excluded.
The aim of this study was to test for statistical association between each of 710 individual SNPs and breast cancer sub-types, and to compare the results of these with the overall-effects analyses. The sub-types were categorized according to clinical stage at diagnosis (I, II, III/ IV), histopathological grade (1, 2, and 3), estrogen receptor (ER) status and progesterone receptor (PR) status (negative or positive) and histopathological morphology. Only the most common morphological types - lobular and ductal were analysed. Pair-wise correlation coefficient was calculated to assess correlation structure between sub-groups (Supplementary Table 2). Sample sizes for most joint phenotypes were too small for sub-grouping to be based on phenotypic correlations. Association between disease and genotype for each SNP within each subtype category for ER status, PR status and morphology was assessed using the one degree of freedom Cochran-Armitage trend test with a single parameter for allele dose. The analyses were conducted with each sub-group being compared with all of the available controls. Grade and stage were assessed as ordered categories using ordinal polytomous logistic regression.
Results for all tests were summarised using standard quantile-quantile (Q-Q) plots constructed by ranking the set of values for the test statistic from smallest to largest and plotting them against their expected values. Per-allele odds ratios and confidence intervals were estimated using logistic regression. In order to compare the previously reported overall effects analysis with the sub-group analyses reported here a Bonferroni correction was applied to correct for the number of sub-group analyses (eight). A nominal significance level of p < 0.05 was chosen for overall effects and an equivalent p < 0.00625 (=0.05/8) for subgroup analyses. Note that this is not a correction for the number of SNPs tested as such a correction would be the same for each sub-group analysis and would make no difference to the comparison between overall effects and sub-group analyses.
The number of cases by sub-group is shown in Table 1. This is a maximum sample size as not all SNPs were genotyped for both set 1 and set 2. Figure 1(A) shows the Q-Q plot for the univariate trend test for association between SNPs and breast cancer (overall effect). For chi-squared values less than three, the observed values lie close to the line expected under the null hypothesis of no association, providing no evidence of inflation of the test statistic that would suggest population stratification or other systematic bias. The deviation of the higher observed values from those expected is suggestive of multiple weak associations. One SNP showed a much higher chi-squared statistic than the others. This SNP - rs3020314 in the estrogen receptor α gene (p = 8 × 10-5) – did not reach genome-wide significance, but did reach the p < 10-4 threshold that has been suggested for candidate gene studies (20). Figures 1 (B)-(I) show Q-Q plots for the univariate trend test for each subgroup scan. None of the associations reached the level of significance for the most significant association in the overall effects analysis.
In the overall effects analyses, 52 SNPs (7.5%) were significant at the p < 0.05 level. In subgroup analysis, at the equivalent threshold significance of p < 0.00625, 7 SNPs were significantly associated with increasing cancer grade, 16 with increasing cancer stage, 7 with lobular cancer, 7 with ductal cancer, 7 with ER positive disease, 14 with ER negative disease, 6 with PR positive cancer and 5 with PR negative cancer (data not shown). Most of the SNPs detected at the p < 0.00625 level in subgroup analysis achieve at least p < 0.05 significance in overall effects analysis, but 18 SNPs found to be significant in subgroup analysis did not. For these SNPs per allele OR and the corresponding 95% confidence intervals, in subgroup and overall effects analysis are shown in Table 2. Sample sizes of cases and controls for each SNP tested is also indicated. The strongest association observed was for CCND1 rs3212879 in ER negative disease (P=0.0001, OR = 1.40, 95% CI = 1.20-1.70).
Thus, sub-group analysis has not identified any highly significant associations missed by overall effects analysis. Nevertheless, sub-group analysis may still be useful when selecting SNPs of borderline significance for further replication. In general, the number of SNPs selected for replication studies is limited by the cost of genotyping. For example, assume funding is available to attempt to replicate 50 SNPs in further studies in order to provide definitive evidence of association. One strategy would be to simply select the top 50 ‘hits’ ie SNPs that are significant at some pre-defined level – for example p < 0.05, from the overall effects analysis. However, better candidates for replication may be identified from the sub-group analyses. A possible strategy would be to include SNPs for replication that achieve the same p - value as overall effects analysis after Bonferroni correction. Here we applied a correction of eight in order to compare directly the results of the subgroup and overall-effects analysis, although this correction is overly conservative as sub-groups (and therefore tests) are correlated (Supplementary Table 2). A ranking of corrected p - values representing the top 50 ‘hits’ (p < 0.05), from both overall effects and sub-group analyses is shown in Table 3. With this strategy, twelve SNPs achieved higher ranks in subgroup analysis than in overall-effects analysis. Thus 50 SNPs chosen for replication would include 38 SNPs from the overall effects analysis and 12 SNPs from the sub-group analyses.
Complex diseases such as breast cancer are phenotypically heterogeneous, and this heterogeneity may obscure genetic associations. If alleles at different genetic loci are responsible for different subtypes of disease, genetic associations may be best detected by subgroup analysis. However there is a trade off between the increased specificity that may be obtained by sub-group analysis, and the loss of statistical power from reduction in sample size and increase in number of hypotheses being evaluated. The most powerful way to test for and detect associations is not known because the true underlying biological/genetic models for the data are not known.
Our results support the notion that sub-group analysis may be worthwhile because some associations that were not detected using an overall-effects approach were detected using a subgroup approach at a nominal level of p < 0.05 adjusted for multiple sub-group testing. Nevertheless, the findings are only illustrative of the potential for sub-group analysis, because none of the associations detected could be regarded as definitive, and we cannot state with certainty that sub-group analyses have identified true associations missed by the overall effects analysis. The most strongly associated SNP was rs3212879 in CCND1, which was associated with risk of estrogen receptor negative tumours (p = 0.001 adjusted for multiple sub-group testing). This does not reach the threshold for genome-wide significance (p < 10-8), which is necessarily stringent due to the low prior probability of any individual SNP being associated with disease. The prior probability for SNPs in candidate gene studies may be somewhat higher but there is still no biological a priori hypothesis for association between any particular SNP with a particular subtype of cancer. Despite good total sample size, the sub-group sizes are modest and power to detect sub-group effects at very stringent significance levels may be small. Statistical power may be further limited by sub-group classification error. Data for sub-group categorisation were obtained from clinical records and there is likely to be some degree of misclassification of phenotype. Nevertheless, the effect size for GPX4 rs4087542 in lobular carcinoma, and the SNPs in the TBXAS1 family in ER positive tumours and those in the CCND1 family in ER negative tumours are of sufficient magnitude to warrant replication in larger studies of patients with these subtypes of cancer.
There are other published examples of associations between genetic variants and breast cancer restricted to specific subtypes. The associations of common variants at FGFR2, MAP3K1 and 2q35 have been reported to be confined to estrogen receptor positive cancers (14, 15). These loci were not evaluated as part of our candidate gene study. In these examples, the overall effect analyses between the variants and breast cancer had reached genome-wide significance levels prior to subgroup analysis.
Our results also demonstrate that sub-group analyses may be incorporated into a strategy for selection of SNPs for replication in independent data sets. Staged study designs are commonly used in genome-wide association studies in order to reduce costs (21). The most appropriate selection of SNPs for the second and subsequent stages is critical to maximise power. To date most GWAS have based this selection on the results of overall effects analyses, but SNP selection based on overall effects and sub-group analysis with an appropriate correction for multiple testing may prove more efficient.
The potential advantages of reducing phenotypic heterogeneity by restricting analysis to specific sub-types of disease are clear. Further evaluation of such a strategy is required to provide definitive evidence of its value.
We thank the SEARCH team, the EPIC collaborators and the Eastern Cancer Registration and Information Centre (patient recruitment and phenotype data). Genotyping was carried out by many individuals from the Department Oncology at Strangeways Research Laboratory and funded by Cancer Research UK NM was funded by scholarships from Cancer Research UK and the Medical Research Council, PP is a Senior Clinical Research Fellow and DFE a Principal Fellow of Cancer Research UK.
Funding: NM was funded by scholarships from Cancer Research UK and the Medical Research Council, PP is a Senior Clinical Research Fellow and DFE a Principal Fellow of Cancer Research UK.