|Home | About | Journals | Submit | Contact Us | Français|
Genome-wide association studies (GWAS) focus on relatively few highly significant loci while less attention is given to other genotyped markers. Employing pathway analysis to existing GWAS data may shed light on relevant biological processes, and illuminate new candidate genes. We employed a pathway-based approach to the breast cancer GWAS data of the National Cancer Institute (NCI) Cancer Genetic Markers of Susceptibility (CGEMS) project that includes 1145 cases and 1142 controls. Pathways were retrieved from three databases: KEGG, BioCarta, and the NCI’s Protein Interaction Database (PID). Genes were represented by their most strongly associated SNP, and an enrichment score (ES) reflecting the overrepresentation of gene-based association signals in each pathway was calculated using a weighted Kolmogorov-Smirnov procedure. Finally, hierarchical clustering was used to identify pathways with overlapping genes, and clusters with excess of association signals were determined by the adaptive rank-truncated product (ARTP) method. A total of 421 pathways containing 3962 genes were included in our study. Of these, three pathways (‘Syndecan-1-mediated signaling ‘, ‘Signaling of Hepatocyte Growth Factor Receptor’ and ‘Growth Hormone Signaling’) were highly enriched with association signals (PES < 0.001, False Discovery Rate (FDR) = 0.118). Our clustering analysis revealed that pathways containing key components of the RAS/RAF/MAPK canonical signaling cascade, were significantly more likely to have excess of association signals than expected by chance (PARTP = 0.0051, FDR = 0.07). These results suggest that genetic alterations associated with these three pathways and one canonical signaling cascade may contribute to breast cancer susceptibility.
Breast cancer is a complex disease with a well established genetic component (1-4). Different genetic methodologies have been employed in the last twenty years to identify multiple susceptibility loci that vary in population frequency, relative risk and potential functional role. For example, familial linkage and positional cloning studies found that rare mutations in the genes BRCA1, BRCA2, TP53, and PTEN, are associated with a 10-fold to 20-fold increase in breast cancer risk (5-8). Variations in these genes are thought to contribute to breast cancer susceptibility through different cellular mechanisms. While BRCA1 and BRCA2 belong to the DNA repair mechanism in the cells (9, 10), TP53 and PTEN are tumor suppressor genes that participate in processes related to cell cycle control and cell proliferation (11, 12). Further interrogation of candidate genes associated with these cellular processes has led to the discovery of additional rare genetic variants conferring moderate relative risks (2-3 fold) of breast cancer (13-16).
Recently, Genome-wide association studies (GWAS) have become a key paradigm in genetic studies of complex diseases. These studies are particularly powerful in identifying common-low penetrance risk alleles. Indeed, several novel markers with low (< 2) relative risk for breast cancer have been detected by this approach (17-21). However, these reported loci are only those that reached a stringent statistical “genome-wide” significance criterion, while less attention has been given to the other hundreds of thousands of genotyped markers. Therefore, employing new methods to the existing GWAS data may provide additional biological insights and highlight new candidate loci. To this end, a pathway-based approach is particularly appealing. This method examines whether the cumulative contribution of genes with a common biological denominator is greater than expected by chance. This approach has recently been applied to GWAS of several non-cancer complex diseases (22-25).
In this study we applied pathway analysis to the breast cancer GWAS of the Cancer Genetic Markers of Susceptibility (CGEMS) project of the National Cancer Institute (NCI). This study has recently identified significant association of single-nucleotide polymorphisms (SNPs) in the FGFR2 gene with breast cancer susceptibility (18). We used the modified gene-set enrichment analysis (GSEA) of Wang et al. (22) to identify an excess of genotype-phenotype association signals in pathways from different resources. Finally, we examined whether the excess of association signals in different pathways is driven by the same subset of genes comprising a common biological module. These analyses illuminated three pathways and one canonical cascade that are possibly involved in genetic susceptibility to breast cancer.
We collected pathway data from three widely-used resources: the BioCarta pathway database (26), the Kyoto Encyclopedia of Genes and Genomes (KEGG) (27), and the NCI’s Protein Interaction Database (PID) (28). Genes belonging to these pathways were associated with SNPs included in the Illumina’s Sentrix HumanHap550 genotyping BeadChip that was used in the CGEMS GWAS. SNPs were mapped to genes if they were located within a genomic region encompassing 20kb 5′ upstream and 10kb 3′ downstream of the gene’s coding region (NCBI’s human genome build 36.3). Since FGFR2 is highly associated with breast cancer in CGEMS (18), we excluded all SNPs mapped to this gene from our analysis. Finally, we restricted our analysis to pathways with 10-100 genes so as to alleviate the multiple testing problem by avoiding testing too narrowly- or too broadly- defined functional categories. A complete list of the studied pathways is available in supplementary Table 1.
The CGEMS breast cancer GWAS includes 1145 postmenopausal women of European ancestry with invasive breast cancer and 1142 controls. Associations between single nucleotide polymorphisms (SNPs) and breast cancer were determined using the Cochran-Armitage test for trend (29). Each gene Gj(j = 1, …, N, where N is the total number of gene in our dataset) was assigned the chi-square trend test statistic of the SNP with the highest statistical significance among all genotyped SNPs that mapped to its region (rj). Next, for each pathway (S) we ordered its gene’s test statistics (rj S) from highest to lowest, and used a weighted Kolmogorov-Smirnov procedure to calculate enrichment score (ES):
where WS = ΣGj*S|rj*| and NH is the number of genes in a pathway. This statistics reflects the relative overrepresentation of genotype-phenotype association signals in a particular set of genes (30). We used a permutation procedure (10,000 permutations, permuting the case-control status) to assess the significance of the pathway-based ES. This procedure generates a null distribution for the ES of each pathway based on its real genotype data and therefore accounts for the variable SNP count and the intrinsic correlation between SNPs within the different genes. Finally, a normalized enrichment score (NES) was calculated for each pathway in the observed and permutated data to allow a direct comparison of pathways of different sizes as suggested by Wang et al. (22). The calculation of the normalized enrichment score for each pathway NESS relied on the ESS, and the mean and standard deviation (SD) of in all permutations for a given pathway (S):
These data were then used to calculate a false discovery rate (FDR) (31) that controls for the proportion of expected false positive findings to be under a certain threshold. For a pathway with an NES value of NESS*, the FDR is calculated as:
Pathways with FDR < 0.2 were considered noteworthy.
To explore the relationships between pathways in our database, we defined the overlap between the set of genes in pathway A and the set of genes in pathway B as the percentage of genes common to both pathways (A and B) and calculated it as:
Next, we used a hierarchical clustering algorithm to assemble pathways with similar gene content. This algorithm iterates over all pathways, finds the most similar pair of pathways and groups them together. This process repeated until all pathways are grouped in one cluster in a hirarchial manner. To determine clusters of significantly related pathways, we used the same guidelines of the confidence interval algorithm originally developed for LD block determination in the genome described by Gabriel et al. (32). In short, clusters were defined as containing significantly related pathways if they included ≥ 5 pathways, and if ≥ 95% of all the pairwise overlap scores were greater than 9.6% (the 95th percentile of all pairwise overlap scores in our database).
Finally, we applied the ‘adaptive rank truncated product’ method (33) using the 10,000 permutated data sets, to assess whether pathways within the defined clusters had higher gene-set enrichment scores than expected by chance. This method assesses the null hypothesis that a cluster of pathways do not contain excess of pathways with high enrichment scores. Denote the ordered P-values of enrichment scores of pathways belonging to a particular cluster as (p1 ≤ … ≤ pL), with p1 being the smallest P-value. To test for the global null hypothesis, the ARTP procedure calculate the RTP statistics over all possible truncation points (k), and assess the significance of the best RTP using the permutation data.
Overall, 421 pathways and 3962 genes (164/996, 155/3088 and 102/1317 pathways/genes from BioCarta, KEGG and PID respectively) were included in our study. These were represented by 69,525 of the 528,173 SNPs (13.2%) that were originally genotyped in this GWAS. A significant variation was seen between pathway resources. Over 90% of the pathways from BioCarta in our database had <30 genes, while pathways from KEGG and PID were generally larger and more variable in size (mean gene count = 44 and 36, SD = 24 and 16 respectively). This variation between databases was evident even in pathways nominally representing the same biological processes. For example, the mTOR signaling pathway included 19, 24 and 49 genes in BioCarta, PID and KEGG respectively. Interestingly, pathways from BioCarta tended to be ranked higher (higher enrichment scores) than KEGG pathways but not than PID pathways (Kruskal-Wallis test P = 0.013 and P = 0.151 respectively). No rank differences were seen between pathways from PID and KEGG.
Of the 421 examined pathways, twenty one were significantly enriched with association signals at the p-value < 0.05 level (Table 1). Of these, ‘Syndecan-1-mediated signaling events‘ from the PID database (Figure 1A; PES = 0.00053), ‘Signaling of Hepatocyte Growth Factor Receptor’ from BioCarta (Figure 1B; PES = 0.00063), and ‘Growth Hormone Signaling’ from BioCarta (Figure 1C; PES = 0.00084), had distinctly smaller p-values than the rest of the pathways, with a noteworthy False Discovery Rate (FDR) value of 0.118. For brevity, we will further refer to these three pathways as ‘syndecan-1 signaling’, ‘c-Met signaling’ and ‘GH signaling’ respectively.
Further examination of the gene content of these three pathways revealed some overlap. For example, the gene mitogen-activated protein kinase 3 [MAPK3] is involved in all three pathways, but does not appear to contribute to their significant enrichment scores due to its non-significant association with breast cancer risk in this study (p-value = 0.35). In contrast, the gene hepatocye growth factor [HGF] was the strongest associated gene in the ‘Syndecan-1 signaling‘ and the ‘c-Met signaling’ pathways (Ptrend = 0.0007, for rs975360) and therefore a major contributor for their high enrichment scores. Another two genes (mitogen-activated protein kinase 1 [MAPK1], and met proto-oncogene (hepatocyte growth factor receptor) [MET]) were shared by these two pathways. Similarly, 6 other genes (growth factor receptor-bound protein 2 [GRB2], v-Ha-ras Harvey rat sarcoma viral oncogene homolog [HRAS1], mitogen-activated protein kinase kinase 1 [MAP2K1], phosphoinositide-3-kinase, catalytic, beta polypeptide [PIK3CB], v-raf-1 murine leukemia viral oncogene homolog [RAF1] and son of sevenless homolog 1[SOS1]) were shared by the pathways: ‘c-Met signaling’ and ‘GH Signaling’.
In light of the considerable overlap between the three most significant pathways in our study, we hypothesized that common biological modules may underlie the enrichment signals in multiple biological pathways. To explore this hypothesis, we used hierarchical clustering to identify pathways with overlapping genes likely to constitute common biological process. Consequently, sixteen clusters of highly related pathways were characterized (Figure 2A). Of these, cluster C1 (Figure 2B) was significantly more likely to include pathways with high enrichment scores than expected by chance (adaptive rank-truncated product (ARTP) P = 0.0045, FDR = 0.07). Further examination of this cluster revealed that it contained 2/3 of the top three pathways (‘Signaling of Hepatocyte Growth Factor Receptor’, and ‘Growth Hormone Signaling’) and it was dominated by the genes GRB2, SOS1, HRAS, RAF1, MAP2K1, and MAPK3, which are major components of the RAS/RAF/MAPK canonical signaling cascade (Figures 1B,1C).
To assess the power of our analysis to identify true susceptibility pathways, we constructed an artificial positive control pathway from 11 genes that were previously associated with breast cancer risk in different GWAS of postmenopausal women of European ancestry (see details in Supplemental Table 2). Applying the pathway analysis to this pseudo-pathway result with a PES = 0.0126 that ranked it 8th out of 422 pathways. It is noteworthy that most of the individual genes in the pseudo pathway had unremarkable significance (Supplemental Table 2, last column) in the CGEMS study itself. Thus the high rank of this artificial positive control pathway demonstrates the ability of the gene-set enrichment analysis to detect pathways containing multiple genes that individually have weak association signals and may be missed by standard single-SNP analysis of GWAS data.
Three pathways and one signaling cascade are highlighted in this study. The top ranked pathway in this study, ‘syndecan-1 signaling’, contains 13 genes involved in different cellular processes that are mediated by syndecan-1 [SDC1]. This gene encodes a transmembrane heparan sulfate proteoglycan which mediates signal transduction cascades leading to cell proliferation, cell migration and cell adhesion processes following interactions with extracellular matrix proteins. There are multiple lines of evidence for a potential role of syndecan-1 in breast cancer development. For example, altered syndecan-1 expression has been detected in several different tumor types and has been linked with unfavourable breast cancer prognosis (34). Additionally, expression of syndecan-1 by stromal fibroblasts, has been shown to promote breast carcinoma growth in vivo and stimulates tumor angiogenesis (35). In this GWAS, SDC1 was only moderately associated with breast cancer susceptibility (Ptrend = 0.019 for rs7563245) however, it is a key mediator of different pathways illuminated in this study (e.g. ‘c-Met Signaling’ and ‘Fibroblast Growth Factor Signaling’; see discussion below). Therefore, it can modulate breast cancer susceptibility through different biological mechanisms.
The second highest ranked pathway in our analysis, ‘c-Met signaling’ consists of 33 genes participating in signal transduction mechanisms induced by the tyrosine-kinase proto-oncogene c-Met [MET]. Stimulation of the c-Met pathway can lead to several different cellular processes related to tumor growth and progression such as proliferation, enhanced cell motility, invasion, and apoptosis (36). Moreover, both c-Met and its ligand, hepatocyte growth factor (HGF), have been shown to be dysregulated and correlated with poor prognosis in a number of human malignancies including breast cancer (37). Consequently, this pathway has served as an important therapeutic target for human cancers, particularly through the development of small-molecule that inhibit the c-Met/HGF-dependent signaling (37). Notably, co-expression of SDC1 and MET, the key players in the top two ranked pathways in our study, have been established as a marker signature associated with poor prognosic factors in ductal carcinoma in situ (DCIS) of the breast (38).
The third ranked pathway in our study, ‘GH Signaling’, contains 22 genes participating in cellular mechanisms induced by either growth hormone or insulin receptors. These two receptors as well as the insulin-like growth factor (IGF) receptor are all transmembrane tyrosine kinase receptors inducing cell growth and proliferation. Alteration in the activity of these receptors or their related pathways may lead to hyperplasia and eventually to the development of a tumor (39). Naturally, the ‘GH signaling’ pathway, had considerable overlap with both the ‘IGF-1 signaling’ and ‘insulin signaling’ pathways from BioCarta (32% and 41% respectively), however only the latter had a significant enrichment score in our study (PES = 0.0064). Examining the genes of these three closely related pathways revealed that what differentiates their respective enrichment scores is a SNP in the insulin receptor gene [INSR] with a small p-value (Ptrend = 0.0019 for rs12460755) that is absent from the ‘IGF-1 signaling pathway’.
An interesting pathway related to syndecan-1 signaling is the ‘fibroblast growth factor (FGF) signaling’. This pathway was ranked 10th in our study (Table 1; PES = 0.0219) in spite of the exclusion of the FGFR2 SNPs from our analyses (see Methods). Adding the FGFR2 SNPs to this pathway improved its enrichment signal (PES = 0.0053) that ranked it 4th out of the 421 pathways. This finding in combination with the FGFR2 signal in CEGEMS (18) and elsewhere (17) suggests that variations in other genes involved in FGFR2 signaling may modulate breast cancer susceptibility. An important extension to the gene-set enrichment analysis the clustering analysis that highlighted the RAS/RAF/MAPK canonical signaling cascade as the common denominator of pathways associated with breast cancer risk in this study. This cascade plays an essential role in transmitting extracellular signals from growth factors to promote the growth, proliferation, differentiation and survival of cells, and modification in its activity has been linked to multiple human malignancies (40). Notably, this cascade plays an important role in all three top pathways in this study. It is an integral component in the ‘Signaling of Hepatocyte Growth Factor Receptor’, and ‘Growth Hormone Signaling’ pathways, and a transducer for many of the signals initiated by the Syndecan-1 pathway.
An important limitation of the pathway-based approach for GWAS analysis is the incomplete annotation of the human genome. At present, the function of many human genes is unknown and therefore these genes cannot be assigned to known pathways. Moreover, susceptible loci in intergenic regions are also not included in a study of this kind. As a result, when employing this approach, only a small portion of the human genome variation can be studied and therefore it should only be used as a supplementary method to the standard single-SNP analysis of GWAS. Additionally, there is no gold standard for pathway definition, and different databases have different guidelines for their pathways construction and curation. Consequently, the gene content of pathways representing the same biological process may vary substantially between different databases, and this may have a major impact on the sensitivity and specificity of this approach. We aimed at minimizing this effect by selecting pathways from three commonly used and manually curate resources. Still, considerable differences were observed between similar pathways from different resources. For example, the ‘c-Met signaling’ pathway from BioCarta contained 33 genes and was ranked second in our analysis, while a pathway with the same name and presumably the same cellular function in the PID database included 52 genes but was ranked only 69th. Although 30 of the 33 genes in the BioCarta pathway were also included in the PID pathway, the remaining 22 genes likely attenuated the signal in the PID pathway. These differences emphasize the importance of further synthesizing the results of such pathway-based approach. The clustering analysis we applied in this study is one way to do so, as it helps in finding the common genes underlying enrichment signals in multiple pathways and allows one to focus on a limited number of candidate genes. Other methods aiming to improve the characterization of pathways and their overlap might be useful.
A second limitation of our study is that it does not include validation of the results using a completely independent dataset. The experiment we conducted using a positive-control-pathway with known breast cancer susceptibility genes provides us validation that our approach has high power to detect pathways containing multiple weakly associated susceptibility genes. Therefore, the top ranked pathways in our study should be prime targets for future analyses in independent datasets.
In conclusion, our results suggest that genetic alterations associated with the top three pathways and one canonical signaling cascade in our study may contribute to breast cancer susceptibility. Ultimately, additional studies would be needed to confirm and further explore the genetic variations underlying the association of these pathways with breast cancer. Moreover, this study highlights the potential insight that could be gained by pathway-based approach as a complimentary method to the primary single-SNP analysis of GWAS. Particularly, the organization of multiple association signals according to underlying biological processes may improve our understanding about the cellular mechanisms underlying this too common malignancy.
This research was supported in part by the Intramural Research Program of the NIH/National Cancer Institute.
The authors have no conflicts to disclose.