Previous studies by Sjöblom et al.
and Wood et al.
identified significant clustering of mutations in the “genomic landscapes” of human breast and colorectal cancers. Despite the need of larger samples to reach more accurate conclusions [1
], these early studies demonstrated the potential of genome-wide studies to capture decades of research into the association of individual genes to cancer in one study. However, due to the rarity of mutations in the gene hills, the authors concluded that these less frequently mutated genes might be better studied within their pathway contexts to elucidate their functional roles in cancer. Today, it is still a major challenge for genome-wide studies of somatic mutations in cancers to identify rare somatic mutations, those gene mutations occurring in a low percentage of tumor samples, that still contribute to cancer initiation and progression.
By mapping mutations not only to the genes, but also to the individual domains they occurred in, we were able to construct the mutational landscapes for both genes and domains for 100 colon cancer patients (Figure and ). We also constructed the gene and domain mutational landscapes for 522 breast cancer patients (Figure and ) for comparison to another cancer type. Mapping the mutations to specific domains had the advantage of adding the critical functional context necessary for explaining how the mutations potentially contribute to disease. While a relatively small number of significantly mutated domains were shared in both the colon and breast cancer patients, the method also shows the potential of the domain landscape to find commonalities between different cancers at the functional level that might not be apparent at the gene level. Construction of the domain landscape also revealed many properties that are not apparent from traditional gene-based analyses by examining the individual contributions of mutations from distinct genes that fall within a shared domain. These properties include expected instances where a highly mutated gene contained a highly mutated domain, but also unexpected instances where a shared domain is highly mutated, but the individual genes are not, or even where after the removal of mutations from highly mutated genes, some genes still contain mutations within the shared domain. Examination of the domain landscape also revealed instances where all the genes contributed mutations relatively equally to the domain, and where only one or two genes contributed the majority of mutations. We also found instances where highly mutated domains are shared by genes in the same family, and by genes from different families.
Comparison of our gene-based landscape for colon cancer to the landscape constructed by Wood et al.
revealed similar topographies: a few highly mutated gene mountains along with a much larger number of still significantly mutated gene hills. There was a relatively small overlap between the 154 genes identified by our study and the 140 CAN genes; only six genes were found to be significant in both studies. As noted, the two studies also used different tumor samples and different statistical models to determine significant mutation frequencies. Yet, despite these differences, four of the top five colorectal CAN genes (APC
) ranked in the top twenty genes with the highest normalized mutation frequency, and the fifth top CAN gene (PIK3CA
) was identified to have a significantly mutated domain. We also identified seven genes with significant mutation frequency from the Cancer Gene Census [15
] list known to have somatic mutations in colorectal cancers including the top five CAN genes, NRAS
. A GO term enrichment analysis of all 154 significantly mutated genes in our study identified enrichment in many biological processes and molecular functions known to be disrupted in cancer development including signal transduction, regulation of apoptosis, regulation of cell proliferation and DNA damage response.
Our analysis of the gene landscape resulted in the re-identification of genes with known cancer association and confirmation on enrichment of genes involved in processes critical to cancer development, which validates that our method can identify significantly mutated genes relevant to cancer, and also provides evidence that the method can be applied to other specified regions within the genome, including domain regions. The main focus of this study, however, was the construction of the domain mutational landscape for colon cancer and its comparison with the gene-based mutational landscape. In total, we identified 45 domains with significant mutation frequency in the colon tumor samples. Again, the landscape was characterized by mountains and hills, similar to that of the gene-based landscape, with the highest peaks in the P53, APC_crr and CENP-B_N domains. The CENP-B_N domain, a known DNA-binding domain [22
], receives mutations from the TIGD7
genes. Although TIGD7
are both homologs of the Jrk “jerky” gene associated with epilepsy in mice [23
], they do not have known relevance in cancer development. The peaks for P53 and APC_crr were not surprising due to the well-known tumor suppressing functions of the genes containing the domains, TP53
, respectively. However, mapping mutations to the individual domains illustrates the value of our domain-centric method to provide the essential functional context to explain the role the mutations in cancer development. The GO term enrichment analysis for significantly mutated domains confirmed enrichment of significantly mutated domains with functions important to cancer development including kinase activity, DNA binding and repair, and signal transduction.
Our study of the domain landscape of cancer mutations also highlights the relevance of considering the modularity of the proteins when studying somatic mutations. Is the whole protein responsible for the disruption that promotes tumor growth, or are only some of the functional units of the proteins relevant? For instance, the P53 domain, also known as the P53 DNA-binding domain, contains over 90% of the known TP53
], even though the P53 DNA-binding domain covers approximately half of the P53 protein (193 of 393 amino acids). In our study, 27 of the 31 mutations in the P53 protein occurred within the P53 DNA-binding domain. Mutation within the domain has been shown to have multiple detrimental effects including reduced DNA binding affinity, protein misfolding, protein instability and loss of ability to oligomerize (reviewed in [25
]). The APC
gene contains seven repeats of the APC_crr domain that bind to the Arm domains of the beta-catenin protein in addition to thirteen other distinct domains [26
]. Truncating mutations mainly within the region of the protein containing the second and third repeats of the APC_crr domain, also referred to as the “mutation cluster region”, are known to eliminate APC
’s ability to bind and down-regulate beta-catenin, critically impairing its function as a tumor suppressor gene in the Wnt signalling pathway [26
Despite not reaching significance at the gene level in our colon cancer mutation set, the PIK3CA
gene ranked in the top five highest, normalized mutation frequencies in the breast cancer set (see Additional file 3
), and was also a top colorectal CAN gene in the Wood et al. study. PI3KCA
functions in signal transduction pathways to mediate signalling for processes such as cell growth and survival, and has been found to be oncogenic in several different cancer types [28
contains a total of five domains, so we compared the domain peaks identified by our method to the domains identified with high mutation prevalence, a measure commonly applied to identify genes mutated in a high percentage of patients. We found that while the PI3K_p85B domain, which is responsible for binding the PI3K p85 subunit to form a heterodimer [29
], was identified as a significant domain peak in both cancer types, the domain only had a high mutation prevalence (threshold of 0.04) in the colon cancer set (Figure ). We also did not find significant mutation frequency or high mutation prevalence in the PI3K_rdb, RAS-binding domain, or in the PI3K_C2 domain, which contains signals for the cellular localization of the PIK3CA protein [30
]. The final two domains in the gene, the PI3Ka helical domain and the PI3_PI4_kinase domain, contain known somatic missense mutation hotspots in a variety of cancer types including colon and breast cancer [31
]. Only the PI3Ka helical domain had significant mutation frequency and high prevalence in the breast cancer dataset. The PI3Ka domain did not reach significant mutation frequency in the colon cancer set. We also found few mutations from either cancer set in the PI3_PI4_kinase domain, however, the C-terminal region of the domain is believed to be partially disordered [32
], likely preventing alignment of the domain model to that region. Therefore, the domain did not pick up mutations in the hotspot.
Figure 5 Comparison of mutation prevalence in PIK3CA domains from colon and breast cancer Depiction of the mutation prevalence in colon and breast cancer for domains occurring on the PIK3CA gene. Each box represents a distinct domain from the PIK3CA gene. The (more ...)
Together, these examples demonstrate both the advantages and a potential drawback for our domain-based approach. While the traditional, gene centric view of mutation does not consider the location of mutations within the PIK3CA gene, our domain-centric approach captures the functional modularity of protein domains and enables us to reveal specific domains critical to the cancer development process. Our approach also identifies domains with significant mutation frequency that might be missed by approaches based on mutation prevalence, as illustrated by the identification of significant mutation frequency in the PI3K_p85B domain in breast cancer patients. Yet, the power of our approach is derived from aggregating mutations from all genes containing a particular domain, therefore currently restricting our method to identifying significant mutation frequency inside domain regions. More work will be needed to extend the scope of our approach to other regions of the genome.
Comparison of the gene and domain landscapes also enabled us to identify a small number of domains, seven in total, which retained mutations even after the removal of mutations contributed from significantly mutated genes. The WAP domain in particular retained a significant number of mutations aggregated from the WFDC5
genes even after the removal of mutations from the significantly mutated WFDC8
gene. The WAP, whey acidic protein-type, domain contains four disulfide bonds at its core, characteristic of genes with protease inhibitor activity [33
has no known association to cancer, however, WFDC5
has been shown to be upregulated in genes undergoing P53 induced apoptosis [34
], and SLPI
has been shown to promote malignancy in a lung cancer cell line due to its protease inhibitor function [35
]. In addition, mutations in KAL1
are responsible for Kallmann syndrome [36
]. Therefore, because of the known cancer and disease relevance of mutations in the WAP domain of other genes, the presence of mutations in the WAP domain of WFDC8
encourage further study of the role of WFDC8
in colon cancer development.
The examples discussed above, in which significant domain peaks correspond to at least one significant gene peak only constitute 14 of the 45 significantly mutated domains from the colon tumor set. The other 31 domain peaks correspond to genes without significant mutation frequencies which are undetected in the gene landscape. Because these domains do not occur in significantly mutated genes, they would likely not be found by traditional, gene-centric studies, but may reveal the disruption of potentially critical functional mechanisms within the cancer tissues. One of these peaks corresponds to the cortactin-binding protein-2 domain, CortBP2, that was mutated in four genes, CTTNBP2NL
(1 mutation), CTTNBP2
(2 mutations), FILIP1
(5 mutations), and FILIP1L
(1 mutation). Interestingly, FILIP1L
is a highly conserved protein known to inhibit proliferation and migration and increase apoptosis in endothelial cells [37
]. This anti-angiogenic protein acts as a tumor suppressor and its loss of function has been implicated in ovarian cancer, head and neck squamous cell carcinoma and oligodendrogliomas [38
]. While the mutation frequency for the FILIP1L
gene was not significant in our study, CortBP2 ranked in the top 75 domains with the highest mutation frequency, suggesting a novel role in colon cancer development for FILIP1L
and the other genes containing mutations in the CortBP2 domain. As with any in silico
analysis, however, the identification of domains and genes with suspected roles in cancer development can only generate new hypotheses that must ultimately be experimentally validated.