We applied methods based on the aforementioned approaches to detect signals of positive selection (MuSiC10
) to the unified analysis of all tumors (see methods) (). To evaluate the quality of the lists of driver candidates produced by each method, and the combinations thereof, we computed their content of known cancer genes. To that end, we employed the Cancer Gene Census, CGC3
as the most reliable catalog of known cancer genes to date. Nevertheless, due to its biased nature and the fact that arguably, many cancer genes are yet to be uncovered, we consider the rate of CGC genes in each list simply as a surrogate estimator of the actual positive predictive value of each method or combination (see Discussion). Applying this principle, we found that the four methods prioritized lists of genes highly enriched for known cancer drivers. Moreover, increasing the cutoff of statistical significance increased the proportion of known cancer genes retrieved (). This proportion was higher among genes exhibiting more than one signal of positive selection. In other words, the likelihood that a gene is involved in tumorigenesis increased proportionally with the number of methods that identified it (), probably because the false positives of one method are likely to be discarded by the others. For example, only 84 out of 232 recurrently mutated genes (MuSiC) –or 87 out of 259 FM biased genes– are also identified by other methods (). However, the proportion of genes in the CGC rises from 22% and 25%, respectively to 54%. On the other hand, genes missed by one method may be identified by others designed to detect other signals of positive selection, as exemplified in . For instance, while RB1 possesses both clear recurrence and FM bias, it has undetectable CLUST or ACTIVE biases. Mutations in HRAS are both significantly clustered and biased towards high functional impact, but are neither significantly recurrent nor ACTIVE biased. BRAF, on the other hand shows all signals of positive selection, except FM bias.
Figure 1 (A) Illustration of the four signals of positive selection used to identify driver genes and the methods that implement them. (B) Venn diagram showing the contribution of each method in number of genes that it detects to the list of HCDs. The names of (more ...)
Validation of methods' output lists of driver candidates and the approach taken to combine them.
Pooling all pan-cancer samples together (pan-cancer analysis) increases the statistical power to detect drivers acting across tumor types, thus facilitating the identification of driver genes that are not detected when each tumor is analyzed individually. However, the pan-cancer analysis may also diminish the relevance of mutations in some drivers acting only in certain tumor types (Supplementary Fig. 1
). To overcome this issue, we also analyzed each tumor type separately (per-project analysis) and added the genes identified in each project to those detected across all pan-cancer tumors (see Methods).
Next, we decided to combine the resulting 48 (four pan-cancer and 44 per-project) lists of driver candidates. We discarded the direct combination of pvalues or rankings of the genes across the lists, because they reflect different signals of positive selection in different tumor-types. For example, a gene exhibiting the four signals of positive selection to a mild degree across several tumor types is not necessarily a better candidate than other with one stronger signal in an individual tumor type. More elaborate combination approaches based, for example on Bayesian classifiers or other machine learning methods are unfeasible due to the lack of a gold standard dataset of drivers and passengers to optimize the combination. Instead, we used a rule-based approach exploiting our current knowledge of the features of cancer genes (Supplementary Fig. 2
). To construct a list of high-confidence drivers (HCDs) we first selected 130 genes that exhibit more than one signal of positive selection in the pan-cancer (or any per-project) analysis. This may leave out drivers with only one signal of positive selection. To rescue some of those while keeping the false-positive rate as low as possible within HCDs, we included 40 CGC genes with one signal of positive selection. Furthermore, we upgraded to the HCD list 81 genes detected by a single approach which functionally interact –considering all Pathway Commons20
database connections, except those less specific direct protein-protein interactions– with at least one HCD. In addition, we populated a list of Candidate Drivers (CDs) with 144 one-signal genes that participate in protein-protein interactions with HCDs. (See Methods and Supplementary Fig. 2
for details.) Finally, we included in the HCD list another 40 significantly mutated genes identified by MutSig's most recent version –also combining three signals of positive selection— (Supplementary Fig. 3
). (Note that because these genes are already selected based on a combination of signals of positive selection, we unite rather than intersect the list of MutSig significantly mutated list with our own HCD list.) In summary, we provide a very reliable list of 291 HCDs and a second one, of 144 CDs, more comprehensive but with an expectedly higher false-positives rate (Supplementary Table 2
When HCDs are mapped to a functional interaction network (see Methods), they appear enriched for biological processes within 5 broad modules –Chromatin remodeling, mRNA processing, Cell signaling/proliferation, Cell adhesion, DNA repair/Cell cycle– which loosely correspond to both established and emergent cancer hallmarks ( and Supplementary Table 3
). Thirteen selected non-CGC, or novel cancer genes are depicted in within their functional interaction context. These novel driver candidates appear alongside other well-established cancer genes. One may thus hypothesize that as more tumor genomes are sequenced, new lowly recurrent mutational drivers in these modules will emerge. This idea is further illustrated in , where, for example well-known cancer genes within the Cell cycle pathway are schematically represented together with not well established HCDs. Examples of novel cell cycle driver candidates include ATR, a kinase which phosphorylates p53 and other proteins, such as CHK1 and RAD1721
and has been associated to tumors with hypermutator phenotypes when defective. ATR is included in the HCD list because it is both recurrently mutated and FM biased in UCEC (). CDKN1A and CDKN1B, inhibitors of cyclin-dependent kinase activity22,23
which mediate the role of TP53 in the arrest of cellular proliferation after DNA damage, also appear to drive tumorigenesis in several pan-cancer samples alongside other well-known cell cycle genes. CDKN1A is recurrently mutated and FM biased in BLCA and in the pan-cancer analysis, whereas CDKN1B is recurrently mutated and FM biased in BRCA. Both genes are also detected by MutSig (). On the other hand, in the broad module of signal transduction and proliferation, PIK3CG and PIK3CB, within the PIK3-AKT signaling pathway appear to complement the tumorigenic role of PIK3CA. Collectively, these kinases are key in the transduction of information from receptors on the outer membrane of eukaryotic cells to effectors in the nucleus24,25,26
. They receive their names after their catalytic subunit. De-regulation of PIK3CG and PIK3CB had been previously linked to tumor progression27,28,29,30
. PIK3CB exhibits a significant FM bias and PIK3CG, a significant mutational recurrence, both in the pan-cancer analysis. Thus, they are both included in the HCD list based on their functional interactions with other HCDs, such as PIK3CA (). Finally, FOXA1 and FOXA2 are general transcriptional regulators, involved in opening the chromatin to make DNA accessible to the entry of other regulators31,32
. They are both missregulated in several malignancies33,34,35,36,37
. While FOXA1 is both recurrently mutated and FM biased in BRCA, FOXA2 is recurrently mutated and FM biased in UCEC and recurrently mutated and CLUST biased in the pan-cancer analysis. In summary, these non-CGC likely driver candidates –25 are detailed in Supplementary Table 4
– help to complete the landscape of tumor-causing mechanisms in known cancer pathways.
Figure 3 (A) Network representation of HCDs. Trimmed version of the functional interaction network integrated by 124 HCDs that either map to the five broad biological modules enriched among HCDs or connect them. Genes annotated in the CGC are represented as round (more ...)
Figure 4 (A) Diagram showing 13 selected candidate cancer genes within their functional interaction context. (B) Heat-map depicting the frequency and number of samples with PAMs of the 13 selected ‘novel’ cancer genes in each tumor type and in (more ...)
Amongst HCDs, only TP53 and PIK3CA have protein affecting mutations, or PAMs (non-synonymous, stop, splice site and frameshift indels), in more than 10% of pan-cancer samples (). Another 51 genes –some of which are not well-established drivers– bear PAMs in more than 10% of samples of at least one tumor type (Supplementary Fig. 4
). Interestingly, 16 HCDs have a clear bias (Fisher's odds-ratio > 25) towards sustaining PAMs in one tumor type with respect to others ( and Supplementary Fig. 5
). (We checked that Fisher's results were not biased towards tumor types with higher mutation rates; see Methods and Supplementary Fig. 7
Further support of the mutational drivers identified by our combined methodology stems from the analysis of copy number changes (CNAs) across pan-cancer samples. Many HCDs are also affected by CNAs, and 38 of them are significantly altered according to GISTIC38
and/or highly biased towards misregulation due to CNAs according to OncodriveCIS39
(Supplementary Fig. 6
). Therefore, these are also likely involved in tumorigenesis upon deletion (tumor suppressors) or amplification (oncogenes).
It has previously been suggested that tumorigenesis requires 5–7 driver mutations in common epithelial cancers, while hematological and pediatric malignancies may require fewer8,40,41
. Even under the assumption that the HCD list is not complete, it allows us to explore this question. Pan-cancer tumors have a median of 4 PAMs in HCDs (), although this number varies widely depending on the cancer type; OV and AML tumors exhibit the lowest rate (median of 2), whereas BLCA (9.5), LUSC (9) and LUAD (9) have the highest. Most tumors (94%) have at least one HCD bearing a PAM (). Again, AML tumors present the highest rate of samples without PAMs in HCDs (16%), highlighting the possible relevance of other alterations in this cancer type.
(A) Histogram of the proportion of samples in the pancancer dataset with PAMs in HCDs. (B) Proportion of samples in each cancer type with PAMs in HCDs.