|Home | About | Journals | Submit | Contact Us | Français|
Exome sequencing (exome-seq) has aided in the discovery of a huge amount of mutations in cancers, yet challenges remain in converting oncogenomics data into information that is interpretable and accessible for clinical care. We constructed DriverDB (http://ngs.ym.edu.tw/driverdb/), a database which incorporates 6079 cases of exome-seq data, annotation databases (such as dbSNP, 1000 Genome and Cosmic) and published bioinformatics algorithms dedicated to driver gene/mutation identification. We provide two points of view, ‘Cancer’ and ‘Gene’, to help researchers to visualize the relationships between cancers and driver genes/mutations. The ‘Cancer’ section summarizes the calculated results of driver genes by eight computational methods for a specific cancer type/dataset and provides three levels of biological interpretation for realization of the relationships between driver genes. The ‘Gene’ section is designed to visualize the mutation information of a driver gene in five different aspects. Moreover, a ‘Meta-Analysis’ function is provided so researchers may identify driver genes in customer-defined samples. The novel driver genes/mutations identified hold potential for both basic research and biotech applications.
Next-generation sequencing (NGS) has greatly increased the identification of mutations in cancer genomes and allows researchers to profile the molecular characteristics of various cancer types. In the past few years, applying exome sequencing (exome-seq) in oncogenomics studies has become the norm (1). Also, enormous amounts of cancer genomics data have been generated from large-scale cancer projects (2) such as The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC), the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and the Pediatric Cancer Genome Project (PCGP). Although NGS has already helped researchers discover huge amounts of aberrant events in cancer genomics, translating these data into information that can be easily interpreted and accessed is still challenging.
Cancers are primarily caused by the accumulation of genetic alterations and could be characterized by numerous somatic mutations. However, not all of these mutations are involved in tumorigenesis. Only a subset of mutations contributes to cancer development, whereas others make no or little important contribution. To crystallize this concept, the terms ‘driver and ‘passenger’ mutation have been coined (3). The mutations that confer a selective growth advantage to the tumor cell are called ‘driver’ mutations (1). ‘Passenger’ mutations are defined as those which do not confer growth advantage but that do occur in a cell that coincidentally or subsequently acquires a driver mutation (4). In most solid tumors, an average of 33–66 genes with somatic mutations were found to alter their protein products, but the count of non-synonymous mutations varies across cancer types (1). More than 80% of mutations are missense (1), and these mutations vary highly in their functional impact depending on their position and function in the protein and the nature of the replacement amino acid. It remains a significant challenge to identify cancer driver mutations because many observed missense changes are neutral passenger mutations (5). Several computational algorithms have been developed to predict the functional impact of missense mutations based on concepts including evolutionary conservation, structural constraints and the physicochemical attributes of amino acids. In the last few years, machine learning methods have been developed to specifically predict cancer-driving deleterious mutations (6–8).
A driver gene is defined as a gene whose dysfunction will cause tumorigenesis. Vogelstein et al. have demonstrated the fundamental difference between a driver gene and a driver mutation (1). Numerous computational methods to identify driver genes have been published; algorithms such as MutsigCV (9), MuSiC (10), Simon (11), OncodriverFM (12) and ActiveDriver (13) are based on the mutation frequency of an individual gene compared with the background mutation rate. However, background mutation rates among different genome regions and patients are highly variable (9). Recent studies have shown that the mutation rate varies in normal cells by more than 100-fold within the genome (14) and that such variation is higher in tumor cells (15). To correct for this bias, MutSigCV uses patient-specific mutation frequency and spectrum, as well as gene-specific background mutation rates. OncodriverFM incorporates the functional impacts of mutations as additional information. ActiveDriver identifies driver genes with statistically significant mutation rates in phosphorylation-specific regions. Other methods are based on the sub-network approach (16–24) that can identify groups of genes containing driver mutations directly from cancer mutation data either with or without prior knowledge of pathways or other information of protein/genetics interactions. This approach is successful particularly when the observed frequencies of passenger and driver mutations are indistinguishable, a situation wherein single gene tests fail. Moreover, sub-networks are believed to identify cancer driver genes with low recurrence (25). Most of sub-network based methods, such as MEMo (19), MDPFinder (16), Dendrix (17), Multi-Dendrix (18) and RME (24), identify driver genes with the characteristics of mutual exclusivity. Moreover, sub-network methods could additionally incorporate copy number variation (CNV) data for driver gene identification (16–19,22,24).
In this study, we present the DriverDB database, which incorporates a large amount (>6000 cases) of exome-seq data, annotation databases (such as dbSNP (26), 1000 Genome (27) and COSMIC (28)), and the various bioinformatics algorithms devoted to defining driver genes or mutations. DriverDB focuses on predicting driver genes by various algorithms and provides different aspects of the mutation profiles of an individual gene. We provide two view points, ‘Cancer’ and ‘Gene’, for benefiting researchers to visualize the relationships between cancers and driver genes/mutations. A ‘Meta-Analysis’ function is further included in the DriverDB for allowing researchers to identify driver genes of custom-defined samples according to clinical criteria.
As shown in Figure 1, DriverDB includes mutation profiles from 6079 tumor–normal pairs, including 4397 from TCGA, 861 from ICGC, 112 from PCGP, 238 from TARGET and 471 from published papers (denoted as ‘others’ in Figure 1). Detailed information for the datasets is provided in Supplementary Table S1. The mutation data and CNV data of these pairs were retrieved from the data portal of the projects or from the supplementary data of the published papers, and were then parsed using in-house Perl scripts. To ensure annotation consistency and to make the retrieval process more efficient, clinical information for each sample was manually curated, based on clinical data obtained as mentioned above. Each sample was re-annotated with 38 clinical characteristics. The summary of the clinical information is provided in Supplementary Table S2.
All mutations were mapped to known databases, and their functional impacts were predicted by numerous bioinformatics tools shown in the Annotation module in Figure 1. For annotating known variants, DriverDB incorporates the information collected from different databases including dbSNP, NHLBI GO ESP (29), 1000 genomes, COSMIC, ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/), NHGRI GWAS catalog (30), HGMD-PUBLIC (31) and OMIM (http://omim.org/). We used SnpEff (32) and VEP (33) to predict the effect of each mutation, such as non-synonymous coding, stop gained/lost and frame-shift. In addition, DriverDB scores the deleterious effects and functional impact by seven algorithms, including SIFT (34), PolyPhen2 (35), Condel (36), LRT (37), FATHMM (38), MutationAssessor (39) and MutationTaster (40). Furthermore, we scored each mutation by the number of algorithms that judge the mutation as deleterious (these numbers are denoted as ‘Driver Score’). For example, the mutation g.178952085A>G of PIK3CA, which occurs in >100 patients from various cancer types, was identified as deleterious by seven algorithms; therefore, its Driver Score is 7.
DriverDB utilized eight computational methods to identify driver genes of cancer types (the Cancer Driver Gene module in Figure 1). Four methods, including MutsigCV, Simon, OncodriverFM and ActiveDriver, are based on mutation frequencies and utilize all mutations to identify driver genes.
For the sub-network based methods, MEMo, Dendrix, MDPFinder and NetBox were used. We applied the following filters to remove mutations/genes from the analysis:
Detailed criteria for each method are described in Supplementary Methods.
For each set of driver genes identified by individual/multiple method(s) in a group of cancer samples, we provided three levels of biological interpretation (Gene Oncology, Pathways and Protein/Genetic Interaction) to help researchers to realize the relationships between driver genes. In the ‘Gene Oncology’ part, we used the topGO and GeneAnswers packages of Bioconductor to calculate the topology of the GO graph, as well as to visualize the many-to-many relationships between GO terms and genes. In the ‘Pathway’ analysis, we used collections from KEGG (41), PID (42), Biocarta (http://www.biocarta.com/), REACTOME (43) and MSigDB (44) to annotate driver genes. Detailed information for these eight collections is provided in Supplementary Table S3. The three databases, IntAct (45), BioGRID (46) and iRefIndex (47), were used to interpret the Protein/Genetic Interaction. We also performed classic Fisher’s exact test and utilized –log(P value) to score each GO term and Pathway category in the Gene Oncology and Pathway analyses. For the ‘Pathway’ and ‘Protein/Genetic Interaction’ sections in the DriverDB web interface, the Cytoscape Web (48) tool was embedded for interactive network visualization.
The ‘Cancer’ section stored the calculated results of driver genes for a specific cancer type/dataset. First, users can define the data type(s) incorporated for driver gene identification (the red rectangle in Supplementary Figure S1A) and then select a specific dataset, for example, ‘Glioblastoma multiforme’ (GBM). The result section will then indicate the detailed information of the specific dataset (red circle in Supplementary Figure S1B). Users can select a driver gene set identified by ‘N’ methods (the ‘Summary’ in Supplementary Figure S1B; ‘N’ is determined by a drop-down menu) or by individual methods according to the name of the method (Supplementary Figure S1B). For ‘Summary’, a heat map shows the relationship between genes and methods (Supplementary Figure S1C; the blue color indicates genes identified as driver genes by a method). For each driver gene set, there is a heat map showing a mutation profile of that driver gene set of samples (Supplementary Figure S1D). We also performed functional analysis in three levels of biological interpretation: ‘Gene Ontology’, ‘Pathway’ and ‘Protein/Genetics Interaction’. In the ‘Gene Ontology’ analysis, I and II indicate the topology of GO graph by topGO and GeneAnswers, respectively, (Supplementary Figure S1E) whereas III and IV show the most significant GO terms and genes. The table in Supplementary Figure S1E lists the information of all the significant GO terms. In the ‘Pathway’ analysis, there are eight collections of gene sets from public databases including KEGG, REACTOME, MSigDB, PID and Biocarta. For each collection, there is a network visualization and a table displaying pathway categories of the driver genes that are involved (Supplementary Figure S1F). Finally, in the ‘Protein/Genetics Interaction’ part, the interactions between driver genes are illustrated according to three resources: BioGRID, IntAct and iRefIndex (Supplementary Figure S1G).
In this section, researchers can visualize the mutation data for a specific protein encoded by a gene in five different kinds of aspects: Mutation Profile, Mutation Percentage, Exon, Driver Score and Mutation Information (Supplementary Figure S2A). Here, we use the gene PIK3CA, which is identified as a driver gene in the ‘Cancer’ section, as an example. Bar chart colors in the sub-figures of Supplementary Figure S2 indicate the functional impact of a mutation, such as non-synonymous and frame-shift shown in Supplementary Figure S2B. For ‘Mutation Profile’ (Supplementary Figure S2C), a heat map shows the mutation rate calculated by the mutation count/sample count of a cancer, at different protein positions across several cancer types. We also provide exon and domain information with protein coordinates at the bottom of the heat map (Supplementary Figure S2C). Two bar charts located at the top and the left of the heat map indicate the sum of mutation rate according to protein position and cancer type, respectively. The ‘Mutation Percentage’ (Supplementary Figure S2D) is similar to Supplementary Figure S2C, but the number in the heat map is calculated by the following: (mutation count of a protein region/total mutation count of a cancer) × 100. The heights of the two bar charts at the left and the top of the heat map are normalized to the mutation count of a cancer type or a protein region, respectively. In the ‘Exon’ panel, the mutation counts and the mutation types of each exon are illustrated in Supplementary Figure S2E and S2F, respectively. For the ‘Driver Score’ part, Supplementary Figure S2G and S2H indicate the Driver Score (please see the ‘Materials and Methods’ section for details) distributions of exons and protein positions, respectively. All the mutation data of a specific protein are listed under ‘Mutation Information’ (Supplementary Figure S2A).
In addition to the stored calculated results, DriverDB allows researchers to identify driver genes of a user-defined, specific set of samples. As shown in Supplementary Figure S3, users can select one or multiple datasets in DriverDB. We provide a list of clinical criteria, such as ICD-O-3 histology, tumor stage, distant metastasis and lymph node status, to help researchers to select a sub-group of well-defined cancer samples according to one or multiple clinical parameters for driver gene identification. Users can overview the detailed clinical information of selected samples before submitting this job to the server for real-time calculation. The user will receive a notification email with a Result ID, and then visualized driver gene results in the ‘Result and Download’ section when the job is completed.
DriverDB makes the best of the massive amount of exome-seq data published in recent years by integrating driver gene analysis from numerous methods, as well as by providing visualizations of mutation information according to different aspects. As described in the ‘Introduction’ section, different bioinformatics algorithms have been developed to identify driver genes based on several assumptions and characteristics, each of which provides different points of view regarding driver genes. DriverDB integrates the analysis results of individual/multiple method(s) and provides three levels of biological interpretation: Gene Oncology, Pathway and Protein/Genetics Interaction. These visualization results will help users to quickly realize the relationships between driver genes. A representative example of driver genes identified in GBM is shown in Supplementary Figure S1. A total of 14 driver genes were identified (each gene by at least 4 methods), and nearly all samples had at least 1 deleterious mutation among these 14 genes. Ten genes (CDKN2A, EGFR, PTEN, TP53, CDK4, PIK3R1, NF1, PIK3CA, RB1 and IDH1) are known to be critical in GBM tumorigenesis (49,50). For the other four genes (ATRX, CHEK2, CPSF6 and COL6A3), our functional analysis shows that they are involved in cell cycle-related categories (Supplementary Figure S1F). Moreover, ATRX has been reported as the driver gene in pediatric glioblastomas (51) and neuroblastomas (52,53). CHEK2 is relevant to familial breast/ovarian cancer (54) and neuroblastomas (54). CPSF6 can either enhance the invasive capacities of or inhibit the proliferation of cancer cells (55). The spliced variants and the aberrant methylation of COL6A3, are also related to cancers (56–58). Genes reported in other references but not included in our 14-gene list can be identified by less stringent criteria (such as those identified by at least three methods; for example PDGFRA, MDM2, MDM4 and CDKN2B).
The ‘Gene’ section is designed to help researchers to visualize the mutation data of a driver gene. The representative example is PIK3CA, a well-known driver gene in GBM as well as in other cancers (Supplementary Figure S2). It is easy to find that there are two hotspot mutation regions (at the middle and the end of the protein), especially in the ‘Mutation Percentage’ figure (Supplementary Figure S2D). The two well-known driver genes, BRAF and KRAS, also have the same characteristics (Supplementary Figure S4). However, a driver gene may have distinct hotspot mutation regions in different cancers. For example, unlike lung cancers that carry EGFR mutations at the kinase domain (KD), activation of EGFR in GBM occurs through mutation at the extracellular domain (59). This has been noted as the reason that GBM with mutations in the extracellular domain respond poorly to EGFR inhibitors (e.g., erlotinib) that target the active kinase conformation (59). This phenomena was recaptured by our calculation and was present in the ‘Mutation Profile’ of EGFR in DriverDB (Supplementary Figure S5).
In the ‘Gene’ section, bar chart colors indicate the functional impact of a mutation, which can help to convey important information. For example, FLT3 has been reported to be mutated in approximately one-third of patients in acute myeloid leukemia and has two hotspot regions: one consists of internal tandem duplication (ITD) mutations of 3–400 bp (always in-frame), and the other consists of point mutations at aspartic acid 835 of the KD (60). Such mutation information for FLT3 can be easily obtained in DriverDB (Supplementary Figure S6).
Several studies have assessed the performance of existing tools for predicting deleterious mutations, and the results have demonstrated that identifying cancer-driving mutations remains a significant challenge (5,61). Hence, we used the ‘Driver Score’, which integrates the information from seven computational tools, to describe the deleterious level of a mutation and to highlight the hotspot mutation region. For example, the Driver Score distribution of the cancer-related gene ‘MLL2’ implies that the third region of the MLL2 protein plays a more important role than other positions (Supplementary Figure S7). In summary, in the ‘Gene’ section of DriverDB, researchers can easily be informed when mutations are concentrated in one/some specific protein position(s)/domain(s)/exon(s)/cancer(s).
The ‘Meta-analysis’ section allows a user to re-define a group of samples from one/multiple datasets and then identify driver genes for selected samples. It has been noted that mutations are accumulated during tumor progression. Different driver mutations may be used to convert a normal cell to a tumor cell, or to turn a benign tumor into a malignant one. The timing of mutations is relevant to metastasis, and there are mutations that occur during this process (1). Thus, if we could define samples by a clarified biological or clinical goal, we would have the opportunity to identify a specific set of driver genes for a distinct question. To achieve this, DriverDB offers a list of clinical characteristics to define samples and provides a high degree of freedom for researchers to utilize the huge amount of sequencing data. For example, in Supplementary Figure S3 we selected only 180 samples from TCGA breast cancer project. Their lymphonode pathologic spread and ICD oncology of histology are ‘N0’ and ‘infiltrating duct carcinoma, NOS’, respectively.
A number of databases and frameworks have been developed to integrate large-scale genomic data (2), including cBioportal (62,63) and IntOGen (64). cBioportal contains datasets from TCGA and provides gene-based search capabilities to interactively explore multidimensional cancer genomics data. IntOGen is a framework that integrates multidimensional data for the identification of genes and biological modules involved in cancer development. DriverDB incorporates a large-scale data mining work using these algorithms in one go, presents summarized driver genes, and provides different kinds of aspects for mutation visualization. Another unique part of DriverDB is that it also helps researchers to identify driver genes in a customer-defined manner.
NGS has become the norm for large-scale cancer research, and cancer exome-seq results will accumulate rapidly in the next few years. For example, TCGA will examine over 11,000 samples for 20 cancer types by the end of 2014. Due to the Publication Guidelines of TCGA (http://cancergenome.nih.gov/abouttcga/policies/publicationguidelines), parts of data from TCGA are excluded in DriverDB. As time goes by, data from TCGA, as well as from other cancer projects/literatures, will have no publication limitations and will be incorporated into updated DriverDB. We envision that these novel driver genes or mutations identified and stored in DriverDB will hold great potential for both basic research and biotech product development.
Supplementary Data are available at NAR Online.
National Science Council [NSC; NSC101-2320-B-010-059-MY3, NSC101-2627-B-010-003 and NSC101-2321-B-010-011]; Veterans General Hospitals University System of Taiwan (VGHUST) Joint Research Program; Tsou’s Foundation [GHUST102-G7-3-2]; National Health Research Institutes (NHRI) [NHRI-EX102-10254SI]; Taipei Veteran General Hospital [Cancer Excellence Center Plan DOH102-TD-C-111-007]; Taipei City Hospital [10201-62-070]; National Yang-Ming University [Ministry of Education, Aim for the Top University Plan]; UST-UCSD International Center for Excellence in Advanced Bioengineering sponsored by the Taiwan NSC I-RiCE Program (in part) [NSC101-2911-I-009-101]. Funding for open access charge: NSC.
Conflict of interest statement. None declared.
We acknowledge the contribution of Chien Shu and Kun Zhang from University of California at San Diego for the analysis of exome-seq; and thank the National Center for High-performance Computing for computer time and facilities.