|Home | About | Journals | Submit | Contact Us | Français|
CCancer is an automatically collected database of gene lists, which were reported mostly by experimental studies in various biological and clinical contexts. At the moment, the database covers 3369 gene lists extracted from 2644 papers published in ~80 peer-reviewed journals. As input, CCancer accepts a gene list. An enrichment analyses is implemented to generate, as output, a highly informative survey over recently published studies that report gene lists, which significantly intersect with the query gene list. A report on gene pairs from the input list which were frequently reported together by other biological studies is also provided. CCancer is freely available at http://mips.helmholtz-muenchen.de/proj/ccancer.
At the moment, various high-throughput experimental platforms are employed intensively to provide new insights into the molecular mechanisms underlying a variety of biological phenomena (1,2). An increasing number of biological or clinical studies report differentially expressed genes, epigenetically silenced genes, frequently mutated genes, genes with copy number variations or other gene lists involved in common biological processes. Although being publicly available, this type of information, at the same time, is dissolved in hundreds of papers. The only way to collect this valuable data is to use automatic text mining systems.
Text-mining systems are employed by biomedical researchers to automatically extract relevant information from the literature [see ref. (3) for a review]. For example, PolySearch (4) is a generic text mining system for extracting relationships between genes and diseases. Several other databases, which are based on text mining, focus on specialized research areas: PubMeth (5) and MeInfoText (6) collect information on gene methylation in cancer. DDOC (7) and DDEC (8) collect heterogeneous information about genes differentially expressed in ovarian and esophageal cancer, such as manually curated information about the promoter regions and associated transcription factors, as well as text-mined reports.
Recently, we have developed the PLIPS database, a collection of protein lists extracted from proteomics studies by text-mining (9). PLIPS also provides a statistical framework for the interpretation of a protein list. To generate the PLIPS database, relatively few ‘text mining’ efforts were required, since a majority of proteomics studies are published in a few highly specialized proteomics journals. PLIPS covers only five major proteomics journals (Proteomics, Journal of Proteome Research, Molecular and Cellular Proteomics, Proteomics—Clinical Applications) and ~1000 different protein lists extracted from 800 independent studies.
In contrast to proteomics, high-throughput genomic technologies were more frequently used and their results are published in a much wider spectrum of journals. Gene lists, which were characterized to play key roles in molecular mechanisms for a variety of biological phenomena, are regularly reported in general biological journals, as well as in highly specific medical journals. Thus, automatic extraction of this information requires a lot of additional efforts.
Here, we present a database, termed CCancer, which provides a collection of 3369 gene lists automatically extracted from tables in 2644 studies covering ~80 peer-reviewed journals. Cancer is a major focus of biomedical research. According to our estimates, more than a half of the gene lists stored in CCancer are extracted from cancer related studies. This fact pre-defines the name of the database.
CCancer is not only a database but a web-based analysis platform, which employs an enrichment analyses framework (10–14) to interpret a given user-defined gene list. As input, CCancer accepts gene/protein list. As output, a catalogue of previously published studies that report a table of genes/proteins, which significantly intersects with a query list, is provided. Thus, CCancer supports the interpretation of the functional context for an experimentally derived gene lists. To illustrate the valuable and often unprecedented information that the user can get by using the CCancer database, we present several examples of data analyses.
We collected all articles (~150 000) published in 80 peer-reviewed journals for the last 5–7 years. The articles were screened for tables which report gene identifiers. The search algorithm was implemented to recognize a table with gene/protein identifiers within the paper text. If the table reports at least 10 unique gene identifiers of the same type [i.e. ‘Entrez Gene IDs’, ‘Gene Symbols’, ‘RefSeq’, ‘UNIGENE’, ‘ENSEMBL’, ‘Affymetrix Probes’, ‘IPISYN (Internatinal Protein Identifire)’, ‘Uniprot, SwissProt’] then the paper was selected. In total 3369 gene/protein lists were identified from 2644 papers. All gene list were mapped to ‘Entrez Gene IDs’. The data in Ccancer covers ~20 000 unique ‘Entrez Gene IDs’.
The top journals, in terms of the number of extracted gene lists, were highly specific journals in cancer research: ‘Cancer Research’ (327 papers and 411 gene lists), ‘Oncogene’ (214 papers and 278 gene lists), ‘Clinical Cancer Research’ (149 papers and 178 gene list), ‘International Journal of Cancer’ (109 papers and 143 gene lists). The full list of journals is accessible on the web server (http://mips.helmholtz-muenchen.de/proj/ccancer/journals).
We would like to point out that the data were collected automatically. A gene list in the database may be incomplete (in comparison to the originally reported list in the paper) and might have false positive gene identifiers. In our estimate based on 100 randomly selected records, ~60% of records are of high quality (the original table in the paper and the Ccancer record have <10% of false negative and false positive genes), ~20% of records are of good quality (containing 10–25% false negatives and not more than 10% of false positives) and ~15% of records contains ~35–75% of genes actually reported in the paper table. About 3–5% of the records may represent artefacts, i.e. a result from a wrongly recognized table which does not actually reports a gene list.
A comprehensive hierarchical controlled vocabulary for human disease (http://do-wiki.nubic.northwestern.edu/index.php/Main_Page) was used to link articles to human disease terms. First we computed the background distribution for each disease term using all available articles (~150 000). For each disease term ‘A’ we select the subset of articles where the term was present at least once and compute for each article from this subset the number of times the term ‘A’ was present in the paper. The average number of times the term ‘A’ was mentioned per paper across this subset was computed. If a term ‘A’ was mentioned twice as many times as computed average value, then the paper was annotated with term ‘A’.
The gene lists from Ccancer database were annotated with ‘human disease terms’ based on the annotation of the paper from which it was extracted. We would like to point out that the biological context of genes reported in the table may not correspond to the context of the terms overrepresented in the paper text. In each case, manual analysis is required.
To statistically link a given gene list (the query list) to the lists from the database we implement standard enrichment analyses. For each gene/protein list L in the database, the number of genes I common between the query list and the gene list L are counted. The null hypothesis H0 ‘Genes from the query list (size NQ) and from the list L (size NL) have at least I common genes by chance’ is tested. The hypergeometric test, adjusted for multiple testing by a Monte–Carlo simulation procedure, is employed to assess the significance of the intersection I. The estimated P-value corresponds exactly to the definition of an experiment-wise Westfall and Young P-value (11,12,15–17).
Based on the data from the Ccancer database we identified gene pairs which were significantly associated (frequently reported together). Let us denote N as the total number of tables in the Ccancer database. Let us denote Cj to be the number of tables from Ccancer database where gene j was reported. For each pair of genes (j and k) we compute Cjk the number of tables where both gene j and k were reported. The number (intersection) Cjk follows a hypergeometric distribution with parameters N, Cj and Ck (‘Cj’ balls were drawn without replacement from an urn containing ‘N’ balls in total, ‘Ck’ of which are white). A Monte–Carlo simulation procedure was employed to adjust the P-value for multiple testing (for each gene we tested K hypotheses where K is a total number of genes). At the significance level (P < 0.05) each gene from the Ccancer database was associated on average with ~20 other genes.
Extracted gene/protein lists were mapped to NCBI Entrez Gene IDs. Each gene/protein list L in the database was considered as ‘query’ list to identify the other lists from the database which have significant number of common genes (with ‘query’ list). Thus, we cross-linked all gene list pairs if they share a significant number (P < 0.01) of common genes. This information can be browsed online (http://mips.helmholtz-muenchen.de/proj/ccancer/journals/).
Each gene/protein list from CCancer database was linked to gene ontology (GO) terms which were significantly (P < 0.01) enriched in the list (13). In analogy to the calculation of the intersection P-value, a hypergeometric test, adjusted for multiple testing using a Monte–Carlo simulation approach, was employed to estimate the statistical significance of a GO category.
All three considered cases (intersection between gene list and Ccancer records, GO enrichment of gene list and significantly associated gene pairs) can be modelled by the same statistical model ‘sampling without replacement’. In this model, k balls were drawn without replacement from an urn containing ‘N’ balls in total, n of which are white (all others are black). In this case, the number k1 of white balls drawn from the urn follows a hypergeometric distribution with parameters N, k and n. However, in our cases the balls are multicolored and we actually test multiple hypotheses at the same time: ‘white balls were drawn randomly’, ‘blue balls were drawn randomly’, ‘red balls were drawn randomly’ and so on. As we select the most enriched whatever color (let say red for this case), the estimated P-value based on the hypergeometric distribution does not reflect the actual probability to get k1 of whatever color balls; it reflects the probability to get k1 red balls. To get the actual probability to get k1 whatever color balls we need to adjust P-value for multiple testing. One way to do this is to use Monte–Carlo simulation to directly measure this probability based on, let say 1000 simulations.
In this case we simulate a random drawing of k balls 1000 times and each time we estimate the P-value based on hypergeometric distribution for the best (whatever) color. Thus, we got a distribution of size 1000 of the best P-values for a random drawn of k balls and compare it to the P-value for the best (whatever) color balls related to our original drawn of k balls. The estimate of the adjusted P-value is given by the share of random simulations where the best P-value was equal or superior (less) than the P-value for the best (whatever) color balls related to our original drawing of k balls.
The user can query his/her list of gene/protein identifiers to find statistically significant links to previously published studies as well as to identify gene pairs from the submitted list which were frequently reported together (P < 0.05). As input, CCancer accepts several types of gene identifiers. CCancer supports most gene and protein identifiers such as ‘Gene Symbol’, ‘Entrez Gene Id’, ‘UniProt/Swiss-Prot’, ‘UniGene’, ‘Ensembl’, ‘RefSeq Protein ID’, ‘RefSeq Transcript ID’ and’Affymetrix probe codes’. As output, a catalogue of previously published studies that report a table of genes that significantly intersect with a query list is provided.
After a list of potential studies is generated, one needs to check manually all interesting hits. First, a list of gene IDs common between query and database list need to be checked (a link ‘Mapping protocol’ is provided on the resulting page). Additionally, by looking into the corresponding study (a link is provided on the resulting page) one can understand better the functional context of the ‘database’ gene list. As been already mentioned in ‘text mining’ section, the database was collected automatically and, thus, some hits may represent artefacts.
CCancer also provides an interface to browse gene lists from the database with a common property. At the moment, the user can select gene lists which are statistically linked to either a particular GO biological process (http://mips.helmholtz-muenchen.de/proj/ccancer/go_bp), molecular function (http://mips.helmholtz-muenchen.de/proj/ccancer/go_mf) or cellular component (http://mips.helmholtz-muenchen.de/proj/ccancer/go_cc). The possibility to browse gene lists based on their local properties is going to be extended in the future.
Next we present examples of analyses of experimental data by CCancer to illustrate the potential utility of our database. In principle, interpretation of gene list using CCancer is based on the widely accepted guilt-by-association principle: significant similarities between protein lists can be indicative of similarity in molecular mechanisms between corresponding phenomena. The next example aims to illustrate this application of CCancer
A study by Young et al. (18) provided evidence that autophagy-related genes mediate the acquisition of the senescence phenotype. The authors studied 53 autophagy- and senescence-related genes, which were up- or down-regulated after Ras induction. A screen in CCancer for studies, which report gene sets that significantly intersect with the genes reported in ref. (18), identified several related papers (P < 0.01, Table 1). For example, a study is detected, where Staber et al. (19) report genes associated with recurrent acute myeloid leukemia after high-dose chemotherapy. The genes, which were differentially expressed in patients with acute myeloid leukemia prior to high-dose chemotherapy and after relapse, significantly intersect with the senescence-related genes reported by ref. (18).
Two other studies detected by CCancer covered related topics. A study (21) described genes related to macrophage activation and TH1 immune response, which were induced by low-dose radiation therapy in follicular lymphoma. A second study of ref. (22) identified genes differentially expressed in response to ionizing radiation in lymphoblastoid cells.
A relationship between cancer- and senescence-related genes or pathways would have been certainly expected. Interestingly, some topics, which emerged form the comparison of senescence-related genes to the studies in the CCancer database, were related to cancer therapy or prognosis. For instance, the studies commonly reported Cathepsin B and D (CTSB/D) and Cathepsin L1 (CTSL1), which participate in protein degradation and turnover (23,24).
Another output from Ccancer (Table 2) reports pairs of genes which were frequently reported together. For example, the already mentioned genes Cathepsin L1 (CTSL1) and Cathepsin B and D (CTSB/D) were reported together much more frequently than it would be expected by chance. CTSL1 and CTSB were reported together by 12 Ccancer records while each gene was reported 78 and 49 times, respectively (P < 0.05). Interesting, 7 out of 12 papers were related to ‘cancer’ and three to ‘LEUKEMIA’.
CCancer can further be exploited to identify hints for novel clinical applications of known drugs (25) or drugs under development in the case, where the list of drug molecular targets is known. For a variety of cell disorders (including most cancer subtypes) CCancer stores lists of genes identified to be at differential states (comparison between normal versus disordered cells). This type of information can be mined to identify new potential therapeutic implications. Significant number of common genes between drug targets and gene lists from the database related to some cell disorder can be indicative of probable usability of the drug for the corresponding disease.
For example, Bosutinib is a novel promiscuous kinase inhibitor. We extracted a reported list of direct interactors of Bosutinib identified by chemical proteomics (26). We used CCancer to identify previously reported gene lists that have a significantly share of common genes/proteins (http://mips.helmholtz-muenchen.de/proj/ccancer/example.html) and, thus, to identify potential physiological conditions, where an application of Bosutinib might be effective. Among already known applications, like different types of ‘leukemia’ (27,28), CCancer suggests other specific cancer types, like ‘oral squamous cell carcinoma’(28).
Here, we have generated a comprehensive collection of gene list reported by the papers in 80 peer-reviewed journals. Tables in articles usually present, to some extent, pre-processed gene lists, which are selected for significance by experts. To our knowledge, no existing text-mining system provides a similarly accessible and comparable collection of experimentally derived gene lists for analysis.
The CCancer database provides a computational interface, which generates a highly informative survey over recently published cancer-related studies, which report similar and significantly intersecting gene lists. As we have demonstrated, by applying this automatic analysis, the user may obtain sometimes unexpected links to previous studies. It would be a tedious, if not impossible task for experimental researchers to gain these insights by a manual analysis of the literature. Articles, which contain significantly intersecting gene lists, are not necessarily listed as ‘related articles’ in PubMed. In addition, CCancer implements a robust statistical treatment of the intersection between a query and a database gene list, and provides a valid estimate of the P-values by a Monte–Carlo simulation procedure. The P-values actually reflect the probability of getting an intersection of the same size, in terms of the number of genes, for any random query gene list.
This work was supported by the Helmholtz Alliance on Systems Biology (project ‘CoReNe’). Funding for open access charge: Helmholtz Center Munich – German Research Center for Environmental Health (GmbH).
Conflict of interest statement. None declared.