|Home | About | Journals | Submit | Contact Us | Français|
GeneSigDB (http://www.genesigdb.org or http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from http://www.genesigdb.org.
Accurately curated and annotated gene sets have emerged as essential tools for the analysis of large, complex biological datasets. Gene set analysis (GSA) is widely used in the analysis and interpretation of gene expression profiling data (1–4), evolutionary relationships (5), genomic associations—including QTL analysis (6), genotyping (7) and SNP chips (8)—and even for cross platform integration of genomics data (9). GSA aims to find sets of genes that collectively distinguish two phenotypes, even if the genes in the set are not significantly different when tested individually. This reflects the fact that genes within the cell function as members of complex networks and pathways, often with multiple, overlapping functions. As a result, direct comparisons of genes may miss biologically important connections that are only seen when these related genes are assessed collectively.
Gene sets have also become invaluable tools for characterizing and distinguishing phenotypic states. In breast cancer, for example, several gene expression signatures have been developed as commercial diagnostic assays (10) and new methods are being developed that combine the predictive strength of multiple gene signatures to increase their prognostic power (11).
Gene set resources can be broadly divided into those which assign a gene to collections based on ‘known’ gene or protein interactions or functional activity and those that include gene lists from high-throughput experimental assays. Functional and pathway databases such as Gene Ontology (GO), KEGG and Reactome capture published descriptions of cellular pathways and gene functions (12), including, in the case of GO, functional predictions inferred from orthologous sequences (13). However, these resources are incomplete as we have not yet been able to comprehensively and completely catalog the functions of all genes in the genome (13).
High-throughput experiments, such as microarray expression profiling and RNA-seq have also produced large numbers of potentially informative gene lists. Most genomics papers present one or more gene signatures that reportedly correlate with experimental phenotypes. While there has been some controversy over the value of individual gene sets, due to the fact that many fail to fully replicate in independent data sets, the analysis of the collected gene lists defined for similar phenotypes has been demonstrated to provide meaningful biological insight (14).
Despite tremendous interest in using gene signatures, public repositories such as GEO and ArrayExpress (15,16) store primary gene expression data but fail to capture the gene sets that are the end product of published analyses. Without a systematic way of reporting these, the gene sets often appear only in published tables or figures or in supplementary materials hosted on the author's or the journal's website. And as there are no accepted standards for reporting gene sets, they often appear with non-standard gene identifiers, making comparison to other lists, or even to the original data, a significant challenge. Because of these limitations, gene sets from published research studies are often inaccessible to automated computational analysis.
In August 2009, we created GeneSigDB (17) as a repository for gene sets that had been systematically collected and manually curated from published articles indexed by PubMed. Our approach in building GeneSigDB was to capture gene signatures from the literature as published, to map them to standard identifiers using transparent, reproducible protocols and to freely provide these to the research community together with some elementary analytical tools. Since its launch, GeneSigDB had 7918 web hits with 4404 hits in 2010 and 3354 so far this year, suggesting that this resource is of value to the biomedical research community.
GeneSigDB has grown considerably since its introduction (Figure 1), nearly doubling in size with each subsequent release. GeneSigDB 4.0, released in September 2011, contains 3515 human, mouse and rat gene sets curated from 1604 published articles. While we have continued to focus on gene sets related to cancer and stem cells, we now also include signatures for development, inflammation and immune regulation, and lung disease and we have begun to catalog signatures for miRNA expression and proteomics. The content of GeneSigDB and its composition are summarized in Tables 1 and and22.
GeneSigDB had minimal overlap with other gene signatures resources when we compared the overlap of publications curated by MSigDB (18) and CCancer (19) to GeneSigDB. Only 198/1604 (12%) publications were curated by both GeneSigDB (n=1604) and MSigDB (n=786, Release 3.0). GeneSigDB and MSigDB are both manually curated, but CCancer gene lists are computationally extracted from publications in 100 journals indexed by PubMed (19), and ~30% of publications in CCancer are also manually curated in GeneSigDB. To estimate of curation quality, we examined 121 publications that were curated by all three gene signatures resources. The number of gene signatures identified in these 121 publications were 428287 and 123 by MSigDB, GeneSigDB and CCancer respectively. MSigDB and GeneSigDB captured more data per publication that the automated curation of CCancer.
The primary data objects in GeneSigDB are genes, published articles and gene signatures from those publications. We define a gene signature as a set of gene identifiers that were experimentally derived from analysis of gene, protein or miRNA expression.
Published articles likely to contain gene signature are identified using predefined PubMed searches as described at http://compbio.dfci.harvard.edu/genesigdb/documentation.jsp. We download and read each article, identify tables or figures containing gene signatures, and transcribe these from the main body of the article, its supplementary materials or websites referenced within the article (Figure 2). We manually transcribe the entire table and then use a pipeline based on Biomart (20) to map the published gene identifiers to EnsEMBL IDs, creating standardized gene sets.
The number of genes in the standardized sets may not be equal to that reported in the source publication and this may occur for a variety of reasons. Gene identifiers reported in the article maybe have been retired, reported probes may now be recognized as non-specific or they may map to multiple genes, or gene identifiers maybe invalid due to inaccurate reporting or to MS Excel gene name conversion errors (21). GeneSigDB users have the option of seeing either the original or standardized versions of a gene set and in the current release version users can compare the original and standardized lists side-by-side. In addition, we have improved the visibility of unmapped genes so that users can identify unmapped genes in a GeneSigDB standardized table
Gene signatures in the database are identified by unique SigIDs and SigNames. The SigIDs combine the PubMed ID of the paper from which the signature was derived and the table or figure in which it was reported. For example, SigID 11823860-SuppTable2 refers to a gene signature obtained from Table 2 in the supplementary material from an article with PMID 11823860. The SigName is designed to be a more descriptive, human-readable identifier; the SigName associated with 11823860-SuppTable2 is Breast_van'tVeer02_231genes_PoorPrognosisSignature, which indicates it is a signature of poor prognosis in breast cancer, contains 231 genes and was published by van 't Veer and colleagues in 2002.
We provide two search tools for finding signatures in the database, one based on publications and the other based on genes. The publication search tool allows users to enter one or more search terms, such as author name, article title, journal name or keywords (such as disease type), and these are then searched against the full text of articles represented in the database to find those best meeting the search criteria. In release 4.0 we added the ability to search Medical Subject Headings (MeSH) terms associated with each publication.
Users can also search for genes and their annotated properties, including gene names and synonyms, functional classifications such as GO terms, InterPro domains, KEGG or Reactome Pathways or almost any valid gene identifiers including gene symbol, Entrez gene ID, EnsEMBL gene ID, RefSeq ID or common commercial microarray probe IDs.
Results can be further refined by applying additional search criteria or using faceted terms associated with publication or gene search results (Figure 3). For either gene-based or publication-based searches, users can collect their results and load these into a ‘Shopping Basket’, allowing the signatures to be collected signatures using a variety of independent criteria and to then viewed, downloaded or compared (Figure 3).
Clicking on a publication, gene or gene signature will open up a data type-specific view for each of these. The publication view provides information about the published article, its authors, an abstract and a list of gene signatures associated with that publication. The gene view provides annotation on a gene and a list of gene signatures which contain that gene. Each of these pages includes links to one or more gene signatures. The gene signature view now provides a dynamic table presenting both the original and standardized gene lists, each of which can be used to sort and filter the signature.
One common question is whether a particular gene set overlaps with others that have been reported. For published gene lists, we have pre-computed the pairwise similarity between all gene signatures using a one-tailed Fisher's exact test (which is equivalent to a hypergeometric distribution test) with P-values corrected for multiple testing. When a user clicks ‘Related Signatures’ from a gene signature view, or ‘Compare’ gene signatures in the Shopping Basket the most similar signatures are presented both in list and graphical form (Figure 4).
A heatmap is used to visualize the overlaps of genes in a selection of gene signatures. In the heatmap, gray pixels indicate no overlap and red indicates two signatures sharing a common gene. Users can reorder, add or remove genes or gene signatures from the heatmap or use a sliding selector to define a minimum number of genes required for overlap. Users can also edit the heatmap column and row labels and export the heatmap as a publication quality image.
Those wishing to compare their own gene lists to the published lists in GeneSigDB can select ‘Analyze My Genes’ from the top menu bar. This brings up an interface that allows them to paste in their own gene list or to upload a file containing a gene list. If the gene list is not a list of EnsEMBL IDs, it is converted using BioMart, tested for overlap with the gene sets in the database, and reported using the table and heatmap views described above.
As described above, users may perform a search to create a custom selection of gene signatures in their basket and these can be downloaded as a compressed file. In addition, all of the data in GeneSigDB is freely available for download. Users selecting ‘Download’ from the top menu bar are taken to a page where they can download the current release and previous versions of the database. The data are available in a variety of formats, including a tab-delimited flat file, GSEA gmt format and as an R/Bioconductor RData file.
Release 4.0 of GeneSigDB includes expanded programmatic access to the database through a Java RESTful web service. We use the reference implementation of JAX-RS found at Glassfish (Jersey: https://jersey.dev.java.net) to provide the REST HTTP functionality and use the Glassfish reference implementation of JAXB (JAXB: https://jaxb.dev.java.net), for the XML transformation. GeneSigDB provides REST services to retrieve each of the major objects in GeneSigDB (GeneSignature, Gene and Publication) along with all of their ancillary member objects. These objects are in either XML or JSON format. The REST request is made over HTTP by creating a URL with an embedded key that will then GET the specified resource. The Accepts portion of the HTTP request header will determine the MIME (and format) of the response. Further details and examples of how these queries should be constructed are available in GeneSigDB online documentation.
In building GeneSigDB, we chose to report the published signatures rather than attempt to re-analyze the original data described in each of the manuscripts we extract from PubMed. Not only is re-analysis technically challenging because of incomplete metadata, but there are other projects that attempt to do this including Oncomine, Exalt (22) the Gene Expression Atlas (GXA) project (23) and our OncoSurf project, which is focused on finding signatures that predict survival (http://cccb.dfci.harvard.edu/oncosurf/).
The most comprehensive resource is GXA (http://www.ebi.ac.uk/gxa) whose developers have reanalyzed over 5500 gene expression and RNA-sequencing studies to identify expression profiles in 19000 cellular or clinical phenotypes. In partnership with GXA, we are providing ‘link out’ access to GXA, providing visualization of the GeneSigDB gene sets so that users can easily see which genes within a signature are significantly associated with specific phenotypes in GXA.
Although we have been accepting submission to GeneSigDB by email, Release 4.0 includes a web form for signature submission. Users can also use this form to suggest updates to gene signatures currently in GeneSigDB.
Although GeneSigDB is a relatively new database, it has already been used to advance our understanding of cancer and disease in new and interesting ways. Abba and colleagues used GeneSigDB 2.0 to retrieve breast cancer gene signatures (n=42) to identify the 117 most common genes across those signatures. They found the common genes to be enriched for those associated with response to steroid hormone stimulus, and the cell cycle. Their meta-signature of the 42 GeneSigDB gene signatures was capable of predicting overall survival (P<0.0001) and relapse-free survival (P<0.0001) in patients with early-stage breast carcinoma. GeneSigDB has also been used to develop methods for Transcription factor binding site analysis (24) and graph theory algorithms (25)
GeneSigDB addresses the important need within the community to standardize gene expression signatures so they can easily be compared to each other and used in GSA. The current release represents an almost 8-fold increase in the number of gene sets from the first version and includes 3515 gene signatures derived from the analysis of cancer, stem cells, immune system function, development and lung disease; at present it is the largest source of cancer-specific gene signatures available. Based on the needs of our users, we have made considerable efforts to increase the functionality of the GeneSigDB website to facilitate mining and analysis of gene signatures.
In the future, we hope to expand GeneSigDB functionality to provide links so that users can analyze connections between genes using our predictive networks application (http://www.predictivenetworks.org) and test whether a gene signature is prognostic or associated with a known SNP using OncoSurf (http://cccb.dfci.harvard.edu/oncosurf).
Although there have been other attempts to catalog gene sets, we believe that GeneSigDB represents a significant advance in both the quantity of gene signature data we have amassed and the quality of the analysis we perform. By standardizing both the way we refer to these gene sets and the manner in which they are mapped to standard formats, we have created an infrastructure that can be scaled and extended to capture the growing number of genomic profiles that are being created and published. In a time when we as a community have become aware of the need for reproducible research, such standardization is essential to assure that the results of genomic studies can be broadly used and replicated in independent analysis. Further, the availability of standardized signatures creates an opportunity for us to more fully leverage the prior knowledge that has been gained by expert analysis of individual studies so that we can more rapidly advance our understanding of the nature of a broad range of human diseases.
This work was supported by National Institutes for Health National Library of Medicine (1R01 LM010129), National Cancer Institute (1U19 CA148065), Genome Research Institute (1P50 HG004233); Dana-Farber Cancer Institute Women's Cancers Program (to A.C.C); Claudia Adams Barr foundation.
Conflict of interest statement. None declared.
We thank Dr Oliver Hofmann, Prof. Winston Hide and Dr Levi Waldron for their useful suggestions, which have helped us to improve GeneSigDB.