Microarray gene expression profiling and other high throughput technologies have been applied to investigate and classify thousands of biological conditions. Most studies report one or more gene signatures; lists of genes that are differentially regulated between the cellular states under study, for example in a cell or tissue type, in response to treatment or at a specific time point. The value of these experimentally derived gene signatures often extend beyond their initial publication. A range of applications have been developed to use them, including Gene Set Enrichment Analysis (GSEA) which analyzes gene expression data to look for groups of genes (or gene lists) over-represented among statistically significant genes from a particular experiment (1–3
). In breast cancer, a number of experimentally derived gene expression signatures including Mammaprint and Oncotype DX have been developed into commercial diagnostic assays (4
) and are being validated in large scale clinical trials (5
). Gene signatures are analyzed and validated on new gene expression data (7
) and novel computational methods are being developed for meta analysis of gene signatures. Finally, because published experimentally derived gene signatures are typically selected to differentiate between different classes of samples, meta-analysis of multiple gene lists may provide deeper insight into the biological mechanisms underlying a wide range of processes.
While public databases such as ArrayExpress and GEO have been developed to capture gene expression data, there is no existing resource to capture the valuable end-product of the analysis of those data—the gene lists that the analyses produce. Instead, these gene lists are often included in tables or figures embedded in publications or included as supplementary material
on the journal’s or the author’s website, making them generally inaccessible to automated computational analysis. If one is able to access these lists, one often finds that the lists are reported using non-standard gene identifiers, making comparison to other lists, or often to the original data, a significant challenge. To be of maximal value, gene signatures should be available through a resource that provides gene lists in a common standard format that is computationally accessible. In addition it should provide the original gene signature table as transcribed from the publication. Reproduction of a computationally accessible original transcribed gene signature table may provide additional signature meta-data, such as information and annotation about the experimental conditions and the criteria used in generating gene lists from the data (such as t
-statistics scores or other ranking information) which is useful in some gene set analyses.
There have been a number of attempts to collect experimentally derived gene signatures, however these do not generally retain the original transcribed gene signature data from the publication. The largest collections of gene signature are available in MSigDB (3
). MSigDB (v2.5) provides gene signatures as annotated lists of gene symbols and have curated gene signatures from 344 publications. Curators and users can submit gene lists as a two column table, containing a gene identifier and its gene symbol. However many MSigDB gene signatures contain only gene symbols, thereby limiting their future re-annotation. The Lists of Lists-Annotated (LOLA; http://www.lola.gwu.edu
) database (9
) contains 47 gene lists (v1.2, October 2009) and gene list input format is limited to EntrezGene or Affymetrix probeset identifiers. SignatureDB (http://lymphochip.nih.gov/signaturedb/
) provides 147 published and non-published gene signatures related to haematopoietic cells. The number of cancer gene signatures in Cancer Genes (http://cbio.mskcc.org/cancergenes
) is 26, of which 4 are from the published literature. The lack of retention of original gene list identifiers by these resources, prevents remapping of the original signatures as better genome assemblies and annotation become available (12–14
). Possibly due to the limitations of the gene signature input formats, frequently little or no information is provided on the process used to map gene signatures to the identifiers that are reported.
To address these issues and facilitate gene list meta-analysis, we have systematically collected published gene signatures from publications indexed in PubMed and mapped them to a common, standardized format, and have made these available in GeneSigDB (http://compbio.dfci.harvard.edu/genesigdb
). We do not collect or re-analyze the gene expression data as this is being done by other projects including gene expression atlas (http://www.ebi.ac.uk/gxa
). For the GeneSigDB initial release, we reviewed over 850 published articles and manually transcribed 575 experimental cancer gene signatures from tables, figures or Supplementary Data
from 319 of those papers. Signatures were curated, annotated, and mapped to the genome, providing 560 standardized gene lists. GeneSigDB provides both the original transcribed gene signature as well as the gene signature in a standard format and we publish a mapping-trace showing how each gene identifier in the original signature was re-annotated. The GeneSigDB web portal allows users to search for gene signatures and provides tools to compare gene signatures, to convert gene lists to common gene identifiers and download gene signatures in over 30 different gene identifier formats.