Accurately curated and annotated gene sets have emerged as essential tools for the analysis of large, complex biological datasets. Gene set analysis (GSA) is widely used in the analysis and interpretation of gene expression profiling data (
1–4), evolutionary relationships (
5), genomic associations—including QTL analysis (
6), genotyping (
7) and SNP chips (
8)—and even for cross platform integration of genomics data (
9). GSA aims to find sets of genes that collectively distinguish two phenotypes, even if the genes in the set are not significantly different when tested individually. This reflects the fact that genes within the cell function as members of complex networks and pathways, often with multiple, overlapping functions. As a result, direct comparisons of genes may miss biologically important connections that are only seen when these related genes are assessed collectively.
Gene sets have also become invaluable tools for characterizing and distinguishing phenotypic states. In breast cancer, for example, several gene expression signatures have been developed as commercial diagnostic assays (
10) and new methods are being developed that combine the predictive strength of multiple gene signatures to increase their prognostic power (
11).
Gene set resources can be broadly divided into those which assign a gene to collections based on ‘known’ gene or protein interactions or functional activity and those that include gene lists from high-throughput experimental assays. Functional and pathway databases such as Gene Ontology (GO), KEGG and Reactome capture published descriptions of cellular pathways and gene functions (
12), including, in the case of GO, functional predictions inferred from orthologous sequences (
13). However, these resources are incomplete as we have not yet been able to comprehensively and completely catalog the functions of all genes in the genome (
13).
High-throughput experiments, such as microarray expression profiling and RNA-seq have also produced large numbers of potentially informative gene lists. Most genomics papers present one or more gene signatures that reportedly correlate with experimental phenotypes. While there has been some controversy over the value of individual gene sets, due to the fact that many fail to fully replicate in independent data sets, the analysis of the collected gene lists defined for similar phenotypes has been demonstrated to provide meaningful biological insight (
14).
Despite tremendous interest in using gene signatures, public repositories such as GEO and ArrayExpress (
15,
16) store primary gene expression data but fail to capture the gene sets that are the end product of published analyses. Without a systematic way of reporting these, the gene sets often appear only in published tables or figures or in supplementary materials hosted on the author's or the journal's website. And as there are no accepted standards for reporting gene sets, they often appear with non-standard gene identifiers, making comparison to other lists, or even to the original data, a significant challenge. Because of these limitations, gene sets from published research studies are often inaccessible to automated computational analysis.
In August 2009, we created GeneSigDB (
17) as a repository for gene sets that had been systematically collected and manually curated from published articles indexed by PubMed. Our approach in building GeneSigDB was to capture gene signatures from the literature as published, to map them to standard identifiers using transparent, reproducible protocols and to freely provide these to the research community together with some elementary analytical tools. Since its launch, GeneSigDB had 7918 web hits with 4404 hits in 2010 and 3354 so far this year, suggesting that this resource is of value to the biomedical research community.