Microarrays and other high-throughput genomic technologies typically produce long lists of potentially interesting genes, which are not always easily interpreted. Recognizing the importance of coordinately expressed sets of genes, our seminal paper (Mootha et al., 2003
) introduced Gene Set Enrichment Analysis (GSEA) to discover metabolic pathways altered in human type 2 diabetes mellitus. GSEA and other analytical enrichment tools summarize genomic data in prioritized lists of higher-level biological features. As underscored by a recent survey of 68 enrichment tools, they critically depend on ‘backend annotation databases’ (Huang et al., 2009
). Typically, such databases focus on a particular domain of knowledge or annotation procedure. For example, Gene Ontology (GO) (Ashburner et al., 2000
) represents a hierarchy of controlled terms to describe individual gene products, while TRANSFAC (Matys et al., 2006
) stores information about transcription factor binding sites. A growing number of databases obtain sets from gene expression signatures reported in the literature. These include SignatureDB (Shaffer et al., 2006
), GeneSigDB (Culhane et al., 2009
), CCancer (Dietmann et al., 2010
) and L2L and LOLA (Cahan et al., 2007
Molecular Signatures Database (MSigDB) differs from these resources in several distinguishing aspects. (i) MSigDB is explicitly designed to provide gene sets for enrichment analysis methods. As such, it is natively and seamlessly integrated with our GSEA software (Subramanian et al., 2005
). (ii) MSigDB covers a substantially more diverse and wider range of gene set sources and types. These include signatures extracted from original research publications, and entire collections of sets derived from specialized resources such as GO, KEGG (Kanehisa and Goto, 2000
), TRANSFAC and L2L. (iii) MSigDB gene sets are acquired both through manual curation and by automatic computational means, whereas other databases emphasize only one of these approaches. (iv) Finally, MSigDB contains the largest number of gene sets overall.
The initial MSigDB database, released in 2005 with GSEA software, contained 1325 sets. In contrast, MSigDB 3.0, released in September 2010, includes 6769 sets and a richer set of annotations. Here, we describe the MSigDB 3.0 sets in more detail and the accompanying online resource.