Creating the gene-set libraries
Enrichr contains 35 gene-set libraries where some libraries are borrowed from other tools while many other libraries are newly created and only available in Enrichr. The gene-set libraries provided by Enrichr are divided into six categories: transcription, pathways, ontologies, diseases/drugs, cell types and miscellaneous. The following is a description of each library and how it was created:
The transcription category provides six gene-set libraries that attempt to link differentially expressed genes with the transcriptional machinery. These six libraries include the ability to identify transcription factors that are enriched for target genes within the input list using four different options: 1) ChEA [10
]; 2) position weight matrices (PWMs) from TRANSFAC [11
] and JASPAR [12
]; 3) target genes generated from PMWs downloaded from the UCSC genome browser [13
]; and 4) transcription factor targets extracted from the ENCODE project [14
]. In addition, the two other gene-set libraries in the transcription category are gene sets associated with: 5) histone modifications extracted from the Roadmap Epigenomics Project [16
]; and 6) microRNAs targets computationally predicted by TargetScan [17
1. The ChIP-x Enrichment Analysis (ChEA) database [10
] is our own resource for storing putative targets for transcription factors extracted from publications that report experiments of profiling transcription factors binding to DNA in mammalian cells. The database is already formatted into a gene-set library where the functional terms are the transcription factors profiled in each study together with the PubMed identifier (PMID) of the paper used to extract the gene. The ChEA gene-set library used in Enrichr is an updated version from the originally published database containing more than twice the entries compared to the originally published version [10
2. PWMs from TRANSFAC and JASPAR were used to scan the promoters of all human genes in the region −2000 and +500 from the transcription factor start site (TSS). We retained only the 100% matches to the consensus sequences to call an interaction between a factor and target gene. This gene-set library was created for a tool we previously published called Expression2Kinases [18
3. Transcription factor target genes inferred from PWMs for the human genome were downloaded from the UCSC Genome Browser [13
] FTP site which contains many resources for gene and sequence annotations. We converted this file into a gene set library and included it in Enrichr since it produces different results compared with the other method to identify transcription factor/target interactions from PWMs as described above.
4. The ENCODE transcription factor gene-set library is the fourth method to create a transcription factor/target gene set library. We processed the newly published data from the Encyclopedia of DNA Elements (ENCODE) project [14
]. Using the aligned files for all 646 experiments that profiled transcription factors in mammalian cells, we identified the peaks using the MACS software [19
] and then identified the genes targeted by the factors using our own custom processing. We sorted the peaks for each experiment by distance to the transcription factor start site (TSS) and retained the top 2000 target genes for each experiment.
5. The Histone modification gene-set library was created by processing experiments from the NIH Roadmap Epigenomics [20
]. Such experiments were conducted using various types of human cell lines types with antibodies targeting over 30 different histone modification marks. ChIP-seq datasets from the Roadmap Epigenomics project deposited to the GEO database were analyzed and converted to gene sets with the use of the software, SICER [21
]. Previous studies [22
] have indicated that the use of control sample substantially reduces DNA shearing biases and sequencing artifacts; therefore, for each experiment, an input control sample was matched according to the description in GEO. ChIP-seq experiments without matched control input were not included. The resulting gene-set library contains 27 types of histone modifications for 64 human cell lines from various tissue origins.
6. The microRNA gene set library was created by processing data from the TargetScan online database [23
] and was borrowed from our previous publication, Lists2Networks [24
The pathways category includes gene-set libraries from well-known pathway databases such as WikiPathways [25
], KEGG [26
], BioCarta, and Reactome [27
] as well as five gene-set libraries we created from our own resources: kinase enrichment analysis (KEA) [28
] for kinases and their known substrates, protein-protein interaction hubs [18
], CORUM [29
], and complexes from a recent high-throughput IP-MS study [30
] as well as a manually assembled gene-set library created from extracting lists of phosphoproteins from SILAC phosphoproteomics publications [31
The pathway associated gene-set libraries were created from each of the above databases by converting members of each pathway from each pathway database to a list of human genes.
5. The Kinase Enrichment Analysis (KEA) gene-set library contains human or mouse kinases and their known substrates collected from literature reports as provided by six kinase-substrate databases: HPRD [32
], PhosphoSite [33
], PhosphoPoint [34
], Phospho.Elm [35
], NetworKIN [36
], and MINT [37
6. The protein-protein interaction hubs gene-set library is made from an updated version of a human protein-protein interaction network that we are continually updating and originally published as part of the program, Expression2Kinases [18
]. From this network, we extracted the proteins with 120 or more interactions. These proteins are the terms in the library whereas their direct protein interactors are the genes in each gene set.
The next two gene-set libraries in the pathway category are protein complexes. The first library was created from a recent study that profiled nuclear complexes in human breast cancer cell lines after applying over 3000 immuno-precipitations followed by mass-spectrometry (IP-MS) experiments using over 1000 different antibodies [30
]. The second complexes gene-set library was created from the mammalian complexes database, CORUM [29
9. The SILAC phosphoproteomics gene set library was created by processing tables from the supporting materials of SILAC phosphoproteomics studies. From each supporting table, we extracted lists of up and down proteins without applying any cutoffs. Protein IDs were converted to mammalian gene IDs when necessary using online gene symbol conversion tools. A total of 84 gene lists were extracted from such studies.
The ontology category contains gene-set libraries created from the three gene ontology trees [6
] and from the knockout mouse phenotypes ontology developed by the Jackson Lab from their MGI-MP browser [38
]. To create such gene-set libraries, we “cut” the tree at either the third or fourth level and created a gene set from the terms and their associated genes downstream of the cut. The details about creating the Gene Ontology gene-set libraries are provided in our previous publication, Lists2Networks [24
The disease/drugs category has gene set libraries created from the Connectivity Map database [39
], GeneSigDB [40
], MSigDB [5
], OMIM [41
], and VirusMINT [42
The Connectivity Map (CMAP) database [39
] contains over 6,000 Affymetrix microarray gene expression experiments where human cancer cell lines were treated with over 1,300 drugs, many of them FDA approved, and changes in expression where measured after six hours. The drugs were always used as a single treatment but varied in concentrations. The CMAP database provides the results in a table where genes are listed in rank order based on their level of differential expression compared to the untreated state. From this table, we extracted the top 100 and bottom 100 differentially expressed genes to create two gene-set libraries, one for the up genes and one for the down genes for each condition. Each set is associated with a drug name and the four digit experiment number from CMAP. This four digit number can be used to locate the concentration, cell-type, and batch.
3. The GeneSigDB gene-set library was borrowed from the GeneSigDB database [40
]. The database contains gene lists extracted manually from the supporting tables of thousands of publications; most are from cancer related studies.
The OMIM gene-set library was created directly from the NCBI’s OMIM Morbid Map [41
]. We removed diseases with only a few genes and merged diseases with similar names because these are likely made of few subtypes of the same disease. In addition, since most diseases have only few genes, we used our tool, Genes2Networks [43
], to create the OMIM expanded gene-set library. We entered the disease genes as the seed list and expanded the list by identifying proteins that directly interact with at least two of the disease gene products; in other words, we searched for paths that connect two disease gene products with one intermediate protein, resulting in a sub-network that connects the disease genes with additional proteins/genes. Each sub-network for each disease was converted to a gene set.
6. The VirusMINT gene-set library was created from the VirusMINT database [42
], which is made of literature extracted protein-protein interactions between viral proteins and human proteins. Each term in the library represents a virus wherein the genes/proteins in each set are the host proteins that are known to directly interact with all the viral proteins for each virus.
The MSigDB computational and MSigDB oncogenic signature gene-set libraries were borrowed from the MSigDB database from categories C4 and C6 [5
]. These gene-set libraries contain modules of genes differentially expressed in various cancers.
The cell type category is made of four gene-set libraries: genes highly expressed in human and mouse tissues extracted from the Mouse and Human Gene Atlases [44
] and genes highly expressed in cancer cell lines from the Cancer Cell Line Encyclopedia (CCLE) [45
] and NCI-60 [46
]. The gene-set libraries in this category were all created similarly. The Cancer Cell Line Encyclopedia (CCLE) dataset was derived from the gene-centric RMA-normalized mRNA expression data from the CCLE site. The Human Gene Atlas and Mouse Gene Atlas datasets were derived from averaged GCRMA-normalized mRNA expression data from the BioGPS site. Finally, the Human NCI60 Cell Lines dataset, while also downloaded from the BioGPS site, was raw and not normalized; hence, it was normalized using quantile normalization. The downloaded datasets were all of similar format such that the raw data was in a table with the rows being the genes and the columns being the expression values in the different cells. For each gene, the average and standard deviation of the expression values across all samples were computed. For each gene/term data point, a z-score was calculated based on the row’s average and standard deviation. Duplicate gene probes were merged by selecting the highest absolute z-score. Only genes with an absolute z-score of greater than 3 were selected to be part of a gene set for a particular cell which represents the term.
The miscellaneous category has three gene-set libraries: chromosome location, metabolites, and structural domains. The chromosomal location library is made of human genes belonging to chromosomal segments of the human genome. It is derived from MSigDB [5
]. The metabolite library was created from HMDB, a database [47
] enlisting metabolites and the genes associated with them. Finally, the structural domains library was created from the PFAM [48
] and InterPro [49
] databases where the terms are structural domains and the genes/proteins are the genes containing the domains.