|Home | About | Journals | Submit | Contact Us | Français|
High content studies that profile mouse and human embryonic stem cells (m/hESCs) using various genome-wide technologies such as transcriptomics and proteomics are constantly being published. However, efforts to integrate such data to obtain a global view of the molecular circuitry in m/hESCs are lagging behind. Here, we present an m/hESC-centered database called Embryonic Stem Cell Atlas from Pluripotency Evidence integrating data from many recent diverse high-throughput studies including chromatin immunoprecipitation followed by deep sequencing, genome-wide inhibitory RNA screens, gene expression microarrays or RNA-seq after knockdown (KD) or overexpression of critical factors, immunoprecipitation followed by mass spectrometry proteomics and phosphoproteomics. The database provides web-based interactive search and visualization tools that can be used to build subnetworks and to identify known and novel regulatory interactions across various regulatory layers. The web-interface also includes tools to predict the effects of combinatorial KDs by additive effects controlled by sliders, or through simulation software implemented in MATLAB. Overall, the Embryonic Stem Cell Atlas from Pluripotency Evidence database is a comprehensive resource for the stem cell systems biology community.
Database URL: http://www.maayanlab.net/ESCAPE
Embryonic stem cells (ESCs) are pluripotent cells characterized by their capability to self-renew and differentiate into all adult cell types. Recent efforts in systematically profiling ESCs have yielded a wealth of high-throughput data. High-throughput technologies including gene expression microarrays, RNA sequencing, chromatin immunoprecipitation followed by deep sequencing (ChIP-chip/seq), genome-wide inhibitory RNA (RNAi) screens, immunoprecipitation followed by mass spectrometry (IP-MS) proteomics and phosphoproteomics, as well as other emerging technologies have been applied to profile the same cell type by many laboratories across the world in the past decade. Several databases and tools have been published to facilitate the integration of such data (1–9), and such efforts pave the way toward an in silico reconstruction of the gene and protein regulatory networks that regulate self-renewal and pluripotency of these important cells. For example, Plurinet (2), FunGenES (4), StemBase (5), SyStemmCell (10), iScMiD (9) and PluriNetWork (1) incorporate stem cell data from several studies and provide web-based interfaces for data query and visualization. However, in general, these databases contain information from a single regulatory layer, mostly transcriptome measurements, and thus overlook other important layers as well as cross-layer interactions. To address the need for further data integration in the field, we constructed a more inclusive database called Embryonic Stem Cell Atlas from Pluripotency Evidence (ESCAPE). This database integrates numerous additional types of data ranging from epigenetics, transcriptomics, to proteomics and phosphoproteomics. These data sets are processed into gene lists, gene–gene and protein–protein interactions, and data tables for easy download and manipulation. In addition, a rich-content web-based application has been developed to enable users to interact with the various types of data in the ESCAPE database. These tools enable users to construct subnetworks, perform enrichment analyses visualized on a canvas and predict lineage specification based on in silico gene KDs or overexpressions.
The ESCAPE database provides web-based user interface to allow easy browsing and querying. From the Browse page of the web interface, users can click on one of the tables listed on the left, and then the contents of the selected table are displayed in the center of the page. The contents of the table can be sorted by clicking the name of the column. In addition, information about the methods used to generate the table and the number of entries are displayed above each table. There are two ways to search the ESCAPE database: (i) a general search for a gene using the search bar displayed on top of any web page of the ESCAPE web interface or (ii) a detailed search within a selected table. The detailed search is provided under the Browse section of the website. In the case of looking for a specific gene name using the global search, a list of all the tables where the gene appears is displayed in the search results page, and direct links to the table are provided. In the second case when searching within a specific table, more complicated queries can be created. For each column of the table, there is a possibility to choose an operator on the column, for example, equal or not equal. In addition, logical operators are provided to select specific records using the operators AND or OR. For instance, if you are searching for all the interactions where NANOG, ESRRB and SOX2 are the source genes, and the interactions are upregulation, and the experiment type is GOF, the parameters are set as follows:
The query will be automatically converted to the SQL statement: ‘SELECT * FROM logof WHERE (sourceName=“srrb” OR sourceName=“Nanog” OR sourceName=“Sox2”) AND (relationship=“1”) AND (perturbType=“GOF”)’. The results will be formatted and displayed in a results page.
All the tables of the ESCAPE database can be freely downloaded from the Download page of the website. The tables are provided as either flat tab-separated text files or as mySQL files.
The web interface provides a subnetwork construction functionality to facilitate connecting genes/proteins of interest using the various types of gene–gene interaction networks from the ESCAPE database and a user provided list of input gene IDs (Figure 3). The tool allows users to construct a subnetwork from a list of seed genes. The links that establish the connections within the subnetwork are determined by the background knowledge networks selected from various gene–gene interaction tables or a combination of them. Interactions in these subnetworks can be from: (i) ChIP-chip/seq, (ii) protein–protein interactions or (iii) LOF/GOF evidence. First, the user inputs a seed gene list in Entrez gene symbol format. Then, the user chooses which background networks to use to connect the seed genes. The program uses the interactions from these networks to find connections between the input seed gene list using the shortest path algorithm. The user can also select the path length between seed nodes. The default path length is set to two, or in other words one intermediate node. The program can filter interactions based on user defined parameters of minimum number of references per interaction or by maximum number of interactions per reference. Furthermore, the intermediates are ranked by significance of specificity to interact with the seed nodes as implemented by our software tool Genes2Networks (15). The resultant subnetwork is visualized using the Flash-based interactive network viewer Cytoscape Web (11) that is embedded within the web page. Based on the various interaction types, edges are colored by the various three possible types of interactions. Additionally, the output sub-networks are made available for download in PNG, SVG, PDF, XGMML, GRAPHML or SIF formats. This provides compatibility with other network visualization software such as Cytoscape (16) and yED (http://www.yworks.com/en/products_yed_about.html).
Another function of the ESCAPE web interface is the ability to perform enrichment analyses (Figure 4A). The enrichment analysis tool performs gene list enrichment analysis using the various experimental modalities that produced gene lists. These include candidate genes from RNAi screens, protein lists from IP-MS pull-downs, genes differentially expressed after KD or overexpression, and target genes for TFs and histone modifications as determined by ChIP-seq/chip. In this web application portion of the site, users can query their own gene lists for overlap with gene lists from the ESCAPE database or analyze their gene list with another external gene list enrichment analysis tool called Enrichr (14). On the left, users can cut and paste lists of Entrez gene symbols and then press Submit to perform the enrichment analysis. In the middle, most of the lists from the ESCAPE database are visualized as a canvas. Each square represents a list. The color indicates the experiment type, and the brightness indicates the level of local similarity among the lists. We use simulated annealing to arrange the lists from the ESCAPE database by their gene content similarity using the Sets2Networks algorithm (17). The enriched terms appear as circles on top of the colored squares representing the gene lists from the ESCAPE database on the canvas: the brighter the circle, the more significant the overlap with the input list. The results are also available in a table with the associated p-values on the right. To compute statistical enrichment, the Fisher exact test is implemented. The resulting lists of enriched experiments only show the enriched terms determined by a cutoff threshold P-value of P < 0.05.
We created two examples to demonstrate how the enrichment analysis with the canvas visualization can be informative for obtaining new insights. We took two lists of genes that when knocked out in mice are causing the phenotypes of ‘embryonic growth arrest’ and ‘abnormal kidney physiology’ based on the MGI-MP ontology (18) terms 1730 and 2136 respectively. The enrichment results for ‘embryonic growth arrest’ show that the enriched terms are clustered in few specific areas on the grid (Figure 4B). The clustering of enriched terms is clearly not random. The input genes contain H3K36ME targets that are also Oct4 interacting proteins. Interestingly, there is also high overlap with TCFC2L1 interacting proteins as determined by proteomics and target genes of TCFC2L1 as determined by ChIP-seq. The enrichment results for ‘abnormal kidney morphology’ are all clustered in the same area, which mostly represents the PRC2 complex members, known to suppress the expression of genes required for terminal differentiation, including those critical genes for the maintenance of kidney morphology (19) (Figure 4C). Overall, such analyses can be used to link relevant phenotypes to specific regulatory mechanisms in embryonic stem cells, as well as help experimental stem cell biologists who perform high throughput experiments to place their results in context of prior studies.
The next function of the ESCAPE web interface is a tool to predict lineage-propensity differentiation outcome on single or combinatorial KD of multiple pluripotency factors (Figure 5). The tool considers the target genes of knocked-down pluripotency factors and predicts the additive expression of lineage markers based on the combinatorial additive predicted levels of these factors. Specifically, effects of gene KDs on lineage commitment are dynamically computed by enrichment analysis for targets of knocked-down factors against lists of lineage-specific marker genes using the Fisher’s exact test. Targets of KD factors were first identified from the LOF/GOF table, and lineage specific components were assembled manually from literature as follows: (i) Trophectoderm: the gene expression data set (GSE11523) reported trophectoderm-like state after depletion of Oct4/Pou5f1 in mESCs. Gene expression was profiled at six time points. Genes were sorted according to average fold change of expression on differentiation related to time point 0. The top 5% of genes with an average fold change of at least two and with a monotone increase in expression at each time point upon differentiation were considered as trophectoderm markers. (ii) Primitive endoderm: the same set of experiments and data processing as described for (i) were conduct after overexpression of Gata6 in mESCs. (iii) Neuroectoderm: the gene expression dataset (GSE12982) isolated Sox1-GFP positive cells from mESCs where Ezh1 and Ezh2 were knocked-down. Genes were sorted according to fold change increase in expression comparing differentiated cells to mESCs. The top 10% genes with a monotonic increase and fold change of at least 1.5 were considered as neuroectoderm markers. (iv) Mesendoderm: the same set of experiments and data processing as described for (iii) were conduct after isolation of T-GFP positive cells (T stands for the gene brachyury). By sliding the bars on the web interface, users can choose the components and level of knockdown of 14 pluripotency factors. Corresponding positive and negative targets of each specific pluripotency factor were extracted from the LOF table within the ESCAPE database. As a result, the enrichment P-values reflecting the significance of differentiation potential toward each specific lineage on knockdown(s) are displayed on top. In addition, the up and down genes are provided in two text boxes below the lineage prediction display. Such lists can be further analyzed using the external enrichment analysis tool Enrichr (14) or any other tool available within ESCAPE or beyond.
The aggregated data in ESCAPE can be used to elucidate functional associations among pluripotency and differentiation components across various regulatory layers. Specifically, to demonstrate the usefulness of the compiled ESCAPE database to dissect the pluripotency machinery, we examined functional correlations among 15 pluripotency factors and 15 differentiation markers selected (20). Heatmaps of degree of similarity were constructed (Figure 6) where we scored pair-wise similarity distance between the components as follows: (i) Shared targets from the ChIP-chip/seq experiments; (ii) Co-expression similarities based on global mRNA measurements after pluripotency TF LOF or GOF; (iii) Histone modification target gene similarities analyzed in mESCs and differentiated cells; (iv) Protein co-occurrence measured after pull-downs of pluripotency TFs followed by MS proteomics; (v) Similarities of miR targets predicted computationally and limited to miRs preferentially expressed in mESCs; and (vi) co-expression similarities during embryoid body differentiation. Additionally, a multi-layer heatmap integrating all six layers was created. As expected, pluripotency regulators and differentiation markers generally cluster into two separate groups. A previous attempt to generate a heatmap for 13 pluripotency regulators based solely based on genomic target binding similarities resulted in slightly different clusters (21). Here, Oct4/Pou5f1 shares greatest functional similarity with Sall4 and Zfp42 (also called Rex1) (Figure 2G). This is consistent with a report that Sall4 and Oct4/Pou5f1 form a regulatory feedback loop (22). In addition, Rex1 is a known target of Oct4/Pou5f1. However, it is interesting that Rex1 is so closely associated with Oct4/Pou5f1 across several layers. Surprisingly, Gli2, a known ectoderm marker, is highly correlated with pluripotency components across numerous layers, suggesting a function in the pluripotent state for this gene. Gli2 is a downstream TF effector of Hedgehog signaling (23), and thus, potentially linking this pathway to pluripotency. Binding of Gli1 and Gli2 to the Nanog regulatory sequences in neural stem cells has been reported (24). Based on a recent genome-wide RNAi screen, another member of the Gli family, Gli3, was among the hits of genes involved in mESC early differentiation (25).
ESCAPE is a freely available online resource that integrates current genome-wide data encompassing several regulatory layers and data types. Through the web interface, the data sets can be browsed, searched and downloaded. Additionally, a set of web-based tools were implemented to interact with the information in the database. Given a set of seed genes, users can perform network expansion, and upstream regulatory factors and downstream targets enrichment analyses, as well as perform combinatorial lineage predictions. Organizing the experimental data into a coherent and interactive framework can potentially enable better utilization of such data for systems-level analyses and construction of dynamical models. ESCAPE has been proven useful already to at least one research study (26).
This work was supported by NIH grants R01GM098316-01A1, P50GM071558-03, R01DK088541-01A1 (A.M.) and R01GM078465-03, RC1GM091176-01 (I.R.L.). Funding for open access charge: Irma T. Hirschl Career Scientist Award (A.M.).