PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of databaseAlertsAuthor InstructionsSubmitAboutDatabase
 
Database (Oxford). 2013; 2013: bat045.
Published online 2013 June 21. doi:  10.1093/database/bat045
PMCID: PMC3689438

ESCAPE: database for integrating high-content published data collected from human and mouse embryonic stem cells

Abstract

High content studies that profile mouse and human embryonic stem cells (m/hESCs) using various genome-wide technologies such as transcriptomics and proteomics are constantly being published. However, efforts to integrate such data to obtain a global view of the molecular circuitry in m/hESCs are lagging behind. Here, we present an m/hESC-centered database called Embryonic Stem Cell Atlas from Pluripotency Evidence integrating data from many recent diverse high-throughput studies including chromatin immunoprecipitation followed by deep sequencing, genome-wide inhibitory RNA screens, gene expression microarrays or RNA-seq after knockdown (KD) or overexpression of critical factors, immunoprecipitation followed by mass spectrometry proteomics and phosphoproteomics. The database provides web-based interactive search and visualization tools that can be used to build subnetworks and to identify known and novel regulatory interactions across various regulatory layers. The web-interface also includes tools to predict the effects of combinatorial KDs by additive effects controlled by sliders, or through simulation software implemented in MATLAB. Overall, the Embryonic Stem Cell Atlas from Pluripotency Evidence database is a comprehensive resource for the stem cell systems biology community.

Database URL: http://www.maayanlab.net/ESCAPE

Introduction

Embryonic stem cells (ESCs) are pluripotent cells characterized by their capability to self-renew and differentiate into all adult cell types. Recent efforts in systematically profiling ESCs have yielded a wealth of high-throughput data. High-throughput technologies including gene expression microarrays, RNA sequencing, chromatin immunoprecipitation followed by deep sequencing (ChIP-chip/seq), genome-wide inhibitory RNA (RNAi) screens, immunoprecipitation followed by mass spectrometry (IP-MS) proteomics and phosphoproteomics, as well as other emerging technologies have been applied to profile the same cell type by many laboratories across the world in the past decade. Several databases and tools have been published to facilitate the integration of such data (1–9), and such efforts pave the way toward an in silico reconstruction of the gene and protein regulatory networks that regulate self-renewal and pluripotency of these important cells. For example, Plurinet (2), FunGenES (4), StemBase (5), SyStemmCell (10), iScMiD (9) and PluriNetWork (1) incorporate stem cell data from several studies and provide web-based interfaces for data query and visualization. However, in general, these databases contain information from a single regulatory layer, mostly transcriptome measurements, and thus overlook other important layers as well as cross-layer interactions. To address the need for further data integration in the field, we constructed a more inclusive database called Embryonic Stem Cell Atlas from Pluripotency Evidence (ESCAPE). This database integrates numerous additional types of data ranging from epigenetics, transcriptomics, to proteomics and phosphoproteomics. These data sets are processed into gene lists, gene–gene and protein–protein interactions, and data tables for easy download and manipulation. In addition, a rich-content web-based application has been developed to enable users to interact with the various types of data in the ESCAPE database. These tools enable users to construct subnetworks, perform enrichment analyses visualized on a canvas and predict lineage specification based on in silico gene KDs or overexpressions.

Results

A comprehensive embryonic stem cell database constructed from published high-throughput studies

Results from numerous published mouse and human embryonic stem cells (m/hESC) genome-wide profiling studies, as well as loss-of-function/gain-of-function (LOF/GOF) studies, were systematically collected and processed to construct the ESCAPE database. Most data sets are from mouse with several from human embryonic stem cells. In its current version, ESCAPE contains (i) 206 521 documented protein/DNA interactions from ChIP-chip/seq studies, connecting 61 transcription factors (TFs) to their putative target genes; (ii) 153 920 LOF/GOF interactions connecting 28 TFs from LOF KD/knockout studies followed by genome-wide expression, and 55 TFs from GOF overexpression studies followed by genome-wide expression. These interactions directly or indirectly connect a target gene to an upstream TF regulator. These interactions are directed (arrow from the factor to the target) and signed (activation/inhibition); (iii) 1037 protein–protein interactions from IP-MS interactome studies centered on 16 bait proteins, as well as from smaller-scale studies; (iv) 813 gene-products functionally identified in five large-scale RNAi screens as key regulators of mESC pluripotency; (v) 19 801 m/hESC and differentiating progeny-specific nuclear proteins from whole nuclear MS proteomic analyses; (vi) 8323 ESC and differentiating progeny-specific phosphoproteins with identified phosphosites extracted from four studies; (vii) three genome-wide microarray mRNA time courses collected during mESC differentiation from one study; (viii) one genome-wide microRNA (miR) expression data set collected from mESCs; and (xi) 18 genome-wide ChIP-chip/seq histone modification studies in ESCs and early differentiated cells. The ESCAPE database descriptive statistics are provided in Table 1. The references are also listed in Table 2. The entity relationship diagram of the database design is shown in Figure 1. Data sets to construct the ESCAPE database are freely downloadable and searchable online. The ESCAPE database is stored as a MySQL relational database. The web interface is implemented as a set of PHP scripts running under Apache as well as a set of Java Servlets running under Tomcat all interacting with the database using SQL. The network viewer used in the network generator page is Cytoscape Web (11) implemented in Flash. The canvas visualization within the enrichment analysis page is implemented with the JavaScript library D3 (12). JavaScript and AJAX are implemented throughout the site for improving user experience (UX) (13). The web interface contains several modules: (i) an interface to browse and query the data; (ii) an interface to download the data; (iii) a tool to generate subnetworks from an input list of genes using background networks generated from the database; (iv) a tool to perform enrichment analysis on user entered gene lists using background lists of genes generated from the database and visualized on a canvas, as well as enrichment analysis of user inputted lists using Enrichr, a tool to visualize enrichment results against 35 gene set libraries (14); (v) an interface to predict lineage commitment on gene KDs or overexpressions; (vi) a downloadable MATLAB software with a graphical user interface for learning Boolean functions and simulating subnetwork dynamics given a prior subnetwork topology and experimental measurements of subnetwork node expression levels across many conditions (Figure 2). Details of the modules are described in the following sections.

Figure 1.
Entity relation diagram of the ESCAPE database. PK- primary key, FK- foreign key.
Figure 2.
Screenshot of the main menu of ESCAPE.
Table 1.
Statistics of the current version of the ESCAPE database
Table 2.
ESCAPE literature references

Browsing and querying data sets within the ESCAPE database

The ESCAPE database provides web-based user interface to allow easy browsing and querying. From the Browse page of the web interface, users can click on one of the tables listed on the left, and then the contents of the selected table are displayed in the center of the page. The contents of the table can be sorted by clicking the name of the column. In addition, information about the methods used to generate the table and the number of entries are displayed above each table. There are two ways to search the ESCAPE database: (i) a general search for a gene using the search bar displayed on top of any web page of the ESCAPE web interface or (ii) a detailed search within a selected table. The detailed search is provided under the Browse section of the website. In the case of looking for a specific gene name using the global search, a list of all the tables where the gene appears is displayed in the search results page, and direct links to the table are provided. In the second case when searching within a specific table, more complicated queries can be created. For each column of the table, there is a possibility to choose an operator on the column, for example, equal or not equal. In addition, logical operators are provided to select specific records using the operators AND or OR. For instance, if you are searching for all the interactions where NANOG, ESRRB and SOX2 are the source genes, and the interactions are upregulation, and the experiment type is GOF, the parameters are set as follows:

  1. Operator = and ‘NANOG, ESRRB, SOX2’ listed in the GeneName column.
  2. Logic operator AND, and operator = and where ‘1’ is written in the Relationship column.
  3. Logic operator AND, and operator = and where ‘GOF’ is written in the PerturbType column.

The query will be automatically converted to the SQL statement: ‘SELECT * FROM logof WHERE (sourceName=“srrb” OR sourceName=“Nanog” OR sourceName=“Sox2”) AND (relationship=“1”) AND (perturbType=“GOF”)’. The results will be formatted and displayed in a results page.

All the tables of the ESCAPE database can be freely downloaded from the Download page of the website. The tables are provided as either flat tab-separated text files or as mySQL files.

Subnetwork construction with ESCAPE

The web interface provides a subnetwork construction functionality to facilitate connecting genes/proteins of interest using the various types of gene–gene interaction networks from the ESCAPE database and a user provided list of input gene IDs (Figure 3). The tool allows users to construct a subnetwork from a list of seed genes. The links that establish the connections within the subnetwork are determined by the background knowledge networks selected from various gene–gene interaction tables or a combination of them. Interactions in these subnetworks can be from: (i) ChIP-chip/seq, (ii) protein–protein interactions or (iii) LOF/GOF evidence. First, the user inputs a seed gene list in Entrez gene symbol format. Then, the user chooses which background networks to use to connect the seed genes. The program uses the interactions from these networks to find connections between the input seed gene list using the shortest path algorithm. The user can also select the path length between seed nodes. The default path length is set to two, or in other words one intermediate node. The program can filter interactions based on user defined parameters of minimum number of references per interaction or by maximum number of interactions per reference. Furthermore, the intermediates are ranked by significance of specificity to interact with the seed nodes as implemented by our software tool Genes2Networks (15). The resultant subnetwork is visualized using the Flash-based interactive network viewer Cytoscape Web (11) that is embedded within the web page. Based on the various interaction types, edges are colored by the various three possible types of interactions. Additionally, the output sub-networks are made available for download in PNG, SVG, PDF, XGMML, GRAPHML or SIF formats. This provides compatibility with other network visualization software such as Cytoscape (16) and yED (http://www.yworks.com/en/products_yed_about.html).

Figure 3.
An example from the ‘NETWORKS’ tab for linking seed nodes through other nodes based on the interactions within the database.

Enrichment analysis with ESCAPE

Another function of the ESCAPE web interface is the ability to perform enrichment analyses (Figure 4A). The enrichment analysis tool performs gene list enrichment analysis using the various experimental modalities that produced gene lists. These include candidate genes from RNAi screens, protein lists from IP-MS pull-downs, genes differentially expressed after KD or overexpression, and target genes for TFs and histone modifications as determined by ChIP-seq/chip. In this web application portion of the site, users can query their own gene lists for overlap with gene lists from the ESCAPE database or analyze their gene list with another external gene list enrichment analysis tool called Enrichr (14). On the left, users can cut and paste lists of Entrez gene symbols and then press Submit to perform the enrichment analysis. In the middle, most of the lists from the ESCAPE database are visualized as a canvas. Each square represents a list. The color indicates the experiment type, and the brightness indicates the level of local similarity among the lists. We use simulated annealing to arrange the lists from the ESCAPE database by their gene content similarity using the Sets2Networks algorithm (17). The enriched terms appear as circles on top of the colored squares representing the gene lists from the ESCAPE database on the canvas: the brighter the circle, the more significant the overlap with the input list. The results are also available in a table with the associated p-values on the right. To compute statistical enrichment, the Fisher exact test is implemented. The resulting lists of enriched experiments only show the enriched terms determined by a cutoff threshold P-value of P < 0.05.

Figure 4.
(A) General example from the ‘ENRICHMENT ANALYSIS’ tab for identifying overlap between gene lists from the database and other user inputted lists. (B) Enrichment results for a list of gene from MGI-MP with a phenotype designation ‘embryonic ...

We created two examples to demonstrate how the enrichment analysis with the canvas visualization can be informative for obtaining new insights. We took two lists of genes that when knocked out in mice are causing the phenotypes of ‘embryonic growth arrest’ and ‘abnormal kidney physiology’ based on the MGI-MP ontology (18) terms 1730 and 2136 respectively. The enrichment results for ‘embryonic growth arrest’ show that the enriched terms are clustered in few specific areas on the grid (Figure 4B). The clustering of enriched terms is clearly not random. The input genes contain H3K36ME targets that are also Oct4 interacting proteins. Interestingly, there is also high overlap with TCFC2L1 interacting proteins as determined by proteomics and target genes of TCFC2L1 as determined by ChIP-seq. The enrichment results for ‘abnormal kidney morphology’ are all clustered in the same area, which mostly represents the PRC2 complex members, known to suppress the expression of genes required for terminal differentiation, including those critical genes for the maintenance of kidney morphology (19) (Figure 4C). Overall, such analyses can be used to link relevant phenotypes to specific regulatory mechanisms in embryonic stem cells, as well as help experimental stem cell biologists who perform high throughput experiments to place their results in context of prior studies.

Lineage specification prediction with ESCAPE

The next function of the ESCAPE web interface is a tool to predict lineage-propensity differentiation outcome on single or combinatorial KD of multiple pluripotency factors (Figure 5). The tool considers the target genes of knocked-down pluripotency factors and predicts the additive expression of lineage markers based on the combinatorial additive predicted levels of these factors. Specifically, effects of gene KDs on lineage commitment are dynamically computed by enrichment analysis for targets of knocked-down factors against lists of lineage-specific marker genes using the Fisher’s exact test. Targets of KD factors were first identified from the LOF/GOF table, and lineage specific components were assembled manually from literature as follows: (i) Trophectoderm: the gene expression data set (GSE11523) reported trophectoderm-like state after depletion of Oct4/Pou5f1 in mESCs. Gene expression was profiled at six time points. Genes were sorted according to average fold change of expression on differentiation related to time point 0. The top 5% of genes with an average fold change of at least two and with a monotone increase in expression at each time point upon differentiation were considered as trophectoderm markers. (ii) Primitive endoderm: the same set of experiments and data processing as described for (i) were conduct after overexpression of Gata6 in mESCs. (iii) Neuroectoderm: the gene expression dataset (GSE12982) isolated Sox1-GFP positive cells from mESCs where Ezh1 and Ezh2 were knocked-down. Genes were sorted according to fold change increase in expression comparing differentiated cells to mESCs. The top 10% genes with a monotonic increase and fold change of at least 1.5 were considered as neuroectoderm markers. (iv) Mesendoderm: the same set of experiments and data processing as described for (iii) were conduct after isolation of T-GFP positive cells (T stands for the gene brachyury). By sliding the bars on the web interface, users can choose the components and level of knockdown of 14 pluripotency factors. Corresponding positive and negative targets of each specific pluripotency factor were extracted from the LOF table within the ESCAPE database. As a result, the enrichment P-values reflecting the significance of differentiation potential toward each specific lineage on knockdown(s) are displayed on top. In addition, the up and down genes are provided in two text boxes below the lineage prediction display. Such lists can be further analyzed using the external enrichment analysis tool Enrichr (14) or any other tool available within ESCAPE or beyond.

Figure 5.
An example from the ‘LINEAGE PREDICTION’ tab showing that LOF of Oct4/Pou5f1, Nanog and Sox2 results mostly in differentiating toward the trophectoderm lineage.

Functional associations among 15 pluripotency regulators and 15 lineage markers

The aggregated data in ESCAPE can be used to elucidate functional associations among pluripotency and differentiation components across various regulatory layers. Specifically, to demonstrate the usefulness of the compiled ESCAPE database to dissect the pluripotency machinery, we examined functional correlations among 15 pluripotency factors and 15 differentiation markers selected (20). Heatmaps of degree of similarity were constructed (Figure 6) where we scored pair-wise similarity distance between the components as follows: (i) Shared targets from the ChIP-chip/seq experiments; (ii) Co-expression similarities based on global mRNA measurements after pluripotency TF LOF or GOF; (iii) Histone modification target gene similarities analyzed in mESCs and differentiated cells; (iv) Protein co-occurrence measured after pull-downs of pluripotency TFs followed by MS proteomics; (v) Similarities of miR targets predicted computationally and limited to miRs preferentially expressed in mESCs; and (vi) co-expression similarities during embryoid body differentiation. Additionally, a multi-layer heatmap integrating all six layers was created. As expected, pluripotency regulators and differentiation markers generally cluster into two separate groups. A previous attempt to generate a heatmap for 13 pluripotency regulators based solely based on genomic target binding similarities resulted in slightly different clusters (21). Here, Oct4/Pou5f1 shares greatest functional similarity with Sall4 and Zfp42 (also called Rex1) (Figure 2G). This is consistent with a report that Sall4 and Oct4/Pou5f1 form a regulatory feedback loop (22). In addition, Rex1 is a known target of Oct4/Pou5f1. However, it is interesting that Rex1 is so closely associated with Oct4/Pou5f1 across several layers. Surprisingly, Gli2, a known ectoderm marker, is highly correlated with pluripotency components across numerous layers, suggesting a function in the pluripotent state for this gene. Gli2 is a downstream TF effector of Hedgehog signaling (23), and thus, potentially linking this pathway to pluripotency. Binding of Gli1 and Gli2 to the Nanog regulatory sequences in neural stem cells has been reported (24). Based on a recent genome-wide RNAi screen, another member of the Gli family, Gli3, was among the hits of genes involved in mESC early differentiation (25).

Figure 6.
Heatmaps of 15 pluripotency components and 15 lineage marker genes. (A–G) Numeric values in the color bars represent similarity distance scores (1 – Similarity). Similarity is calculated using the Jaccard-coefficient for: (A) TF/target-gene ...

Conclusions

ESCAPE is a freely available online resource that integrates current genome-wide data encompassing several regulatory layers and data types. Through the web interface, the data sets can be browsed, searched and downloaded. Additionally, a set of web-based tools were implemented to interact with the information in the database. Given a set of seed genes, users can perform network expansion, and upstream regulatory factors and downstream targets enrichment analyses, as well as perform combinatorial lineage predictions. Organizing the experimental data into a coherent and interactive framework can potentially enable better utilization of such data for systems-level analyses and construction of dynamical models. ESCAPE has been proven useful already to at least one research study (26).

Funding

This work was supported by NIH grants R01GM098316-01A1, P50GM071558-03, R01DK088541-01A1 (A.M.) and R01GM078465-03, RC1GM091176-01 (I.R.L.). Funding for open access charge: Irma T. Hirschl Career Scientist Award (A.M.).

References

1. Som A, Harder C, Greber B, et al. The PluriNetWork: an electronic representation of the network underlying pluripotency in mouse, and its applications. PLoS One. 2010;5:e15165. [PMC free article] [PubMed]
2. Muller FJ, Laurent LC, Kostka D, et al. Regulatory networks define phenotypic classes of human stem cell lines. Nature. 2008;455:401–405. [PMC free article] [PubMed]
3. Assou S, Le Carrour T, Tondeur S, et al. A meta-analysis of human embryonic stem cells transcriptome integrated into a web-based expression atlas. Stem Cells. 2007;25:961–973. [PMC free article] [PubMed]
4. Schulz H, Kolde R, Adler P, et al. The FunGenES database: a genomics resource for mouse embryonic stem cell differentiation. PLoS One. 2009;4:e6804. [PMC free article] [PubMed]
5. Porter CJ, Palidwor GA, Sandie R, et al. StemBase: a resource for the analysis of stem cell gene expression data. Methods Mol. Biol. 2007;407:137–148. [PubMed]
6. Glauche I, Herberg M, Roeder I. Nanog variability and pluripotency regulation of embryonic stem cells - insights from a mathematical model analysis. PLoS One. 2010;5:e11238. [PMC free article] [PubMed]
7. Chickarmane V, Troein C, Nuber UA, et al. Transcriptional Dynamics of the embryonic stem cell switch. PLoS Comput. Biol. 2006;2:e123. [PubMed]
8. MacArthur BD, Please CP, Oreffo ROC. Stochasticity and the molecular mechanisms of induced pluripotency. PLoS One. 2008;3:e3086. [PMC free article] [PubMed]
9. Xu H, Schaniel C, Lemischka IR, Ma'ayan A. Toward a complete in silico, multi-layered embryonic stem cell regulatory network. Wiley Interdiscip Rev. Syst. Biol. Med. 2010;2:708–733. [PMC free article] [PubMed]
10. Yu J, Xing X, Zeng L, et al. SyStemCell: a database populated with multiple levels of experimental data from stem cell differentiation research. PLoS One. 2012;7:e35230. [PMC free article] [PubMed]
11. Lopes CT, Franz M, Kazi F, et al. Cytoscape web: an interactive web-based network browser. Bioinformatics. 2010;26:2347–2348. [PMC free article] [PubMed]
12. Bostock M, Ogievetsky V, Heer J. D3 data-driven documents. IEEE Trans. Vis. Comput. Graph. 2011;17:2301–2309. [PubMed]
13. Holzinger A, Mayr S, Slany W, Debevc M. ICE-B 2010 - Proceedings of the International Conference on e-Business. 2010. The influence of AJAX on Web Usability. ICETE The International Joint Conference on e-Business and Telecommunications. INSTICC, Athens, Greece. pp. 124–127.
14. Chen EY, Tan CM, Kou Y, et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics. 2013;14:128. [PMC free article] [PubMed]
15. Berger S, Posner J, Ma'ayan A. Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases. BMC Bioinformatics. 2007;8:372. [PMC free article] [PubMed]
16. Kohl M, Wiese S, Warscheid B. Cytoscape: software for visualization and analysis of biological networks. Methods Mol. Biol. 2011;696:291–303. [PubMed]
17. Clark NR, Dannenfelser R, Tan CM, et al. Sets2Networks: network inference from repeated observations of sets. BMC Systems Biol. 2012;6:89. [PMC free article] [PubMed]
18. Smith CL, Goldsmith CA, Eppig JT. The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol. 6:R7. [PMC free article] [PubMed]
19. He JC, Chuang PY, Ma'Ayan A, Iyengar R. Systems biology of kidney diseases. Kidney Int. 2011;81:22–39. [PMC free article] [PubMed]
20. Xu H, Lemischka IR, Ma'ayan A. SVM classifier to predict genes important for self-renewal and pluripotency of mouse embryonic stem cells. BMC Syst. Biol. 2010;4:173. [PMC free article] [PubMed]
21. Kim J, Chu J, Shen X, et al. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 132:1049–1061. [PMC free article] [PubMed]
22. Yang J, Gao C, Chai L, Ma Y. A novel SALL4/OCT4 transcriptional feedback network for pluripotency of embryonic stem cells. PLoS One. 2010;5:e10766. [PMC free article] [PubMed]
23. Ding Q, Motoyama J, Gasca S, et al. Diminished Sonic hedgehog signaling and lack of floor plate differentiation in Gli2 mutant mice. Development. 1998;125:2533–2543. [PubMed]
24. Po A, Ferretti E, Miele E, et al. Hedgehog controls neural stem cells through p53-independent regulation of Nanog. EMBO J. 2010;29:2646–2658. [PubMed]
25. Yang SH, Kalkan T, Morrisroe C, et al. A genome-wide RNAi screen reveals MAP kinase phosphatases as key ERK pathway regulators during embryonic stem cell differentiation. PLoS Genet. 2012;8:e1003112. [PMC free article] [PubMed]
26. De Cegli R, Iacobacci S, Flore G, et al. Reverse engineering a mouse embryonic stem cell-specific transcriptional network reveals a new modulator of neuronal differentiation. Nucleic Acids Res. 2013;41:711–726. [PMC free article] [PubMed]

Articles from Database: The Journal of Biological Databases and Curation are provided here courtesy of Oxford University Press