|Home | About | Journals | Submit | Contact Us | Français|
Recent efforts in systematically profiling embryonic stem (ES) cells have yielded a wealth of high-throughput data. Complementarily, emerging databases and computational tools facilitate ES cell studies and further pave the way toward the in silico reconstruction of regulatory networks encompassing multiple molecular layers. Here, we briefly survey databases, algorithms, and software tools used to organize and analyze high-throughput experimental data collected to study mammalian cellular systems with a focus on ES cells. The vision of using heterogeneous data to reconstruct a complete multilayered ES cell regulatory network is discussed. This review also provides an accompanying manually extracted dataset of different types of regulatory interactions from low-throughput experimental ES cell studies available at http://amp.pharm.mssm.edu/iscmid/literature.
Pluripotent embryonic stem (ES) cells are derived from the inner cell mass of a developing embryo and can be cultured indefinitely in vitro. In vivo, mouse ES cells can contribute to all adult cell populations, including the germ line. Under defined in vitro conditions both mouse and human ES cells can differentiate into numerous mammalian cell types providing great promise for regenerative medicine. Recent studies have shown that adult mouse and human cells can be ‘reprogrammed’ into an induced pluripotent stem (iPS) cell state using simple combinations of transcription factors. In order to harness the exciting biomedical potential of ES/iPS cells, the molecular regulatory networks responsible for controlling pluripotency/self-renewal as well as, commitment and differentiation into different lineages, need to be characterized. Stem cell research is increasingly employing high-throughput systems biology approaches to define molecular ‘parts lists’ and regulatory interactions between the parts in ES cells and in their more differentiated progeny. How these parts are interconnected into gene and cell signaling regulatory networks ultimately responsible for self-renewal and differentiation is unclear. Approaches aimed to bridge the gap among molecules, network architectures, and dynamics in order to ultimately ‘explain’ phenotypic behavior are in their infancy. To enable these efforts, a pipeline process that couples experimental and computational approaches has emerged. An example of such a pipeline is outlined in Figure 1. First, data are collected from different molecular regulatory layers [for example: epigenomic, messenger RNA (mRNA), and proteomic data] using emerging high-throughput technologies. Second, in order to extract biological knowledge out of such rich, complex but often, noisy experimental datasets, advanced computational tools and databases are being developed. Moreover, computational methods capable of synthesizing data from numerous experimental platforms with user-friendly interactive interfaces are gradually emerging. The computational methods include tools that convert raw data into standardized database formats/records. Such data records are organized into databases where experiments from different sources can be merged. Algorithms are then used to query such databases and integrate the high-throughput data with annotated data collated from low-throughput studies and other high-throughput studies in order to obtain new biological insights. Here, the organization of experimental data into sets of biochemically related gene products and ultimately interacting gene-product networks is extremely useful. The abstraction/simplification of data into gene-sets and networks is qualitative and as such typically ignores quantitative detail. However, it provides a birds-eye-view of the system as a whole when advanced algorithms are applied to dissect the complexity and rank components. Taken together, computational tools and the algorithms embedded within them are used to make predictions that are translated into rational hypotheses that can be validated using low-throughput functional experiments. Although results from high-throughput experiments provide a global view of the many variables involved and their relationships, current technologies lack accuracy and direct functional perspective. In contrast, low-throughput techniques, while providing functional understanding of specific components and interactions, do not have the scope needed to understand the multi-factorial behavioral complexity of the system’s behavior as a whole.
ES cell research is an area that fits well with the systems biology pipeline process because these cells are relatively easy to handle experimentally, have defined sets of phenotypes that can be experimentally evaluated, and are relatively homogenous in gene expression and morphology (this latter point is discussed in more detail in a subsequent section). The fact that we can differentiate stem cells to different cell types also makes these cells ideal. Stem cells can be considered the ‘ideal model organism’. In this review, we first discuss the systems biology pipeline process by surveying different types of high-throughput experiments with an emphasis on the computational tools and databases associated with each specific type of experimental approach. We then describe how these methodologies have been applied so far to study ES cells at different molecular regulatory layers, whether these layers are epigenetic, transcriptional, mRNA, microRNA, proteomic, and others. We focus on data-mining approaches which include algorithms, software tools, and databases. Such methods are used to reconstruct in silico regulatory networks and develop hypotheses for further experimentation. Finally, we present an initial ES cell regulatory network constructed from low-throughput studies.
Systematic genomic approaches for profiling the state of the cell at the mRNA expression level are the most established high-throughput approach for studying ES cells. At early stages of technology development, expressed sequence tags (EST),1–3 serial analysis of gene expression (SAGE),4,5 and massively parallel signature sequencing (MPSS)6,7 have been utilized for probing mRNA expression patterns in ES cells. For example, through global comparisons of mouse versus human ES cells, Wei et al.7 showed similarities as well as discrepancies between the transcriptomes of human and mouse ES cells. Currently, the most established method to globally profile mRNA levels on a genome-wide scale is via microarray technology. Although microarray technology has matured and is now widely used, it is expected to be gradually replaced by next generation or deep, sequencing techniques8 which can be used to profile the transcriptome with greater accuracy and at lower cost. Sequencing-based profiling is capable of providing more accurate absolute gene expression values, at least at the cell population level. Simply ‘counting’ the numbers of sequence tags derived from individual genomic loci is also more statistically robust. Although powerful, the ability of deep sequencing to detect extremely low levels of gene-specific transcripts raises its own set of potential complications. For example, running a single lane using the Illumina Solexa or similar technologies can yield up to tens of gigabases of sequence information in segments of nucleotides. Collecting such large volumes of data can run into an ‘embarrassment of riches’ situation, where transcripts present at less than a single copy per cell are identified. Whether such low abundance transcripts are biologically relevant and reflective of functionally important heterogeneity of the analyzed cell population, or whether they are a product of transcriptional (or other) ‘noise’, needs to be carefully considered.
In recent years, public mRNA expression repositories have been developed for the purpose of sharing and exchanging results from mRNA microarray studies. Gene Expression Omnibus (GEO)9 (http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress10 (http://www.ebi.ac.uk/microarray-as/ae/) are two leading resources. Although these are general repositories for any type of organism and cell type, more focused repositories exist. For example, Gene Expression Database (GXD) (http://www.informatics.jax.org/)11 is a community specific resource for mRNA expression in mouse; BloodExpress (http://hscl.cimr.cam.ac.uk/bloodexpress/) is a repository for the mouse hematopoesis system,12 Stem Cell Database (SCDB) (http://stemcell.mssm.edu/v2/) and StemBase (http://www.stembase.ca/?path=/) are mRNA expression databases collecting data from different types of stem cells.13 In addition, Stromal Cell Data Base (StroCDB)14 (http://stromalcell.mssm.edu/) contains gene expression data from stromal cell lines with differing hematopoietic stem cell supporting activities. Also, the FunGenES (Functional Genomics in Embryonic Stem Cells) database (http://www.fungenes.org/index.html) includes expression profiles in mouse ES cells with several analytical tools. Data exchange is typically handled through XML and OWL exchange standards such as MIAME.15
Analysis of gene expression data remains an active area of research because the extraction of meaningful biological knowledge out of large volumes of noisy data is challenging. Different algorithms have been employed for identifying expression patterns. Unsupervised clustering methods such as hierarchical clustering,16 principle component analysis,2 and self-organizing maps17 are some of the most popular approaches. The rationale behind these different algorithms is that groups of genes that behave similarly at their mRNA expression levels are co-regulated and are therefore likely to encode components of the same pathway or biological process. Once identified, these clusters or patterns can be linked to prior biological knowledge. Specific tools to perform clustering analyses include TIGR MeV,18 Genes@Work,19 Cluster 3.0,20 BioDiscovery,21 EXPANDER 2.0,22 and the BioConductor project which implements modules using the freely available R statistical package.23
Gene expression microarray technology has been widely used for profiling the expression level of the mRNA of genes that control self-renewal and differentiation in ES cells. For example, Bhattacharya et al.24 analyzed expression patterns of six human ES cell lines by oligonucleotide arrays and identified an overrepresented subset of over 90 genes as ES cell signature genes, which include Nanog, Sox2, and Oct4. Similarly, Ivanova et al.25 performed gene expression analysis in mouse embryonic and neural stem cells and defined ESC- and NSC-enriched gene-sets; Ramalho-Santos et al.26 defined from transcriptional profiles 216 genes enriched in mouse embryonic, neural, and hematopoietic stem cells. Consistent with such molecular signatures of ES cell genes, Masui et al.27 used mRNA expression microarrays to screen genes that are up- or down-regulated in a Sox2-null ES cell line. They found multiple genes in the down-regulated set that are related to Oct4, pointing to the fact that Sox2 and Oct4 may function in a coordinated manner. In another study, Pritsker et al.25 conducted gain-of-function screens combined with expression microarrays to identify genes whose over-expression causes the maintenance of the undifferentiated status of mES cells cultured without leukemia inhibitory factor (LIF), an extracellular ligand sufficient for maintenance of mES cells. They found a role for genes encoding a wide scope of regulatory proteins including the gene that encodes the kinase Akt1 and several novel transcription factors. Sperger et al.28 compared expression profiles of human ES cells and several carcinoma cells. They identified sets of mRNAs that are specific for human ES cells. These mRNAs showed relative similarity of expression patterns among different ES cell lines. Westfall et al.29 conducted microarray analysis for mRNA expression in human ES cells under either 4% or 20% oxygen cell culture conditions. They identified several oxygen-sensitive genes required for the maintenance of self-renewal. Their global gene expression profiling experiments characterized the mRNA signature of ES cells during specific developmental stages and environmental conditions. One of the best characterizations of human ES cells can be found in Adewumi et al.30 In their study, they combined FACS analysis of protein surface expression with mRNA profiling.
Various other studies have reported gene expression profiling of undifferentiated human and mouse ES cells as well as their derivatives (see Refs 31–37 and one of many reviews38). These combined efforts ultimately define specific molecular signatures of ES cells and bring us closer to unraveling the architecture of ES cell regulatory networks at the transcriptional regulatory layer.
Complementing mRNA gene expression profiling, genome-wide knock-down screens are increasingly applied for identifying novel functional genes linked to specific phenotypes or a specific biological process. The primary methodology utilizes short interfering RNA (siRNA) screen to target specific mRNAs for degradation. Commercially available chemically synthesized siRNAs are 21 nucleotides long with symmetric 3′-overhangs of two nucleotides. Alternatively, short hairpin RNAs (shRNA) can be transcribed from expression cassettes inserted into plasmids or viral vectors.39 Another option is to use endoribonuclease-prepared short interfering esiRNAs.40 These are produced in vitro from cDNA templates transcribed into double-stranded RNA and subsequently digested by endoribonucleases to a pool of overlapping effectors. It has been reported that this approach results in less off-target effects. Robotics is often used to streamline the knock-down process and to measure changes in phenotypic outcomes using microscopy.
siRNA sequences validated by experiments are available through several public databases. For example, the siRecords41 is a database collecting over 17,000 validated siRNAs. Another useful resource is siRNAdb,42 which provides both siRNA experimental data as well as computationally predicted siRNAs sequences. siRNAdb provides users with the ability to evaluate siRNAs for potential specific and non-specific targets. In addition, commercially available repositories include validated siRNAs from Qiagen, and Silencer-validated siRNAs from Ambion. Several public or commercial bioinformatics prediction tools have been developed to assist in selecting siRNAs with enhanced hit likelihood and reduced potential for off-target effects. Representative computational tools include siDESIGN43 and siSearch.44 The algorithms used in these tools are based on empirical rules such as sequence asymmetry, stability, and predicted secondary structure, often implementing machine-learning techniques.
Several mid-scale to large-scale siRNA screens allowed stem cell researchers to systematically identify novel genes essential for ES cell pluripotency/self-renewal and differentiation. Such knockdown approaches are generally complimented with mRNA expression changes to link the knocked-down-gene with the genes it regulates. For example, Ivanova et al.36 explored the effect of shRNA-mediated silencing of seven different transcription factors on mRNA expression profiles in mouse ES cells. They identified four novel genes (Esrrb, Tbx3, Tcl1, and Dppa4) previously unreported as being essential for maintenance of ES cell self-renewal and pluripotency. Based on the expression data and further clustering analysis, Ivanova et al. defined the similarities and differences among various effects of these factors. Mid-level screens used an esiRNA approach to screen 1008 chromatin regulators. This study identified Tip60-p400 as a key regulator of ES cell identity.45 In a separate study, Schaniel et al.46 performed a synthetic RNA-based shRNA screen on 312 genes identifying several SWI/SNF chromatin remodeling complex components such as Smarcc1/Baf155 to be important for early commitment/differentiation events. Recently, two groups performed larger-scale RNAi screens in mouse ES cells to identify genes critical for stem cell self-renewal. Hu et al.47 used synthesized siRNA pools and identified more than 100 novel functional genes implicated in mouse ES cell maintenance. Another study by Ding et al.48 employed esiRNA library and found over 200 candidate genes important for ES cell self-renewal. These studies illustrate the power of genome-wide RNA interference to functionally and systematically identify the part-list components important for ES cell self-renewal.
MicroRNAs (miRNAs) are small, approximately 22 nucleotide long non-coding endogenous RNAs which play an important role in many biological systems including the regulation of ES cells.49 The binding of the approximately 22 nucleotide long mature miRNAs to mRNAs in the RNA-induced silencing complex (RISC) triggers either degradation of mRNAs or inhibition of translation. miRNA expression and function has been observed to be critical in ES cell regulation.50 The experimental approaches for detecting miRNA expression are by miRNA array profiling, northern blots, and quantitative real-time PCR. Such empirical methods are complimented by computational algorithms that can discover miRNAs as well as predict their targets.51
Some of the primary online resources for miRNA sequences, targets, and other annotations are miRBase,52 Miranda (microRNA.org), and TargetScan.53 CoGemiR54 and TarBase55 are examples of two other emerging microRNA databases. RNAdb56 is a database for all non-protein-coding RNAs including microRNAs and small nucleolar RNAs (snoRNAs), the latter of which are participants in the process of ribosomal RNA modification and maturation.57
Several stand-alone miRNA–target prediction tools, such as TargetScanS,53 PicTar,58 and miRanda,59 have been developed. These tools implement algorithms that employ observed base-pair rules summarized into principles extracted from known miRNA–target interactions. In addition, cross-species conservation of miRNA–target interactions is used for miRNA–target interaction prediction. Different methods use slightly different scoring schemas, detection criteria, and conservation requirements. For example, TargetScanS requires perfect complement with a miRNA seed, whereas DIANA-MicroT60 allows for targets with imperfect seed matching. Recently published tools implemented machine-learning techniques to make predictions directly based on validated miRNA targets, i.e., MirTarget261 and NBmirTar.62 Apart from sequence matching, algorithms are developed to consider secondary structure, one example is PITA63; whereas EIMMo predicts miRNA targets using evolutionary sequence conservation across different organisms combined with information about molecular and biochemical pathways.64 Such tools can be evaluated using mRNA expression data. Complementarily, several resources integrate various stand-alone databases and tools for better prediction and comprehensiveness. For example, miRecords65 comprise both manually retrieved experimentally validated miRNA–target interactions and miRNA–target interaction predictions integrated from 11 stand-alone prediction tools. Also, lists of miRNA targets or clusters of miRNAs working as a group can be predicted using enrichment statistics. For example, GeneSet2microRNA66 implements such an approach. In addition to the miRNA–target prediction tools and databases, analyses of functional annotation of predicted targets for specific miRNAs are emerging. Such analyses link miRNA–targets to gene ontology (GO) terms or to cell signaling pathways.67 These analyses are best utilized when miRNA data are integrated with mRNA expression data.68 A list of miRNA prediction tools and databases is provided in Table 1.
miRNAs are regulators of ES cell pluripotency/self-renewal and differentiation.80–82 It was shown, for example, that Dicer-deficient ES cells are defective in proliferation and differentiation.83 Expression profiles of miRNAs in various ES and derived cell lines already revealed unique signatures in these cells. For example, Thomson et al.84 performed custom microarray-based analysis of miRNAs expression in mouse ES cells, embryoid bodies, and adult tissues. Based on their results, the expression profiles of miRNAs in ES cells are much different than in embryoid bodies or adult cells. Babiarz et al.82 identified novel Dicer-dependent non-canonical microRNAs in mouse ES cells by deep sequencing of knockout cell lines; Wu et al.85 applied genomic analysis of miRNA profiling and revealed differences between two human ES cell lines, which in turn, explains subtype-specific differentiation bias; whereas a miRNA expression microarray by Cao et al.86 compared miRNA expression between human and mouse ES cells identifying conserved expression from chromosomes 19 and X. In several other seminal studies, miRNA regulation was linked to key pluripotency transcription factors. For example, Boyer et al.87 showed that Nanog, Oct4, and Sox2 occupy the promoters of a combined 14 miRNAs, only 2 of which are bound by all 3 in hESC, suggesting that regulation of miRNAs by these core pluripotency factors is crucial to maintain the pluripotent state. Card et al.88 showed that Oct4/Sox2 regulate miR-302 which targets the mRNA encoding cyclin D in human ES cells, whereas a study reported by Barroso-delJesus et al.89 suggests that the miR-302–367 cluster is regulated by Nanog, Oct4, Sox2, and Rex1 (all self-renewal transcription factors). Xu et al.90 showed that Mir-145 targets and represses pluripotency transcription factors, whereas Tay et al.91 demonstrated that miRNA-134, miR-296, and miR-470, which are induced upon retinoic acid mediated differentiation, target the pluripotency transcription factors Nanog, Oct4, and Sox2.
One application of in silico prediction of miRNA–target interactions specific for ES cells attempted to identify miRNA–mRNA interactions essential for pluripotency and differentiation.92 The authors combined mRNA expression with predicted miRNAs to suggest a list of miRNAs that are important for maintaining pluripotency. Although computational approaches can be used to predict novel miRNA candidates important for ES cell regulation, experimental approaches are necessary to confirm such predictions. For example, Ciaudo et al.93 performed both computational prediction for miR-302 targets using two target-predicting tools: PicTar58 and EIMMo, coupled with experimental validation. They confirmed that Arid4a and Arid4b are targeted by the miR-302 family, which is enriched in male-specific differentiating ES cells. In a study that utilized a miRNA over-expression strategy in Dgcr8−/− ES cells, the authors identified the miR-290 cluster as being capable of rescuing the proliferation defect in these cells.82 Experimental approaches of direct miRNA–mRNA target identification include co-immunoprecipitation of the RISC with target mRNAs.94 This procedure was recently enhanced by applying the cross-linked immunoprecipitation (CLIP) approach for identifying RISC binding sites more precisely.95,96 By expanding the analyses, miRNA–target interactions can be elaborated into regulatory networks that can be combined with other types of regulatory interactions. Specifically, miRNAs can be represented as nodes with outgoing (mostly) negative links to their target mRNAs, although there are exceptions.97 MicroRNA expression can be used to explain discrepancies between mRNA expression and protein levels, and because miRNAs are regulated by the transcriptional machinery, the input links to miRNA nodes originate from the transcriptional regulatory machinery which is discussed next.
Transcription, the initial step in gene expression, is regulated by the transcriptional machinery involving transcription factors and co-regulatory complexes. Transcription factors bind to cis-regulatory elements in proximity to gene coding sequences. Uncovering the dynamics and complexity of transcriptional regulation through transcriptional-factor binding to DNA remains an enormous challenge. Large-scale experimental methods can be applied to profile the global interactions of transcription factors with DNA. Such methods include chromatin immunoprecipitation (ChIP) combined with DNA microarrays, ChIP-chip,98 or deep sequencing, ChIP-seq,99 or ChIP-PET. 100 A recently developed alternative method, called DamID,101 is based on the expression of a fusion protein consisting of the protein of interest and DNA adenine methyltransferase (Dam). Methylation of adenines by Dam near the protein–DNA interacting sites marks the sites of interest. DamID is applied to eukaryotic cells where adenine methylation does not happen endogenously. DamID relies on selective PCR amplification via adapter oligonucleotides ligated to DNA fragments sequentially digested with DpnI (GAmeTC) and DpnII (GATC) followed bymicroarray hybridization. In addition, protein/DNA arrays such as those offered by Panomics can be used to measure the activity for canonical transcription factors.102 Protein–DNA interaction profiling using large-scale methods has been widely used by the stem cell research community when compared with other similar experimental systems and cell types.47,87,102,103,103–111
Facilitated by high-throughput protein/DNA interaction studies as well as results from low-throughput studies such as gel-shift assays, tools and databases are being developed for organizing transcription factor/DNA interaction information. Leading transcription-factor-binding-site-resources include JASPAR,112 TRANSFAC,113 and TRED.114 These databases contain collections of transcription factor binding profiles together with information on conserved regulatory elements stored in binding site matrices. Other, organism-specific databases are YEASTRACT115 for Saccharomyces cerevisiae, RegulonDB116 for Escherichia coli, and DBTBS117 for Bacillus subtilis. Recently, MacArthur et al.118 collected results from various ChIP experiments performed using ES cells and created a consolidated ChIP profiling dataset for over 20 transcription factors and their putative targets. Such a dataset can be used as part of ongoing efforts of data consolidation for hypotheses generation and multi-layered regulatory network reconstruction.
One approach to summarize the results from protein/DNA interactions is to develop consensus binding site sequences for individual transcription factors.119,120 Leading databases that provide such consensus binding site sequences as matrices, or logo-motifs, for mammalian cells are JASPAR112 and TRANSFAC.113 Once such matrices have been developed, it is straightforward to use these to map potential transcription factor binding sites across entire genomes. Representative tools that use such data include: PASTAA,121 P-Match,122 and Pscan.123 The complete list ismuch longer. PASTAA implements the TRAP124 method utilizing TRANSFAC to predict the affinity of transcription factors to gene promoters. Because such computational predictions have been proven to not be highly reliable, combining predictions with additional experimental data, such as data from ChIP and mRNA expression experiments, significantly improves predictability performance.
As mentioned above, high-throughput ChIP experiments have been widely applied to studying transcription-factor-targets in ES cells.47,87,102,103,103–111 For example, Hu et al.47 performed a genome-wide siRNA screen combined with ChIP-chip to elucidate the regulatory networks controlled by the two transcription factors Cnot3 and Trim28, discovering a new role for these factors in ES cell pluripotency/self-renewal; Marson et al.125 used ChIP-seq technology to link important self-renewal transcription factors to promoters of genes and miRNAs. They discovered that miRNAs are highly regulated by self-renewal transcription factors, and are highly expressed in ES cells, while specific groups of miRNAs are only expressed in differentiated cells. Applying ChIP-PET to profile protein/DNA interactions for several transcription factors, Loh et al.104 used an optimized algorithm named NestedMICA126 to predict the motif binding sites for transcription factor pairs such as Oct4/Sox2. This analysis helped to further characterize how heterodimers regulate mouse ES cell self-renewal. Their findings were consistent with two other studies that reported common Sox2/Oct4 heterodimer binding sites upstream of many genes important for self-renewal or differentiation.105,127 In another study, Sharov et al.128 applied ChIP experiments with pull-downs of Oct4, Sox2, and Nanog combined with time-course microarray analyses. They used an inducible Oct4-depletion ES cell line called ZHBTc4 and a Sox2 line called 2TS22C. Combining ChIP experiments with mRNA expression can be used to assign signs (activation or inhibition) to links from transcription factors to the genes they regulate.
Protein phosphorylation is a critical post-translational modification used to transfer information from the extracellular environment by affecting protein activity. Phosphorylation is used to regulate various biological processes including pluripotency/self-renewal as well as differentiation of ES cells. Classical experiments to identify phosphorylated sites include radioactive labeling and affinity chromatography. These methods are labor intensive and can only be performed on a small scale. Recently, large-scale phosphorylation data (phospho-proteomics) was made possible with approaches such as tandem mass spectrometry (MS) and antibody phosphoarrays.129,130 For example, a popular strategy in recent years is using stable isotope labeling by amino acids in cell culture (SILAC)131,132 combined with liquid-chromatography and mass spectrometry (LC)-MS/MS. With SILAC, the whole proteome of a given cell is labeled with stable heavy and light (normal) isotope variants. In this approach, the relative levels of protein phosphorylation from different samples are measured simultaneously by the ratio of intensities of light/heavy amino acids labeled with the distinct isotopes. Although SILAC quantitation is done at the MS level, an alternative method called iTRAQ uses stable isotope labeling at the tag-level. This allows multiplexing of up to eight samples, and as such, quantitation is done at the MS/MS level.133,134 An alternative to MS-based proteomics is antibody arrays. For example, Kinexus is a specialized method for analyzing the phosphoproteome using antibody-based microarrays (Kinex™ antibody microarrays), kinase substrate, and inhibitor profiling. Kinex is an antibody-based method that relies on sodium dodecyl sulfate (SDS)-polyacrylamide mini-gel electrophoresis and multilane immunoblotters to permit the specific and quantitative simultaneous detection of protein kinases or other signal transduction proteins.135
Protein phosphorylation-centered databases mostly report phopho-sites identified using techniques such as SILAC, whereas some databases also list the kinases that are likely responsible for the phosphorylation. Hence, there are many orphan phospho-sites, i.e., the kinases that are responsible for the phosphorylation of most sites are not known. Phosphorylation-centered databases can be developed manually by retrieving identified phosphorylations from results reported in low-throughput studies, as well as by consolidating results from high-throughput assays. Such phosphorylation repositories include Swiss-Prot,136 phospho.ELM,137,138 HPRD,139 PhosphoPoint,140 and PhosphoSitePlus.141 A comprehensive list of phosphorylation-centered databases is provided in Table 2.
For data generated from SILAC phospho-proteomics, there are computational tools developed for data processing and analysis. For example, ASCORE189 and Colander190 were developed to process the raw data for phospho-site identification. Databases that record phospho-sites and predictive tools that utilize such data as training sets to predict additional potential phospho-sites are emerging. For instance, NetPhos,152 DISPHOS,154 NetPhosYeast,155 and PHOSIDA142 are such databases. Furthermore, tools to predict the protein kinases that catalyze phosphorylation events on specific peptide motifs are also being developed. Predictors such as NetPhosK,167 PhoScan,172 and NetworKIN197 attempt to link phosphorylation sites to the kinases most likely to phosphorylate those sites. Some tools, such as kinase enrichment analysis (KEA),186 compute the likelihood of specific kinases to phosphorylate a set of proteins based on annotated kinase–substrate interactions. Most predictors are sequence-based and depend on an assortment of machine-learning algorithms. As such, these algorithms require training data from known examples. This category consists of NetPhosK,167 KinasePhos,171 Phosite,158 GPS,169 ScanSite,166 and PhoScan.172 Alternatively, some additional information apart from sequence has also been integrated to augment specificity of substrates, including disorder information implemented by DISPHOS,154 structure information implemented by NetPhos152 and Predikin,165 as well as contextual factors by NetworKIN.197 Different machine-learning algorithms are implemented for kinase–substrate prediction using different tools. For example, support vector machine (SVM) is implemented by PHOSIDA142 and PredPhospho,151 artificial neural networks (ANN) by NetPhos,152 and hidden Markov model (HMM) by KinasePhos.171 In addition, DISPHOS relies on a logistic regression-based linear predictor, whereas GPS169 uses a group-based scoring method. The same group that developed GPS also developed an additional strategy called PPSP170 which implements Bayesian Decisions. Integrating and comparing such complementary efforts is expected to improve predictive specificity and accuracy.
Characterization of the phosphoproteome status of ES cells can provide understanding of the cellular signaling status at the pluripotency/self-renewal state, and how external stimuli drive ES cell signaling toward differentiation. Because phosphorylation is a key mechanism for the regulation of the transcriptional machinery, phosphoproteomic experiments are expected to bridge the gap between the transcriptome and proteome. Recently, Wang et al.198 studied the phosphorylation status of 42 receptor tyrosine kinases (RTKs) in human ES cells under conditional medium simultaneously by means of membrane arrays with a pan-anti-phosphotyrosine antibody. RTKs such as IGF1 and insulin receptors contribute to hESC pluripotency/self-renewal. Such approaches aid to define the contributions of ligands to maintain stem cell characteristics in heterogeneous culture conditions. More systematically, Brill et al.199 performed an MDLC-MS/MS-based phosphoproteomic study in human ES cells and their differentiated derivatives to identify differentially modified proteins potentially involved in self-renewal or differentiation. Similarly, using SILAC, Prokhorova et al.132 studied the phosphoproteome status of undifferentiated human ES cells and identified 527 unique phosphopeptides; while Van Hoof et al.200 unraveled the phosphoproteome status of hES cells during differentiation induced by BMP. Coupled with the phosphoproteomic results, they utilized the NetworKIN algorithm197 for predicting upstream kinases for these phosphorylated substrates and identified CDK1/2 as an overrepresented kinase during early differentiation of hES cells. In another study, Saxe et al.201 used prediction algorithms implemented by ScanSite to find potential phosphorylation sites on Oct4, a key component in regulating self-renewal. Subsequently, based on the computational predictions, they performed experiments to confirm that one of the predicted phosphorylation sites partially controls Oct4-mediated transcriptional activity.
To improve our understanding of the cell signaling and transcriptional complexes that regulate ES cells, further characterization of protein–protein interactions in high-throughput is essential. Protein interactions and cell signaling pathways curated from the literature complement high-throughput techniques, such as yeast-two-hybrid (Y2H)202 and MS following co-immunoprecipitation. Besides binary interactions protein level changes are also important. The multiple reaction monitoring-mass spectrometry (MRM/MS) assay quantifies a specific tryptic peptide that is selected as a stoichiometric representative of the cleaved protein against an internal synthetic stable isotope-labeled peptide, giving rise to the absolute measure of protein concentration.203,204
Leading resources for collecting and merging protein–protein interactions identified using different experimental techniques include BioGIRD,205 HPRD,206 MINT,207 IntAct,208 and Reactome.209 These protein interaction databases are often organized into cell signaling pathways which also include small-messengers such as DAG, cyclic AMP, and calcium, as well as non-covalent interactions and post-translational modifications such as phosphorylation and dephosphorylation. Some protein–protein interaction databases also include computationally predicted interactions.210 For example, the STRING211 database contains interactions predicted from co-localization and phylogenetic profiles. Data exchange and compatibility is handled through standardization of protein IDs and exchange formats such as BioPAX.
The most extensive study to characterize protein–protein interactions in mouse ES cells used affinity purification followed by MS proteomics to unravel the protein interactions associated with Nanog, a key regular transcription factor for ES cell pluripotency/self-renewal. 212 Up-to-date, other protein interaction (interactome) studies in ES cells have been small-scale and focused on individual complexes.
Genome-wide experimental approaches such as mRNA microarrays, proteomics, and ChIP-chip/PET/seq generate large datasets that can be summarized as lists of genes/proteins that have been identified or as displaying changes in expression or activity under different conditions. Such lists can be analyzed using enrichment analysis methods which report the overlap between the experimentally identified lists and previously annotated functionally labeled gene-sets. Systematic annotation of such gene-sets assists researchers in measuring the similarity among different experiments and in identifying the functional signatures or biological themes in newly generated datasets. The most common enrichment analysis applications are based on GO,213 which is a hierarchical tree-structured database of controlled vocabulary terms associated with genes for gene annotation. GO is a useful way of collecting and organizing biological knowledge that can be reused for data analysis. Besides GO annotation, genes/proteins have been grouped into gene-sets using several types of prior biological knowledge, for example: chromosomal location, expression regulation by upstream transcription factors, binding to specific metabolites, shared structural domains, involvement in canonical biological pathways, and association with specific diseases.
Enrichment analysis tools such as DAVID214 and GSEA215 have been developed to handle many types of prior biological knowledge gene-sets. Apart from these two leading enrichment analysis tools, other approaches exist. GeneTrail216 is based on categories such as TRANSFAC for gene regulation, Refseq, KEGG, and GO; GFINDer217 also covers OMIM, which is for association with disease. The FatiGO tool218 allows users to analyze two sets of genes by means of statistical tests based on various criteria, including functional criteria [GO, Biocarta (http://www.biocarta.com/genes/index.asp)], regulatory criteria (miRNA), and chromosomal location. A list containing additional similar enrichment analysis tools is provided in Table 3. Based on the different algorithms used, enrichment tools can be categorized into three classes219: singular enrichment analysis (SEA); gene-set enrichment analysis (GSEA); and modular enrichment analysis (MEA). Common statistical methods such as Fisher Exact, Chi-Squared, and Binomial Proportion tests are used when comparing sets without considering the ranks of the genes within the set. Most tests assume independence for the probability of genes to appear together in an input list, generally a naïve assumption because genes that belong to the same functional category tend to be co-expressed. Tools such as GSEA also consider the ranks of the genes in the input list. Statistical tests, such as Kolmogorov–Smirnov or Rank-Sum, are used to compute the enrichment for such input lists against sets of genes from different categories in an arguably more accurate manner.
Enrichment analysis has been widely applied to interpret high-throughput profiling results from ES cells. For example, using DAVID, Storm et al.237 identified downstream gene-sets of phosphoinositide 3-kinase (PI3K) that show over-representation of functional groups in transcriptional regulation and DNA binding, including Zfp36, Sox4, Dnmt3a/b. This implicates the PI3K signaling pathway as playing an important role in maintaining ES cell pluripotency. Another example is the over-representation of functional groups involved in biological processes such as cellular growth, proliferation, and embryonic development for targets of c-Myc and Stat3, two key transcriptional factors governing ES cell pluripotency.108
Taken together, functional enrichment analyses can generate hypotheses about potential links between groups of miRNAs and transcription factors with the genes they regulate, and identifying molecular complexes that function in a coordinated manner during differentiation. Enrichment analyses can also help in computing the significance of signals from high-throughput experiments. For example, if enrichment analysis shows that many phospho-sites for a specific kinase are increasingly phosphorylated during the course of a particular treatment, it is likely that this kinase is more active under the treated versus control conditions. This allows making reasonable assumptions that the phospho-sites that are marginally identified to be real can be upgraded to greater reliability if they are predicted to be a substrate of the up-regulated kinase. Such concepts can be applied to similar scenarios where high-throughput screens can be combined with prior knowledge.
Representation of molecular intracellular biological systems as networks is useful for combining results from different studies and obtaining an overall bird’seye-view of the system.238–241 Network representation permits making predictions about undiscovered interactions and identifying functional modules. In principle, a molecular regulatory network consists of nodes and links, in which nodes represent molecular entities such as genes, gene products, miRNAs or metabolites, and links represent interactions/relationships that can be direct or indirect (influence or physical association), signed or unsigned (activation, inhibition, or neutral). One approach to reconstruct intracellular regulatory networks is to manually extract interactions from the literature. For example, Thiele et al.242 reconstructed a transcriptional machinery for E. coli by studying over 500 publications, whereas Ma’ayan et al.243 extracted over 1200 cell signaling interactions reported in mammalian neurons. For this review, we manually extracted binary interactions from 286 experimental studies specifically focused on ES cells. A regulatory pluripotency/self-renewal sub-network for mouse ES cells based on this manual extraction is presented in Figure 2, whereas the entire network is available online at http://amp.pharm.mssm.edu/ismid/literature. Note that the interactions on the website include binary interactions identified in mouse ES cells by Wang et al.,212 which were performed using high-throughput techniques. In contrast with previous studies, we combined direct and indirect interactions and included cell signaling interactions and gene regulatory interactions. Such accumulated-knowledge-based-networks provide a scaffold for further data integration and for the interrogation of data collected in new high-throughput and low-throughput studies. For example, changes in expression of mRNA levels detected in microarray experiments can be projected onto knowledge-based networks for inferring and expanding potential interactions among gene products at the mRNA and protein levels.244 Co-clustering245,246 can be applied to identify functional modules based on expression profiles combined with protein interaction networks. The known regulatory topology can also be used to begin to understand how information flows through the system over-time. Here, we projected a time-course mRNA expression profiling after Nanog knock-down in mouse ES cells247 with the literature-based network we developed for this review to illustrate this point (Figure 3).
Having established networks from literature for specific biological processes, and linking such networks to genome-wide profiling studies, can lead to the identification of novel disease genes or to better understanding of the pathogenesis of previously less understood complex diseases. For example, a study that used mRNA expression microarrays at different developmental stages profiling a mouse model of prion disease, a disease caused by misfolding and aggregation of prion proteins, projected expression changes onto a network of protein interactions, similarly to what we did for Figure 3, to identify novel molecular origins of the disease.248 It is also possible to extract important disease-relevant information from networks just by examining the network topology, especially when the sign of the links (activation/inhibition) is known. A cross-disciplinary study by Abdi et al.249 showed how using circuit fault diagnosis commonly applied to study electrical circuits can identify vulnerable genes in literature-based mammalian cell signaling networks. Another example that falls within this category is a study by Mani et al.250 who developed an algorithm to characterize oncogenic mechanisms in B-cell lymphomas. They used a B-cell interaction map as a scaffold and then projected onto it a large set of microarray expression profiles. Another seminal study by Ideker and colleagues251 used a protein–protein interaction network combined with gene expression profiles to identifying markers of metastasis. Their method scored subnetworks based on mutual information with expression activity over tumor sample and class category. In summary, these approaches utilize prior knowledge network topology in conjunction with genome-wide expression profiling to unravel molecular mechanisms of disease.
Reconstruction of regulatory networks can be completely data-driven. Approaches for gene regulatory network inference from microarray data can be categorized into three principle classes: (1) Linear models using differential equation to describe gene expression changes as a function of expression of other genes and external perturbations.252 (2) Information theory-based models, such as ARACNe253 or the CLR.254 With these methods, edges are weighted and filtered based on conditional mutual information. (3) Probabilistic-based graphical models such as Bayesian Networks255 and Module Networks.256 These methods treat expression of genes as random variables and embody the description of the joint probability distribution of these random variables. Commonly, research groups also attempt to integrate expression data with other data sources such as genome-scale ChIP results in order to augment the inference.257 Most network inference methods do not consider post-translational regulation which is controlling transcription factor activity. Post-translational modifications such as phosphorylation, acetylation, ubiquitination, or methylation may change the function and localization of transcription factors. With the limited experimental approaches for identifying PTMs, computational approaches for inferring this layer of regulation have been developed. For example, the MINDy algorithm258 infers candidate modulators of transcription factors using mutual information.
Several studies used reverse engineering methods to characterize ES cell networks. For example, Chen and Zhong259 inferred a regulatory network in mouse ES cells from time-series microarray data by first finding delayed correlations and then inferring reaction rules that can recapitulate the dynamical changes observed in the data. For each of their identified transcription factor–target pair, ChIP-chip and RNAi data were used for model verification. Other probabilistic-based methods for network inference can be used.260–262 For example, Woolf et al.262 inferred a signaling network from a proteomics dataset. They reconstructed a self-renewal signaling network of mouse ES cells by applying a Bayesian-learning algorithm. With respect to defining a signed network, the links can be inferred from transcription factor–target gene interactions identified by ChIP experiments, whereas the signs of links can be drawn from transformation of raw expression data, or inversely related to knockout effects.118,261,263 For instance, Chavez et al.264 applied this methodology of integrating Oct4 ChIP-chip experiments and Oct4 RNAi in human ES cells. Similarly, Chen et al.105 constructed a regulatory network in ES cells inferred from integrated data of transcription factor binding sites and expression profiles in undifferentiated and differentiated ES cells. The resulting network shows high interconnectedness, which reflects the interwoven relationship between the core transcription factors responsible for maintaining the self-renewal state. Recently, Muller et al.265 combined microarray data, classification algorithms, and protein–protein interactions from MATISSE246 to construct a consensus stem cell network called PluriNet from many studies. Such network is undirected. Hence, more efforts are required for more complete reconstruction of ES cells regulatory networks expanding both depth and breadth around key transcription factors and cell signaling pathways responsible for stem cell development. Such networks will be then simulated using different dynamical modeling techniques. However, describing such computational methods is beyond the scope of this review.
Although we do not elaborate our discussion about dynamical computational modeling techniques, it is critical to emphasize that changes in cell fate occur over time where it will be necessary to explore regulatory network dynamics.266 In addition, a number of studies have documented the stochastic, intrinsically noisy nature of mRNA and protein expression in stem cells.267,268 Of course, stochastic components of stem cell regulation have been demonstrated more than 40 years ago in the hematopoietic system. In simpler biological systems such as the lytic versus lysogenic decision process of certain bacteriophages (lambda), stochasticity or noise have been shown to be vital for the biological decision process.269 In more complex systems, the problem comes down to the difficulty of discerning biologically important versus unimportant stochastic phenomena. A more general problem is that essentially most gene and protein expression studies provide data that are an average over a population of cells. Thus, intermediate levels of a given gene’s expression level can result from on or off expression in single cells, truly graded expression levels in single cells or a combination of both. Currently, robust technologies are emerging to quantitatively monitor gene expression in single cells, and the next several years will surely see additional insights into these issues. For example, single-cell behavioral complexity have been resolved in quantitative analyses of the Ras-MAPK and other biochemical signaling pathways. It is also noteworthy that during the development of PCR technologies, many of these types of ‘sensitivity’ issues have already been addressed; thus providing important paradigms. The bottom line is that all data collecting technologies have inherent limitations and these need to be considered when moving forward to the development of databases and computational analysis methods.
In this review, we summarized some of the systematic methods to profile intracellular regulatory molecular systems at different layers of regulation, focusing on ES cells in particular. We describe how utilization of high-throughput profiling approaches is paired with related databases and computational tools in a pipeline process that concludes in network reconstruction, data integration, and functional enrichment analyses (Figure 1). Clearly, the trend in the field is to combine several different types of high-throughput methods for interrogating cells at multiple layers of regulation. For example, in a recent study, Rong et al. measured mRNA expression, nuclear protein levels, and chromatin status markers as a time-series in mES cells after a defined perturbation: knock-down of the self-renewal essential transcription factor Nanog.247 Such approaches are gradually unraveling the molecular regulatory complexity of ES cells while requiring advanced computational analysis methods. Drilling down to further characterize the most interesting new components and interactions identified by high-throughput methods coupled with computational analyses should provide low-hanging-fruit for further functional experiments which are all expected to move the field rapidly forward.