Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Nature. Author manuscript; available in PMC 2009 March 18.
Published in final edited form as:
PMCID: PMC2637443

Regulatory networks define phenotypic classes of human stem cell lines


Stem cells are defined as self-renewing cell populations that can differentiate into multiple distinct cell types. However, hundreds of different human cell lines from embryonic, fetal, and adult sources have been called stem cells, even though they range from pluripotent cells, typified by embryonic stem cells, which are capable of virtually unlimited proliferation and differentiation, to adult stem cell lines, which can generate a far more limited repertory of differentiated cell types. The rapid increase in reports of new sources of stem cells and their anticipated value to regenerative medicine1, 2 have highlighted the need for a general, reproducible method for classification of these cells3. We report here the creation and analysis of a database of global gene expression profiles (“Stem Cell Matrix”) that enables the classification of cultured human stem cells in the context of a wide variety of pluripotent, multipotent, and differentiated cell types. Using an unsupervised clustering method4, 5 to categorize a collection of ~150 cell samples, we discovered that pluripotent stem cell lines group together, while other cell types, including brain-derived neural stem cell lines, are very diverse. Using further bioinformatic analysis6 we uncovered a protein-protein network (“PluriNet”) that is shared by the pluripotent cells (embryonic stem cells, embryonal carcinomas, and induced pluripotent cells). Analysis of published data showed that the PluriNet appears to be a common characteristic of pluripotent cells, including mouse ES and iPS cells and human oocytes. Our results offer a new strategy for classifying stem cells and support the idea that pluripotence and self-renewal are under tight control by specific molecular networks.

Cultured cell populations are traditionally classified as having the qualities of stem cells by their expression of immunocytochemical or PCR markers.7 This approach can often be misleading if these markers are used to categorize novel stem cell preparations or predict inherent multi- or pluripotent features.8 To develop a more robust classification system, we created a framework for identifying putative novel stem cell preparations by their whole genome mRNA expression phenotypes (Figure 1). The core reference dataset, which we call the Stem Cell Matrix, includes cultures of human cells that have been reported to have either stem cell or progenitor qualities, including human embryonic stem cells, mesenchymal stem cells, and neural stem cells. To provide the context in which to place the stem cells, we included non-stem cell samples such as fibroblasts and differentiated embryonic stem cell derivatives. To avoid biasing the classification methods, it was critical that we designate the input cell types with terminology that carried as little preconception about their identity as possible. Our nomenclature (“Source Code”) has two components: the first is the tissue or cultured cell line of origin. The second term captures a description of the culture itself. Supplementary Tables 1 – 8 summarize the descriptions of the core samples and their assigned Source Codes.

Figure 1
Sample collection and analysis for the Stem Cell Matrix

To sort the cell types we used an unsupervised machine learning approach to cluster transcriptional profiles of the cell preparations into stable distinct groups. Sparse nonnegative matrix factorization (sNMF) was adjusted for this task by implementing a bootstrapping algorithm to find the most stable groupings (see also Supplementary Discussion 1).4, 5 The stability of the clustering9 indicated that the dataset most likely contained about twelve different types of samples (Figure 2; Supplementary Method 2). The composition of the stable clusters revealed both predictable and unpredicted groupings of a priori designations (Figure 2 and Supplementary Figure 1). The twenty samples identified as undifferentiated human pluripotent stem cell (PSC) preparations were grouped together in one dominant cluster (Figure 2, Cluster 1) and one secondary cluster (Figure 2, Cluster 5). Sixty-two of the samples were brain-derived cells that were described as neural stem or progenitor cells based on their source, culture methods and classical markers. Most of the designated neural stem cells were distributed among multiple clusters, indicating a great deal of diversity in neural stem cell preparations. But one group of the brain-derived lines, those derived from surgical specimens from living patients (HANSE cells, see below), remained together throughout the iterative clusterings (Figure 2, Cluster 6; Supplementary Figure 3; Supplementary Method 1). The HANSE cell group consisted of transcriptional profiles that were derived from neurosurgical specimens following published protocols for multipotent neural progenitor derivation and propagation.10, 11 These cells expressed markers that are commonly used to identify neural stem cells12 (see Supplementary Figure 4), but the clustering clearly separated them from the other samples that had been derived from postmortem brains of prematurely born infants (see Figure 2).10,11

Figure 2
Clusters of samples based on machine learning algorithm

We tested the ability of our dataset to categorize additional preparations by adding 66 samples comprising new cultures derived from PSC lines that were already in the matrix, preparations that were not yet included (but their presumptive cell type was already represented), or new cell types. We chose two new types of cells: a differentiated cell type (umbilical vein endothelial cells [HUVEC]) and a recently developed new source of pluripotent cells, induced pluripotent stem cells13-16 (iPSC, Supplementary Table 9). iPSCs have been generated from somatic cells, including adult fibroblasts, by genetic manipulation of certain transcription factors.13, 15-17 We re-computed clustering results including the test dataset (Supplementary Table 10). All of the HUVEC samples clustered together and formed a distinct group. Most of the additional PSC lines (human ES cells [embryonic PSC; ePSC] and iPSCs) from several different labs were placed into a context that contained solely PSC lines. The three additional germ cell tumor lines clustered together with the tumor-derived pluripotent stem cell (tPSC) line 2102Ep and samples of three human ES cell lines: BG01v18, Hues719, and Hues1319. BG01v is an established aneuploid variant line and the two Hues lines were aneuploid variants of the originally euploid lines (not shown).

We used a combination of analysis tools to explore the basis of the unsupervised classification of the samples in the core dataset. Gene Set Analysis3 (GSA) is a means to identify the underlying themes in transcriptional data in terms of their biological relevance.

GSA uses lists of genes5 that are related in some way; the common criterion is that the relationships among the genes in the lists are supported by empirical evidence.20 GSA highlighted numerous significant differences among the computationally defined categories. (See Supplementary Figure 2, Supplementary Table 11 and Supplementary Online Materials).

While GSA is valuable for discovering specific differences among sample groups, it is limited to curated gene lists and cannot be used to discover new regulatory networks. The MATISSE algorithm6 ( takes predefined protein-protein interactions (e.g. from yeast-two-hybrid screens) and seeks connected subnetworks that manifest high similarity in sample subsets. The modified version used in this analysis is capable of extracting sub-networks that are co-expressed in many samples but also significantly up- or down-regulated in a specific sample cluster. Since the PSC preparations were consistently clustered together we used MATISSE to look for distinctive molecular networks that might be associated with the unique PSC qualities of pluripotence and self-renewal. A Nanog-associated regulatory network has been outlined in mouse embryonic PSC,21 and we looked for the elements of this network in human PSCs using our unbiased algorithm. We found that the algorithm predicts that human PSC possess a similar NANOG-linked network (Figure 3a; elements labelled in red). However, we also discovered that the human NANOG network appears to be integrated as a small component of a much larger protein-protein interaction network that is up-regulated in human PSCs (Figure 3). Remarkably, this PSC-specific network (termed Pluripotency associated Network, PluriNet) contains key regulators that are involved in the control of cell cycle, DNA replication, DNA repair, DNA methylation, SUMOylation, RNA processing, histone modification and nucleosome positioning (see also Supplementary Discussion 2 and Many of the genes in the PluriNet have been linked to embryogenesis, tumorigenesis, and aging (Figure 3c and Supplementary Figure 6). We further explored the hypothesis that pluripotency is closely linked to PluriNet expression by analyzing published gene expression datasets from human oocytes, various types of PSCs, and murine embryos (see Table 1 for a summary of our findings in various model systems). Analysis of a microarray dataset22 that spans development from murine oocytes to the late blastocyst stage revealed that the PluriNet expression is dynamic and up-regulated during early mammalian embryogenesis (Table1; Supplementary Figures 7 - 9).23 Also, our preliminary analyses indicate that the PluriNet is strongly up-regulated in mouse PSCs, mouse iPSCs, and mouse epiblast-derived stem cells24 when compared to somatic cells. Therefore the PluriNet may be useful as a biologically inspired gauge for classifying both murine and human PSC phenotypes (Table 1; Supplementary Figures 10 – 13).

Figure 3
Pluripotent Stem Cell-specific protein-protein interaction network detected by MATISSE
Table 1
PluriNet Expression patterns in various model system for pluripotecy

In summary, our data indicate that an unbiased global molecular profiling approach combined with a transcriptional phenotype collection using suitable machine learning algorithms can be used to understand and codify the phenotypes of stem cells.4, 5, 25 Although it is more extensive than any stem cell dataset reported to date, we consider our database and the PluriNet to be a work in progress. As more direct evidence for protein-protein interactions in human cells becomes available, it will be possible to refine the networks we’ve defined and make them more useful for testing hypotheses about the nature of stem cell pluri- and multipotence. Also, our sample collection is limited to pluri- and multipotent stem cell types that grow well in culture, and does not include some of the most well-studied lineages, such as hematopoietic stem cells. Resolution and reliability of a context-based unsupervised classification can be expected to grow with the breadth and depth of the database content.26 Even with these limitations, we have shown that the dataset and PluriNet have already proved useful for categorizing cell types using unbiased criteria. As more stem cell populations become available, cultured by new methods, isolated from new sources, or induced by new methods, we will use the PluriNet and the Stem Cell Matrix as a reference system for phenotyping the cells and comparing them with existing cell lines.

Methods Summary

For an overview of the general workflow, please also refer to Figure 1. A detailed list of the samples, culture methods and reference publications is provided in the Supplementary materials.11. Generally, RNA from each sample was prepared from approximately 1 × 106 cultured cells. Sample amplification, labeling and hybridization on Illumina WG8 and WG6 Sentrix BeadChips were performed for all arrays in this study according to the manufacturer’s instructions ( at a single Illumina BeadStation facility. We used the Consensus Clustering framework9 to cluster transcription profiles and to assess stability of the results. As the algorithm, we used sparse non-negative matrix factorization.5 For data perturbation, 30 sub-sampling runs were performed for each considered number of clusters (k). In each run, 80% of the data was subjected to ten random restarts. The R-script can be downloaded at the accompanying website Details on the application of GSA,20 PAM,27 MATISSE6 as well as publicly available datasets used in this study can be found in the Methods section. We modified the MATISSE6 computational framework to fit the goals of this study. For the present analysis we used the human physical interaction network that we had previously assembled6 and augmented it with additional interactions from recent publications.21, 28 29 The 64 interactions in Wang et al. 200621 were mapped to the corresponding human orthologs using the NCBI Homologene database. The microarray data has been deposited at NCBI GEO (GEO series accession number: GSE11508). It can also be accessed, processed and downloaded at

Supplementary Material


Supplementary Information is linked to the online version of the paper on


We thank Chris Stubban, Helga Dittmer, Svenja Zapf and Hildegard Meissner for their work with various cell cultures. We are grateful to Dustin Wakeman, Rodolfo Gonzalez, Scott McKercher, Jean Pyo Lee, Hyun-Sook Park, and Shin Yong Moon for sharing their cell preparations for the type collection. We are especially grateful to Robin Wesselschmidt and Martin Pera for their unique GCT lines and George Daley for providing human iPSCs. Arif Murat Kocabas and Jose Cibelli shared their human oocyte expression data with us. Aaron Barsky let us use the CEREBRAL 2.0 plug-in before its publication. Maggie Rosentraeger helped to compile the cell culture meta-data. We thank Josef Aldenhoff, Dunja Hinze-Selch, Manfred Westphal, Katrin Lamszus, Uwe Kehler, David Barker, and Anja Fritz for their support and discussions of this project.

Financial support This study has been supported by the following grants and awards: Christian-Abrechts University Young Investigator Award (FJM), SFB-654/C5 Sleep and Plasticity (FJM and Dunja Hinze-Selch), Hamburger Krebsgesellschaft Grant (NOS), Edmond J. Safra Bioinformatics program fellowship at Tel-Aviv University (UI), Converging Technologies Program of The Israel Science Foundation Grant No 1767.07 (RS), Raymond and Beverly Sackler Chair in Bioinformatics (RS), Reproductive Scientist Development Program Scholar Award K12 5K12HD000849-20 (LL), California Institute for Regenerative Medicine Clinical Scholar Award (LL), NIH P20 GM075059-01 (JFL), the Alzheimer’s Association (JFL), and anonymous donations in support of stem cell research.


1. Müller FJ, Snyder EY, Loring JF. Gene therapy: can neural stem cells deliver? Nat Rev Neurosci. 2006;7:75–84. [PubMed]
2. Murry CE, Keller G. Differentiation of embryonic stem cells to clinically relevant populations: lessons from embryonic development. Cell. 2008;132:661–80. [PubMed]
3. Adewumi O, et al. Characterization of human embryonic stem cell lines by the International Stem Cell Initiative. Nat Biotechnol. 2007;25:803–16. [PubMed]
4. Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci U S A. 2004;101:4164–9. [PubMed]
5. Gao Y, Church G. Improving molecular cancer class discovery through sparse non-negative matrix factorization. Bioinformatics. 2005;21:3970–5. [PubMed]
6. Ulitsky I, Shamir R. Identification of functional modules using network topology and high-throughput data. BMC Syst Biol. 2007;1:8. [PMC free article] [PubMed]
7. Carpenter MK, Rosler E, Rao MS. Characterization and differentiation of human embryonic stem cells. Cloning Stem Cells. 2003;5:79–88. [PubMed]
8. Goldman B. Magic Marker Myths. Nature Reports Stem Cells 2008. 2008
9. Monti S, Tamayo P, Mesirov J, Golub T. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data. Machine Learning. 2003;52:91–118.
10. Palmer TD, et al. Cell culture. Progenitor cells from human brain after death. Nature. 2001;411:42–3. [PubMed]
11. Schwartz PH, et al. Isolation and characterization of neural progenitor cells from post-mortem human cortex. J Neurosci Res. 2003;74:838–51. [PubMed]
12. Kornblum HI, Geschwind DH. Molecular markers in CNS stem cell research: hitting a moving target. Nat Rev Neurosci. 2001;2:843–6. [PubMed]
13. Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006;126:663–76. [PubMed]
14. Takahashi K, et al. Induction of pluripotent stem cells from adult human fibroblasts by defined factors. Cell. 2007;131:861–72. [PubMed]
15. Yu J, et al. Induced Pluripotent Stem Cell Lines Derived from Human Somatic Cells. Science. 2007 [PubMed]
16. Park IH, et al. Reprogramming of human somatic cells to pluripotency with defined factors. Nature. 2008;451:141–6. [PubMed]
17. Okita K, Ichisaka T, Yamanaka S. Generation of germline-competent induced pluripotent stem cells. Nature. 2007 [PubMed]
18. Zeng X, et al. BG01V: a variant human embryonic stem cell line which exhibits rapid growth after passaging and reliable dopaminergic differentiation. Restor Neurol Neurosci. 2004;22:421–8. [PubMed]
19. Cowan CA, et al. Derivation of embryonic stem-cell lines from human blastocysts. N Engl J Med. 2004;350:1353–6. [PubMed]
20. Bradley Efron RT. On testing the significance of sets of genes. The Annals of Applied Statistics. 2007;1:107–129.
21. Wang J, et al. A protein interaction network for pluripotency of embryonic stem cells. Nature. 2006;444:364–8. [PubMed]
22. Wang QT, et al. A genome-wide study of gene activity reveals developmental signaling pathways in the preimplantation mouse embryo. Dev Cell. 2004;6:133–44. [PubMed]
23. Chambers I, et al. Nanog safeguards pluripotency and mediates germline development. Nature. 2007;450:1230–4. [PubMed]
24. Tesar PJ, et al. New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature. 2007;448:196–9. [PubMed]
25. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–7. [PubMed]
26. Donoho D, Stodden V. When Does Non-Negative Matrix Factorization Give Correct Decomposition into Parts? Advances in Neural Information Processing Systems NIPS*2003 Online Papers. 2003
27. Lacayo NJ, et al. Gene expression profiles at diagnosis in de novo childhood AML patients identify FLT3 mutations with good clinical outcomes. Blood. 2004;104:2646–54. [PubMed]
28. Ewing RM, et al. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol Syst Biol. 2007;3:89. [PMC free article] [PubMed]
29. Mishra GR, et al. Human protein reference database--2006 update. Nucleic Acids Res. 2006;34:D411–4. [PMC free article] [PubMed]