S-MAP calculates scores for gene modules, sets of related genes, based on the pattern of differential expression of each module’s constituent genes observed in several studies. It uses the scores to classify modules according to whether the genes are associated with up-regulation in stem cells (stemness-on), down-regulated in stem cells (stemness-off) or neither. S-MAP can then use the lists of stemness-related modules to classify additional datasets according to their stemness signatures using a Stemness Index score. We describe the definition of S-MAP’s non-redundant set of modules, collection of the stem cell expression compendium, and the various scores for classifying modules and datasets.
Definition of gene modules
A non-redundant set of gene modules was collected that included both putative homolog families determined by BLAST similarity analysis and genes that participate in common pathways (Figure S1a
–b). Our method used sets of functionally- and evolutionarily-related genes to determine if each set has an expression pattern significantly associated with stemness. We collected a large number of gene modules including putative homolog families and genes that participate in common pathways.
We identified putative homolog modules by determining collections of genes sharing high protein sequence similarity. BLASTP (at an E-value cutoff <0.05) was used to align the entire mouse proteome (UCSC version mm9) containing 45,480 peptides with an associated EntrezGene identifier. For each pair of proteins, only the alignment with the highest BLASTP E-value, as well as the highest overall sequence coverage, was chosen as representative of the gene pair. Only gene pairs whose sequences had an E-value smaller than 10−70
and coverage of more than 50% were connected. A depth-first traversal on the resulting protein-protein similarity network was used to identify all connected components totaling 14,941 in total. Of these, 11,920 had only a single gene in the group. To incorporate more evolutionarily distant relationships and connect genes in singleton groups to more distantly related homologs, unconnected genes were assigned to the module containing the gene with the most similar protein sequence if its best match exceeded a less stringent cutoff of 10−10
and coverage of at least 50%. The process was iteratively repeated until singleton modules could no longer be assigned to larger gene sets. The process converged on a final set 4,659 homolog modules and decreased the number of genes remaining unconnected to 5,249. Of the homolog modules, six represented very large families of over 100 gene members. We found that the homolog module definitions were highly concordant with HomoloGene 
) even though HomoloGene was not designed specifically to delineate paralogous groups.
We collected functional modules from five different data sources including Gene Ontology 
, BioCarta’s curated pathway sets 
, CORUM’s experimentally derived mouse protein complexes 
, and putative protein complexes identified from dense sub-networks derived from BioGRID’s database of interacting proteins 
. For BioGRID, modules were also identified from human datasets and the genes were mapped to mouse using best-reciprocal BLASTP hit analysis.
Non-redundant collection of modules
While the S-MAP procedure could score all collected modules, we defined a non-redundant set of modules to use for assessing the overall significance and distribution of scores. Allowing redundancy could skew the results because of highly characterized pathways. For example, Gene Ontology contains a deep hierarchy with many sub- and sub-sub-categorizations under the cell cycle process because it has been studied extensively in yeast. A redundancy filter on the functional modules helped us avoid redundant annotations of highly characterized pathways and protein complexes from dominating the results.
To filter out redundant functional modules, we first sorted the functional modules from smallest to largest and excluded any having a 25% or greater gene overlap with a smaller functional module. Functional modules with over half of the genes (50% overlap) belonging to the same set of homologs were also excluded because any such module also reflects an evolutionary connection between the genes to a greater degree than the functional connection. The relatively high overlap cutoff (50%) between evolutionary and functional gene modules allows genes with multiple functional roles to be captured by different module types. We collected all homolog and functional modules into the set Ω.
Collection of a compendium of differentially expressed gene lists
We collected lists of differentially expressed genes from 30 transcriptional profiling studies, corresponding to 49 different cell populations and 12 different stem cell types. For each stem cell population in each study, we collected the list of clones found to be up-regulated relative to differentiated cells. Likewise, a differentiated-cell gene list (DGL) was constructed by including genes found to have higher expression in differentiated cells compared to stem cells. The lists were obtained or inferred either directly from the publication or by communication with the authors. To obtain a stem cell gene list (SGL), clones were mapped to their corresponding EntrezGene identifiers. If a gene had even a single matching differentially expressed clone it was included in the SGL. Clones without a corresponding EntrezGene identifier were excluded from further analysis.
For each module, a stem-cell event matrix S
was constructed from the set of all SGLs such that Sij
is empty if gene i
was not tested in study j
, 1 if it was found to be differentially up-regulated, and 0 otherwise. In addition to the lists representing genes over-expressed in stem cells, we also derived lists with genes over-expressed in differentiated cells (or under-expressed in stem cells). Differentiated-cell gene lists (DGLs) were obtained in an analogous fashion using the clones reported to be downregulated in each study. A DGL was obtained for all but four of the studies in the compendium. Each module was also associated with a corresponding differentiation event matrix D
recording the up-regulation events across the DGL compendium for that module. The descriptions of the publications, cell types, and gene list sizes are listed in Dataset S1
The recurrence score used for S-MAP (described subsequently) was motivated by the need to compensate for potential biases in combining heterogeneous results. There are obvious variations in the overlap and sizes of the SGLs and DGLs, which reflect the differences in the compared biological specimens, cell, and RNA isolation techniques, hybridization protocols, and statistical methods applied to identify differentially expressed clones. The recurrence score down-weights the influence of studies reporting highly similar gene lists to avoid a bias due to over-representation of any one stem cell type. Before computing recurrence scores for genes or modules, we calculated the similarity of all gene lists to derive groups of gene lists. Gene lists in the same group could then be collectively down-weighted in the recurrence scoring. The identification of these gene list groups is described next.
Identifying groups of similar gene lists
Groups of similar gene lists from possibly different studies were identified by performing an all-against-all comparison of the collected gene lists, constructing a network from the pair-wise overlaps, and then unifying lists that fell into the same sub-network (Figure S1c
). Gene list groups were determined by clustering similar gene lists on a network. All pairs of gene lists were compared and the significance of their overlap was determined by calculating the hypergeometric P
value reflecting the probability of obtaining k
overlapping genes by drawing two clusters of size n
randomly from a genome of size N
. Gene lists with overlap P
values smaller than 10−50
were linked together to form a network of gene lists. Less stringent cutoffs did not separate gene lists from the same or different stem cell types.
was used to ordinate and visualize the resulting network using the negative logarithm of the hypergeometric P
values as the weights for spring-embedded layout (). The final groups were determined by visual inspection of the network. Subnets of mutually similar gene lists of the same stem cell type were identified with the same group. This yielded three groups of sizes 2, 5, and 7 with the remaining left as degenerate groups of size 1. While the Ramalho-Santos HSC gene list 
had overlaps with many other HSC-derived gene lists and appeared at first to warrant inclusion in the HSC group of size 5, the significance of the overlaps were borderline and less than all of the other HSC overlaps. For this reason, we left the Ramalho-Santos HSC gene list in its own degenerate group. We denoted the set of gene list groups including degenerates as B1, B2,…, BF
, where F
is the number of groups identified by the above procedure. The Bf
's define a mutually exclusive and exhaustive partition of the experiments in the compendium. We write experiment j
's replicate group with the notation b(j)
. For example b(j)
1 is interpreted to mean the jth
experiment was placed in the first group, B1
. Thus, the notation Bb(j)
represents the set of gene lists grouped with a particular gene list j
including gene list j
A score called recurrence was developed to measure the degree to which observed differential expression in a module is replicated across independent gene lists. Two recurrence scores were computed – one measuring repeated up-regulation across the SGLs and the other down-regulation across the DGLs. The DGL recurrence is interpreted as a measure of specificity for a module when considering its association to stemness. The recurrence score uses a meta-analysis that computes a weighted count of the number of upregulated genes across all of the studies, observed for a particular module. The score incorporates a weight for each experiment that is inversely proportionally to the number of genes upregulated in the study. Thus, studies with smaller, more specific gene lists are given higher weight than studies with large, less specific lists. The score also incorporates the redundancy of the gene lists by allowing each group of similar lists to have a single vote no matter how many lists are in a group.
A module M
was considered to have a stemness-associated pattern of expression if some of its genes were found in a high proportion of SGLs. We developed a score called recurrence
that measures the degree to which observed differential expression in a module is replicated in independent gene lists. In practice, we compute two recurrence scores – one measuring repeated up-regulation across the SGLs and the other down-regulation across the DGLs – but here we describe the calculations only for SGLs as the others for the DGLs are analogous. The recurrence score used for module M
computed over the SGL matrix S
is a gene-specific score for gene i
measuring its degree of up-regulatation across all of the studies, Ti(S)
is a normalization constant to compensate for the number of genes in the module, and the q
exponent determines how much weight frequently expressed genes have on the score relative to less frequently expressed genes.
A gene-specific score was used that reflects the frequency with which a gene is up-regulated in the SGL compendium. Because the SGLs are of highly different sizes, we introduced a “signal strength” zj
for each gene list j
to account for the differences in specificities inherent in the experimental results. We computed the gene specific score:
is the total number of SGLs, uij
is the unit of weight contributed by gene list j
in its gene list group, and Sij
encodes the binary event that gene i
is upregulated in SGL j
as defined above. The uij
term forces each gene list to contribute the same amount of weight as any other experiment in its group and is a function of the gene because different genes may be tested in different sets of experiments in the group. This term is defined as:
is 1 if the ith
gene was tested in experiment j
and 0 otherwise. Experiments in large groups get a smaller weight while experiments in small or singleton groups get the largest weight. For example, suppose three gene lists are grouped together and one gene is tested in two of the lists, while a second gene is tested in all three. For the first gene, two experiments would contribute a weight of 1/2 while, for the second gene, all experiments would contribute a weight of 1/3.
Finally, the normalization constant for the recurrence score ensures that the zjuij
products sum to unity to normalize for the overall size of the module; i.e. Ti(S) = Σj zjuij
, where j
ranges from 1
. A completely analogous score was used to compute the recurrence of a module’s down-regulated events across the compendium. Only, in this case, the D
matrix was used in place of S
so that each module also had an associated R(M,D)
recurrence measure as well.
The recurrence score is similar to vote-counting meta-analysis methods that count the number of occurrences of a particular observation in a set of independent tests. In addition, as described next, it incorporates a weight for each experiment similar to the inverse variance weighting used for combining effect sizes (or inverse document frequency weighting used in information retrieval applications).
Simulation to determine recurrence parameters
Due to the lack of appropriate positive controls, we used synthetic data to assess the accuracy of recurrence scores with different parameter choices (q and zj) at identifying significant up-regulation across many stem cell types.
First, the size (n) of each module was randomly sampled from an exponential distribution with a mean number of four genes to match the size distribution of modules in our collection. Stemness-related modules were simulated by randomly sampling three different idealized stemness patterns: 1) Many genes expressed in Many tissues (MM), 2) Single gene expressed in Many tissues (SM), and 3) Single gene expressed in a Single tissue (SS).
For each experiment, MM modules were simulated by independently drawing two genes at random such that all genes were equally likely to be selected. SM modules were simulated by first selecting a random gene from the module to have a high probability (p
0.8) and the rest a low probability (p
0.2/(n-1)). For each of the nine synthetic experiments, a single gene was chosen as up-regulated according to these probabilities. Finally, SS modules were simulated by independently choosing a single gene to be up-regulated in each experiment where each gene was equally likely to be selected.
In addition to the simulated stemness-related modules, two types of non-stemness modules were simulated: 1) tissue-specific (TS) modules in which genes were primarily expressed in a single tissue, and 2) non-related (NR) modules in which genes exhibited up-regulated events expected by chance. First, a set of preferentially up-regulated genes U
was simulated by selecting each gene independently in the module with probability p=
0.5. For TS modules, a random experiment was selected and assigned a higher probability (p=0.8
) relative to the other experiments (p=0.2/(J-1))
. TS modules were then simulated by allowing only genes in U
to be up-regulated, each up-regulated in one tissue chosen according to these biased probabilities. For NR modules, the simulation was similar but all experiments had an equal probability of selection.
Nine experiments were simulated for 2000 modules among which 120 represented stemness modules (MM, SM, and SS in equal proportions). The remaining non-related modules were simulated in the ratio 2:3 (TS:NR).
For the q
exponent, we tested values of 0.5, 1, 2, and 3. For the signal strength zj
, three different functions of the size of gene list j
were tested – 1/vj
– where vj
corresponds to the size of gene list j.
, we also tested an additional zj
1, equivalent to the use of no weighting.
All 13 combinations were evaluated for their ability to find different stemness-associated patterns in synthetic data, and the results were tabulated in Table S2
. The area-under-the-curve (AUC) was calculated for the different q
choices. To estimate the AUC, we ranked all simulated stemness-related and unrelated modules by their scores. Precision and recall were computed by sweeping through cutoffs in the recurrence score. False positives were assumed to be those unrelated modules with scores above the cutoff; false-negatives were assumed to be those related modules with scores below the cutoff.
The critical parameter in the simulation was q
. Smaller values of q
produced more accurate results for modules that use multiple genes in multiple tissues, while higher values of q
produced more accurate results for modules that use a single gene in many tissues. The combination q=2
and zj_= -log(vj)
gave the highest average AUC among all combinations (Table S2
) and was used for the analysis.
A cell-diversity score based on information theory was used to distinguish between modules that are up-regulated in many, compared to few, stem cell types. The score is computed on the relative proportions of up-regulated genes in the module rather than on the absolute proportions. Thus, even a module expressing a small number of genes on average can have a high diversity if the same (small) fraction of its genes is expressed across many cell types. In contrast, a module with a low diversity has a higher relative proportion of up-regulated genes in one or a few stem cell types compared to the relative proportion in other types.
A relative frequency vector was derived from a module’s event matrix by computing, for each cell type l
, the relative proportion of its genes found to be up-regulated across all SGLs of type l
represents the set of experiments of cell type l
is the total number of such cell types. The second equation above defines f’l
to be the relative frequency of the module genes present in any SGLs of type l
. The cell-diversity of a module is defined as the normalized entropy over these relative frequencies:
is the base 2 logarithm of x
ranges between 0 and log |L| “
bits,” with a module obtaining a value near the maximum when its up-regulated events are roughly equal across all cell types.
An analogous gene-diversity score was defined to quantify how evenly utilized the genes are across the studies. The gene-diversity score Dgene(M) was calculated on the transpose of the event matrix. Cell- and gene-diversity scores were computed over the DGLs. We computed normalized versions of cell-diversity, denoted Dcell’(M), by dividing by log |L|, and of gene diversity, denoted Dgene’(M), by dividing by log |n|, to provide an intuitive measure of the diversity ranging between 0 and 1.
Significance of the recurrence and diversity scores
The significance of the recurrence and cell-diversity scores was determined by simulating randomly drawn modules with matched sizes to generate size-specific false discovery rates. Each module could then be compared to a thousand permuted modules of the same size to identify those with significantly scores.
The data within each experiment were permuted to produce 1,000 different SGL compendiums and their corresponding event matrices. The total number of up-regulated genes in each experiment was kept the same, but a random set of genes of the same size was simulated as up-regulated. The permutations preserved the correlation structure between gene lists in the same gene list group (Figure S1c
) by keeping blocks of data values for each gene within each group together. This procedure produced randomly generated event matrices
from which an empirical distribution for the recurrence was obtained by calculating the recurrence for each module and each permuted event matrix. Because we expected the distribution of the recurrence to be influenced by the size of a module, we grouped modules into six size classes 1, 2, 3, 4, 5-10, and >10 and denote the modules in the kth
class as Ω(k)
). From this null distribution, we estimated the false-discovery rate of a recurrence score for a module of size-class k
is the indicator function equal to 1 if the argument x
is true and 0 otherwise. To identify all modules with a score associated with an FDR of 5% or less, we determined a critical value c
for each module size class that gave FDR(c,k)
We used a similar random sampling approach to assess the significance of the cell-diversity scores. Randomized families from 1,000 permutations were used to generate cell-type diversity scores for random modules. For each module size, a false discovery rate was estimated for various cell-diversity cutoffs ranging between 0 and 3.6 bits (log (12) for the 12 stem cell types). Inspection of these cutoffs suggested that the FDR estimates for modules of different sizes were highly comparable (Figure S2c
). Therefore, a single overall cell-diversity cutoff of 2.5, calculated as a weighted average across all modules, was used to achieve a desired FDR rate of 5%. Any module with a cell-diversity exceeding 2.5 was considered “cell type diverse.”
Because modules have a variety of different genes, identifying FDR cutoffs is problematic, especially for small modules. Therefore, a simple cutoff was used to distinguish modules utilizing many genes versus those utilizing a few. Modules with calculated normalized gene diversity scores higher than 0.5 were considered “gene-diverse”.
While individual genes in a module could be exclusively expressed in some stem cells, if a module has many such genes (differentially expressed in a diverse set of cell types), it may not always be stemness-specific. For example, even a recurrent module could have a subset of genes switched on in stem cells and a different subset switched on in differentiated cells. To identify modules with highly specific patterns, we used the recurrence score computed on the DGLs as a measure of specificity. If a module’s SGL recurrence had an FDR less than 5% and its DGL recurrence had an FDR higher than 95%, it was classified as “highly specific”. Modules with DGL recurrence FDR between 5% and 95% were classified as “moderately specific.” Highly and moderately specific modules were included in the stemness categories.
For all modules found to have significant recurrence (FDR <5%), we classified their pattern of expression using cell- and gene-diversity. Restricting our investigation to the recurrent modules is expected to reduce false positive inferences but may miss some interesting biological phenomena. Based on the cell- and gene-diversity scores, modules were assigned to one of six possible categories: all-for-all (AFA), one-for-all (OFA), constitutive module (CM), and constitutive gene (CG), all-for-one (AFO), and one-for-one (OFO). The definitions used were as follows: AFAs, OFAs, CMs and CGs had cell-diversities of at least 2.5 bits of information entropy, while AFOs and OFOs had lower cell-diversity values. AFAs, AFOs and CMs had normalized gene-diversities of at least 0.5 while OFAs, OFOs, and CGs had lower gene-diversity values. CMs and CGs were non-specific, as they showed significant recurrence in both SGLs and DGLs and may reflect housekeeping functions. AFA and OFA modules were defined as stemness modules and are referred to as stemness-on if identified from the SGLs and stemness-off if from the DGLs.
Deriving an expression map of stemness modules
Each module was represented as a vector of fractions of upregulated genes in each cell type. We calculated the Pearson correlation between all pairs of modules. The large number of correlations was reduced by a filtering step, such that only correlations higher than 0.8 were kept. To reduce the number of spurious unrelated connections, at least one of the two compared modules was required to be recurrent. The modules (nodes) and their correlations (edges) were loaded into Cytoscape. A force-directed layout algorithm was used to position the modules onto the X-Y plane such that the correlations were treated as pair-wise force constraints among the modules.
Stemness index (SI)
We used the set of stemness-on modules M+
and stemness-off modules M−
identified by S-MAP to develop a score capable of recognizing that a new query set of differentially expressed genes Q
is enriched for stemness modules. We defined the Stemness Index
) as a contrast between the overlap of the query with modules indicative of stemness compared to those indicative of differentiated cells. To this end, we defined the Stemness Overlap (SO) to be the weighted overlap of Q
with all of the stemness modules – the higher the overlap, the more likely the query was derived from genes up-regulated in stem cells. To increase the sensitivity of the score for identifying new modules involved in core stemness, overlaps with modules of high cell diversity were more highly weighted compared to modules of lower diversity. We defined the weight of influence for a module in proportion to its cell-diversity as
. The SO could then be defined as:
represents the observed fraction of overlap between the query Q
and module M; the denominator in the logarithm represents the fraction of overlap expected by chance. The form of the SO score is similar to a log-likelihood ratio where each term in the summation can be thought of as the log-probability that the query would be observed given it was produced by module M
, contrasted by the log-probability of it being produced by chance. A Differentiation Overlap (DO) was calculated identically except that M−
was used in place of M+
for the calculation of the module weights and the score. For a given list of genes, we compute the Stemness Enrichment (SE) as half of the difference between the SO and DO scores. A positive SE indicates that the gene list Q
has a higher overlap with stemness modules compared to differentiated modules on average.
The stemness index (SI) score assigns a single number to a type-II study, one that compares two populations. Each type-II study has two associated gene lists: a positive gene list containing genes upregulated in the experimental population (e.g. stem cells) and a negative gene list containing genes upregulated in the control population (e.g. differentiated cells). Let Q+
represent the postive and negative gene lists for an experiment respectively. An experiment that induces stemness-related modules to the exclusion of differentiation-related modules will receive a high SE(Q+
). By the same token, the same experiment could also repress differentiation modules to the exclusion of stemness modules detected in the negative list and therefore receive an extremely negative SE(Q−
). To capture both of these indicators of stemness, we defined the stemness index (SI) score as the difference between the SE(Q+
) and the SE(Q−
). A positive SI indicates that the experiment’s positive gene list has higher correspondence with stemness-on compared to stemness-off modules relative to the experiment’s own negative gene list. Values near zero indicate either that an experiment’s queries have little overlap with any stemness-on or –off modules, or that they overlap with modules from both in equal proportions. For this reason, we also consider the total overlap of these gene lists against SI. The total module overlap (TMO
) was computed as the sum of all of the module overlap scores for both the positive and negative gene lists; i.e. TMO(Q+,Q−)
1/2 * (SO(Q+) + SO(Q−) + DO(Q+) + DO(Q−))
To assess the self-consistency of the SI score, we used a 5-fold cross-validation framework. In each cross-validation run, we held out 20% of the SGLs (and their corresponding DGLs). We then identified stemness-on and –off modules using recurrence, cell-diversity, and specificity applied to the remaining 80% of the gene lists. The SI was then calculated on all of the lists in the held-out 20%.
We compared the ability of homolog versus functional modules to contribute accurate information to the SI
. Each set was used without the other. Surprisingly, homolog modules showed higher precision than functional modules at all levels of recall (Figure S6a
). The precision-recall plot showed significant levels of precision compared to a control in which SI
was computed on random modules (Figure S6b
Assessing the significance of the stemness index and total module overlap
We asked whether the observed SI and TMO scores were significant. We constructed a background distribution for these scores by simulating pairs of random gene lists. For both the up-regulated and down-regulated genes in an experimental set, we simulated random gene lists of the same size by drawing without replacement from the set of all genes tested in the experiment. This sampling procedure was repeated twenty times to generate twenty pairs of matched upregulated and downregulated gene lists for each experimental data set. The SI and TMO were calculated on all simulated pairs. Because we expect the variability in the SI to increase with an increase in TMO (because, for example, extremely positive or negative SI’s must necessarily have high TMO levels), we estimated both the mean and standard deviation of the simulated pairs as a function of their TMO. We used windows of width equal to one TMO unit and incremented by 0.1 units.
Detection of Enriched Biological Categories
We compiled a collection of functional categories from Gene Ontology 
, BioCarta 
, KEGG 
and MIPS 
into a single set of functional gene groups. To avoid any categories that may be too general, we restricted the collection to functional categories that had 100 member genes or less. For each module, we assessed the significance of its overlap with every individual functional category using the hypergeometric distribution to estimate a P
-value (PV) as:
is the number of genes in the module that are also in functional group, n
is the total number of genes in the module, K
is the number of genes in the functional group, and N
is the number of total genes with any functional annotation. All P-
values were corrected for multiple hypotheses testing using the Bonferonni correction and a significance level of 0.01 was chosen for further analysis.
We developed a Stemness Index (SI) to contrast the degree to which a study upregulates modules associated with stemness compared to those associated with differentiated cells. For a particular study, we are given two lists of genes: one representing genes up-regulated (Q+) and another representing those down-regulated (Q−) in the conditions of the study. Both Q+ and Q− are overlapped with every S-MAP module to compute a score reflecting the total overlap to stemness and differentiated modules, weighted by each module’s cell-diversity. A total overlap to all S-MAP modules provides an indication of how relevant a prediction of stemness might be to the study. For studies found to have at least some non-random level of overlap, we can then ask whether the upregulated genes show a biased correspondence to stemness or differentiated modules – i.e. Q+ may overlap with stemness modules in higher proportion than Q− or vice versa. To quantify this intuition, we used a stemness enrichment (SE) score to measure the difference in overlap between stemness modules and differentiated modules for both Q+ and Q−. The final SI was then defined as SE(Q+)-SE(Q−).