|Home | About | Journals | Submit | Contact Us | Français|
High-throughput gene-expression studies result in lists of differentially expressed genes. Most current meta-analyses of these gene lists include searching for significant membership of the translated proteins in various signaling pathways. However, such membership enrichment algorithms do not provide insight into which pathways caused the genes to be differentially expressed in the first place. Here, we present an intuitive approach for discovering upstream signaling pathways responsible for regulating these differentially expressed genes. We identify consistently regulated signature genes specific for signal transduction pathways from a panel of single-pathway perturbation experiments. An algorithm that detects overrepresentation of these signature genes in a gene group of interest is used to infer the signaling pathway responsible for regulation. We expose our novel resource and algorithm through a web server called SPEED: Signaling Pathway Enrichment using Experimental Data sets. SPEED can be freely accessed at http://speed.sys-bio.net/.
Signal transduction pathways integrate information about the cellular environment and regulate the activity of a multitude of transcription factors, thereby controlling cellular processes, such as migration, proliferation and differentiation through context-dependent gene expression. While the techniques to measure transcript levels on a genome scale have matured and become affordable, it is not yet possible to get a global picture of the activity of signaling pathways in a high-throughput manner.
Typical strategies to analyze gene-expression profiles include searches for overrepresented biological functions, molecular mechanisms or pathways that the regulated genes are involved in, often utilizing annotation databases like Gene Ontology (1) and pathway databases such as KEGG (2). While these strategies have proven very useful to systematically organize the results, it is often forgotten that these analyses mainly detect pathways that are regulated in response to the perturbation—and not the cause of the observed regulatory pattern. If, for example, the MAPK signaling pathway has been found overrepresented in the regulated genes it only shows that the pathway is regulated in response to a perturbation, but not that the MAPK signaling pathway caused the gene regulation. Although genes regulated by a signaling pathway may involve genes that encode for members of the same pathway (feedback regulators), they may also regulate genes that are members of other pathways (transcriptional cross-talk) (3). Thus, pathway membership of regulated genes is unlikely to unveil the regulating pathway, and other strategies have to be employed in order to probe the origin of a certain gene-expression pattern.
One such computational strategy that is often used is the search of overrepresented binding sites in the promoter region of the regulated genes. This strategy is indeed suitable for finding the transcription factors causing the regulation. However, the method is seriously hampered by the poor specificity of computational prediction of transcription factor binding sites, due in large part to the high degeneracy and low information content of the recognized binding sequences (4). In the future, development in experimental strategies such as ChIP-chip or ChIP-seq will likely improve this situation. Still, links between signaling pathways and the transcription factors are only known for a limited number of cases. Thus, even if one has discovered which transcription factor is causing the regulation, in most cases one still cannot pin down the signaling pathway upstream of this transcription factor.
In order to infer signaling pathways that caused the regulation of a group of genes, we developed Signaling Pathway Enrichment using Experimental Data sets (SPEED), a data collection and algorithm that allows for identifying signaling pathways that cause an observed regulatory pattern. The key idea behind SPEED is that the same signaling pathway typically regulates a similar (small) core set of genes in most cell types, while different signaling pathways typically regulate different core sets of genes. We term such a set of consistently regulated genes ‘signature genes’ for that particular pathway. A list of regulated genes in an experiment can then be compared to a catalog of signature genes for different signaling pathways. In the following section, we present the SPEED web server with application to the analysis of two acute myeloid leukemia (AML) data sets.
Gene-expression data sets were identified manually in the gene-expression omnibus (GEO) database based on their abstracts. We concentrated our search on single-perturbation experiments measured within 4 h after treatment to limit secondary effects. This time span represents the shortest interval, which still allowed collection of a sufficient number of data sets in most pathways. If, however, not enough such data sets were available for a particular pathway, long-term experiments such as stable overexpression or knockdown experiments were included if their effect on signature genes was assumed to be prominent.
Custom R-scripts were written to download and process the selected expression data sets automatically. Data for each micro-array experiment within the data set were subjected to quantile normalization and log-transformation if needed. Finally, a LOESS model for the standard deviation as a function of expression value was trained either on replicate experiments (when available) or on all experiments and was used to compute Z-scores. Z-score (x) = (x − μ)/σ, where μ is the sample mean and σ is the standard deviation. The probes were annotated with Entrez Gene identifiers using GEOquery.
SPEED runs on an Apache server with the pre-processed micro-array data stored in an SQLite 3.0 database (see Supplementary Figure S2 for database schema). The SPEED algorithm is implemented in Python 2.5 with dependency on the scipy 0.7.0 (http://www.scipy.org/) package. Graphical output is dynamically generated using the Google Chart API (http://code.google.com/apis/chart/).
Hierarchical clustering was performed using the statistical package R, clusters were linked using Ward’s method. The distance between two pathways was defined as: −log [# common signature genes/min (# signature genes for each set)].
Gene lists were extracted directly from the manuscripts or supplementary data and were not further processed (see Supplementary Table S1 for references). Raw data were not available for these lists. Every significant hit in SPEED had a false discovery rate (FDR) < 0.05.
The FANTOM4 data set was downloaded as tab delimited text files from the Center for Information Biology gene EXpression (CIBEX) database (http://cibex.nig.ac.jp/cibex2/ExperimentMiame.do?queryExperimentalDesignAccession=CBX47). Each of the 52 transcription factor knockdowns in THP-1 cell line were conducted in three replicates. Probe expression values were divided by appropriate controls and the ratios were log (base 2) transformed. Probes were only considered valid if the transcript was expressed in all three replicates for at least half of the 52 transcription factor knockdowns. Valid probe IDs were converted to Entrez Gene IDs and used as background. Genes with at least 2-fold differential median expression were considered for input to SPEED.
We define signature genes as genes that are consistently regulated by a particular pathway across many experiments. Overrepresentation of such signature genes in a list of differentially expressed genes can then hint at the signaling pathway that caused the regulation. We set up a pipeline to compile publicly available expression data sets from pathway perturbation experiments, store them into a database, extract signature genes, and finally detect significant overlap with a user-supplied gene group via a web server (http://speed.sys-bio.net/, see schematics in Figure 1).
In order to extract signature genes, we selected a set of 11 signaling pathways: TGF-β, H2O2, TLR, IL-1, MAPK, PI3K, MAPK+PI3K, Wnt, JAK-STAT, TNF-α and VEGF. For each signaling pathway, we manually searched the GEO database (5) for gene-expression data sets where this signaling pathway is specifically perturbed (see ‘Materials and Methods’ for selection criteria). The data sets were automatically downloaded from GEO (6), annotated and normalized. The expression changes per probe were then transformed into Z-scores per gene (see ‘Materials and Methods’ section for details). Subsequently, the gene ID, expression rank and Z-score rank were stored in the SPEED database. Currently, 215 sets of micro-array experiments are stored in the database, resulting in almost 6 million expression values for 21 485 human genes.
From this database, signature genes for each pathway can be extracted. We define three criteria for a signature gene: it has to be (i) expressed and (ii) regulated (iii) across several experiments for each pathway. As the stored micro-array data originates from different sources and even platforms, we chose to apply thresholds based on percentiles to select the signature genes. For each of the data sets, genes are defined as expressed if their expression value rank is higher than a pre-determined threshold [criteria (i); default: top 50%). Next, regulated genes are extracted based on the rank of their Z-score [criteria (ii); default: top 1%). Finally, genes are selected as signature genes if they are expressed and regulated in more than a pre-determined percentage of data sets for a pathway [criteria (iii); default: >20%). Users can adjust these parameters to create lists of signature genes on the fly.
The mean number of signature genes per pathway using the default parameters is 139. The signature gene lists for any given parameter set can be downloaded as tab delimited text directly from the SPEED web server. Figure 2 lists the mean number of signature genes per pathway for various parameters.
Since signaling pathways form highly interconnected networks, the definition of distinct signaling pathways is difficult and often arbitrary. We therefore analyzed the overlap between signature genes between all pairs of pathways and noticed that there is considerable overlap. In order to group similar pathways, we performed hierarchical cluster analysis (Figure 3). Clustering revealed that the major clusters separate pathways involved in immune response from pathways controlling cell cycle and cell growth. Furthermore, VEGF signaling seems very closely related to pathways from both clusters (MAPK+PI3K signaling and TNF-α signaling). Also, signaling triggered by IL-1 and TNF-α shows strong overlap in the signature genes, probably due to the shared downstream IKK-β/NF-κB signaling cascade (7). We therefore added a feature in SPEED to select only those signature genes that are unique to each pathway. We note, however, that under such a uniqueness constraint the signature genes depend strongly on the thresholds chosen, and the predictive power of the algorithm does not improve (Figure 2; Supplementary Figure S1). Additionally, since we expect a strong crosstalk between pathways, non-unique signature genes seem to be a more biologically reasonable choice. We also decided to give the user the possibility to restrict their analysis on a subset of pathways. The default setting restricts the analysis on four major pathways JAK-STAT, MAPK+PI3K, TGF-β and TLR. While users have complete control in their choice of parameters, the selected parameters should be biologically reasonable (Figure 2 describes our recommended parameters).
In order to detect overrepresented signature genes, we apply Fisher’s exact test with a multiple hypothesis correction. For each pathway, it is tested whether the corresponding signature genes are overrepresented in the user-supplied list. A P-value is provided to indicate the significance for the overrepresentation. To adjust for multiple hypotheses, an FDR for the P-value (8) is also provided. Users can restrict the analysis to a subset of genes in the database by submitting a list of background genes. For example, if one submits a list of differentially expressed genes from a micro-array study, we would recommend using all significantly expressed genes from that micro-array study as background.
In order to test how well SPEED can predict modulated signaling pathway activity, we collected gene lists from literature with known signaling pathway perturbation. Note that gene-expression data from these literature sources was not integrated into SPEED. A result was considered to be true positive if the top-ranked pathway prediction matched the actual perturbed pathway and the FDR was below 0.05. Table 1 describes the validation results for the default pathways (see Supplementary Table S1 for details). The overall sensitivity, calculated as the fraction of all literature sources with correct pathway predictions, using default parameters was found to be 78%. Since traditional signaling pathway membership analyses do not aim at predicting causal pathways, SPEED outperforms GATHER (9), which correctly predicts the upstream pathway with a sensitivity of 29%.
Due to limited number of available positive sets in literature, we also conducted leave-one-out cross validation to get a better estimate of sensitivity. Once again, we considered a result to be true positive if the top-ranking predicted pathway was indeed the one that was perturbed. Using default parameter values, the sensitivity was calculated to be 83% of 130 tests. The classification problem here is not binary and therefore the expected sensitivity for random predictions is far <50%. In order to estimate the specificity, the fraction of negative input sets correctly identified as negative, we ran SPEED on 200 negative sets comprised of randomly selected genes from the database. The size of the negative sets was randomly chosen between 50 and 200 genes. The overall specificity and precision (the proportion of true positive results against all positive results) was calculated to be 98 and 96%, respectively. Figure 2 notes the sensitivity and specificity for different parameters.
Validation using the literature derived gene sets indicated a difference between genes regulated by a pathway and gene products that are members of the pathway. To verify the disparity between regulated genes and pathway members, we searched regulated genes from each SPEED experimental data set for overrepresentation of acting pathway members in KEGG Pathway (2), BioCarta and Panther (10) databases. 39% of all data sets had significant enrichment of the acting pathway members as determined by KEGG Pathway. 9 and 24% of all data sets had significant enrichment of the corresponding pathway in BioCarta and Panther, respectively. These findings suggest some degree of transcriptional feedback where genes regulated by a pathway translate to protein members within the same pathway. However, only 4 and 7% of all data sets had the acting pathway ranked as the top one in KEGG and Panther, respectively. No experimental data set had the acting pathway ranked as the top one in BioCarta (see Supplementary Tables S4 and S5 and Figures 3 and and4,4, for detailed results). The low top-rank percentage reasserts the unfavorable use of pathway membership databases for identifying pathways regulating a list of genes. Consequently, the SPEED data sets and signature genes are fundamentally different than lists of pathway members found in existing membership databases.
The scope of the SPEED web server is the identification of causal signaling pathways given a list of input human genes. Thus, we implemented the web server with an unencumbered user interface where the user can input gene lists in widely adopted formats including Entrez Gene ID, gene symbol, Uniprot, GI number, Refseq, Ensembl and IPI. Any other gene identifiers can be easily converted to those formats using available conversion tools such as Clone ID converter (11), DAVID bioinformatics resources (12) or the UniProt ID mapping service (13). Upon submission, the user is directed to the output page where the results are presented as a list of signaling pathways ordered by their significance with number of signature genes present in the input, P-values and FDRs for each pathway. Graphical outputs, such as a pie chart representing the distribution of signature genes in the input list and a bar chart describing the relative FDRs, further aid in interpretation of the results (Figure 4). The symbols of the signature genes present in the input are provided with hyperlinks to the respective NCBI Entrez Gene web page. Separate web pages are provided for database access to retrieve signature genes as tab delimited text or to browse through experimental details such as cell type, perturbation strategy and length of perturbation with hyperlinks to the original gene-expression data sources.
In addition to the web interface, we provide download options for the raw data as tab delimited text or as an SQLite database, which is a single cross-platform disk file, together with Python source code to facilitate local programmatic access to SPEED functionality. All SPEED code is open-source and free to use.
In the following sections, we demonstrate the functionality of SPEED through analysis of two AML data sets.
Patients with AML harbor mutations in one or more genes such as FLT3, NPM1, c-KIT and CEBPa (14). Some mutations like internal tandem duplication in FLT3 (FLT3-ITD) are associated with high-risk AML patients and confer poor outcomes (14). Marcucci et al. (14) conducted a study to search for prognostic markers in cytogenetically normal AML patients with high-risk molecular features (FLT3-ITD and/or NPM1 wild-type) with or without CEBPa mutations. The study identified 928 genes that were downregulated in CEBPa mutants as compared to wild-type. We searched the downregulated genes for enriched pathway annotations using GATHER (9) and DAVID (12), using standard parameters (Bayes factor ≥6 and corrected P-value ≤ 0.05 for GATHER and DAVID, respectively). No pathways were identified as containing a significant number of members in the input gene list, which suggests that the downregulated genes have little effect on signaling pathways.
As previously noted, searching for membership enrichment does not provide insight into the signaling pathways that causally regulate a list of genes. We searched the same list of downregulated genes using SPEED (default parameters with all pathways selected) and identified TNF-α as the top-ranking pathway (Figure 4). Interestingly, the top three pathways, TNF-α, IL-1 and TLR, form a single cluster (Figure 3).
In the recently published FANTOM4 study, the consortium knocked down 52 transcription factors in the AML cell line THP-1 (15) and measured changes in the transcriptome. We reasoned that if the target genes of the transcription factors overlap significantly with signature genes of a signaling pathway, the pathway is likely to be upstream of that transcription factor. We therefore used this data to see whether SPEED can be used to systematically infer relationships between signaling pathways, transcription factors and regulated genes. We extracted 52 lists of differentially expressed genes corresponding to each transcription factor knockdown (see ‘Materials and Methods’ section for details) and ran SPEED to identify upstream signaling pathways. SPEED identified at least one signaling pathway upstream of 24 transcription factors. Table 2 describes the top-ranking signaling pathway for each transcription factor (see Supplementary Tables S2 and S3 for all predictions). While for some of the factors, the identified pathway is known to be upstream (indicated in the table), for others our results are predictions that may provide starting points for experiments.
The analysis of gene lists obtained from high-throughput experiments is a classical problem in bioinformatics. Most approaches, however, search for overrepresented functions, processes or pathways in the regulated gene group and are thus only suitable to identify the affected signaling pathways and not the pathways that caused the observed changes in gene expression. With SPEED, we focus on targets of signaling pathways by automatically deriving signature genes, i.e. genes that are typically regulated in a variety of cell types when the activity of a specific pathway is altered. While existing web servers also compare user gene lists with disease related or single-experiment signatures derived from micro-array data (16–21), SPEED, to our knowledge, is the only web server that integrates heterogeneous micro-array data for the purpose of identifying causal signaling pathway relationships. Additionally, SPEED signature gene lists can be readily incorporated into established gene list comparison tools like Gene Set Enrichment Analysis (22).
The number of pathways in SPEED is currently limited by publicly available gene-expression data from pathway perturbation experiments. Currently, SPEED contains 11 signaling pathways. However, the presence of pathway perturbation data in GEO suggests that these 11 pathways are of special interest to researchers and are thus the major signaling pathways responsible for many phenotypes studied using gene-expression technologies. Nevertheless, we expect that the number of SPEED pathways will increase in the future as microarrays continue to become more affordable. Furthermore, we can readily incorporate quantitative next-generation sequencing data as the field matures. To facilitate user suggestions, we have included a form on the web server for recommending new pathways or data.
It is important to note that not all pathway perturbation data should be included in SPEED. The reliability and biological meaning of SPEED signatures crucially depends on the nature of the experiments. For example, we aimed at selecting experiments that measured gene expression at early time points in order to reduce any indirect or long-term effects. To further reduce indirect effects, we selected experiments such that single pathways were primarily perturbed. While some perturbations may affect more than a single pathway, the reproducibility over multiple experiments via the overlap parameter can account for noise in the data as long as the pathway is the primarily perturbed one. However, if pathways are strongly coupled or converge on the same transcription factors, such as the MAPK and PI3K/AKT pathways (23), it is necessary to combine them into a single pathway description. The SPEED pathway labeled MAPK+PI3K is an example of combining coupled pathways.
Validation using literature-derived gene lists indicated that SPEED performs well in predicting upstream causal pathway-perturbation events with a sensitivity of 78%. However, it is possible that other cellular processes also regulate the same genes as SPEED signaling pathways and are the actual causal influences. Since the predicted signaling pathways are derived based on correlations with gene signatures, results should be considered as a starting point for further investigation into the actual events leading to the regulated set of genes.
The identification of signaling pathways that cause observed changes in gene expression is important, since aberrant signaling pathways have been implicated in diseases such as diabetes (24) and several cancers (25,26). We exemplify the utility of SPEED by identifying upstream signaling pathways in two AML studies. For high-risk AML patients with and without CEBPa mutations, SPEED identified TNF-α signaling as the top candidate that caused the observed differences between the two groups of patients. Given that high-risk AML patients with CEBPa mutations have better clinical outcomes, SPEED results may be taken as a starting point to narrow down the search for the differences in outcome by focusing on differences in TNF-α signaling.
Using micro-array data from knockdown of transcription factors in an AML cell line, we aimed at identifying links between transcription factors and upstream signaling pathways. We searched for associations between signature genes of signaling pathways and the targets of the transcription factors. For 24 factors, we could establish such associations, of which some are already known, whereas others, like MAPK+PI3K to ZNF238, are novel predictions. Thus, SPEED facilitates completing the picture of cell signaling events leading to differential expression of genes via a layer of transcription factors.
Supplementary Data are available at NAR Online.
Germany’s Federal Ministry for Education and Research (BMBF) (grant Forsys Partner to N.B.); the European Commission (grant number CancerSys HEALTH-F4-2008-223188 to N.B.); National Science Foundation Integrative Graduate Education and Research Traineeship (IGERT) (grant number DGE-0654108 to J.P.). Funding for open access charge: National Science Foundation Integrative Graduate Education and Research Traineeship (IGERT) (grant number DGE-0654108).
Conflict of interest statement. None declared.
The authors thank Rositsa Koleva and Eric D. Smith for helpful discussions.