|Home | About | Journals | Submit | Contact Us | Français|
A comprehensive analysis of enriched functional categories in differentially expressed genes is important to extract the underlying biological processes of genome-wide expression profiles. Moreover, identification of the network of significant functional modules in these dynamic processes is an interesting challenge. This study introduces DynaMod, a web-based application that identifies significant functional modules reflecting the change of modularity and differential expressions that are correlated with gene expression profiles under different conditions. DynaMod allows the inspection of a wide variety of functional modules such as the biological pathways, transcriptional factor–target gene groups, microRNA–target gene groups, protein complexes and hub networks involved in protein interactome. The statistical significance of dynamic functional modularity is scored based on Z-statistics from the average of mutual information (MI) changes of involved gene pairs under different conditions. Significantly correlated gene pairs among the functional modules are used to generate a correlated network of functional categories. In addition to these main goals, this scoring strategy supports better performance to detect significant genes in microarray analyses, as the scores of correlated genes show the superior characteristics of the significance analysis compared with those of individual genes. DynaMod also offers cross-comparison between different analysis outputs. DynaMod is freely accessible at http://piech.kaist.ac.kr/dynamod.
The analysis of genome-wide gene expression profiles is widely used in the current biomedical research. An important goal of this analysis is to generate biological hypotheses by identifying statistically significant genes and their functions that reflect the different phenotype classes under test. Various useful gene selection tools are available for this purpose (1,2). Typical gene selection approaches focus on the characterization of individual genes distinguishing the biological conditions. These individually selected genes are subjected to annotate enriched functional categories. However, many phenotypes and underlying cellular processes are associated with groups of genes in various functional categories and their networks rather than with isolated genes. Under the typical gene selection procedure, the coordinated effect of correlated genes in the functional modules or networks cannot be well characterized. Furthermore, the statistical significance of individual genes cannot be presented properly when the sets of genes in the test have expression dependencies that may exist in the functionally correlated genes. In fact, statistical test showed that the average significances of functionally correlated genes are closer to a normal distribution than those of individual genes (3). The coordinated changes of functionally correlated gene sets appears to be an improved measure of extracting significant causes from differential gene expression profile in the sense of biological and statistical significance at the same time.
Several gene set enrichment analyses (GSEAs) have been introduced (3–8). These tools emphasize the discovery of functionally correlated genes more reliably and completely than individual gene-based approaches. The typical strategy of the gene set-based approach is to combine the initial significance evaluation of all genes to the coordinated significance of the set of genes obtained from predefined functional categories. Although GSEA approach can more appropriately determine the potent effects of functionally correlated genes than individual gene-based approach, the efficiency of this method can vary depending on the strategy of selecting and scoring the gene sets. Most scoring methods used in GSEA tools remain limited, in that they must use a relatively large size of gene set in conjunction with a parametric approach (3,5). Alternative non-parametric approaches are usually associated with an increase in the computational complexity and a decrease in the level of the generality in comparisons of results from different tests. Another restriction to use current GSEA approaches is on the coverage of functional categories they choose. More importantly, they do not handle the networked characteristic of functional modules. A networked feature of the genes in a module can be characterized by the dynamic changes of correlated activities or biologically dysregulated relationships of the genes under different conditions (9,10). This can be applied to various functional modules to identify dynamic functional modularity (DFM), which could not be described in typical GSEA approaches. Another networked characteristic can be obtained by hub protein networks or protein complexes that connect the functional module sets of the analysis.
DynaMod is a web-based application that identifies significant functional modules reflecting the dynamic changes of correlated activities and differential expressions in gene expression profiles under different conditions. A novel scoring strategy uses the average mutual information (MI) differences of gene pairs in predefined functional modules depending on phenotypes. The gene sets are constructed from a wide variety of functional categories representing cellular pathways, transcription factor regulations, miRNA regulations, hub protein networks and protein complexes. Overlap genes across the functional modules are used to find interconnectivity across functional modules. It also provides a cross-comparison between different analysis outputs that helps users to interpret the results from more than two phenotypes or different functional categories together.
There are five types of functional modules in this system: biological pathways from KEGG (11), transcriptional factor–target gene groups from TRANSFAC (12), microRNA–target gene groups from MsigDB (4), protein complexes and hub networks from protein interactome. Integrated protein complexes from COFECO (13) and combined protein interaction database (14) are used here. Protein complexes are increased by adding human protein complexes from Ewing et al. (15). Hub networks with at least five interacting partners are retrieved from an integrated interactome including BIND (16), BioGrid (17), DIP (18), HPRD (19), IntAct (20), MINT (21) and MIPS MPPI (22). Gene pairs are constructed from the pairwise relations presented in these five types of functional modules and are utilized to compute MI. DynaMod imports flat files related to functional modules and their gene pairs. In addition, three disease databases including OMIM (23), Genetic Association Database (24) and Cancer Gene Census (25) are used to annotate disease genes in functional modules. All gene entries for the functional modules of DynaMod are marked with an Entrez Gene ID. Gene identifiers from various databases including Entrez Gene, UniGene, RefSeq, EMBL, ENSEMBL, SGD, RGD, MGI, HGNC and the microarray probe identifiers of Affymetrix and Agilent are allowed. DynaMod also accepts identifiers of UniProtKB, iProClass and IPI. The same Entrez Gene is frequently represented by several probes in a microarray data set. DynaMod determines a probe representing a gene with the best scored gene probe when there are several probes corresponding to a gene in a microarray data set.
Our proposed DFM analysis works in two steps. We first use an aforementioned compendium of gene pairs to compute MI. MI is a quantity that measures the mutual dependence of two variables in information theory, which is zero if and only if the two variables are independent. The MI of gene pairs shows a correlation or dependence between two genes in the phenotype of interest. We then identify DFM of predefined modules according to the statistical significance of the altered correlation of gene pairs in the functional modules. To perform the latter step, MI difference (ΔMI) is measured by a method similar to that of Mani et al. (9), which proposed an oncogene prediction method using dysregulated interactions that show significant MI differences in the phenotype of interest. MI difference is given by the following equation:
Here, MItotal is the MI calculated from all given samples and MIcontrol is the MI calculated from control sample set that excludes the phenotype samples of interest. The MI differences of gene pairs represent the change of correlated activity or biologically dysregulated relationships among the genes and the dynamic modularity can be measured by the average MI differences of functional modules. It is assumed that given an expression profile, there are N predefined gene pairs in each module type. Subsequently, N MI differences are computed. A null distribution is generated by sampling a subset of gene pairs across 100 equally sized MI difference bins covering N MI difference range in overall gene pairs of an expression profile. For each bin of 100 gene pairs, MI differences for those gene pairs are computed by phenotype randomization of experimental samples. Thereafter, a null distribution composed of resulting 10 000 MI differences is constructed. A normality test was conducted by several standard routines on four microarray data sets. The null distribution of MI differences for individual gene pairs in glioblastoma data set [GSE4290 from Gene Expression Omnibus (GEO) in NCBI] was slightly out of normal distribution according to Kolmogorov–Smirnov normality test (D = 0.0094, P = 0.04493), although it was much closer to normal than the case using other previously used measure such as fold change (D = 0.0819, P = 2.2e-16). After the test of increasing number of gene pairs, the sufficient minimal size of the gene pairs was set to two, which shows confident normality (D = 0.0082, P = 0.2036 according to Kolmogorov–Smirnov normality test). This result is comparable with the normality characteristic of the previous method acquired from the mean of 10 samples with a fold change measure (Supplementary Data). DynaMod identifies significant functional modules in different conditions by using the Z-statistics of the average MI difference. This tool computes a Z-score for the average MI difference of a functional module using the aforementioned null distribution of MI differences and estimates the statistical significance of the Z-score against a standard normal distribution. If a sample size of MI differences is k (i.e. ≥2), the mean of the averages of all MI differences (μ) and the standard deviation of the averages of all of MI differences (σ) are computed from a null distribution. When the mean of the MI differences for a given functional module is and the number of gene pairs in that module is n, the standard error (SE) of and the Z-score is computed as follows:
The statistical significances (P-values) of the functional modules are adjusted for multiple-testing routines, such as with the Bonferroni method or the false discovery rate (FDR) of Benjamini and Hochberg (26).
Module expression activity (MEA) is a score estimating the differential expression of a module and upregulated group (Up_MEA) or downregulated group (Down_MEA) of a module. MEA is given by the following equations.
Here, gi is a gene score acquired by t-test or fold change, sgn is an indicator function and n is the number of member genes in the module. In Up_MEA, sgn results in 1 if gi is positive and 0 otherwise. In Down_MEA, sgn results in −1 if gi is negative and 0 otherwise.
Unions of the functional and neighbor modules are provided for the significant functional modules, whose overlap score exceeds a specific threshold (one overlap gene as the default number). Alternative score is defined as (n×n)/n1×n2, where n, n1, and n2 are the number of genes in the overlap and modules 1 and 2, respectively. Users can acquire the association among functional modules by genes in the overlap.
Although functional module-based analysis is the main goal, users may need occasionally the significance of individual genes. DynaMod evaluates the significance scores of individual genes by t-test or fold changes. DynaMod also provides the significances of ΔMI for individual gene pairs.
The core algorithm of DynaMod was implemented in R and the web interface was implemented in JAVA and Java Server Page on Linux. It runs on Apache Web Server combined with the Tomcat servlet engine. All annotation data for gene entries and functional modules within this system were stored in Oracle 10g RDBMS. As the computation works of DynaMod runs on a Linux-based cluster system, a large number of MI calculations of gene pairs can be parallelized using the Parallel Virtual Machine (PVM) via the rpvm and snow in R packages on a cluster of nine nodes, each with dual quad-core Intel Xeon 2.46 GHz CPUs and 24 GB of RAM. The functional modules, biological gene pairs, annotation resources, organisms and identifiers will be updated periodically.
The input of DynaMod is a genome-wide expression profile. Expression profiles have to contain gene identifiers, expression values and class labels (i.e. 1 or 2) of experimental samples.
The following DynaMod outputs are accessible at specified URL addresses that are sent to user by e-mail. These outputs are also downloadable from the web pages. An example outputs are shown in Figures 1 and and22.
DynaMod produces a summary rank table of significant functional modules and their neighbor modules including their significance scores and Up/Down_MEA scores. The summary table is linked to the further detail tables that contain the information of entry genes, gene pairs and their scores with P-values in the functional modules and networked modules. Biological functions of significant modules can be efficiently annotated by composite functional enrichment of COFECO (13).
Individual functional modules and their neighbor modules are summarized with a network graphical view of the scored genes and gene pairs. Up/down expression of genes and up/down expression correlations are depicted from green to red. The network graph is implemented using the GraphViz library.
A cross-comparison is used to identify changes/trends between the analysis results from different phenotypes. This functionality is useful for comparing various types of outputs from different input sets.
Figures 1 and and22 show the selected DynaMod outputs for an example analysis of glioblastoma data set (GSE4290) that have been submitted to the GEO database at NCBI (27). The functional modularity of PA700-20S-PA28 complex, one of the proteasome complexes, looks decreased mostly [most negative Z-score of MI difference in DFM activity (DFMA) column] and the member genes are separated into up- and downregulated groups during the change from normal to glioblastoma. For the second ranked MCM complex, it increases mostly (most positive Z-score) and all member genes are upregulated during the cancer formation. These results can give new implications of the role of those complexes in conjunction with previous studies indicating that PA700-20S-PA28 complex is involved in tumorigenesis and immune surveillance (28) and MCM complex is highly expressed in malignant human cancer cells (29). The eighth ranked complex, ‘SNARE complex’, is sublocalized in neurons of brain and is responsible for membrane fusion in the secretory pathway (30). All member genes are downregulated with significant decrease of their modularity in glioblastoma. Detail information of SNARE complex is shown in Figure 2. With the inspection of gene relationships (Figure 2B and C) and neighbor modules that have overlapped genes (Figure 2D), further refined association study among the specific pairs of genes in different functional categories can be designed in detail.
DynaMod provides a new method to identify both significant changes of functional modularity by using the average MI differences of gene pairs and differential expressions in predefined functional modules depending on expression profiles under different phenotypes, which could not be fully supported by typical GSEA approaches. Interestingly, this study showed that the average MI differences acquired by phenotype randomization of an expression profile demonstrate normality in small sampling size. Hence, this proposed method supports that parametric tests such as Z-statistics properly analyze functional modules composed of at least three genes (including at least two biological gene pairs). On the basis of this background, a wide variety of functional modules can be collected and analyzed, including protein complexes, hub-partner groups, miRNA–target groups and transcriptional factor-target groups.
In summary, this tool evaluates the significant modular correlations of various functional categories with gene expression profiles of different phenotypes using the average MI differences of gene pairs in the modules. Significantly correlated networks across functional modules can be generated through utilization of overlapping genes among different modules. As this tool ascertains the overall effect of those modules, it provides module-wise interpretation of dynamic cellular behaviors. User can conveniently interpret the dynamic modular activities of various functional categories and their networks depending on phenotype changes.
Supplementary Data are available at NAR Online.
Grant from the Korea Institute of Science and Technology Information (KISTI). Funding for open access charge: the Korea Institute of Science and Technology Information (KISTI).
Conflict of interest statement. None declared.
The authors thank anonymous reviews for constructive criticisms and fruitful discussions.