Search tips
Search criteria

Results 1-25 (27)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Ruan, jinhua")
1.  Altered expression of Arabidopsis genes in response to a multifunctional geminivirus pathogenicity protein 
BMC Plant Biology  2014;14(1):302.
Geminivirus AC2 is a multifunctional protein that acts as a pathogenicity factor. Transcriptional regulation by AC2 appears to be mediated through interaction with a plant specific DNA binding protein, PEAPOD2 (PPD2), that specifically binds to sequences known to mediate activation of the CP promoter of Cabbage leaf curl virus (CaLCuV) and Tomato golden mosaic virus (TGMV). Suppression of both basal and innate immune responses by AC2 in plants is mediated through inactivation of SnRK1.2, an Arabidopsis SNF1 related protein kinase, and adenosine kinase (ADK). An indirect promoter targeting strategy, via AC2-host dsDNA binding protein interactions, and inactivation of SnRK1.2-mediated defense responses could provide the opportunity for geminiviruses to alter host gene expression and in turn, reprogram the host to support virus infection. The goal of this study was to identify changes in the transcriptome of Arabidopsis induced by the transcription activation function of AC2 and the inactivation of SnRK1.2.
Using full-length and truncated AC2 proteins, microarray analyses identified 834 genes differentially expressed in response to the transcriptional regulatory function of the AC2 protein at one and two days post treatment. We also identified 499 genes differentially expressed in response to inactivation of SnRK1.2 by the AC2 protein at one and two days post treatment. Network analysis of these two sets of differentially regulated genes identified several networks consisting of between four and eight highly connected genes. Quantitative real-time PCR analysis validated the microarray expression results for 10 out of 11 genes tested.
It is becoming increasingly apparent that geminiviruses manipulate the host in several ways to facilitate an environment conducive to infection, predominantly through the use of multifunctional proteins. Our approach of identifying networks of highly connected genes that are potentially co-regulated by geminiviruses during infection will allow us to identify novel pathways of co-regulated genes that are stimulated in response to pathogen infection in general, and virus infection in particular.
Electronic supplementary material
The online version of this article (doi:10.1186/s12870-014-0302-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4253603  PMID: 25403083
Geminiviruses; Microarray; Pathogenesis; Expression; Regulatory networks
2.  Network-based Pathway Enrichment Analysis 
Finding out the associations between an input gene set, such as genes associated with a certain phenotype, and annotated gene sets, such as known pathways, are a very important problem in modern molecular biology. The existing approaches mainly focus on the overlap between the two, and may miss important but subtle relationships between genes. In this paper, we propose a method, NetPEA, by combining the known pathways and high-throughput networks. Our method not only considers the shared genes, but also takes the gene interactions into account. It utilizes a protein-protein interaction network and a random walk procedure to identify hidden relationships between gene sets, and uses a randomization strategy to evaluate the significance for pathways to achieve such similarity scores. Compared with the over-representation based method, our method can identify more relationships. Compared with a state of the art network-based method, EnrichNet, our method not only provides a ranked list of pathways, but also provides the statistical significant information. Importantly, through independent tests, we show that our method likely has a higher sensitivity in revealing the true casual pathways, while at the same time achieve a higher specificity. Literature review of selected results indicates that some of the novel pathways reported by our method are biologically relevant and important.
PMCID: PMC4197800  PMID: 25327472
pathway; protein-protein interaction network; enrichment analysis; gene sets; random walk
3.  An Ensemble Approach for Drug Side Effect Prediction 
In silico prediction of drug side-effects in early stage of drug development is becoming more popular now days, which not only reduces the time for drug design but also reduces the drug development costs. In this article we propose an ensemble approach to predict drug side-effects of drug molecules based on their chemical structure. Our idea originates from the observation that similar drugs have similar side-effects. Based on this observation we design an ensemble approach that combine the results from different classification models where each model is generated by a different set of similar drugs. We applied our approach to 1385 side-effects in the SIDER database for 888 drugs. Results show that our approach outperformed previously published approaches and standard classifiers. Furthermore, we applied our method to a number of uncharacterized drug molecules in DrugBank database and predict their side-effect profiles for future usage. Results from various sources confirm that our method is able to predict the side-effects for uncharacterized drugs and more importantly able to predict rare side-effects which are often ignored by other approaches. The method described in this article can be useful to predict side-effects in drug design in an early stage to reduce experimental cost and time.
PMCID: PMC4197807  PMID: 25327524
adverse side-effect; drug development; chemical substructure; uncharacterized drug
4.  A Fully Automated Method for Discovering Community Structures in High Dimensional Data 
Identifying modules, or natural communities, in large complex networks is fundamental in many fields, including social sciences, biological sciences and engineering. Recently several methods have been developed to automatically identify communities from complex networks by optimizing the modularity function. The advantage of this type of approaches is that the algorithm does not require any parameter to be tuned. However, the modularity-based methods for community discovery assume that the network structure is given explicitly and is correct. In addition, these methods work best if the network is unweighted and/or sparse. In reality, networks are often not directly defined, or may be given as an affinity matrix. In the first case, each node of the network is defined as a point in a high dimensional space and different networks can be obtained with different network construction methods, resulting in different community structures. In the second case, an affinity matrix may define a dense weighted graph, for which modularity-based methods do not perform well. In this work, we propose a very simple algorithm to automatically identify community structures from these two types of data. Our approach utilizes a k-nearest-neighbor network construction method to capture the topology embedded in high dimensional data, and applies a modularity-based algorithm to identify the optimal community structure. A key to our approach is that the network construction is incorporated with the community identification process and is totally parameter-free. Furthermore, our method can suggest appropriate preprocessing/normalization of the data to improve the results of community identification. We tested our methods on several synthetic and real data sets, and evaluated its performance by internal or external accuracy indices. Compared with several existing approaches, our method is not only fully automatic, but also has the best accuracy overall.
PMCID: PMC4185921  PMID: 25296858
community structure; modularity; image clustering
5.  Promoter hypomethylation of EpCAM-regulated bone morphogenetic protein gene family in recurrent endometrial cancer 
Epigenetic regulation by promoter methylation plays a key role in tumorigenesis. Our goal was to investigate whether altered DNA methylation signatures associated with oncogenic signaling delineate biomarkers predictive of endometrial cancer recurrence.
Experimental Design
Methyl-CpG-capture sequencing was used for global screening of aberrant DNA methylation in our endometrial cancer cohort, followed by validation in an independent The Cancer Genome Atlas (TCGA) cohort. Bioinformatics as well as functional analyses in vitro, using RNA interference (RNAi) knockdown, were performed to examine regulatory mechanisms of candidate gene expression and contribution to aggressive phenotype, such as epithelial–mesenchymal transition (EMT).
We identified 2,302 hypermethylated loci in endometrial tumors compared with control samples. Bone morphogenetic protein (BMP) family genes, including BMP1, 2, 3, 4, and 7, were among the frequently hypermethylated loci. Interestingly, BMP2, 3, 4, and 7 were less methylated in primary tumors with subsequent recurrence and in patients with shorter diseasefree interval compared with nonrecurrent tumors, which was validated and associated with poor survival in the TCGA cohort (BMP4, P = 0.009; BMP7, P = 0.007). Stimulation of endometrial cancer cells with epidermal growth factor (EGF) induced EMT and transcriptional activation of these genes, which was mediated by the epithelial cell adhesion molecule (EpCAM). EGF signaling was implicated in maintaining the promoters of candidate BMP genes in an active chromatin configuration and thus subject to transcriptional activation.
Hypomethylation signatures of candidate BMP genes associated with EpCAM-mediated expression present putative biomarkers predictive of poor survival in endometrial cancer.
PMCID: PMC4080631  PMID: 24077349
DNA methylation; EGF; BMP; epithelial-mesenchymal transition; and endometrial cancer recurrence
6.  Regulation of adipose oestrogen output by mechanical stress 
Nature communications  2013;4:1821.
Adipose stromal cells are the primary source of local estrogens in adipose tissue, aberrant production of which promotes oestrogen receptor-positive breast cancer. Here we show that extracellular matrix compliance and cell contractility are two opposing determinants for oestrogen output of adipose stromal cells. Using synthetic extracellular matrix and elastomeric micropost arrays with tunable rigidity, we find that increasing matrix compliance induces transcription of aromatase, a rate-limiting enzyme in oestrogen biosynthesis. This mechanical cue is transduced sequentially by discoidin domain receptor 1, c-Jun N-terminal kinase 1, and phosphorylated JunB, which binds to and activates two breast cancer-associated aromatase promoters. In contrast, elevated cell contractility due to actin stress fibre formation dampens aromatase transcription. Mechanically stimulated stromal oestrogen production enhances oestrogen-dependent transcription in oestrogen receptor-positive tumour cells and promotes their growth. This novel mechanotransduction pathway underlies communications between extracellular matrix, stromal hormone output, and cancer cell growth within the same microenvironment.
PMCID: PMC3921626  PMID: 23652009
7.  A novel link prediction algorithm for reconstructing protein–protein interaction networks by topological similarity 
Bioinformatics  2012;29(3):355-364.
Motivation: Recent advances in technology have dramatically increased the availability of protein–protein interaction (PPI) data and stimulated the development of many methods for improving the systems level understanding the cell. However, those efforts have been significantly hindered by the high level of noise, sparseness and highly skewed degree distribution of PPI networks. Here, we present a novel algorithm to reduce the noise present in PPI networks. The key idea of our algorithm is that two proteins sharing some higher-order topological similarities, measured by a novel random walk-based procedure, are likely interacting with each other and may belong to the same protein complex.
Results: Applying our algorithm to a yeast PPI network, we found that the edges in the reconstructed network have higher biological relevance than in the original network, assessed by multiple types of information, including gene ontology, gene expression, essentiality, conservation between species and known protein complexes. Comparison with existing methods shows that the network reconstructed by our method has the highest quality. Using two independent graph clustering algorithms, we found that the reconstructed network has resulted in significantly improved prediction accuracy of protein complexes. Furthermore, our method is applicable to PPI networks obtained with different experimental systems, such as affinity purification, yeast two-hybrid (Y2H) and protein-fragment complementation assay (PCA), and evidence shows that the predicted edges are likely bona fide physical interactions. Finally, an application to a human PPI network increased the coverage of the network by at least 100%.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3562060  PMID: 23235927
8.  Fully automated protein complex prediction based on topological similarity and community structure 
Proteome Science  2013;11(Suppl 1):S9.
To understand the function of protein complexes and their association with biological processes, a lot of studies have been done towards analyzing the protein-protein interaction (PPI) networks. However, the advancement in high-throughput technology has resulted in a humongous amount of data for analysis. Moreover, high level of noise, sparseness, and skewness in degree distribution of PPI networks limits the performance of many clustering algorithms and further analysis of their interactions.
In addressing and solving these problems we present a novel random walk based algorithm that converts the incomplete and binary PPI network into a protein-protein topological similarity matrix (PP-TS matrix). We believe that if two proteins share some high-order topological similarities they are likely to be interacting with each other. Using the obtained PP-TS matrix, we constructed and used weighted networks to further study and analyze the interaction among proteins. Specifically, we applied a fully automated community structure finding algorithm (Auto-HQcut) on the obtained weighted network to cluster protein complexes. We then analyzed the protein complexes for significance in biological processes. To help visualize and analyze these protein complexes we also developed an interface that displays the resulting complexes as well as the characteristics associated with each complex.
Applying our approach to a yeast protein-protein interaction network, we found that the predicted protein-protein interaction pairs with high topological similarities have more significant biological relevance than the original protein-protein interactions pairs. When we compared our PPI network reconstruction algorithm with other existing algorithms using gene ontology and gene co-expression, our algorithm produced the highest similarity scores. Also, our predicted protein complexes showed higher accuracy measure compared to the other protein complex predictions.
PMCID: PMC3908383  PMID: 24564887
PPI network; random walk; protein-protein interaction; protein complex; clustering
9.  Comprehensive methylome analysis of ovarian tumors reveals hedgehog signaling pathway regulators as prognostic DNA methylation biomarkers 
Epigenetics  2013;8(6):624-634.
Women with advanced stage ovarian cancer (OC) have a five-year survival rate of less than 25%. OC progression is associated with accumulation of epigenetic alterations and aberrant DNA methylation in gene promoters acts as an inactivating ?hit? during OC initiation and progression. Abnormal DNA methylation in OC has been used to predict disease outcome and therapy response. To globally examine DNA methylation in OC, we used next-generation sequencing technology, MethylCap-sequencing, to screen 75 malignant and 26 normal or benign ovarian tissues. Differential DNA methylation regions (DMRs) were identified, and the Kaplan?Meier method and Cox proportional hazard model were used to correlate methylation with clinical endpoints. Functional role of specific genes identified by MethylCap-sequencing was examined in in vitro assays. We identified 577 DMRs that distinguished (p < 0.001) malignant from non-malignant ovarian tissues; of these, 63 DMRs correlated (p < 0.001) with poor progression free survival (PFS). Concordant hypermethylation and corresponding gene silencing of sonic hedgehog pathway members ZIC1 and ZIC4 in OC tumors was confirmed in a panel of OC cell lines, and ZIC1 and ZIC4 repression correlated with increased proliferation, migration and invasion. ZIC1 promoter hypermethylation correlated (p < 0.01) with poor PFS. In summary, we identified functional DNA methylation biomarkers significantly associated with clinical outcome in OC and suggest our comprehensive methylome analysis has significant translational potential for guiding the design of future clinical investigations targeting the OC epigenome. Methylation of ZIC1, a putative tumor suppressor, may be a novel determinant of OC outcome.
PMCID: PMC3857342  PMID: 23774800
DNA methylation; Hedgehog pathway; ZIC1; ZIC4; ovarian cancer
10.  A genome-wide cis-regulatory element discovery method based on promoter sequences and gene co-expression networks 
BMC Genomics  2013;14(Suppl 1):S4.
Deciphering cis-regulatory networks has become an attractive yet challenging task. This paper presents a simple method for cis-regulatory network discovery which aims to avoid some of the common problems of previous approaches.
Using promoter sequences and gene expression profiles as input, rather than clustering the genes by the expression data, our method utilizes co-expression neighborhood information for each individual gene, thereby overcoming the disadvantages of current clustering based models which may miss specific information for individual genes. In addition, rather than using a motif database as an input, it implements a simple motif count table for each enumerated k-mer for each gene promoter sequence. Thus, it can be used for species where previous knowledge of cis-regulatory motifs is unknown and has the potential to discover new transcription factor binding sites. Applications on Saccharomyces cerevisiae and Arabidopsis have shown that our method has a good prediction accuracy and outperforms a phylogenetic footprinting approach. Furthermore, the top ranked gene-motif regulatory clusters are evidently functionally co-regulated, and the regulatory relationships between the motifs and the enriched biological functions can often be confirmed by literature.
Since this method is simple and gene-specific, it can be readily utilized for insufficiently studied species or flexibly used as an additional step or data source for previous transcription regulatory networks discovery models.
PMCID: PMC3549801  PMID: 23368633
11.  A Steiner tree-based method for biomarker discovery and classification in breast cancer metastasis 
BMC Genomics  2012;13(Suppl 6):S8.
Metastatic breast cancer is a leading cause of cancer-related deaths in women worldwide. DNA microarray has become an important tool to help identify biomarker genes for improving the prognosis of breast cancer. Recently, it was shown that pathway-level relationships between genes can be incorporated to build more robust classification models and to obtain more useful biological insight from such models. Due to the unavailability of complete pathways, protein-protein interaction (PPI) network is becoming more popular to researcher and opens a new way to investigate the developmental process of breast cancer.
In this study, a network-based method is proposed to combine microarray gene expression profiles and PPI network for biomarker discovery for breast cancer metastasis. The key idea in our approach is to identify a small number of genes to connect differentially expressed genes into a single component in a PPI network; these intermediate genes contain important information about the pathways involved in metastasis and have a high probability of being biomarkers.
We applied this approach on two breast cancer microarray datasets, and for both cases we identified significant numbers of well-known biomarker genes for breast cancer metastasis. Those selected genes are significantly enriched with biological processes and pathways related to cancer carcinogenic process, and, importantly, have much higher stability across different datasets than in previous studies. Furthermore, our selected genes significantly increased cross-data classification accuracy of breast cancer metastasis.
The randomized Steiner tree based approach described in this study is a new way to discover biomarker genes for breast cancer, and improves the prediction accuracy of metastasis. Though the analysis is limited here only to breast cancer, it can be easily applied to other diseases.
PMCID: PMC3481447  PMID: 23134806
12.  Systematic identification of functional modules and cis-regulatory elements in Arabidopsis thaliana 
BMC Bioinformatics  2011;12(Suppl 12):S2.
Several large-scale gene co-expression networks have been constructed successfully for predicting gene functional modules and cis-regulatory elements in Arabidopsis (Arabidopsis thaliana). However, these networks are usually constructed and analyzed in an ad hoc manner. In this study, we propose a completely parameter-free and systematic method for constructing gene co-expression networks and predicting functional modules as well as cis-regulatory elements.
Our novel method consists of an automated network construction algorithm, a parameter-free procedure to predict functional modules, and a strategy for finding known cis-regulatory elements that is suitable for consensus scanning without prior knowledge of the allowed extent of degeneracy of the motif. We apply the method to study a large collection of gene expression microarray data in Arabidopsis. We estimate that our co-expression network has ~94% of accuracy, and has topological properties similar to other biological networks, such as being scale-free and having a high clustering coefficient. Remarkably, among the ~300 predicted modules whose sizes are at least 20, 88% have at least one significantly enriched functions, including a few extremely significant ones (ribosome, p < 1E-300, photosynthetic membrane, p < 1.3E-137, proteasome complex, p < 5.9E-126). In addition, we are able to predict cis-regulatory elements for 66.7% of the modules, and the association between the enriched cis-regulatory elements and the enriched functional terms can often be confirmed by the literature. Overall, our results are much more significant than those reported by several previous studies on similar data sets. Finally, we utilize the co-expression network to dissect the promoters of 19 Arabidopsis genes involved in the metabolism and signaling of the important plant hormone gibberellin, and achieved promising results that reveal interesting insight into the biosynthesis and signaling of gibberellin.
The results show that our method is highly effective in finding functional modules from real microarray data. Our application on Arabidopsis leads to the discovery of the largest number of annotated Arabidopsis functional modules in the literature. Given the high statistical significance of functional enrichment and the agreement between cis-regulatory and functional annotations, we believe our Arabidopsis gene modules can be used to predict the functions of unknown genes in Arabidopsis, and to understand the regulatory mechanisms of many genes.
PMCID: PMC3247083  PMID: 22168340
13.  Cell Density-Dependent Transcriptional Activation of Endocrine-Related Genes in Human Adipose Tissue-Derived Stem Cells 
Experimental cell research  2010;316(13):2087-2098.
Adipose tissue is recognized as an endocrine organ that plays an important role in human diseases such as type II diabetes and cancer. Human adipose tissue-derived stem cells (ASCs), a distinct cell population in adipose tissue, are capable of differentiating into multiple lineages including adipogenesis. When cultured in vitro under a confluent condition, ASCs reach a commitment stage for adipogenesis, which can be further induced into terminally differentiated adipocytes by a cocktail of adipogenic factors. Here we report that the confluent state of ASCs triggers transcriptional activation cascades for genes that are responsible for the endocrine function of adipose tissue. These include insulin-like growth factor 1 (IGF-1) and aromatase (Cyp19), a key enzyme in estrogen biosynthesis. Despite similar adipogenic potentials, ASCs from different individuals display huge variations in activation of these endocrine-related genes. Bioinformatics and experimental data suggest that transcription factor Foxo1 controls a large number of “early” confluency-response genes, which subsequently induce “late” response genes. Furthermore, siRNA-mediated knockdown of Foxo1 substantially compromises the ability of committed ASCs to stimulate tumor cell migration in vitro. Thus, our work suggests that cell density is an important determinant of the endocrine potential of ASCs.
PMCID: PMC2900480  PMID: 20420826
adipose stem cells; cell confluency; aromatase; gene expression profiling; SBSN; Foxo1; adipogenesis commitment
14.  A particle swarm optimization-based algorithm for finding gapped motifs 
BioData Mining  2010;3:9.
Identifying approximately repeated patterns, or motifs, in DNA sequences from a set of co-regulated genes is an important step towards deciphering the complex gene regulatory networks and understanding gene functions.
In this work, we develop a novel motif finding algorithm (PSO+) using a population-based stochastic optimization technique called Particle Swarm Optimization (PSO), which has been shown to be effective in optimizing difficult multidimensional problems in continuous domains. We propose a modification of the standard PSO algorithm to handle discrete values, such as characters in DNA sequences. The algorithm provides several features. First, we use both consensus and position-specific weight matrix representations in our algorithm, taking advantage of the efficiency of the former and the accuracy of the latter. Furthermore, many real motifs contain gaps, but the existing methods usually ignore them or assume a user know their exact locations and lengths, which is usually impractical for real applications. In comparison, our method models gaps explicitly, and provides an easy solution to find gapped motifs without any detailed knowledge of gaps. Our method allows the presence of input sequences containing zero or multiple binding sites.
Experimental results on synthetic challenge problems as well as real biological sequences show that our method is both more efficient and more accurate than several existing algorithms, especially when gaps are present in the motifs.
PMCID: PMC3022572  PMID: 21144057
15.  A novel swarm intelligence algorithm for finding DNA motifs 
Discovering DNA motifs from co-expressed or co-regulated genes is an important step towards deciphering complex gene regulatory networks and understanding gene functions. Despite significant improvement in the last decade, it still remains one of the most challenging problems in computational molecular biology. In this work, we propose a novel motif finding algorithm that finds consensus patterns using a population-based stochastic optimisation technique called Particle Swarm Optimisation (PSO), which has been shown to be effective in optimising difficult multidimensional problems in continuous domains. We propose to use a word dissimilarity graph to remap the neighborhood structure of the solution space of DNA motifs, and propose a modification of the naive PSO algorithm to accommodate discrete variables. In order to improve efficiency, we also propose several strategies for escaping from local optima and for automatically determining the termination criteria. Experimental results on simulated challenge problems show that our method is both more efficient and more accurate than several existing algorithms. Applications to several sets of real promoter sequences also show that our approach is able to detect known transcription factor binding sites, and outperforms two of the most popular existing algorithms.
PMCID: PMC2975043  PMID: 20090174
DNA motif; optimisation; swarm intelligence; PSO; particle swarm optimisation
16.  Building and analyzing protein interactome networks by cross-species comparisons 
BMC Systems Biology  2010;4:36.
A genomic catalogue of protein-protein interactions is a rich source of information, particularly for exploring the relationships between proteins. Numerous systems-wide and small-scale experiments have been conducted to identify interactions; however, our knowledge of all interactions for any one species is incomplete, and alternative means to expand these network maps is needed. We therefore took a comparative biology approach to predict protein-protein interactions across five species (human, mouse, fly, worm, and yeast) and developed InterologFinder for research biologists to easily navigate this data. We also developed a confidence score for interactions based on available experimental evidence and conservation across species.
The connectivity of the resultant networks was determined to have scale-free distribution, small-world properties, and increased local modularity, indicating that the added interactions do not disrupt our current understanding of protein network structures. We show examples of how these improved interactomes can be used to analyze a genome-scale dataset (RNAi screen) and to assign new function to proteins. Predicted interactions within this dataset were tested by co-immunoprecipitation, resulting in a high rate of validation, suggesting the high quality of networks produced.
Protein-protein interactions were predicted in five species, based on orthology. An InteroScore, a score accounting for homology, number of orthologues with evidence of interactions, and number of unique observations of interactions, is given to each known and predicted interaction. Our website provides research biologists intuitive access to this data.
PMCID: PMC2859380  PMID: 20353594
17.  A Top-Performing Algorithm for the DREAM3 Gene Expression Prediction Challenge 
PLoS ONE  2010;5(2):e8944.
A wealth of computational methods has been developed to address problems in systems biology, such as modeling gene expression. However, to objectively evaluate and compare such methods is notoriously difficult. The DREAM (Dialogue on Reverse Engineering Assessments and Methods) project is a community-wide effort to assess the relative strengths and weaknesses of different computational methods for a set of core problems in systems biology. This article presents a top-performing algorithm for one of the challenge problems in the third annual DREAM (DREAM3), namely the gene expression prediction challenge. In this challenge, participants are asked to predict the expression levels of a small set of genes in a yeast deletion strain, given the expression levels of all other genes in the same strain and complete gene expression data for several other yeast strains. I propose a simple -nearest-neighbor (KNN) method to solve this problem. Despite its simplicity, this method works well for this challenge, sharing the “top performer” honor with a much more sophisticated method. I also describe several alternative, simple strategies, including a modified KNN algorithm that further improves the performance of the standard KNN method. The success of these methods suggests that complex methods attempting to integrate multiple data sets do not necessarily lead to better performance than simple yet robust methods. Furthermore, none of these top-performing methods, including the one by a different team, are based on gene regulatory networks, which seems to suggest that accurately modeling gene expression using gene regulatory networks is unfortunately still a difficult task.
PMCID: PMC2816205  PMID: 20140212
18.  A general co-expression network-based approach to gene expression analysis: comparison and applications 
Co-expression network-based approaches have become popular in analyzing microarray data, such as for detecting functional gene modules. However, co-expression networks are often constructed by ad hoc methods, and network-based analyses have not been shown to outperform the conventional cluster analyses, partially due to the lack of an unbiased evaluation metric.
Here, we develop a general co-expression network-based approach for analyzing both genes and samples in microarray data. Our approach consists of a simple but robust rank-based network construction method, a parameter-free module discovery algorithm and a novel reference network-based metric for module evaluation. We report some interesting topological properties of rank-based co-expression networks that are very different from that of value-based networks in the literature. Using a large set of synthetic and real microarray data, we demonstrate the superior performance of our approach over several popular existing algorithms. Applications of our approach to yeast, Arabidopsis and human cancer microarray data reveal many interesting modules, including a fatal subtype of lymphoma and a gene module regulating yeast telomere integrity, which were missed by the existing methods.
We demonstrated that our novel approach is very effective in discovering the modular structures in microarray data, both for genes and for samples. As the method is essentially parameter-free, it may be applied to large data sets where the number of clusters is difficult to estimate. The method is also very general and can be applied to other types of data. A MATLAB implementation of our algorithm can be downloaded from
PMCID: PMC2829495  PMID: 20122284
19.  Correction: A Network of Conserved Damage Survival Pathways Revealed by a Genomic RNAi Screen 
PLoS Genetics  2009;5(10):10.1371/annotation/526db6e9-0ba5-4ec6-a257-2befb76f34b7.
PMCID: PMC2771653
20.  A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes 
BMC Bioinformatics  2009;10(Suppl 9):S5.
The osteocyte is a type of cell that appears to be one of the key endocrine regulators of bone metabolism and a key responder to initiate bone formation and remodeling. Identifying the regulatory networks in osteocytes may lead to new therapies for osteoporosis and loss of bone.
Using microarray, we identified 269 genes over-expressed in osteocyte, many of which have known functions in bone and muscle differentiation and contractility. We determined the evolutionarily conserved and enriched TF binding sites in the 5 kb promoter regions of these genes. Using this data, a transcriptional regulatory network was constructed and subsequently partitioned to identify cis-regulatory modules.
Our results show that many osteocyte-specific genes, including two well-known osteocyte markers DMP1 and Sost, have highly conserved clustering of muscle-related cis-regulatory modules, thus supporting the concept that a muscle-related gene network is important in osteocyte biology and may play a role in contractility and dynamic movements of the osteocyte.
PMCID: PMC2745692  PMID: 19761575
21.  An ensemble learning approach to reverse-engineering transcriptional regulatory networks from time-series gene expression data 
BMC Genomics  2009;10(Suppl 1):S8.
One of the most challenging tasks in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its expression, and how a set of transcription factors coordinate to accomplish temporal and spatial specific regulations.
Here we propose a supervised machine learning approach to address these questions. We focus our study on the gene transcriptional regulation of the cell cycle in the budding yeast, thanks to the large amount of data available and relatively well-understood biology, although the main ideas of our method can be applied to other data as well. Our method starts with building an ensemble of decision trees for each microarray data to capture the association between the expression levels of yeast genes and the binding of transcription factors to gene promoter regions, as determined by chromatin immunoprecipitation microarray (ChIP-chip) experiment. Cross-validation experiments show that the method is more accurate and reliable than the naive decision tree algorithm and several other ensemble learning methods. From the decision tree ensembles, we extract logical rules that explain how a set of transcription factors act in concert to regulate the expression of their targets. We further compute a profile for each rule to show its regulation strengths at different time points. We also propose a spline interpolation method to integrate the rule profiles learned from several time series expression data sets that measure the same biological process. We then combine these rule profiles to build a transcriptional regulatory network for the yeast cell cycle. Compared to the results in the literature, our method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. Our method also identifies many interesting synergetic relationships among these transcription factors, most of which are well known, while many of the rest can also be supported by other evidences.
The high accuracy of our method indicates that our method is valid and robust. As more gene expression and transcription factor binding data become available, we believe that our method is useful for reconstructing large-scale transcriptional regulatory networks in other species as well.
PMCID: PMC2709269  PMID: 19594885
22.  A Network of Conserved Damage Survival Pathways Revealed by a Genomic RNAi Screen 
PLoS Genetics  2009;5(6):e1000527.
Damage initiates a pleiotropic cellular response aimed at cellular survival when appropriate. To identify genes required for damage survival, we used a cell-based RNAi screen against the Drosophila genome and the alkylating agent methyl methanesulphonate (MMS). Similar studies performed in other model organisms report that damage response may involve pleiotropic cellular processes other than the central DNA repair components, yet an intuitive systems level view of the cellular components required for damage survival, their interrelationship, and contextual importance has been lacking. Further, by comparing data from different model organisms, identification of conserved and presumably core survival components should be forthcoming. We identified 307 genes, representing 13 signaling, metabolic, or enzymatic pathways, affecting cellular survival of MMS–induced damage. As expected, the majority of these pathways are involved in DNA repair; however, several pathways with more diverse biological functions were also identified, including the TOR pathway, transcription, translation, proteasome, glutathione synthesis, ATP synthesis, and Notch signaling, and these were equally important in damage survival. Comparison with genomic screen data from Saccharomyces cerevisiae revealed no overlap enrichment of individual genes between the species, but a conservation of the pathways. To demonstrate the functional conservation of pathways, five were tested in Drosophila and mouse cells, with each pathway responding to alkylation damage in both species. Using the protein interactome, a significant level of connectivity was observed between Drosophila MMS survival proteins, suggesting a higher order relationship. This connectivity was dramatically improved by incorporating the components of the 13 identified pathways within the network. Grouping proteins into “pathway nodes” qualitatively improved the interactome organization, revealing a highly organized “MMS survival network.” We conclude that identification of pathways can facilitate comparative biology analysis when direct gene/orthologue comparisons fail. A biologically intuitive, highly interconnected MMS survival network was revealed after we incorporated pathway data in our interactome analysis.
Author Summary
Cellular damage is known to elicit a pleiotropic response, but the relative importance of the constituent components in cell survival is poorly understood. To provide an unbiased identification of the proteins utilized in damage survival, we performed an RNAi survival screen in fly cells with methyl methanesulfonate (MMS). The genes identified are involved in 13 biologically diverse pathways. Comparison with analogous yeast data demonstrated a lack of conservation of the individual MMS survival genes but a conservation of pathways. We went on to demonstrate the MMS responsiveness for five representative pathways in both fly and mouse cells. We conclude that identification of pathways can facilitate comparative biology analysis when direct gene/orthologue comparisons fail. Incorporation of pathway data in interactome analysis also improved connectivity and, more importantly, revealed a biologically intuitive, highly inter-connected “MMS survival network.” This pathway conservation and inter-connectivity implies extensive interaction between pathways; for diseases such as cancer, such crosstalk may dictate disparate cellular responses not necessarily expected and confound treatments that are not tailored to the individual molecular context.
PMCID: PMC2688755  PMID: 19543366
23.  Variations in the transcriptome of Alzheimer's disease reveal molecular networks involved in cardiovascular diseases 
Genome Biology  2008;9(10):R148.
Analysis of microarray data reveals extensive links between Alzheimer’s disease and cardiovascular diseases.
Because of its polygenic nature, Alzheimer's disease is believed to be caused not by defects in single genes, but rather by variations in a large number of genes and their complex interactions. A systems biology approach, such as the generation of a network of co-expressed genes and the identification of functional modules and cis-regulatory elements, to extract insights and knowledge from microarray data will lead to a better understanding of complex diseases such as Alzheimer's disease. In this study, we perform a series of analyses using co-expression networks, cis-regulatory elements, and functions of co-expressed gene modules to analyze single-cell gene expression data from normal and Alzheimer's disease-affected subjects.
We identified six co-expressed gene modules, each of which represented a biological process perturbed in Alzheimer's disease. Alzheimer's disease-related genes, such as APOE, A2M, PON2 and MAP4, and cardiovascular disease-associated genes, including COMT, CBS and WNK1, all congregated in a single module. Some of the disease-related genes were hub genes while many of them were directly connected to one or more hub genes. Further investigation of this disease-associated module revealed cis-regulatory elements that match to the binding sites of transcription factors involved in Alzheimer's disease and cardiovascular disease.
Our results show the extensive links between Alzheimer's disease and cardiovascular disease at the co-expression and co-regulation levels, providing further evidence for the hypothesis that cardiovascular disease and Alzheimer's disease are linked. Our results support the notion that diseases in which the same set of biochemical pathways are affected may tend to co-occur with each other.
PMCID: PMC2760875  PMID: 18842138
24.  Plasticity of the Systemic Inflammatory Response to Acute Infection during Critical Illness: Development of the Riboleukogram 
PLoS ONE  2008;3(2):e1564.
Diagnosis of acute infection in the critically ill remains a challenge. We hypothesized that circulating leukocyte transcriptional profiles can be used to monitor the host response to and recovery from infection complicating critical illness.
Methodology/Principal Findings
A translational research approach was employed. Fifteen mice underwent intratracheal injections of live P. aeruginosa, P. aeruginosa endotoxin, live S. pneumoniae, or normal saline. At 24 hours after injury, GeneChip microarray analysis of circulating buffy coat RNA identified 219 genes that distinguished between the pulmonary insults and differences in 7-day mortality. Similarly, buffy coat microarray expression profiles were generated from 27 mechanically ventilated patients every two days for up to three weeks. Significant heterogeneity of VAP microarray profiles was observed secondary to patient ethnicity, age, and gender, yet 85 genes were identified with consistent changes in abundance during the seven days bracketing the diagnosis of VAP. Principal components analysis of these 85 genes appeared to differentiate between the responses of subjects who did versus those who did not develop VAP, as defined by a general trajectory (riboleukogram) for the onset and resolution of VAP. As patients recovered from critical illness complicated by acute infection, the riboleukograms converged, consistent with an immune attractor.
Here we present the culmination of a mouse pneumonia study, demonstrating for the first time that disease trajectories derived from microarray expression profiles can be used to quantitatively track the clinical course of acute disease and identify a state of immune recovery. These data suggest that the onset of an infection-specific transcriptional program may precede the clinical diagnosis of pneumonia in patients. Moreover, riboleukograms may help explain variance in the host response due to differences in ethnic background, gender, and pathogen. Prospective clinical trials are indicated to validate our results and test the clinical utility of riboleukograms.
PMCID: PMC2215774  PMID: 18270561
25.  Characterization and Identification of MicroRNA Core Promoters in Four Model Species 
PLoS Computational Biology  2007;3(3):e37.
MicroRNAs are short, noncoding RNAs that play important roles in post-transcriptional gene regulation. Although many functions of microRNAs in plants and animals have been revealed in recent years, the transcriptional mechanism of microRNA genes is not well-understood. To elucidate the transcriptional regulation of microRNA genes, we study and characterize, in a genome scale, the promoters of intergenic microRNA genes in Caenorhabditis elegans, Homo sapiens, Arabidopsis thaliana, and Oryza sativa. We show that most known microRNA genes in these four species have the same type of promoters as protein-coding genes have. To further characterize the promoters of microRNA genes, we developed a novel promoter prediction method, called common query voting (CoVote), which is more effective than available promoter prediction methods. Using this new method, we identify putative core promoters of most known microRNA genes in the four model species. Moreover, we characterize the promoters of microRNA genes in these four species. We discover many significant, characteristic sequence motifs in these core promoters, several of which match or resemble the known cis-acting elements for transcription initiation. Among these motifs, some are conserved across different species while some are specific to microRNA genes of individual species.
Author Summary
MicroRNAs are a class of short RNA sequences that have many regulatory functions in complex organisms such as plants and animals. However, our knowledge of the transcriptional mechanisms of microRNA genes is limited. Here, we analyze the upstream sequences of known microRNA genes in four model species, i.e., C. elegans, H. sapiens, A. thaliana, and O. sativa, and compare them with the promoter sequences of protein-coding genes and other classes of RNA genes. This analysis provides genome-wide evidence that microRNA genes have the same type of promoter sequences as protein-coding genes, and therefore are likely transcribed by RNA polymerase II (pol II). Second, we present a novel computational method for promoter prediction, which is then applied to locate the core promoters of known microRNA genes in the four model species. Furthermore, we present an analysis of short DNA motifs that appear frequently in the predicted promoters of microRNA genes, and report several interesting motifs that may have some functional meanings. These results are important for understanding the initiation and regulation of microRNA gene transcription.
PMCID: PMC1817659  PMID: 17352530

Results 1-25 (27)