Search tips
Search criteria

Results 1-25 (39)

Clipboard (0)

Select a Filter Below

Year of Publication
author:("Xuan, jinhua")
2.  BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data 
BMC Bioinformatics  2014;15(Suppl 9):S6.
Recent advances in RNA sequencing (RNA-Seq) technology have offered unprecedented scope and resolution for transcriptome analysis. However, precise quantification of mRNA abundance and identification of differentially expressed genes are complicated due to biological and technical variations in RNA-Seq data.
We systematically study the variation in count data and dissect the sources of variation into between-sample variation and within-sample variation. A novel Bayesian framework is developed for joint estimate of gene level mRNA abundance and differential state, which models the intrinsic variability in RNA-Seq to improve the estimation. Specifically, a Poisson-Lognormal model is incorporated into the Bayesian framework to model within-sample variation; a Gamma-Gamma model is then used to model between-sample variation, which accounts for over-dispersion of read counts among multiple samples. Simulation studies, where sequencing counts are synthesized based on parameters learned from real datasets, have demonstrated the advantage of the proposed method in both quantification of mRNA abundance and identification of differentially expressed genes. Moreover, performance comparison on data from the Sequencing Quality Control (SEQC) Project with ERCC spike-in controls has shown that the proposed method outperforms existing RNA-Seq methods in differential analysis. Application on breast cancer dataset has further illustrated that the proposed Bayesian model can 'blindly' estimate sources of variation caused by sequencing biases.
We have developed a novel Bayesian hierarchical approach to investigate within-sample and between-sample variations in RNA-Seq data. Simulation and real data applications have validated desirable performance of the proposed method. The software package is available at
PMCID: PMC4168709  PMID: 25252852
3.  Knowledge-fused differential dependency network models for detecting significant rewiring in biological networks 
BMC Systems Biology  2014;8:87.
Modeling biological networks serves as both a major goal and an effective tool of systems biology in studying mechanisms that orchestrate the activities of gene products in cells. Biological networks are context-specific and dynamic in nature. To systematically characterize the selectively activated regulatory components and mechanisms, modeling tools must be able to effectively distinguish significant rewiring from random background fluctuations. While differential networks cannot be constructed by existing knowledge alone, novel incorporation of prior knowledge into data-driven approaches can improve the robustness and biological relevance of network inference. However, the major unresolved roadblocks include: big solution space but a small sample size; highly complex networks; imperfect prior knowledge; missing significance assessment; and heuristic structural parameter learning.
To address these challenges, we formulated the inference of differential dependency networks that incorporate both conditional data and prior knowledge as a convex optimization problem, and developed an efficient learning algorithm to jointly infer the conserved biological network and the significant rewiring across different conditions. We used a novel sampling scheme to estimate the expected error rate due to “random” knowledge. Based on that scheme, we developed a strategy that fully exploits the benefit of this data-knowledge integrated approach. We demonstrated and validated the principle and performance of our method using synthetic datasets. We then applied our method to yeast cell line and breast cancer microarray data and obtained biologically plausible results. The open-source R software package and the experimental data are freely available at
Experiments on both synthetic and real data demonstrate the effectiveness of the knowledge-fused differential dependency network in revealing the statistically significant rewiring in biological networks. The method efficiently leverages data-driven evidence and existing biological knowledge while remaining robust to the false positive edges in the prior knowledge. The identified network rewiring events are supported by previous studies in the literature and also provide new mechanistic insight into the biological systems. We expect the knowledge-fused differential dependency network analysis, together with the open-source R package, to be an important and useful bioinformatics tool in biological network analyses.
PMCID: PMC4131167  PMID: 25055984
Biological networks; Probabilistic graphical models; Differential dependency network; Network rewiring; Network analysis; Systems biology; Knowledge incorporation; Convex optimization
4.  Reconstruction of Transcription Regulatory Networks by Stability-Based Network Component Analysis 
Reliable inference of transcription regulatory networks is still a challenging task in the field of computational biology. Network component analysis (NCA) has become a powerful scheme to uncover the networks behind complex biological processes, especially when gene expression data is integrated with binding motif information. However, the performance of NCA is impaired by the high rate of false connections in binding motif information and the high level of noise in gene expression data. Moreover, in real applications such as cancer research, the performance of NCA in simultaneously analyzing multiple candidate transcription factors (TFs) is further limited by the small sample number of gene expression data. In this paper, we propose a novel scheme, stability-based NCA, to overcome the above-mentioned problems by addressing the inconsistency between gene expression data and motif binding information (i.e., prior network knowledge). This method introduces small perturbations on prior network knowledge and utilizes the variation of estimated TF activities to reflect the stability of TF activities. Such a scheme is less limited by the sample size and especially capable to identify condition-specific TFs and their target genes. Experiment results on both simulation data and real breast cancer data demonstrate the efficiency and robustness of the proposed method.
PMCID: PMC3652899  PMID: 24407294
transcription regulatory network; network component analysis; stability analysis; transcription factor activity; target genes identification
5.  Module-Based Breast Cancer Classification 
The reliability and reproducibility of gene biomarkers for classification of cancer patients has been challenged due to measurement noise and biological heterogeneity among patients. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers not as individual genes but as functional modules. Results from four breast cancer studies demonstrate that the identified module biomarkers i) achieve higher classification accuracy in independent validation datasets; ii) are more reproducible than individual gene markers; iii) improve the biological interpretability of results; and iv) are enriched in cancer “disease drivers”.
PMCID: PMC3736598  PMID: 23819260
Cancer biomarkers; systems biology; feature selection; disease classification
6.  mAPC-GibbsOS: an integrated approach for robust identification of gene regulatory networks 
BMC Systems Biology  2013;7(Suppl 5):S4.
Identification of cooperative gene regulatory network is an important topic for biological study especially in cancer research. Traditional approaches suffer from large noise in gene expression data and false positive connections in motif binding data; they also fail to identify the modularized structure of gene regulatory network. Methods that are capable of revealing underlying modularized structure and robust to noise and false positives are needed to be developed.
We proposed and developed an integrated approach to identify gene regulatory networks, which consists of a novel clustering method (namely motif-guided affinity propagation clustering (mAPC)) and a sampling based method (called Gibbs sampler based on outlier sum statistic (GibbsOS)). mAPC is used in the first step to obtain co-regulated gene modules by clustering genes with a similarity measurement taking into account both gene expression data and binding motif information. This clustering method can reduce the noise effect from microarray data to obtain modularized gene clusters. However, due to many false positives in motif binding data, some genes not regulated by certain transcription factors (TFs) will be falsely clustered with true target genes. To overcome this problem, GibbsOS is applied in the second step to refine each cluster for the identification of true target genes. In order to evaluate the performance of the proposed method, we generated simulation data under different signal-to-noise ratios and false positive ratios to test the method. The experimental results show an improved accuracy in terms of clustering and transcription factor identification. Moreover, an improved performance is demonstrated in target gene identification as compared with GibbsOS. Finally, we applied the proposed method to two breast cancer patient datasets to identify cooperative transcriptional regulatory networks associated with recurrence of breast cancer, as supported by their functional annotations.
We have developed a two-step approach for gene regulatory network identification, featuring an integrated method to identify modularized regulatory structures and refine their target genes subsequently. Simulation studies have shown the robustness of the method against noise in gene expression data and false positives in motif binding data. The proposed method has been applied to two breast cancer gene expression datasets to infer the hidden regulation mechanisms. The experimental results demonstrate the efficacy of the method in identifying key regulatory networks related to the progression and recurrence of breast cancer.
PMCID: PMC4028818  PMID: 24564939
7.  Antiestrogen Resistance and the Application of Systems Biology 
Understanding the molecular changes that drive an acquired antiestrogen resistance phenotype is of major clinical relevance. Previous methodologies for addressing this question have taken a single gene/pathway approach and the resulting gains have been limited in terms of their clinical impact. Recent systems biology approaches allow for the integration of data from high throughput “-omics” technologies. We highlight recent advances in the field of antiestrogen resistance with a focus on transcriptomics, proteomics and methylomics.
PMCID: PMC3607389  PMID: 23539064
Systems biology; breast cancer; estrogens; antiestrogens
8.  Computational Analysis of Muscular Dystrophy Sub-types Using A Novel Integrative Scheme 
Neurocomputing  2012;92:9-17.
To construct biologically interpretable gene sets for muscular dystrophy (MD) sub-type classification, we propose a novel computational scheme to integrate protein-protein interaction (PPI) network, functional gene set information, and mRNA profiling data. The workflow of the proposed scheme includes the following three major steps: firstly, we apply an affinity propagation clustering (APC) approach to identify gene sub-networks associated with each MD sub-type, in which a new distance metric is proposed for APC to combine PPI network information and gene-gene co-expression relationship; secondly, we further incorporate functional gene set knowledge, which complements the physical PPI information, into our scheme for biomarker identification; finally, based on the constructed sub-networks and gene set features, we apply multi-class support vector machines (MSVMs) for MD sub-type classification, with which to highlight the biomarkers contributing to sub-type prediction. The experimental results show that our scheme can help identify sub-networks and gene sets that are more relevant to MD than those constructed by other conventional approaches. Moreover, our integrative strategy improves the prediction accuracy substantially, especially for those ’hard-to-classify’ sub-types.
PMCID: PMC3389813  PMID: 22773895
Gene expression; Classification; Muscular dystrophy; Affinity propagation clustering; Biomarker discovery
9.  Identification of condition-specific regulatory modules through multi-level motif and mRNA expression analysis 
Many computational methods for identification of transcription regulatory modules often result in many false positives in practice due to noise sources of binding information and gene expression profiling data. In this paper, we propose a multi-level strategy for condition-specific gene regulatory module identification by integrating motif binding information and gene expression data through support vector regression and significant analysis. We have demonstrated the feasibility of the proposed method on a yeast cell cycle data set. The study on a breast cancer microarray data set shows that it can successfully identify the significant and reliable regulatory modules associated with breast cancer.
PMCID: PMC3749738  PMID: 20054984
transcription regulatory module; motif enrichment analysis; SVR; support vector regression; statistical significance analysis; multi-level regulator identification
10.  Endoplasmic reticulum stress, the unfolded protein response, and gene network modeling in antiestrogen resistant breast cancer 
Lack of understanding of endocrine resistance remains one of the major challenges for breast cancer researchers, clinicians, and patients. Current reductionist approaches to understanding the molecular signaling driving resistance have offered mostly incremental progress over the past 10 years. As the field of systems biology has begun to mature, the approaches and network modeling tools being developed and applied therein offer a different way to think about how molecular signaling and the regulation of critical cellular functions are integrated. To gain novel insights, we first describe some of the key challenges facing network modeling of endocrine resistance, many of which arise from the properties of the data spaces being studied. We then use activation of the unfolded protein response (UPR) following induction of endoplasmic reticulum stress in breast cancer cells by antiestrogens, to illustrate our approaches to computational modeling. Activation of UPR is a key determinant of cell fate decision making and regulation of autophagy and apoptosis. These initial studies provide insight into a small subnetwork topology obtained using differential dependency network analysis and focused on the UPR gene XBP1. The XBP1 subnetwork topology incorporates BCAR3, BCL2, BIK, NFκB, and other genes as nodes; the connecting edges represent the dependency structures amongst these nodes. As data from ongoing cellular and molecular studies become available, we will build detailed mathematical models of this XBP1-UPR network.
PMCID: PMC3734561  PMID: 23930139
Antiestrogen; autophagy; apoptosis; breast cancer; cell signaling; endoplasmic reticulum; estrogens; gene networks; unfolded protein response; computational modeling; mathematical modeling; systems biology
11.  Regulatory component analysis: a semi-blind extraction approach to infer gene regulatory networks with imperfect biological knowledge 
Signal processing  2011;92(8):1902-1915.
With the advent of high-throughput biotechnology capable of monitoring genomic signals, it becomes increasingly promising to understand molecular cellular mechanisms through systems biology approaches. One of the active research topics in systems biology is to infer gene transcriptional regulatory networks using various genomic data; this inference problem can be formulated as a linear model with latent signals associated with some regulatory proteins called transcription factors (TFs). As common statistical assumptions may not hold for genomic signals, typical latent variable algorithms such as independent component analysis (ICA) are incapable to reveal underlying true regulatory signals. Liao et al. [1] proposed to perform inference using an approach named network component analysis (NCA), the optimization of which is achieved by a least-squares fitting approach with biological knowledge constraints. However, the incompleteness of biological knowledge and its inconsistency with gene expression data are not considered in the original NCA solution, which could greatly affect the inference accuracy. To overcome these limitations, we propose a linear extraction scheme, namely regulatory component analysis (RCA), to infer underlying regulatory signals even with partial biological knowledge. Numerical simulations show a significant improvement of our proposed RCA over NCA, not only when signal-to-noise-ratio (SNR) is low, but also when the given biological knowledge is incomplete and inconsistent to gene expression data. Furthermore, real biological experiments on E. coli are performed for regulatory network inference in comparison with several typical linear latent variable methods, which again demonstrates the effectiveness and improved performance of the proposed algorithm.
PMCID: PMC3367667  PMID: 22685363
Transcriptional regulatory network inference; Source extraction; Gene expression; Genomic signal processing
12.  Robust identification of transcriptional regulatory networks using a Gibbs sampler on outlier sum statistic 
Bioinformatics  2012;28(15):1990-1997.
Motivation: Identification of transcriptional regulatory networks (TRNs) is of significant importance in computational biology for cancer research, providing a critical building block to unravel disease pathways. However, existing methods for TRN identification suffer from the inclusion of excessive ‘noise’ in microarray data and false-positives in binding data, especially when applied to human tumor-derived cell line studies. More robust methods that can counteract the imperfection of data sources are therefore needed for reliable identification of TRNs in this context.
Results: In this article, we propose to establish a link between the quality of one target gene to represent its regulator and the uncertainty of its expression to represent other target genes. Specifically, an outlier sum statistic was used to measure the aggregated evidence for regulation events between target genes and their corresponding transcription factors. A Gibbs sampling method was then developed to estimate the marginal distribution of the outlier sum statistic, hence, to uncover underlying regulatory relationships. To evaluate the effectiveness of our proposed method, we compared its performance with that of an existing sampling-based method using both simulation data and yeast cell cycle data. The experimental results show that our method consistently outperforms the competing method in different settings of signal-to-noise ratio and network topology, indicating its robustness for biological applications. Finally, we applied our method to breast cancer cell line data and demonstrated its ability to extract biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Availability and implementation: The Gibbs sampler MATLAB package is freely available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3400952  PMID: 22595208
Ovarian cancer is often called the ‘silent killer’ since it is difficult to have early detection and prognosis. Understanding the biological mechanism related to ovarian cancer becomes extremely important for the purpose of treatment. We propose an integrative framework to identify pathway related networks based on large-scale TCGA copy number data and gene expression profiles. The integrative approach first detects highly conserved copy number altered genes and regards them as seed genes, and then applies a network-based method to identify subnetworks that can differentiate gene expression patterns between different phenotypes of ovarian cancer patients. The identified subnetworks are further validated on an independent gene expression data set using a network-based classification method. The experimental results show that our approach can not only achieve good prediction performance across different data sets, but also identify biological meaningful subnetworks involved in many signaling pathways related to ovarian cancer.
PMCID: PMC3608394  PMID: 22174260
14.  Endoplasmic Reticulum Stress, the Unfolded Protein Response, Autophagy, and the Integrated Regulation of Breast Cancer Cell Fate 
Cancer Research  2012;72(6):1321-1331.
How breast cancer cells respond to the stress of endocrine therapies determines whether they acquire a resistant phenotype or execute a cell death pathway. A successfully executed survival signal then requires determination of whether or not to replicate. How these cell fate decisions are regulated is unclear but evidence suggests that the signals determining these outcomes are highly integrated. Central to the final cell fate decision is signaling from the unfolded protein response, which can be activated following the sensing of stress within the endoplasmic reticulum. Duration of the response to stress is partly mediated by the duration of inositol requiring enzyme-1 (IRE1; ERN) activation following its release from heat shock protein A5 (HSPA5). The resulting signaling appears to use several B-cell lymphoma-2 (BCL2) family members to both suppress apoptosis and activate autophagy. Changes in metabolism induced by cellular stress are key components of this regulatory system, and further adaptation of the metabolome is affected in response to stress. Here we describe the unfolded protein response, autophagy and apoptosis, and how their regulation is integrated. Central topological features of the signaling network that integrate cell fate regulation and decision execution are discussed.
PMCID: PMC3313080  PMID: 22422988
Cell signaling; endoplasmic reticulum; estrogens; unfolded protein response
15.  Identifying protein interaction subnetworks by a bagging Markov random field-based method 
Nucleic Acids Research  2012;41(2):e42.
Identification of differentially expressed subnetworks from protein–protein interaction (PPI) networks has become increasingly important to our global understanding of the molecular mechanisms that drive cancer. Several methods have been proposed for PPI subnetwork identification, but the dependency among network member genes is not explicitly considered, leaving many important hub genes largely unidentified. We present a new method, based on a bagging Markov random field (BMRF) framework, to improve subnetwork identification for mechanistic studies of breast cancer. The method follows a maximum a posteriori principle to form a novel network score that explicitly considers pairwise gene interactions in PPI networks, and it searches for subnetworks with maximal network scores. To improve their robustness across data sets, a bagging scheme based on bootstrapping samples is implemented to statistically select high confidence subnetworks. We first compared the BMRF-based method with existing methods on simulation data to demonstrate its improved performance. We then applied our method to breast cancer data to identify PPI subnetworks associated with breast cancer progression and/or tamoxifen resistance. The experimental results show that not only an improved prediction performance can be achieved by the BMRF approach when tested on independent data sets, but biologically meaningful subnetworks can also be revealed that are relevant to breast cancer and tamoxifen resistance.
PMCID: PMC3553975  PMID: 23161673
16.  DDN: a caBIG® analytical tool for differential network analysis 
Bioinformatics  2011;27(7):1036-1038.
Summary: Differential dependency network (DDN) is a caBIG® (cancer Biomedical Informatics Grid) analytical tool for detecting and visualizing statistically significant topological changes in transcriptional networks representing two biological conditions. Developed under caBIG® 's In Silico Research Centers of Excellence (ISRCE) Program, DDN enables differential network analysis and provides an alternative way for defining network biomarkers predictive of phenotypes. DDN also serves as a useful systems biology tool for users across biomedical research communities to infer how genetic, epigenetic or environment variables may affect biological networks and clinical phenotypes. Besides the standalone Java application, we have also developed a Cytoscape plug-in, CytoDDN, to integrate network analysis and visualization seamlessly.
Availability: The Java and MATLAB source code can be downloaded at the authors' web site
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3065688  PMID: 21296752
17.  PUGSVM: a caBIGTM analytical tool for multiclass gene selection and predictive classification 
Bioinformatics  2010;27(5):736-738.
Summary: Phenotypic Up-regulated Gene Support Vector Machine (PUGSVM) is a cancer Biomedical Informatics Grid (caBIG™) analytical tool for multiclass gene selection and classification. PUGSVM addresses the problem of imbalanced class separability, small sample size and high gene space dimensionality, where multiclass gene markers are defined by the union of one-versus-everyone phenotypic upregulated genes, and used by a well-matched one-versus-rest support vector machine. PUGSVM provides a simple yet more accurate strategy to identify statistically reproducible mechanistic marker genes for characterization of heterogeneous diseases.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3042183  PMID: 21186245
18.  Distinct DNA methylation profiles in ovarian serous neoplasms and their implications in ovarian carcinogenesis 
American journal of obstetrics and gynecology  2010;203(6):584.e1-584.e22.
The purpose is to analyze DNA methylation profiling among different types of ovarian serous neoplasm, a task that has not been performed.
Study Design
The Illumina beads array was used to profile DNA methylation in enriched tumor cells isolated from 75 benign and malignant serous tumor tissues and six tumor-associated stromal cell cultures.
We found significantly fewer hypermethylated genes in high-grade serous carcinomas than in low-grade serous carcinoma and borderline tumors, which in turn had fewer hypermethylated genes than serous cystadenoma. Unsupervised analysis identified that serous cystadenoma, serous borderline tumor and low-grade serous carcinoma tightly clustered together and were clearly different from high-grade serous carcinomas. We also performed supervised analysis to identify differentially methylated genes that may contribute to group separation.
The findings support the view that low-grade and high-grade serous carcinomas are distinctly different with low-grade, but not high-grade serous carcinomas, related to serous borderline tumor and cystadenoma.
PMCID: PMC2993872  PMID: 20965493
ovarian carcinoma; methylation; classification
19.  Identifying cancer biomarkers by network-constrained support vector machines 
BMC Systems Biology  2011;5:161.
One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers.
We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis.
We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.
PMCID: PMC3214162  PMID: 21992556
20.  Multilevel support vector regression analysis to identify condition-specific regulatory networks 
Bioinformatics  2010;26(11):1416-1422.
Motivation: The identification of gene regulatory modules is an important yet challenging problem in computational biology. While many computational methods have been proposed to identify regulatory modules, their initial success is largely compromised by a high rate of false positives, especially when applied to human cancer studies. New strategies are needed for reliable regulatory module identification.
Results: We present a new approach, namely multilevel support vector regression (ml-SVR), to systematically identify condition-specific regulatory modules. The approach is built upon a multilevel analysis strategy designed for suppressing false positive predictions. With this strategy, a regulatory module becomes ever more significant as more relevant gene sets are formed at finer levels. At each level, a two-stage support vector regression (SVR) method is utilized to help reduce false positive predictions by integrating binding motif information and gene expression data; a significant analysis procedure is followed to assess the significance of each regulatory module. To evaluate the effectiveness of the proposed strategy, we first compared the ml-SVR approach with other existing methods on simulation data and yeast cell cycle data. The resulting performance shows that the ml-SVR approach outperforms other methods in the identification of both regulators and their target genes. We then applied our method to breast cancer cell line data to identify condition-specific regulatory modules associated with estrogen treatment. Experimental results show that our method can identify biologically meaningful regulatory modules related to estrogen signaling and action in breast cancer.
Availability and implementation: The ml-SVR MATLAB package can be downloaded at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2872001  PMID: 20375112
21.  Applications of Different Weighting Schemes to Improve Pathway-Based Analysis 
Conventionally, pathway-based analysis assumes that genes in a pathway equally contribute to a biological function, thus assigning uniform weight to genes. However, this assumption has been proved incorrect, and applying uniform weight in the pathway analysis may not be an appropriate approach for the tasks like molecular classification of diseases, as genes in a functional group may have different predicting power. Hence, we propose to use different weights to genes in pathway-based analysis and devise four weighting schemes. We applied them in two existing pathway analysis methods using both real and simulated gene expression data for pathways. Among all schemes, random weighting scheme, which generates random weights and selects optimal weights minimizing an objective function, performs best in terms of P value or error rate reduction. Weighting changes pathway scoring and brings up some new significant pathways, leading to the detection of disease-related genes that are missed under uniform weight.
PMCID: PMC3114410  PMID: 21687588
22.  Motif-guided sparse decomposition of gene expression data for regulatory module identification 
BMC Bioinformatics  2011;12:82.
Genes work coordinately as gene modules or gene networks. Various computational approaches have been proposed to find gene modules based on gene expression data; for example, gene clustering is a popular method for grouping genes with similar gene expression patterns. However, traditional gene clustering often yields unsatisfactory results for regulatory module identification because the resulting gene clusters are co-expressed but not necessarily co-regulated.
We propose a novel approach, motif-guided sparse decomposition (mSD), to identify gene regulatory modules by integrating gene expression data and DNA sequence motif information. The mSD approach is implemented as a two-step algorithm comprising estimates of (1) transcription factor activity and (2) the strength of the predicted gene regulation event(s). Specifically, a motif-guided clustering method is first developed to estimate the transcription factor activity of a gene module; sparse component analysis is then applied to estimate the regulation strength, and so predict the target genes of the transcription factors. The mSD approach was first tested for its improved performance in finding regulatory modules using simulated and real yeast data, revealing functionally distinct gene modules enriched with biologically validated transcription factors. We then demonstrated the efficacy of the mSD approach on breast cancer cell line data and uncovered several important gene regulatory modules related to endocrine therapy of breast cancer.
We have developed a new integrated strategy, namely motif-guided sparse decomposition (mSD) of gene expression data, for regulatory module identification. The mSD method features a novel motif-guided clustering method for transcription factor activity estimation by finding a balance between co-regulation and co-expression. The mSD method further utilizes a sparse decomposition method for regulation strength estimation. The experimental results show that such a motif-guided strategy can provide context-specific regulatory modules in both yeast and breast cancer studies.
PMCID: PMC3072956  PMID: 21426557
23.  Reconstruction of Gene Regulatory Modules in Cancer Cell Cycle by Multi-Source Data Integration 
PLoS ONE  2010;5(4):e10268.
Precise regulation of the cell cycle is crucial to the growth and development of all organisms. Understanding the regulatory mechanism of the cell cycle is crucial to unraveling many complicated diseases, most notably cancer. Multiple sources of biological data are available to study the dynamic interactions among many genes that are related to the cancer cell cycle. Integrating these informative and complementary data sources can help to infer a mutually consistent gene transcriptional regulatory network with strong similarity to the underlying gene regulatory relationships in cancer cells.
Results and Principal Findings
We propose an integrative framework that infers gene regulatory modules from the cell cycle of cancer cells by incorporating multiple sources of biological data, including gene expression profiles, gene ontology, and molecular interaction. Among 846 human genes with putative roles in cell cycle regulation, we identified 46 transcription factors and 39 gene ontology groups. We reconstructed regulatory modules to infer the underlying regulatory relationships. Four regulatory network motifs were identified from the interaction network. The relationship between each transcription factor and predicted target gene groups was examined by training a recurrent neural network whose topology mimics the network motif(s) to which the transcription factor was assigned. Inferred network motifs related to eight well-known cell cycle genes were confirmed by gene set enrichment analysis, binding site enrichment analysis, and comparison with previously published experimental results.
We established a robust method that can accurately infer underlying relationships between a given transcription factor and its downstream target genes by integrating different layers of biological data. Our method could also be beneficial to biologists for predicting the components of regulatory modules in which any candidate gene is involved. Such predictions can then be used to design a more streamlined experimental approach for biological validation. Understanding the dynamics of these modules will shed light on the processes that occur in cancer cells resulting from errors in cell cycle regulation.
PMCID: PMC2858157  PMID: 20422009
24.  Knowledge-guided gene ranking by coordinative component analysis 
BMC Bioinformatics  2010;11:162.
In cancer, gene networks and pathways often exhibit dynamic behavior, particularly during the process of carcinogenesis. Thus, it is important to prioritize those genes that are strongly associated with the functionality of a network. Traditional statistical methods are often inept to identify biologically relevant member genes, motivating researchers to incorporate biological knowledge into gene ranking methods. However, current integration strategies are often heuristic and fail to incorporate fully the true interplay between biological knowledge and gene expression data.
To improve knowledge-guided gene ranking, we propose a novel method called coordinative component analysis (COCA) in this paper. COCA explicitly captures those genes within a specific biological context that are likely to be expressed in a coordinative manner. Formulated as an optimization problem to maximize the coordinative effort, COCA is designed to first extract the coordinative components based on a partial guidance from knowledge genes and then rank the genes according to their participation strengths. An embedded bootstrapping procedure is implemented to improve statistical robustness of the solutions. COCA was initially tested on simulation data and then on published gene expression microarray data to demonstrate its improved performance as compared to traditional statistical methods. Finally, the COCA approach has been applied to stem cell data to identify biologically relevant genes in signaling pathways. As a result, the COCA approach uncovers novel pathway members that may shed light into the pathway deregulation in cancers.
We have developed a new integrative strategy to combine biological knowledge and microarray data for gene ranking. The method utilizes knowledge genes for a guidance to first extract coordinative components, and then rank the genes according to their contribution related to a network or pathway. The experimental results show that such a knowledge-guided strategy can provide context-specific gene ranking with an improved performance in pathway member identification.
PMCID: PMC2865494  PMID: 20353603
25.  Gene Network Signaling in Hormone Responsiveness Modifies Apoptosis and Autophagy in Breast Cancer Cells 
Resistance to endocrine therapies, whether de novo or acquired, remains a major limitation in the ability to cure many tumors that express detectable levels of the estrogen receptor alpha protein (ER). While several resistance phenotypes have been described, endocrine unresponsiveness in the context of therapy-induced tumor growth appears to be the most prevalent. The signaling that regulates endocrine resistant phenotypes is poorly understood but it involves a complex signaling network with a topology that includes redundant and degenerative features. To be relevant to clinical outcomes, the most pertinent features of this network are those that ultimately affect the endocrine-regulated components of the cell fate and cell proliferation machineries. We show that autophagy, as supported by the endocrine regulation of monodansylcadaverine staining, increased LC3 cleavage, and reduced expression of p62/SQSTM1, plays an important role in breast cancer cells responding to endocrine therapy. We further show that the cell fate machinery includes both apoptotic and autophagic functions that are potentially regulated through integrated signaling that flows through key members of the BCL2 gene family and beclin-1 (BECN1). This signaling links cellular functions in mitochondria and endoplasmic reticulum, the latter as a consequence of induction of the unfolded protein response. We have taken a seed-gene approach to begin extracting critical nodes and edges that represent central signaling events in the endocrine regulation of apoptosis and autophagy. Three seed nodes were identified from global gene or protein expression analyses and supported by subsequent functional studies that established their abilities to affect cell fate. The seed nodes of nuclear factor kappa B (NFκB), interferon regulatory factor-1 (IRF1), and X-box binding protein-1 (XBP1) are linked by directional edges that support signal flow through a preliminary network that is grown to include key regulators of their individual function: NEMO/IKKγ, nucleophosmin and ER respectively. Signaling proceeds through BCL2 gene family members and BECN1 ultimately to regulate cell fate.
PMCID: PMC2768542  PMID: 19444933
Antiestrogen; autophagy; apoptosis; breast cancer; cell signaling; endoplasmic reticulum; estrogens; gene networks; unfolded protein response

Results 1-25 (39)