Search tips
Search criteria

Results 1-9 (9)

Clipboard (0)
more »
Year of Publication
Document Types
1.  TVNViewer: An interactive visualization tool for exploring networks that change over time or space 
Bioinformatics  2011;27(13):1880-1881.
Summary: The relationship between genes and proteins is a dynamic relationship that changes across time and differs in different cells. The study of these differences can reveal various insights into biological processes and disease progression, especially with the aid of proper tools for network visualization. Toward this purpose, we have developed TVNViewer, a novel visualization tool, which is specifically designed to aid in the exploration and analysis of dynamic networks.
Availability: TVNViewer is freely available with documentation and tutorials on the web at
PMCID: PMC3117350  PMID: 21551142
2.  Leveraging input and output structures for joint mapping of epistatic and marginal eQTLs 
Bioinformatics  2012;28(12):i137-i146.
Motivation: As many complex disease and expression phenotypes are the outcome of intricate perturbation of molecular networks underlying gene regulation resulted from interdependent genome variations, association mapping of causal QTLs or expression quantitative trait loci must consider both additive and epistatic effects of multiple candidate genotypes. This problem poses a significant challenge to contemporary genome-wide-association (GWA) mapping technologies because of its computational complexity. Fortunately, a plethora of recent developments in biological network community, especially the availability of genetic interaction networks, make it possible to construct informative priors of complex interactions between genotypes, which can substantially reduce the complexity and increase the statistical power of GWA inference.
Results: In this article, we consider the problem of learning a multitask regression model while taking advantage of the prior information on structures on both the inputs (genetic variations) and outputs (expression levels). We propose a novel regularization scheme over multitask regression called jointly structured input–output lasso based on an ℓ1/ℓ2 norm, which allows shared sparsity patterns for related inputs and outputs to be optimally estimated. Such patterns capture multiple related single nucleotide polymorphisms (SNPs) that jointly influence multiple-related expression traits. In addition, we generalize this new multitask regression to structurally regularized polynomial regression to detect epistatic interactions with manageable complexity by exploiting the prior knowledge on candidate SNPs for epistatic effects from biological experiments. We demonstrate our method on simulated and yeast eQTL datasets.
Availability: Software is available at
PMCID: PMC3371859  PMID: 22689753
3.  TREEGL: reverse engineering tree-evolving gene networks underlying developing biological lineages 
Bioinformatics  2011;27(13):i196-i204.
Motivation: Estimating gene regulatory networks over biological lineages is central to a deeper understanding of how cells evolve during development and differentiation. However, one challenge in estimating such evolving networks is that their host cells not only contiguously evolve, but also branch over time. For example, a stem cell evolves into two more specialized daughter cells at each division, forming a tree of networks. Another example is in a laboratory setting: a biologist may apply several different drugs individually to malignant cancer cells to analyze the effects of each drug on the cells; the cells treated by one drug may not be intrinsically similar to those treated by another, but rather to the malignant cancer cells they were derived from.
Results: We propose a novel algorithm, Treegl, an ℓ1 plus total variation penalized linear regression method, to effectively estimate multiple gene networks corresponding to cell types related by a tree-genealogy, based on only a few samples from each cell type. Treegl takes advantage of the similarity between related networks along the biological lineage, while at the same time exposing sharp differences between the networks. We demonstrate that our algorithm performs significantly better than existing methods via simulation. Furthermore we explore an application to a breast cancer dataset, and show that our algorithm is able to produce biologically valid results that provide insight into the progression and reversion of breast cancer cells.
Availability: Software will be available at
PMCID: PMC3117339  PMID: 21685070
4.  StructHDP: automatic inference of number of clusters and population structure from admixed genotype data 
Bioinformatics  2011;27(13):i324-i332.
Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user.
Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data.
Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset.
Availability: StructHDP is written in C++. The code will be available for download at
PMCID: PMC3117349  PMID: 21685088
5.  SPEX2: automated concise extraction of spatial gene expression patterns from Fly embryo ISH images 
Bioinformatics  2010;26(12):i47-i56.
Motivation: Microarray profiling of mRNA abundance is often ill suited for temporal–spatial analysis of gene expressions in multicellular organisms such as Drosophila. Recent progress in image-based genome-scale profiling of whole-body mRNA patterns via in situ hybridization (ISH) calls for development of accurate and automatic image analysis systems to facilitate efficient mining of complex temporal–spatial mRNA patterns, which will be essential for functional genomics and network inference in higher organisms.
Results: We present SPEX2, an automatic system for embryonic ISH image processing, which can extract, transform, compare, classify and cluster spatial gene expression patterns in Drosophila embryos. Our pipeline for gene expression pattern extraction outputs the precise spatial locations and strengths of the gene expression. We performed experiments on the largest publicly available collection of Drosophila ISH images, and show that our method achieves excellent performance in automatic image annotation, and also finds clusters that are significantly enriched, both for gene ontology functional annotations, and for annotation terms from a controlled vocabulary used by human curators to describe these images.
Availability: Software will be available at
Supplementary information: Supplementary data are avilable at Bioinformatics online.
PMCID: PMC2881357  PMID: 20529936
6.  Multi-population GWA mapping via multi-task regularized regression 
Bioinformatics  2010;26(12):i208-i216.
Motivation: Population heterogeneity through admixing of different founder populations can produce spurious associations in genome- wide association studies that are linked to the population structure rather than the phenotype. Since samples from the same population generally co-evolve, different populations may or may not share the same genetic underpinnings for the seemingly common phenotype. Our goal is to develop a unified framework for detecting causal genetic markers through a joint association analysis of multiple populations.
Results: Based on a multi-task regression principle, we present a multi-population group lasso algorithm using L1/L2-regularized regression for joint association analysis of multiple populations that are stratified either via population survey or computational estimation. Our algorithm combines information from genetic markers across populations, to identify causal markers. It also implicitly accounts for correlations between the genetic markers, thus enabling better control over false positive rates. Joint analysis across populations enables the detection of weak associations common to all populations with greater power than in a separate analysis of each population. At the same time, the regression-based framework allows causal alleles that are unique to a subset of the populations to be correctly identified. We demonstrate the effectiveness of our method on HapMap-simulated and lactase persistence datasets, where we significantly outperform state of the art methods, with greater power for detecting weak associations and reduced spurious associations.
Availability: Software will be available at
PMCID: PMC2881376  PMID: 20529908
7.  DISCOVER: a feature-based discriminative method for motif search in complex genomes 
Bioinformatics  2009;25(12):i321-i329.
Motivation: Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate ‘grammatical organization’ of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features.
Results: This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score.
Availability and Implementation: The code is publicly available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2687984  PMID: 19478006
8.  A multivariate regression approach to association analysis of a quantitative trait network 
Bioinformatics  2009;25(12):i204-i212.
Motivation: Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. Although a causal genetic variation may influence a group of highly correlated traits jointly, most of the previous association analyses considered each phenotype separately, or combined results from a set of single-phenotype analyses.
Results: We propose a new statistical framework called graph-guided fused lasso to address this issue in a principled way. Our approach represents the dependency structure among the quantitative traits explicitly as a network, and leverages this trait network to encode structured regularizations in a multivariate regression model over the genotypes and traits, so that the genetic markers that jointly influence subgroups of highly correlated traits can be detected with high sensitivity and specificity. While most of the traditional methods examined each phenotype independently, our approach analyzes all of the traits jointly in a single statistical method to discover the genetic markers that perturb a subset of correlated triats jointly rather than a single trait. Using simulated datasets based on the HapMap consortium data and an asthma dataset, we compare the performance of our method with the single-marker analysis, and other sparse regression methods that do not use any structural information in the traits. Our results show that there is a significant advantage in detecting the true causal single nucleotide polymorphisms when we incorporate the correlation pattern in traits using our proposed methods.
Availability: Software for GFlasso is available at
PMCID: PMC2687972  PMID: 19477989
9.  KELLER: estimating time-varying interactions between genes 
Bioinformatics  2009;25(12):i128-i136.
Motivation: Gene regulatory networks underlying temporal processes, such as the cell cycle or the life cycle of an organism, can exhibit significant topological changes to facilitate the underlying dynamic regulatory functions. Thus, it is essential to develop methods that capture the temporal evolution of the regulatory networks. These methods will be an enabling first step for studying the driving forces underlying the dynamic gene regulation circuitry and predicting the future network structures in response to internal and external stimuli.
Results: We introduce a kernel-reweighted logistic regression method (KELLER) for reverse engineering the dynamic interactions between genes based on their time series of expression values. We apply the proposed method to estimate the latent sequence of temporal rewiring networks of 588 genes involved in the developmental process during the life cycle of Drosophila melanogaster. Our results offer the first glimpse into the temporal evolution of gene networks in a living organism during its full developmental course. Our results also show that many genes exhibit distinctive functions at different stages along the developmental cycle.
Availability: Source codes and relevant data will be made available at
PMCID: PMC2687946  PMID: 19477978

Results 1-9 (9)