Search tips
Search criteria

Results 1-25 (28)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
author:("Nakhleh, lay")
1.  Mutations in Global Regulators Lead to Metabolic Selection during Adaptation to Complex Environments 
PLoS Genetics  2014;10(12):e1004872.
Adaptation to ecologically complex environments can provide insights into the evolutionary dynamics and functional constraints encountered by organisms during natural selection. Adaptation to a new environment with abundant and varied resources can be difficult to achieve by small incremental changes if many mutations are required to achieve even modest gains in fitness. Since changing complex environments are quite common in nature, we investigated how such an epistatic bottleneck can be avoided to allow rapid adaptation. We show that adaptive mutations arise repeatedly in independently evolved populations in the context of greatly increased genetic and phenotypic diversity. We go on to show that weak selection requiring substantial metabolic reprogramming can be readily achieved by mutations in the global response regulator arcA and the stress response regulator rpoS. We identified 46 unique single-nucleotide variants of arcA and 18 mutations in rpoS, nine of which resulted in stop codons or large deletions, suggesting that subtle modulations of ArcA function and knockouts of rpoS are largely responsible for the metabolic shifts leading to adaptation. These mutations allow a higher order metabolic selection that eliminates epistatic bottlenecks, which could occur when many changes would be required. Proteomic and carbohydrate analysis of adapting E. coli populations revealed an up-regulation of enzymes associated with the TCA cycle and amino acid metabolism, and an increase in the secretion of putrescine. The overall effect of adaptation across populations is to redirect and efficiently utilize uptake and catabolism of abundant amino acids. Concomitantly, there is a pronounced spread of more ecologically limited strains that results from specialization through metabolic erosion. Remarkably, the global regulators arcA and rpoS can provide a “one-step” mechanism of adaptation to a novel environment, which highlights the importance of global resource management as a powerful strategy to adaptation.
Author Summary
Changing environmental conditions are the norm in biology. However, understanding adaptation to complex environments presents many challenges. For example, adaptation to resource-rich environments can potentially have many successful evolutionary trajectories to increased fitness. Even in conditions of plenty, the utilization of numerous but novel resources can require multiple mutations before a benefit is accrued. We evolved two bacterial species isolated from the gut of healthy humans in two different, resource-rich media commonly used in the laboratory. We anticipated that under weak selection the population would evolve tremendous genetic diversity. Despite such a complex genetic background we were able to identify a strong degree of parallel evolution and using a combination of population proteomic and population genomic approaches we show that two global regulators, arcA and rpoS, are the principle targets of selection. Up-regulation of the different metabolic pathways that are controlled by these global regulators in combination with up-regulation of transporters that transport nutrients into the cell revealed increased use of the novel resources. Thus global regulators can provide a one-step model to shift metabolism efficiently and provide rapid a one-step reprogramming of the cell metabolic profile.
PMCID: PMC4263409  PMID: 25501822
2.  Computational approaches to species phylogeny inference and gene tree reconciliation 
Trends in ecology & evolution  2013;28(12):10.1016/j.tree.2013.09.004.
An intricate relationship exists between gene trees and species phylogenies, due to evolutionary processes that act on the genes within and across the branches of the species phylogeny. From an analytical perspective, gene trees serve as character states for inferring accurate species phylogenies, and species phylogenies serve as a backdrop against which gene trees are contrasted for elucidating evolutionary processes and parameters. In a 1997 paper, Maddison discussed this relationship, reviewed the signatures left by three major evolutionary processes on the gene trees, and surveyed parsimony and likelihood criteria for utilizing these signatures to computationally elucidate this relationship. Here, we review progress that has been made on developing computational methods for analyses under these two criteria, and survey remaining challenges.
PMCID: PMC3855310  PMID: 24094331
3.  Towards accurate characterization of clonal heterogeneity based on structural variation 
BMC Bioinformatics  2014;15(1):299.
Recent advances in deep digital sequencing have unveiled an unprecedented degree of clonal heterogeneity within a single tumor DNA sample. Resolving such heterogeneity depends on accurate estimation of fractions of alleles that harbor somatic mutations. Unlike substitutions or small indels, structural variants such as deletions, duplications, inversions and translocations involve segments of DNAs and are potentially more accurate for allele fraction estimations. However, no systematic method exists that can support such analysis.
In this paper, we present a novel maximum-likelihood method that estimates allele fractions of structural variants integratively from various forms of alignment signals. We develop a tool, BreakDown, to estimate the allele fractions of most structural variants including medium size (from 1 kilobase to 1 megabase) deletions and duplications, and balanced inversions and translocations.
Evaluation based on both simulated and real data indicates that our method systematically enables structural variants for clonal heterogeneity analysis and can greatly enhance the characterization of genomically instable tumors.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-299) contains supplementary material, which is available to authorized users.
PMCID: PMC4165998  PMID: 25201439
Structural variation; Clonal heterogeneity; Variant allele fraction
4.  Parsimonious Inference of Hybridization in the Presence of Incomplete Lineage Sorting 
Systematic Biology  2013;62(5):738-751.
Hybridization plays an important evolutionary role in several groups of organisms. A phylogenetic approach to detect hybridization entails sequencing multiple loci across the genomes of a group of species of interest, reconstructing their gene trees, and taking their differences as indicators of hybridization. However, methods that follow this approach mostly ignore population effects, such as incomplete lineage sorting (ILS). Given that hybridization occurs between closely related organisms, ILS may very well be at play and, hence, must be accounted for in the analysis framework. To address this issue, we present a parsimony criterion for reconciling gene trees within the branches of a phylogenetic network, and a local search heuristic for inferring phylogenetic networks from collections of gene-tree topologies under this criterion. This framework enables phylogenetic analyses while accounting for both hybridization and ILS. Further, we propose two techniques for incorporating information about uncertainty in gene-tree estimates. Our simulation studies demonstrate the good performance of our framework in terms of identifying the location of hybridization events, as well as estimating the proportions of genes that underwent hybridization. Also, our framework shows good performance in terms of efficiency on handling large data sets in our experiments. Further, in analysing a yeast data set, we demonstrate issues that arise when analysing real data sets. Although a probabilistic approach was recently introduced for this problem, and although parsimonious reconciliations have accuracy issues under certain settings, our parsimony framework provides a much more computationally efficient technique for this type of analysis. Our framework now allows for genome-wide scans for hybridization, while also accounting for ILS. [Phylogenetic networks; hybridization; incomplete lineage sorting; coalescent; multi-labeled trees.]
PMCID: PMC3739885  PMID: 23736104
5.  An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes 
PLoS Computational Biology  2014;10(6):e1003649.
One outcome of interspecific hybridization and subsequent effects of evolutionary forces is introgression, which is the integration of genetic material from one species into the genome of an individual in another species. The evolution of several groups of eukaryotic species has involved hybridization, and cases of adaptation through introgression have been already established. In this work, we report on PhyloNet-HMM—a new comparative genomic framework for detecting introgression in genomes. PhyloNet-HMM combines phylogenetic networks with hidden Markov models (HMMs) to simultaneously capture the (potentially reticulate) evolutionary history of the genomes and dependencies within genomes. A novel aspect of our work is that it also accounts for incomplete lineage sorting and dependence across loci. Application of our model to variation data from chromosome 7 in the mouse (Mus musculus domesticus) genome detected a recently reported adaptive introgression event involving the rodent poison resistance gene Vkorc1, in addition to other newly detected introgressed genomic regions. Based on our analysis, it is estimated that about 9% of all sites within chromosome 7 are of introgressive origin (these cover about 13 Mbp of chromosome 7, and over 300 genes). Further, our model detected no introgression in a negative control data set. We also found that our model accurately detected introgression and other evolutionary processes from synthetic data sets simulated under the coalescent model with recombination, isolation, and migration. Our work provides a powerful framework for systematic analysis of introgression while simultaneously accounting for dependence across sites, point mutations, recombination, and ancestral polymorphism.
Author Summary
Hybridization is the mating between individuals from two different species. While hybridization introduces genetic material into a host genome, this genetic material may be transient and is purged from the population within a few generations after hybridization. However, in other cases, the introduced genetic material persists in the population—a process known as introgression—and can have significant evolutionary implications. In this paper, we introduce a novel method for detecting introgression in genomes using a comparative genomic approach. The method scans multiple aligned genomes for signatures of introgression by incorporating phylogenetic networks and hidden Markov models. The method allows for teasing apart true signatures of introgression from spurious ones that arise due to population effects and resemble those of introgression. Using the new method, we analyzed two sets of variation data from chromosome 7 in mouse genomes. The method detected previously reported introgressed regions as well as new ones in one of the data sets. In the other data set, which was selected as a negative control, the method detected no introgression. Furthermore, our method accurately detected introgression in simulated evolutionary scenarios and accurately inferred related population genetic quantities. Our method enables systematic comparative analyses of genomes where introgression is suspected, and can work with genome-wide data.
PMCID: PMC4055573  PMID: 24922281
6.  Mapping Network Motif Tunability and Robustness in the Design of Synthetic Signaling Circuits 
PLoS ONE  2014;9(3):e91743.
Cellular networks are highly dynamic in their function, yet evolutionarily conserved in their core network motifs or topologies. Understanding functional tunability and robustness of network motifs to small perturbations in function and structure is vital to our ability to synthesize controllable circuits. In establishing core sets of network motifs, we selected topologies that are overrepresented in mammalian networks, including the linear, feedback, feed-forward, and bifan circuits. Static and dynamic tunability of network motifs were defined as the motif ability to respectively attain steady-state or transient outputs in response to pre-defined input stimuli. Detailed computational analysis suggested that static tunability is insensitive to the circuit topology, since all of the motifs displayed similar ability to attain predefined steady-state outputs in response to constant inputs. Dynamic tunability, in contrast, was tightly dependent on circuit topology, with some motifs performing superiorly in achieving observed time-course outputs. Finally, we mapped dynamic tunability onto motif topologies to determine robustness of motif structures to changes in topology and identify design principles for the rational assembly of robust synthetic networks.
PMCID: PMC3958390  PMID: 24642504
7.  Modeling Integrated Cellular Machinery Using Hybrid Petri-Boolean Networks 
PLoS Computational Biology  2013;9(11):e1003306.
The behavior and phenotypic changes of cells are governed by a cellular circuitry that represents a set of biochemical reactions. Based on biological functions, this circuitry is divided into three types of networks, each encoding for a major biological process: signal transduction, transcription regulation, and metabolism. This division has generally enabled taming computational complexity dealing with the entire system, allowed for using modeling techniques that are specific to each of the components, and achieved separation of the different time scales at which reactions in each of the three networks occur. Nonetheless, with this division comes loss of information and power needed to elucidate certain cellular phenomena. Within the cell, these three types of networks work in tandem, and each produces signals and/or substances that are used by the others to process information and operate normally. Therefore, computational techniques for modeling integrated cellular machinery are needed. In this work, we propose an integrated hybrid model (IHM) that combines Petri nets and Boolean networks to model integrated cellular networks. Coupled with a stochastic simulation mechanism, the model simulates the dynamics of the integrated network, and can be perturbed to generate testable hypotheses. Our model is qualitative and is mostly built upon knowledge from the literature and requires fine-tuning of very few parameters. We validated our model on two systems: the transcriptional regulation of glucose metabolism in human cells, and cellular osmoregulation in S. cerevisiae. The model produced results that are in very good agreement with experimental data, and produces valid hypotheses. The abstract nature of our model and the ease of its construction makes it a very good candidate for modeling integrated networks from qualitative data. The results it produces can guide the practitioner to zoom into components and interconnections and investigate them using such more detailed mathematical models.
Author Summary
Within the cell of an organism, three networks—signaling, transcriptional, and metabolic—are always at work to determine the response of the cell to signals from its environment, and consequently, its fate. Evidence from experimental studies is painting a picture of complex crosstalk among these networks. Thus, while a wide array of computational techniques exist for analyzing each of these network types, there is clear need for new modeling techniques that allow for simultaneously analyzing integrated networks, which combine elements from all three networks. Here, we provide a step towards achieving this task by combining two population modeling techniques—Petri nets and Boolean networks—to produce an integrated hybrid model. We demonstrate the accuracy and utility of this model on two biological systems: transcriptional regulation of glucose metabolism in human cells, and cellular osmoregulation in yeast.
PMCID: PMC3820535  PMID: 24244124
8.  Evolution After Whole-Genome Duplication: A Network Perspective 
G3: Genes|Genomes|Genetics  2013;3(11):2049-2057.
Gene duplication plays an important role in the evolution of genomes and interactomes. Elucidating how evolution after gene duplication interplays at the sequence and network level is of great interest. In this work, we analyze a data set of gene pairs that arose through whole-genome duplication (WGD) in yeast. All these pairs have the same duplication time, making them ideal for evolutionary investigation. We investigated the interplay between evolution after WGD at the sequence and network levels and correlated these two levels of divergence with gene expression and fitness data. We find that molecular interactions involving WGD genes evolve at rates that are three orders of magnitude slower than the rates of evolution of the corresponding sequences. Furthermore, we find that divergence of WGD pairs correlates strongly with gene expression and fitness data. Because of the role of gene duplication in determining redundancy in biological systems and particularly at the network level, we investigated the role of interaction networks in elucidating the evolutionary fate of duplicated genes. We find that gene neighborhoods in interaction networks provide a mechanism for inferring these fates, and we developed an algorithm for achieving this task. Further epistasis analysis of WGD pairs categorized by their inferred evolutionary fates demonstrated the utility of these techniques. Finally, we find that WGD pairs and other pairs of paralogous genes of small-scale duplication origin share similar properties, giving good support for generalizing our results from WGD pairs to evolution after gene duplication in general.
PMCID: PMC3815064  PMID: 24048644
whole-genome duplication; protein networks; yeast; duplication rate
9.  Fast algorithms and heuristics for phylogenomics under ILS and hybridization 
BMC Bioinformatics  2013;14(Suppl 15):S6.
Phylogenomic analyses involving whole-genome or multi-locus data often entail dealing with incongruent gene trees. In this paper, we consider two causes of such incongruence, namely, incomplete lineage sorting (ILS) and hybridization, and consider both parsimony and probabilistic criteria for dealing with them.
Under the assumption of ILS, computing the probability of a gene tree given a species tree is a very hard problem. We present a heuristic for speeding up the computation, and demonstrate how it scales up computations to data sizes that are not feasible to analyze using current techniques, while achieving very good accuracy. Further, under the assumption of both ILS and hybridization, computing the probability of a gene tree and parsimoniously reconciling it with a phylogenetic network are both very hard problems. We present two exact algorithms for these two problems that speed up existing techniques significantly and enable analyses of much larger data sets than is currently feasible.
Our heuristics and algorithms enable phylogenomic analyses of larger (in terms of numbers of taxa) data sets than is currently feasible. Further, our methods account for ILS and hybridization, thus allowing analyses of reticulate evolutionary histories.
PMCID: PMC3852049  PMID: 24564257
10.  An Evaluation of Methods for Inferring Boolean Networks from Time-Series Data 
PLoS ONE  2013;8(6):e66031.
Regulatory networks play a central role in cellular behavior and decision making. Learning these regulatory networks is a major task in biology, and devising computational methods and mathematical models for this task is a major endeavor in bioinformatics. Boolean networks have been used extensively for modeling regulatory networks. In this model, the state of each gene can be either ‘on’ or ‘off’ and that next-state of a gene is updated, synchronously or asynchronously, according to a Boolean rule that is applied to the current-state of the entire system. Inferring a Boolean network from a set of experimental data entails two main steps: first, the experimental time-series data are discretized into Boolean trajectories, and then, a Boolean network is learned from these Boolean trajectories. In this paper, we consider three methods for data discretization, including a new one we propose, and three methods for learning Boolean networks, and study the performance of all possible nine combinations on four regulatory systems of varying dynamics complexities. We find that employing the right combination of methods for data discretization and network learning results in Boolean networks that capture the dynamics well and provide predictive power. Our findings are in contrast to a recent survey that placed Boolean networks on the low end of the “faithfulness to biological reality” and “ability to model dynamics” spectra. Further, contrary to the common argument in favor of Boolean networks, we find that a relatively large number of time points in the time-series data is required to learn good Boolean networks for certain data sets. Last but not least, while methods have been proposed for inferring Boolean networks, as discussed above, missing still are publicly available implementations thereof. Here, we make our implementation of the methods available publicly in open source at
PMCID: PMC3689729  PMID: 23805196
11.  Boosting forward-time population genetic simulators through genotype compression 
BMC Bioinformatics  2013;14:192.
Forward-time population genetic simulations play a central role in deriving and testing evolutionary hypotheses. Such simulations may be data-intensive, depending on the settings to the various parameters controlling them. In particular, for certain settings, the data footprint may quickly exceed the memory of a single compute node.
We develop a novel and general method for addressing the memory issue inherent in forward-time simulations by compressing and decompressing, in real-time, active and ancestral genotypes, while carefully accounting for the time overhead. We propose a general graph data structure for compressing the genotype space explored during a simulation run, along with efficient algorithms for constructing and updating compressed genotypes which support both mutation and recombination. We tested the performance of our method in very large-scale simulations. Results show that our method not only scales well, but that it also overcomes memory issues that would cripple existing tools.
As evolutionary analyses are being increasingly performed on genomes, pathways, and networks, particularly in the era of systems biology, scaling population genetic simulators to handle large-scale simulations is crucial. We believe our method offers a significant step in that direction. Further, the techniques we provide are generic and can be integrated with existing population genetic simulators to boost their performance in terms of memory usage.
PMCID: PMC3700844  PMID: 23763838
12.  Population Genomics in Bacteria: A Case Study of Staphylococcus aureus 
Molecular Biology and Evolution  2011;29(2):797-809.
We analyzed the genome-wide pattern of single nucleotide polymorphisms (SNPs) in a sample with 12 strains of Staphylococcus aureus. Population structure of S. aureus seems to be complex, and the 12 strains were divided into five groups, named A, B, C, D, and E. We conducted a detailed analysis of the topologies of gene genealogies across the genomes and observed a high rate and frequency of tree-shape switching, indicating extensive homologous recombination. Most of the detected recombination occurred in the ancestral population of A, B, and C, whereas there are a number of small regions that exhibit evidence for homologous recombination with a distinct related species. As such regions would contain a number of novel mutations, it is suggested that homologous recombination would play a crucial role to maintain genetic variation within species. In the A-B-C ancestral population, we found multiple lines of evidence that the coalescent pattern is very similar to what is expected in a panmictic population, suggesting that this population is suitable to apply the standard population genetic theories. Our analysis showed that homologous recombination caused a dramatic decay in linkage disequilibrium (LD) and there is almost no LD between SNPs with distance more than 10 kb. Coalescent simulations demonstrated that a high rate of homologous recombination—a relative rate of 0.6 to the mutation rate with an average tract length of about 10 kb—is required to produce patterns similar to those observed in the S. aureus genomes. Our results call for more research into the evolutionary role of homologous recombination in bacterial populations.
PMCID: PMC3350317  PMID: 22009061
population genomics; bacteria; homologous recombination; demography; linkage disequilibrium
13.  Inference of reticulate evolutionary histories by maximum likelihood: the performance of information criteria 
BMC Bioinformatics  2012;13(Suppl 19):S12.
Maximum likelihood has been widely used for over three decades to infer phylogenetic trees from molecular data. When reticulate evolutionary events occur, several genomic regions may have conflicting evolutionary histories, and a phylogenetic network may provide a more adequate model for representing the evolutionary history of the genomes or species. A maximum likelihood (ML) model has been proposed for this case and accounts for both mutation within a genomic region and reticulation across the regions. However, the performance of this model in terms of inferring information about reticulate evolution and properties that affect this performance have not been studied.
In this paper, we study the effect of the evolutionary diameter and height of a reticulation event on its identifiability under ML. We find both of them, particularly the diameter, have a significant effect. Further, we find that the number of genes (which can be generalized to the concept of "non-recombining genomic regions") that are transferred across a reticulation edge affects its detectability. Last but not least, a fundamental challenge with phylogenetic networks is that they allow an arbitrary level of complexity, giving rise to the model selection problem. We investigate the performance of two information criteria, the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC), for addressing this problem. We find that BIC performs well in general for controlling the model complexity and preventing ML from grossly overestimating the number of reticulation events.
Our results demonstrate that BIC provides a good framework for inferring reticulate evolutionary histories. Nevertheless, the results call for caution when interpreting the accuracy of the inference particularly for data sets with particular evolutionary features.
PMCID: PMC3526433  PMID: 23281614
14.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles 
Journal of Computational Biology  2011;18(11):1543-1559.
One of the criteria for inferring a species tree from a collection of gene trees, when gene tree incongruence is assumed to be due to incomplete lineage sorting (ILS), is Minimize Deep Coalescence (MDC). Exact algorithms for inferring the species tree from rooted, binary trees under MDC were recently introduced. Nevertheless, in phylogenetic analyses of biological data sets, estimated gene trees may differ from true gene trees, be incompletely resolved, and not necessarily rooted. In this article, we propose new MDC formulations for the cases where the gene trees are unrooted/binary, rooted/non-binary, and unrooted/non-binary. Further, we prove structural theorems that allow us to extend the algorithms for the rooted/binary gene tree case to these cases in a straightforward manner. In addition, we devise MDC-based algorithms for cases when multiple alleles per species may be sampled. We study the performance of these methods in coalescent-based computer simulations.
PMCID: PMC3216099  PMID: 22035329
algorithms; coalescence; dynamic programming; graph theory; phylogenetic trees
15.  Convergent evolution of modularity in metabolic networks through different community structures 
It has been reported that the modularity of metabolic networks of bacteria is closely related to the variability of their living habitats. However, given the dependency of the modularity score on the community structure, it remains unknown whether organisms achieve certain modularity via similar or different community structures.
In this work, we studied the relationship between similarities in modularity scores and similarities in community structures of the metabolic networks of 1021 species. Both similarities are then compared against the genetic distances. We revisited the association between modularity and variability of the microbial living environments and extended the analysis to other aspects of their life style such as temperature and oxygen requirements. We also tested both topological and biological intuition of the community structures identified and investigated the extent of their conservation with respect to the taxomony.
We find that similar modularities are realized by different community structures. We find that such convergent evolution of modularity is closely associated with the number of (distinct) enzymes in the organism’s metabolome, a consequence of different life styles of the species. We find that the order of modularity is the same as the order of the number of the enzymes under the classification based on the temperature preference but not on the oxygen requirement. Besides, inspection of modularity-based communities reveals that these communities are graph-theoretically meaningful yet not reflective of specific biological functions. From an evolutionary perspective, we find that the community structures are conserved only at the level of kingdoms. Our results call for more investigation into the interplay between evolution and modularity: how evolution shapes modularity, and how modularity affects evolution (mainly in terms of fitness and evolvability). Further, our results call for exploring new measures of modularity and network communities that better correspond to functional categorizations.
PMCID: PMC3534581  PMID: 22974099
16.  Gene Duplicability-Connectivity-Complexity across Organisms and a Neutral Evolutionary Explanation 
PLoS ONE  2012;7(9):e44491.
Gene duplication has long been acknowledged by biologists as a major evolutionary force shaping genomic architectures and characteristics across the Tree of Life. Major research has been conducting on elucidating the fate of duplicated genes in a variety of organisms, as well as factors that affect a gene’s duplicability–that is, the tendency of certain genes to retain more duplicates than others. In particular, two studies have looked at the correlation between gene duplicability and its degree in a protein-protein interaction network in yeast, mouse, and human, and another has looked at the correlation between gene duplicability and its complexity (length, number of domains, etc.) in yeast. In this paper, we extend these studies to six species, and two trends emerge. There is an increase in the duplicability-connectivity correlation that agrees with the increase in the genome size as well as the phylogenetic relationship of the species. Further, the duplicability-complexity correlation seems to be constant across the species. We argue that the observed correlations can be explained by neutral evolutionary forces acting on the genomic regions containing the genes. For the duplicability-connectivity correlation, we show through simulations that an increasing trend can be obtained by adjusting parameters to approximate genomic characteristics of the respective species. Our results call for more research into factors, adaptive and non-adaptive alike, that determine a gene’s duplicability.
PMCID: PMC3439388  PMID: 22984517
17.  ncDNA and drift drive binding site accumulation 
The amount of transcription factor binding sites (TFBS) in an organism’s genome positively correlates with the complexity of the regulatory network of the organism. However, the manner by which TFBS arise and accumulate in genomes and the effects of regulatory network complexity on the organism’s fitness are far from being known. The availability of TFBS data from many organisms provides an opportunity to explore these issues, particularly from an evolutionary perspective.
We analyzed TFBS data from five model organisms – E. coli K12, S. cerevisiae, C. elegans, D. melanogaster, A. thaliana – and found a positive correlation between the amount of non-coding DNA (ncDNA) in the organism’s genome and regulatory complexity. Based on this finding, we hypothesize that the amount of ncDNA, combined with the population size, can explain the patterns of regulatory complexity across organisms. To test this hypothesis, we devised a genome-based regulatory pathway model and subjected it to the forces of evolution through population genetic simulations. The results support our hypothesis, showing neutral evolutionary forces alone can explain TFBS patterns, and that selection on the regulatory network function does not alter this finding.
The cis-regulome is not a clean functional network crafted by adaptive forces alone, but instead a data source filled with the noise of non-adaptive forces. From a regulatory perspective, this evolutionary noise manifests as complexity on both the binding site and pathway level, which has significant implications on many directions in microbiology, genetics, and synthetic biology.
PMCID: PMC3556125  PMID: 22935101
18.  The strength of chemical linkage as a criterion for pruning metabolic graphs 
Bioinformatics  2011;27(14):1957-1963.
Motivation: A metabolic graph represents the connectivity patterns of a metabolic system, and provides a powerful framework within which the organization of metabolic reactions can be analyzed and elucidated. A common practice is to prune (i.e. remove nodes and edges) the metabolic graph prior to any analysis in order to eliminate confounding signals from the representation. Currently, this pruning process is carried out in an ad hoc fashion, resulting in discrepancies and ambiguities across studies.
Results: We propose a biochemically informative criterion, the strength of chemical linkage (SCL), for a systematic pruning of metabolic graphs. By analyzing the metabolic graph of Escherichia coli, we show that thresholding SCL is powerful in selecting the conventional pathways' connectivity out of the raw network connectivity when the network is restricted to the reactions collected from these pathways. Further, we argue that the root of ambiguity in pruning metabolic graphs is in the continuity of the amount of chemical content that can be conserved in reaction transformation patterns. Finally, we demonstrate how biochemical pathways can be inferred efficiently if the search procedure is guided by SCL.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3129522  PMID: 21551141
19.  Kinome siRNA-phosphoproteomic screen identifies networks regulating AKT signaling 
Oncogene  2011;30(45):4567-4577.
To identify regulators of intracellular signaling we targeted 541 kinases and kinase-related molecules with siRNAs and determined their effects on signaling with a functional proteomics reverse phase protein array (RPPA) platform assessing 42 phospho and total proteins. The kinome wide screen demonstrated a strong inverse correlation between phosphorylation of AKT and MAPK with 115 genes that when targeted by siRNAs demonstrated opposite effects on MAPK and AKT phosphorylation. Network based analysis identified the MAPK subnetwork of genes along with p70S6K and FRAP1 as the most prominent targets that increased phosphorylation of AKT, a key regulator of cell survival. The regulatory loops induced by the MAPK pathway are dependent on TSC2 but demonstrate a lesser dependence on p70S6K than the previously identified FRAP1 feedback loop. The siRNA screen also revealed novel bi-directionality in the AKT and GSK3 interaction, whereby genetic ablation of GSK3 significantly blocks AKT phosphorylation, an unexpected observation as GSK3 has only been predicted to be downstream of AKT. This method uncovered novel modulators of AKT phosphorylation and facilitated the mapping of regulatory loops.
PMCID: PMC3175328  PMID: 21666717
AKT; MAPK; proteomics; signaling networks; siRNA
20.  The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection 
PLoS Genetics  2012;8(4):e1002660.
Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa.
Author Summary
Species trees depict how species split and diverge. Within the branches of a species tree, gene trees, which depict the evolutionary histories of different genomic regions in the species, grow. Evolutionary analyses of the genomes of closely related organisms have highlighted the phenomenon that gene trees may disagree with each other as well as with the species tree that contains them due to deep coalescence. Furthermore, for several groups of organisms, hybridization plays an important role in their evolution and diversification. This evolutionary event also results in gene tree incongruence and gives rise to a species phylogeny that is a network. Thus, inferring the evolutionary histories of groups of organisms where hybridization is known, or suspected, to play an evolutionary role requires dealing simultaneously with hybridization and other sources of gene tree incongruence. Currently, no methods exist for doing this with general scenarios of hybridization. In this paper, we propose the first method for this task and demonstrate its performance. We revisit the analysis of a set of yeast species and another of Drosophila species, and show that evolutionary histories involving hybridization have higher support than the strictly diverging evolutionary histories estimated when not incorporating hybridization in the analysis.
PMCID: PMC3330115  PMID: 22536161
21.  Coalescent Histories on Phylogenetic Networks and Detection of Hybridization Despite Incomplete Lineage Sorting 
Systematic Biology  2011;60(2):138-149.
Analyses of the increasingly available genomic data continue to reveal the extent of hybridization and its role in the evolutionary diversification of various groups of species. We show, through extensive coalescent-based simulations of multilocus data sets on phylogenetic networks, how divergence times before and after hybridization events can result in incomplete lineage sorting with gene tree incongruence signatures identical to those exhibited by hybridization. Evolutionary analysis of such data under the assumption of a species tree model can miss all hybridization events, whereas analysis under the assumption of a species network model would grossly overestimate hybridization events. These issues necessitate a paradigm shift in evolutionary analysis under these scenarios, from a model that assumes a priori a single source of gene tree incongruence to one that integrates multiple sources in a unifying framework. We propose a framework of coalescence within the branches of a phylogenetic network and show how this framework can be used to detect hybridization despite incomplete lineage sorting. We apply the model to simulated data and show that the signature of hybridization can be revealed as long as the interval between the divergence times of the species involved in hybridization is not too small. We reanalyze a data set of 106 loci from 7 in-group Saccharomyces species for which a species tree with no hybridization has been reported in the literature. Our analysis supports the hypothesis that hybridization occurred during the evolution of this group, explaining a large amount of the incongruence in the data. Our findings show that an integrative approach to gene tree incongruence and its reconciliation is needed. Our framework will help in systematically analyzing genomic data for the occurrence of hybridization and elucidating its evolutionary role. [Coalescent history; incomplete lineage sorting; hybridization; phylogenetic network.]
PMCID: PMC3167682  PMID: 21248369
22.  Properties of metabolic graphs: biological organization or representation artifacts? 
BMC Bioinformatics  2011;12:132.
Standard graphs, where each edge links two nodes, have been extensively used to represent the connectivity of metabolic networks. It is based on this representation that properties of metabolic networks, such as hierarchical and small-world structures, have been elucidated and null models have been proposed to derive biological organization hypotheses. However, these graphs provide a simplistic model of a metabolic network's connectivity map, since metabolic reactions often involve more than two reactants. In other words, this map is better represented as a hypergraph. Consequently, a question that naturally arises in this context is whether these properties truly reflect biological organization or are merely an artifact of the representation.
In this paper, we address this question by reanalyzing topological properties of the metabolic network of Escherichia coli under a hypergraph representation, as well as standard graph abstractions. We find that when clustering is properly defined for hypergraphs and subsequently used to analyze metabolic networks, the scaling of clustering, and thus the hierarchical structure hypothesis in metabolic networks, become unsupported. Moreover, we find that incorporating the distribution of reaction sizes into the null model further weakens the support for the scaling patterns.
These results combined suggest that the reported scaling of the clustering coefficients in the metabolic graphs and its specific power coefficient may be an artifact of the graph representation, and may not be supported when biochemical reactions are atomically treated as hyperedges. This study highlights the implications of the way a biological system is represented and the null model employed on the elucidated properties, along with their support, of the system.
PMCID: PMC3098788  PMID: 21542923
23.  Bootstrap-based Support of HGT Inferred by Maximum Parsimony 
Maximum parsimony is one of the most commonly used criteria for reconstructing phylogenetic trees. Recently, Nakhleh and co-workers extended this criterion to enable reconstruction of phylogenetic networks, and demonstrated its application to detecting reticulate evolutionary relationships. However, one of the major problems with this extension has been that it favors more complex evolutionary relationships over simpler ones, thus having the potential for overestimating the amount of reticulation in the data. An ad hoc solution to this problem that has been used entails inspecting the improvement in the parsimony length as more reticulation events are added to the model, and stopping when the improvement is below a certain threshold.
In this paper, we address this problem in a more systematic way, by proposing a nonparametric bootstrap-based measure of support of inferred reticulation events, and using it to determine the number of those events, as well as their placements. A number of samples is generated from the given sequence alignment, and reticulation events are inferred based on each sample. Finally, the support of each reticulation event is quantified based on the inferences made over all samples.
We have implemented our method in the NEPAL software tool (available publicly at, and studied its performance on both biological and simulated data sets. While our studies show very promising results, they also highlight issues that are inherently challenging when applying the maximum parsimony criterion to detect reticulate evolution.
PMCID: PMC2874802  PMID: 20444286
24.  Species Tree Inference by Minimizing Deep Coalescences 
PLoS Computational Biology  2009;5(9):e1000501.
In a 1997 seminal paper, W. Maddison proposed minimizing deep coalescences, or MDC, as an optimization criterion for inferring the species tree from a set of incongruent gene trees, assuming the incongruence is exclusively due to lineage sorting. In a subsequent paper, Maddison and Knowles provided and implemented a search heuristic for optimizing the MDC criterion, given a set of gene trees. However, the heuristic is not guaranteed to compute optimal solutions, and its hill-climbing search makes it slow in practice. In this paper, we provide two exact solutions to the problem of inferring the species tree from a set of gene trees under the MDC criterion. In other words, our solutions are guaranteed to find the tree that minimizes the total number of deep coalescences from a set of gene trees. One solution is based on a novel integer linear programming (ILP) formulation, and another is based on a simple dynamic programming (DP) approach. Powerful ILP solvers, such as CPLEX, make the first solution appealing, particularly for very large-scale instances of the problem, whereas the DP-based solution eliminates dependence on proprietary tools, and its simplicity makes it easy to integrate with other genomic events that may cause gene tree incongruence. Using the exact solutions, we analyze a data set of 106 loci from eight yeast species, a data set of 268 loci from eight Apicomplexan species, and several simulated data sets. We show that the MDC criterion provides very accurate estimates of the species tree topologies, and that our solutions are very fast, thus allowing for the accurate analysis of genome-scale data sets. Further, the efficiency of the solutions allow for quick exploration of sub-optimal solutions, which is important for a parsimony-based criterion such as MDC, as we show. We show that searching for the species tree in the compatibility graph of the clusters induced by the gene trees may be sufficient in practice, a finding that helps ameliorate the computational requirements of optimization solutions. Further, we study the statistical consistency and convergence rate of the MDC criterion, as well as its optimality in inferring the species tree. Finally, we show how our solutions can be used to identify potential horizontal gene transfer events that may have caused some of the incongruence in the data, thus augmenting Maddison's original framework. We have implemented our solutions in the PhyloNet software package, which is freely available at:
Author Summary
Inferring the evolutionary history of a set of species, known as the species tree, is a task of utmost significance in biology and beyond. The traditional approach to accomplishing this task from molecular sequences entails sequencing a gene in the set of species under consideration, reconstructing the gene's evolutionary history, and declaring it to be the species tree. However, recent analyses of multiple gene data sets, made available thanks to advances in sequencing technologies, have indicated that gene trees in the same group of species may disagree with each other, as well as with the species tree. Therefore, the development of methods for inferring the species tree despite such disagreements is imperative.
In this paper, we propose such a method, which seeks the tree that minimizes the amount of disagreement between the input set of gene trees and the inferred one. We have implemented our method and studied its performance, in terms of accuracy and computational efficiency, on two biological data sets and a large number of simulated data sets. Our analyses, of both the biological and synthetic data sets, indicate high accuracy of the method, as well as computationally efficient solutions in practice. Hence, our method makes a good candidate for inferring accurate species trees, despite gene tree disagreements, at a genomic scale.
PMCID: PMC2729383  PMID: 19749978
25.  GS2: an efficiently computable measure of GO-based similarity of gene sets 
Bioinformatics  2009;25(9):1178-1184.
Motivation: The growing availability of genome-scale datasets has attracted increasing attention to the development of computational methods for automated inference of functional similarities among genes and their products. One class of such methods measures the functional similarity of genes based on their distance in the Gene Ontology (GO). To measure the functional relatedness of a gene set, these measures consider every pair of genes in the set, and the average of all pairwise distances is calculated. However, as more data becomes available and gene sets used for analysis become larger, such pair-based calculation becomes prohibitive.
Results: In this article, we propose GS2 (GO-based similarity of gene sets), a novel GO-based measure of gene set similarity that is computable in linear time in the size of the gene set. The measure quantifies the similarity of the GO annotations among a set of genes by averaging the contribution of each gene's GO terms and their ancestor terms with respect to the GO vocabulary graph. To study the performance of our method, we compared our measure with an established pair-based measure when run on gene sets with varying degrees of functional similarities. In addition to a significant speed improvement, our method produced comparable similarity scores to the established method. Our method is available as a web-based tool and an open-source Python library.
Availability: The web-based tools and Python code are available at:
PMCID: PMC2672633  PMID: 19289444

Results 1-25 (28)