Abstract
DrML is a software program for inferring evolutionary scenarios from a gene tree and a species tree with speciation time estimates that is based on a general maximum likelihood model. The program implements novel algorithms that efficiently infer most likely scenarios of gene duplication and loss events. Our comparative studies suggest that the general maximum likelihood model provides more credible estimates than standard parsimony reconciliation, especially when speciation times differ significantly. DrML is an open source project written in Python, and along with an on-line manual and sample data sets publicly available.
doi:10.1089/cmb.2013.0078
PMCID: PMC3880059
PMID: 24073895
reconciliation; gene tree; species tree; evolutionary scenario; gene duplication
Abstract
Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome.
doi:10.1089/cmb.2012.0277
PMCID: PMC3880068
PMID: 24160839
Markov chains; sequence analysis; statistical models
Abstract
Next-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al., 2011; Schulz et al., 2012). While these tools are extensively applied for biological investigations and often show intrinsic shortcomings from the obtained results, a theoretical investigation of the inherent computational limits of transcriptome analysis from NGS data, when a reference genome is unknown or highly unreliable, is still missing. On the other hand, we still lack methods for computing the gene structures due to AS events under the above assumptions—a problem that we start to tackle with this article. More precisely, based on the notion of isoform graph (Lacroix et al., 2008), we define a compact representation of gene structures—called splicing graph—and investigate the computational problem of building a splicing graph that is (i) compatible with NGS data and (ii) isomorphic to the isoform graph. We characterize when there is only one representative splicing graph compatible with input data, and we propose an efficient algorithmic approach to compute this graph.
doi:10.1089/cmb.2013.0112
PMCID: PMC3880078
PMID: 24200390
alternative splicing; splicing graph
Abstract
Ontology is widely used in semantic computing and reasoning, and various biomedicine ontologies have become institutionalized to make the heterogeneous knowledge computationally amenable. Relation words, especially verbs, play an important role when describing the interaction between biological entities in molecular function, biological process, and cellular component; however, comprehensive research and analysis are still lacking. In this article, we propose an automatic method to build interaction relation ontology by investigating relation verbs, analyzing the syntactic relation of PubMed abstracts to perform relation vocabulary expansion, and integrating WordNet into our method to construct the hierarchy of relation vocabulary. Five attributes are populated automatically for each word in interaction relation ontology. As a result, the interaction relation ontology is constructed; it contains a total of 963 words and covers the most relation words used in existing methods of proteins interaction relation.
doi:10.1089/cmb.2012.0009
PMCID: PMC3880112
PMID: 24377790
interaction relation; ontology learning; relation words; text mining
Abstract
We study the complexity of rearrangement problems in the generalized breakpoint model of Tannier et al. and settle several open questions. We improve the algorithm for the median problem and show that it is equivalent to the problem of finding maximum cardinality nonbipartite matching (under linear reduction). On the other hand, we prove that the more general small phylogeny problem is NP-hard. Surprisingly, we show that it is already NP-hard (or even APX-hard) for a quartet phylogeny. We also show that in the unichromosomal and the multilinear breakpoint model the halving problem is NP-hard, refuting the conjecture of Tannier et al. Interestingly, this is the first problem that is harder in the breakpoint model than in the double cut and join or reversal models.
doi:10.1089/cmb.2013.0004
PMCID: PMC3887456
PMID: 24200391
breakpoint distance; halving; matching; median; NP-hard; phylogeny
Abstract
We observe n sequences at each of m sites and assume that they have evolved from an ancestral sequence that forms the root of a binary tree of known topology and branch lengths, but the sequence states at internal nodes are unknown. The topology of the tree and branch lengths are the same for all sites, but the parameters of the evolutionary model can vary over sites. We assume a piecewise constant model for these parameters, with an unknown number of change-points and hence a transdimensional parameter space over which we seek to perform Bayesian inference. We propose two novel ideas to deal with the computational challenges of such inference. Firstly, we approximate the model based on the time machine principle: the top nodes of the binary tree (near the root) are replaced by an approximation of the true distribution; as more nodes are removed from the top of the tree, the cost of computing the likelihood is reduced linearly in n. The approach introduces a bias, which we investigate empirically. Secondly, we develop a particle marginal Metropolis-Hastings (PMMH) algorithm, that employs a sequential Monte Carlo (SMC) sampler and can use the first idea. Our time-machine PMMH algorithm copes well with one of the bottle-necks of standard computational algorithms: the transdimensional nature of the posterior distribution. The algorithm is implemented on simulated and real data examples, and we empirically demonstrate its potential to outperform competing methods based on approximate Bayesian computation (ABC) techniques.
doi:10.1089/cmb.2014.0218
PMCID: PMC4283070
PMID: 25506749
approximate Bayesian computation; binary trees; change-point models; particle marginal Metropolis-Hastings; sequential Monte Carlo samplers; time machine
Abstract
The graph orientation problem calls for orienting the edges of an undirected graph so as to maximize the number of prespecified source-target vertex pairs that admit a directed path from the source to the target. Most algorithmic approaches to this problem share a common preprocessing step, in which the input graph is reduced to a tree by repeatedly contracting its cycles. Although this reduction is valid from an algorithmic perspective, the assignment of directions to the edges of the contracted cycles becomes arbitrary and, consequently, the connecting source-target paths may be arbitrarily long. In the context of biological networks, the connection of vertex pairs via shortest paths is highly motivated, leading to the following variant: Given an undirected graph and a collection of source-target vertex pairs, assign directions to the edges so as to maximize the number of pairs that are connected by a shortest (in the original graph) directed path. Here we study this variant, provide strong inapproximability results for it, and propose approximation algorithms for the problem, as well as for relaxations where the connecting paths need only be approximately shortest.
doi:10.1089/cmb.2013.0064
PMCID: PMC3842894
PMID: 24073924
approximation algorithms; graph orientation; inapproximability; shortest paths
Abstract
The recent advances in high-throughput sequencing technologies bring the potential of a better characterization of the genetic variation in humans and other organisms. In many occasions, either by design or by necessity, the sequencing procedure is performed on a pool of DNA samples with different abundances, where the abundance of each sample is unknown. Such a scenario is naturally occurring in the case of metagenomics analysis where a pool of bacteria is sequenced, or in the case of population studies involving DNA pools by design. Particularly, various pooling designs were recently suggested that can identify carriers of rare alleles in large cohorts, dramatically reducing the cost of such large-scale sequencing projects. A fundamental problem with such approaches for population studies is that the uncertainty of DNA proportions from different individuals in the pools might lead to spurious associations. Fortunately, it is often the case that the genotype data of at least some of the individuals in the pool is known. Here, we propose a method (eALPS) that uses the genotype data in conjunction with the pooled sequence data in order to accurately estimate the proportions of the samples in the pool, even in cases where not all individuals in the pool were genotyped (eALPS-LD). Using real data from a sequencing pooling study of non-Hodgkin's lymphoma, we demonstrate that the estimation of the proportions is crucial, since otherwise there is a risk for false discoveries. Additionally, we demonstrate that our approach is also applicable to the problem of quantification of species in metagenomics samples (eALPS-BCR) and is particularly suitable for metagenomic quantification of closely related species.
doi:10.1089/cmb.2013.0105
PMCID: PMC4013753
PMID: 24144111
algorithms; alignment; cancer genomics; NP-completeness
Abstract
Inference of gene interaction networks from expression data usually focuses on either supervised or unsupervised edge prediction from a single data source. However, in many real world applications, multiple data sources, such as microarray and ISH (in situ hybridization) measurements of mRNA abundances, are available to offer multiview information about the same set of genes. We propose ISH to estimate a gene interaction network that is consistent with such multiple data sources, which are expected to reflect the same underlying relationships between the genes. NP-MuScL casts the network estimation problem as estimating the structure of a sparse undirected graphical model. We use the semiparametric Gaussian copula to model the distribution of the different data sources, with the different copulas sharing the same precision (i.e., inverse covariance) matrix, and we present an efficient algorithm to estimate such a model in the high-dimensional scenario. Results are reported on synthetic data, where NP-MuScL outperforms baseline algorithms significantly, even in the presence of noisy data sources. Experiments are also run on two real-world scenarios: two yeast microarray datasets and three Drosophila embryonic gene expression datasets, where NP-MuScL predicts a higher number of known gene interactions than existing techniques.
doi:10.1089/cmb.2013.0093
PMCID: PMC3822365
PMID: 24134391
copula; Gaussian graphical models; gene expression; interaction networks; multi-source learning; nonparanormal; sparsity
Abstract
Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.
doi:10.1089/cmb.2013.0106
PMCID: PMC3822366
PMID: 24195709
cancer evolution; genetic variations; tumor phylogeny
Abstract
The analysis of the sequence–structure relationship in RNA molecules is not only essential for evolutionary studies but also for concrete applications such as error-correction in next generation sequencing (NGS) technologies. The prohibitive sizes of the mutational and conformational landscapes, combined with the volume of data to process, require efficient algorithms to compute sequence–structure properties. In this article, we address the correction of NGS errors by calculating which mutations most increase the likelihood of a sequence to a given structure and RNA family. We introduce RNApyro, an efficient, linear time and space inside–outside algorithm that computes exact mutational probabilities under secondary structure and evolutionary constraints given as a multiple sequence alignment with a consensus structure. We develop a scoring scheme combining classical stacking base-pair energies to novel isostericity scores and apply our techniques to correct pointwise errors in 5s and 16s rRNA sequences. Our results suggest that RNApyro is a promising algorithm to complement existing tools in the NGS error-correction pipeline.
doi:10.1089/cmb.2013.0085
PMCID: PMC3822367
PMID: 24134390
mutations; RNA; secondary structure
Abstract
Binding of one protein to another in a highly specific manner to form stable complexes is critical in most biological processes, yet the mechanisms involved in the interaction of proteins are not fully clear. The identification of hot spots, a small subset of binding interfaces that account for the majority of binding free energy, is becoming increasingly important in understanding the principles of protein interactions. Despite experiments like alanine scanning mutagenesis and a variety of computational methods that have been applied to this problem, comparative studies suggest that the development of accurate and reliable solutions is still in its infant stage. We developed PredHS (Prediction of Hot Spots), a computational method that can effectively identify hot spots on protein-binding interfaces by using 38 optimally chosen properties. The optimal combination of features was selected from a set of 324 novel structural neighborhood properties by a two-step feature selection method consisting of a random forest algorithm and a sequential backward elimination method. We evaluated the performance of PredHS using a benchmark of 265 alanine-mutated interface residues (Dataset I) and a trimmed subset (Dataset II) with 10-fold cross-validation. Compared with the state-of-the art approaches, PredHS achieves a significant improvement on the prediction quality, which stems from the new structural neighborhood properties, the novel way of feature generation, as well as the selection power of the proposed two-step method. We further validated the capability of our method by an independent test and obtained promising results.
doi:10.1089/cmb.2013.0083
PMCID: PMC3822376
PMID: 24134392
Abstract
Mapping reads to a reference genome is a routine yet computationally intensive task in research based on high-throughput sequencing. In recent years, the sequencing reads of the Illumina platform have become longer and their quality scores higher. According to our calculation, this allows perfect k-mer seed match for almost all reads when a close reference genome is available subject to reasonable specificity. Our other observation is that the majority reads contain at most one short INDEL polymorphism. Based on these observations, we propose a fast-mapping approach, referred to as “SEME,” which has two core steps: First it scans a read sequentially in a specific order for a k-mer exact match seed; next it extends the alignment on both sides allowing, at most, one short INDEL each using a novel method called “auto-match function.” We decompose the evaluation of the sensitivity and specificity into two parts corresponding to the seed and extension step, and the composite result provides an approximate overall reliability estimate of each mapping. We compare SEME with some existing mapping methods on several datasets, and SEME shows better performance in terms of both running time and mapping rates.
doi:10.1089/cmb.2013.0111
PMCID: PMC3822393
PMID: 24195707
auto-match function high-throughput sequencing; INDEL; mapping; perfect match
Abstract
Our current understanding of cellular networks is rather incomplete. We over look important but so far unknown genes and mechanisms in the pathways. Moreover, we often only have a partial account of the molecular interactions and modifications of the known players. When analyzing the cell, we look through narrow windows leaving potentially important events in blind spots. Network reconstruction is naturally confined to what we have observed. Little is known on how the incompleteness of our observations confounds our interpretation of the available data. Here we ask which features of a network can be confounded by incomplete observations and which cannot. In the context of nested effects models, we show that in the presence of missing observations or hidden factors a reliable reconstruction of the full network is not feasible. Nevertheless, we can show that certain characteristics of signaling networks like the existence of cross-talk between certain branches of the network can be inferred in a nonconfoundable way. We derive a test for inferring such nonconfoundable characteristics of signaling networks. Next, we introduce a new data structure to represent partially reconstructed signaling networks. Finally, we evaluate our method both on simulated data and in the context of a study on early stem cell differentiation in mice.
doi:10.1089/cmb.2013.0119
PMCID: PMC3822397
PMID: 24195708
biological networks; hidden variables; nested effects models; network reconstruction
Abstract
Epigenetic landscapes represent how cells regulate gene activity. To understand their effect on gene regulation, it is important to detect their occupancy in the genome. Unlike transcription factors whose binding regions are limited to narrow regions, histone modification marks are enriched over broader areas. The stochastic characteristics unique to each mark make it hard to detect their enrichment. Classically, a predefined window has been used to detect their enrichment. However, these approaches heavily rely on the predetermined parameters. Also, the window-based approaches cannot handle the enrichment of multiple marks. We propose a novel algorithm, called SeqW, to detect enrichment of multiple histone modification marks. SeqW applies a zooming approach to detect a broadly enriched domain. The zooming approach helps domain detection by increasing signal-to-noise ratio. The borders of the domains are detected by studying the characteristics of signals in the wavelet domain. We show that SeqW outperformed previous predictors in detecting broad peaks. Also, we applied SeqW in studying spatial combinations of histone modification patterns.
doi:10.1089/cmb.2014.0095
PMCID: PMC4224055
PMID: 25072902
wavelet; zero-crossing; enriched region; histone modification
Abstract
Recent advances in the automation of metabolic model reconstruction have led to the availability of draft-quality metabolic models (predicted reaction complements) for multiple bacterial species. These reaction complements can be considered as trait representations and can be used for ancestral state reconstruction to infer the most likely metabolic complements of common ancestors of all bacteria with generated metabolic models. We present here an ancestral state reconstruction for 141 extant bacteria and analyze the reaction gains and losses for these bacteria with respect to their lifestyles and pathogenic nature. A simulated annealing approach is used to look at coordinated metabolic gains and losses in two bacteria. The main losses of Onion yellows phytoplasma OY-M, an obligate intracellular pathogen, are shown (as expected) to be in cell wall biosynthesis. The metabolic gains made by Clostridium difficile CD196 in adapting to its current habitat in the human colon is also analyzed. Our analysis shows that the capability to utilize N-Acetyl-neuraminic acid as a carbon source has been gained, rather than having been present in the Clostridium ancestor, as has the capability to synthesize phthiocerol dimycocerosate, which could potentially aid the evasion of the host immune response. We have shown that the availability of large numbers of metabolic models, along with conventional approaches, has enabled a systematic method to analyze metabolic evolution in the bacterial domain.
doi:10.1089/cmb.2013.0079
PMCID: PMC3791050
PMID: 23992299
biochemical networks; evolution; metabolic network; microbial ecology; phylogenetic analyses; systems biology
Abstract
Over the past several years, genome-wide association studies (GWAS) have implicated hundreds of genes in common disease. More recently, the GWAS approach has been utilized to identify regions of the genome that harbor variation affecting gene expression or expression quantitative trait loci (eQTLs). Unlike GWAS applied to clinical traits, where only a handful of phenotypes are analyzed per study, in eQTL studies, tens of thousands of gene expression levels are measured, and the GWAS approach is applied to each gene expression level. This leads to computing billions of statistical tests and requires substantial computational resources, particularly when applying novel statistical methods such as mixed models. We introduce a novel two-stage testing procedure that identifies all of the significant associations more efficiently than testing all the single nucleotide polymorphisms (SNPs). In the first stage, a small number of informative SNPs, or proxies, across the genome are tested. Based on their observed associations, our approach locates the regions that may contain significant SNPs and only tests additional SNPs from those regions. We show through simulations and analysis of real GWAS datasets that the proposed two-stage procedure increases the computational speed by a factor of 10. Additionally, efficient implementation of our software increases the computational speed relative to the state-of-the-art testing approaches by a factor of 75.
doi:10.1089/cmb.2013.0087
PMCID: PMC3791046
PMID: 24033261
genetics; genomics; haplotypes; machine learning; statistical models
Abstract
Conformational changes frequently occur when proteins interact with other proteins. How to detect such changes in silico is a major problem. Existing methods for docking with conformational changes remain time-consuming, and they solve only a small portion of protein complexes accurately. This work presents a more accurate method (FlexDoBi) for docking with conformational changes. FlexDoBi generates the possible conformational changes of the interface residues that transform the proteins from their unbound states to bound states. Based on the generated conformational changes, multidimensional scaling is performed to construct candidates for the bound structure. We develop a new energy item for determining the orientation of docking subunits and selecting of plausible conformational changes. Experimental results illustrate that FlexDoBi achieves better results. On 20 complexes, we obtained an average iRMSD of 1.55Å, which compares favorably with the average iRMSD of 1.94Å for FiberDock. Compared to ZDOCK, our results are of 0.27Å less in average iRMSD of the medium difficulty group.
doi:10.1089/cmb.2013.0069
PMCID: PMC3791054
PMID: 24093228
backbone flexibility; database method; energy function; flexible docking; weighted multidimensional scaling
Abstract
Phylogenetic tree reconciliation is a powerful approach for inferring evolutionary events like gene duplication, horizontal gene transfer, and gene loss, which are fundamental to our understanding of molecular evolution. While duplication–loss (DL) reconciliation leads to a unique maximum-parsimony solution, duplication-transfer-loss (DTL) reconciliation yields a multitude of optimal solutions, making it difficult to infer the true evolutionary history of the gene family. This problem is further exacerbated by the fact that different event cost assignments yield different sets of optimal reconciliations. Here, we present an effective, efficient, and scalable method for dealing with these fundamental problems in DTL reconciliation. Our approach works by sampling the space of optimal reconciliations uniformly at random and aggregating the results. We show that even gene trees with only a few dozen genes often have millions of optimal reconciliations and present an algorithm to efficiently sample the space of optimal reconciliations uniformly at random in O(mn2) time per sample, where m and n denote the number of genes and species, respectively. We use these samples to understand how different optimal reconciliations vary in their node mappings and event assignments and to investigate the impact of varying event costs. We apply our method to a biological dataset of approximately 4700 gene trees from 100 taxa and observe that 93% of event assignments and 73% of mappings remain consistent across different multiple optima. Our analysis represents the first systematic investigation of the space of optimal DTL reconciliations and has many important implications for the study of gene family evolution.
doi:10.1089/cmb.2013.0073
PMCID: PMC3791060
PMID: 24033262
gene duplication; gene family evolution; gene-tree/species-tree reconciliation; horizontal gene transfer; host-parasite cophylogeny; phylogenetics
Abstract
Protein structure and function are largely specified by the distribution of different atoms and residues relative to the core and surface of the molecule. Relative depths of atoms therefore are key attributions that have been widely used in protein structure modeling and function annotation. However, accurate calculation of depth is time consuming. Here, we developed an algorithm which uses Euclidean distance transform (EDT) to convert the target protein structure into a 3D gray-scale image, where depths of atoms in the protein can be conveniently and precisely derived from the minimum distance of the pixels to the surface of the protein. We tested the proposed EDT-based method on a set of 261 non-redundant protein structures, which shows that the method is 2.6 times faster than the widely used method proposed by Chakravarty and Varadarajan. Depth values by EDT method are highly accurate with a Pearson's correlation coefficient ≈1 compared to the calculations from exhaustive search. To explore the usefulness of the method in protein structure prediction, we add the calculated residue depth to the scoring function of the state of the art, profile–profile alignment based fold-recognition program, which shows an additional 3% improvement in the TM-score of the alignments. The data demonstrate that the EDT-based depth calculation program can be used as an efficient tool to assist protein structure analysis and structure-based function annotation.
doi:10.1089/cmb.2013.0071
PMCID: PMC3837563
PMID: 23992298
Nurk, Sergey | Bankevich, Anton | Antipov, Dmitry | Gurevich, Alexey A. | Korobeynikov, Anton | Lapidus, Alla | Prjibelski, Andrey D. | Pyshkin, Alexey | Sirotkin, Alexander | Sirotkin, Yakov | Stepanauskas, Ramunas | Clingenpeel, Scott R. | Woyke, Tanja | Mclean, Jeffrey S. | Lasken, Roger | Tesler, Glenn | Alekseyev, Max A. | Pevzner, Pavel A.
Abstract
Recent advances in single-cell genomics provide an alternative to largely gene-centric metagenomics studies, enabling whole-genome sequencing of uncultivated bacteria. However, single-cell assembly projects are challenging due to (i) the highly nonuniform read coverage and (ii) a greatly elevated number of chimeric reads and read pairs. While recently developed single-cell assemblers have addressed the former challenge, methods for assembling highly chimeric reads remain poorly explored. We present algorithms for identifying chimeric edges and resolving complex bulges in de Bruijn graphs, which significantly improve single-cell assemblies. We further describe applications of the single-cell assembler SPAdes to a new approach for capturing and sequencing “microbial dark matter” that forms small pools of randomly selected single cells (called a mini-metagenome) and further sequences all genomes from the mini-metagenome at once. On single-cell bacterial datasets, SPAdes improves on the recently developed E+V-SC and IDBA-UD assemblers specifically designed for single-cell sequencing. For standard (cultivated monostrain) datasets, SPAdes also improves on A5, ABySS, CLC, EULER-SR, Ray, SOAPdenovo, and Velvet. Thus, recently developed single-cell assemblers not only enable single-cell sequencing, but also improve on conventional assemblers on their own turf. SPAdes is available for free online download under a GPLv2 license.
doi:10.1089/cmb.2013.0084
PMCID: PMC3791033
PMID: 24093227
bacterial assembly; chimeric reads; de Bruijn graph; multiple displacement amplification (MDA); single cell
Abstract
The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time-consuming and inaccurate for complicated pedigrees, such as pedigrees with inbreeding. In this work, we propose an efficient algorithm that is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation, backward in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path-based approach implemented with an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations, and therefore, it can reconstruct pedigrees that have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data.
doi:10.1089/cmb.2013.0080
PMCID: PMC3791035
PMID: 24093229
dynamic programming; genotype data; identity by descent; inheritance path; pedigree reconstruction
Abstract
A phylogenetic network is a model for reticulate evolution. A hybridization network is one type of phylogenetic network for a set of discordant gene trees and “displays” each gene tree. A central computational problem on hybridization networks is: given a set of gene trees, reconstruct the minimum (i.e., most parsimonious) hybridization network that displays each given gene tree. This problem is known to be NP-hard, and existing approaches for this problem are either heuristics or making simplifying assumptions (e.g., work with only two input trees or assume some topological properties).
In this article, we develop an exact algorithm (called PIRNC) for inferring the minimum hybridization networks from multiple gene trees. The PIRNC algorithm does not rely on structural assumptions (e.g., the so-called galled networks). To the best of our knowledge, PIRNC is the first exact algorithm implemented for this formulation. When the number of reticulation events is relatively small (say, four or fewer), PIRNC runs reasonably efficient even for moderately large datasets. For building more complex networks, we also develop a heuristic version of PIRNC called PIRNCH. Simulation shows that PIRNCH usually produces networks with fewer reticulation events than those by an existing method.
PIRNC and PIRNCH have been implemented as part of the software package called PIRN and is available online.
doi:10.1089/cmb.2013.0072
PMCID: PMC3791036
PMID: 24093230
algorithms; hybridization network; parsimony; phylogenetic network; phylogenetics; reticulate evolution
Abstract
The reconstruction of the history of evolutionary genome-wide events among a set of related organisms is of great biological interest since it can help to reveal the genomic basis of phenotypes. The sequencing of whole genomes faciliates the study of gene families that vary in size through duplication and loss events, like transfer RNA. However, a high sequence similarity often does not allow one to distinguish between orthologs and paralogs. Previous methods have addressed this difficulty by taking into account flanking regions of members of a family independently. We go one step further by inferring the order of genes of (a set of) families for ancestral genomes by considering the order of these genes on sequenced genomes. We present a novel branch-and-cut algorithm to solve the two species small phylogeny problem in the evolutionary model of duplications and losses. On average, our implementation, DupLoCut, improves the running time of a recently proposed method in the experiments on six Vibrionaceae lineages by a factor of ∼200. Besides the mere improvement in running time, the efficiency of our approach allows us to extend our model from cherries of a species tree, that is, subtrees with two leaves, to the median of three species setting. Being able to determine the median of three species is of key importance to one of the most common approaches to ancestral reconstruction, and our experiments show that its repeated computation considerably reduces the number of duplications and losses along the tree both on simulated instances comprising 128 leaves and a set of Bacillus genomes. Furthermore, in our simulations we show that a reduction in cost goes hand in hand with an improvement of the predicted ancestral genomes. Finally, we prove that the small phylogeny problem in the duplication-loss model is NP-complete already for two species.
doi:10.1089/cmb.2013.0057
PMCID: PMC3761417
PMID: 24000925
Abstract
Comparative approaches in genomics have long relied on rigorous mathematical models of sequence evolution. Such models provide the basis for formulating and solving well-defined computational problems, in turn yielding key insights into the evolutionary processes acting on the genome. Analogous model-based approaches for analyzing biological networks are still under development. Here we describe a model-based approach for estimating the probability of network rewiring events during evolution. Our method builds on the standard duplication-and-divergence model and incorporates phylogenetic analysis to guide the comparison of protein networks across species. We apply our algorithm to study the evolution of functional modules and unconstrained network regions in seven available eukaryotic interactomes. Based on this analysis we identify a map of co-functioning protein families whose members participate in strongly conserved interactions and form major complexes and pathways in the eukaryotic cell. The proposed approach provides principled means for inferring the probability of network rewiring events, enabling insights into the conservation and divergence of protein interactions and the formation of functional modules in protein networks.
doi:10.1089/cmb.2012.0295
PMCID: PMC3761421
PMID: 23931333
discrimination between selective and neutral evolutionary dynamics; dynamics of evolution; evolution of protein networks