Search tips
Search criteria

Results 1-25 (37)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  Scaling statistical multiple sequence alignment to large datasets 
BMC Genomics  2016;17(Suppl 10):135-144.
Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today.
We use simulated data with up to 10,000 sequences representing a variety of model conditions, including some that are significantly divergent from the statistical models used in BAli-Phy and elsewhere. We give a method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and give alignment and tree accuracy results measured against the ground truth from simulations. Comparable results are also given for other methods capable of aligning this many sequences.
Extensions of BAli-Phy using PASTA and UPP produce significantly more accurate alignments and phylogenetic trees than the current leading methods.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-3101-8) contains supplementary material, which is available to authorized users.
PMCID: PMC5123300
Multiple sequence alignment; Boosting; MCMC
2.  An analytical upper bound on the number of loci required for all splits of a species tree to appear in a set of gene trees 
BMC Bioinformatics  2016;17(Suppl 14):241-250.
Many methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. However, few studies have attempted to use analytical theory to quantify “sufficiently large”.
Using the multispecies coalescent model, we report a general analytical upper bound on the number of gene trees n required such that with probability q, each bipartition of a species tree is represented at least once in a set of n random gene trees. This bound employs a formula that is straightforward to compute, depends only on the minimum internal branch length of the species tree and the number of taxa, and applies irrespective of the species tree topology. Using simulations, we investigate numerical properties of the bound as well as its accuracy under the multispecies coalescent.
Our results are helpful for conservatively bounding the number of gene trees required by the ASTRAL inference method, and the approach has potential to be extended to bound other properties of gene tree sets under the model.
PMCID: PMC5123308
Bipartitions; Coalescent; Gene trees; Species trees
3.  HIPPI: highly accurate protein family classification with ensembles of HMMs 
BMC Genomics  2016;17(Suppl 10):89-100.
Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics.
We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy.
HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-3097-0) contains supplementary material, which is available to authorized users.
PMCID: PMC5123343
Protein family identification; Ensemble of profile hidden Markov models; Pfam
4.  High diversity of picornaviruses in rats from different continents revealed by deep sequencing 
Outbreaks of zoonotic diseases in humans and livestock are not uncommon, and an important component in containment of such emerging viral diseases is rapid and reliable diagnostics. Such methods are often PCR-based and hence require the availability of sequence data from the pathogen. Rattus norvegicus (R. norvegicus) is a known reservoir for important zoonotic pathogens. Transmission may be direct via contact with the animal, for example, through exposure to its faecal matter, or indirectly mediated by arthropod vectors. Here we investigated the viral content in rat faecal matter (n=29) collected from two continents by analyzing 2.2 billion next-generation sequencing reads derived from both DNA and RNA. Among other virus families, we found sequences from members of the Picornaviridae to be abundant in the microbiome of all the samples. Here we describe the diversity of the picornavirus-like contigs including near-full-length genomes closely related to the Boone cardiovirus and Theiler's encephalomyelitis virus. From this study, we conclude that picornaviruses within R. norvegicus are more diverse than previously recognized. The virome of R. norvegicus should be investigated further to assess the full potential for zoonotic virus transmission.
PMCID: PMC5034103  PMID: 27530749
cardiovirus; metagenomics; picornavirus; Rattus norvegicus; sequencing; viral discovery
5.  Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé 
SATé is a method for estimating multiple sequence alignments and trees that has been shown to produce highly accurate results for datasets with large numbers of sequences. Running SATé using its default settings is very simple, but improved accuracy can be obtained by modifying its algorithmic parameters. We provide a detailed introduction to the algorithmic approach used by SATé, and instructions for running a SATé analysis using the GUI under default settings. We also provide a discussion of how to modify these settings to obtain improved results, and how to use SATé in a phylogenetic analysis pipeline.
PMCID: PMC4894340  PMID: 24170406
Multiple sequence alignment; Maximum likelihood; Phylogenetics; SATé; Species tree estimation; Gene tree estimation; Phylogenomics
7.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences 
Journal of Computational Biology  2015;22(5):377-386.
We introduce PASTA, a new multiple sequence alignment algorithm. PASTA uses a new technique to produce an alignment given a guide tree that enables it to be both highly scalable and very accurate. We present a study on biological and simulated data with up to 200,000 sequences, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé). We also show that trees estimated on PASTA alignments are highly accurate—slightly better than SATé trees, but with substantial improvements relative to other methods. Finally, PASTA is faster than SATé, highly parallelizable, and requires relatively little memory.
PMCID: PMC4424971  PMID: 25549288
algorithms; metagenomics; molecular evolution; multiple alignment; phylogenetic trees
8.  The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference 
Genome Biology and Evolution  2016;8(2):330-344.
Placental mammals comprise three principal clades: Afrotheria (e.g., elephants and tenrecs), Xenarthra (e.g., armadillos and sloths), and Boreoeutheria (all other placental mammals), the relationships among which are the subject of controversy and a touchstone for debate on the limits of phylogenetic inference. Previous analyses have found support for all three hypotheses, leading some to conclude that this phylogenetic problem might be impossible to resolve due to the compounded effects of incomplete lineage sorting (ILS) and a rapid radiation. Here we show, using a genome scale nucleotide data set, microRNAs, and the reanalysis of the three largest previously published amino acid data sets, that the root of Placentalia lies between Atlantogenata and Boreoeutheria. Although we found evidence for ILS in early placental evolution, we are able to reject previous conclusions that the placental root is a hard polytomy that cannot be resolved. Reanalyses of previous data sets recover Atlantogenata + Boreoeutheria and show that contradictory results are a consequence of poorly fitting evolutionary models; instead, when the evolutionary process is better-modeled, all data sets converge on Atlantogenata. Our Bayesian molecular clock analysis estimates that marsupials diverged from placentals 157–170 Ma, crown Placentalia diverged 86–100 Ma, and crown Atlantogenata diverged 84–97 Ma. Our results are compatible with placental diversification being driven by dispersal rather than vicariance mechanisms, postdating early phases in the protracted opening of the Atlantic Ocean.
PMCID: PMC4779606  PMID: 26733575
placental; phylogeny; mammalian; genome; microRNA; palaeontology
9.  TIPP: taxonomic identification and phylogenetic profiling 
Bioinformatics  2014;30(24):3548-3555.
Motivation: Abundance profiling (also called ‘phylogenetic profiling’) is a crucial step in understanding the diversity of a metagenomic sample, and one of the basic techniques used for this is taxonomic identification of the metagenomic reads.
Results: We present taxon identification and phylogenetic profiling (TIPP), a new marker-based taxon identification and abundance profiling method. TIPP combines SAT\'e-enabled phylogenetic placement a phylogenetic placement method, with statistical techniques to control the classification precision and recall, and results in improved abundance profiles. TIPP is highly accurate even in the presence of high indel errors and novel genomes, and matches or improves on previous approaches, including NBC, mOTU, PhymmBL, MetaPhyler and MetaPhlAn.
Availability and implementation: Software and supplementary materials are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4253836  PMID: 25359891
10.  ASTRID: Accurate Species TRees from Internode Distances 
BMC Genomics  2015;16(Suppl 10):S3.
Incomplete lineage sorting (ILS), modelled by the multi-species coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL-2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL-2, and cannot run on some datasets.
We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distance-based tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL-2. Furthermore, ASTRID is much faster than ASTRAL-2, completing in minutes on some datasets for which ASTRAL-2 used hours.
ASTRID is a new coalescent-based method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github.
PMCID: PMC4602181  PMID: 26449326
incomplete lineage sorting; phylogenomics; species trees; ASTRAL; NJst; MP-EST; FastME; PhyD*; neighbor joining
11.  A comparative study of SVDquartets and other coalescent-based species tree estimation methods 
BMC Genomics  2015;16(Suppl 10):S2.
Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Because ILS is expected to occur and the standard concatenation approach can return incorrect trees with high support in the presence of ILS, "coalescent-based" summary methods (which first estimate gene trees and then combine gene trees into a species tree) have been developed that have theoretical guarantees of robustness to arbitrarily high amounts of ILS. Some studies have suggested that summary methods should only be used on "c-genes" (i.e., recombination-free loci) that can be extremely short (sometimes fewer than 100 sites). However, gene trees estimated on short alignments can have high estimation error, and summary methods tend to have high error on short c-genes. To address this problem, Chifman and Kubatko introduced SVDquartets, a new coalescent-based method. SVDquartets takes multi-locus unlinked single-site data, infers the quartet trees for all subsets of four species, and then combines the set of quartet trees into a species tree using a quartet amalgamation heuristic. Yet, the relative accuracy of SVDquartets to leading coalescent-based methods has not been assessed.
We compared SVDquartets to two leading coalescent-based methods (ASTRAL-2 and NJst), and to concatenation using maximum likelihood. We used a collection of simulated datasets, varying ILS levels, numbers of taxa, and number of sites per locus. Although SVDquartets was sometimes more accurate than ASTRAL-2 and NJst, most often the best results were obtained using ASTRAL-2, even on the shortest gene sequence alignments we explored (with only 10 sites per locus). Finally, concatenation was the most accurate of all methods under low ILS conditions.
ASTRAL-2 generally had the best accuracy under higher ILS conditions, and concatenation had the best accuracy under the lowest ILS conditions. However, SVDquartets was competitive with the best methods under conditions with low ILS and small numbers of sites per locus. The good performance under many conditions of ASTRAL-2 in comparison to SVDquartets is surprising given the known vulnerability of ASTRAL-2 and similar methods to short gene sequences.
PMCID: PMC4602346  PMID: 26449249
species tree inference; SNP; multilocus; SVDquartets; QMC; ASTRAL; NJst; RAxML
12.  Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer 
BMC Genomics  2015;16(Suppl 10):S1.
Species tree estimation is challenged by gene tree heterogeneity resulting from biological processes such as duplication and loss, hybridization, incomplete lineage sorting (ILS), and horizontal gene transfer (HGT). Mathematical theory about reconstructing species trees in the presence of HGT alone or ILS alone suggests that quartet-based species tree methods (known to be statistically consistent under ILS, or under bounded amounts of HGT) might be effective techniques for estimating species trees when both HGT and ILS are present.
We evaluated several publicly available coalescent-based methods and concatenation under maximum likelihood on simulated datasets with moderate ILS and varying levels of HGT. Our study shows that two quartet-based species tree estimation methods (ASTRAL-2 and weighted Quartets MaxCut) are both highly accurate, even on datasets with high rates of HGT. In contrast, although NJst and concatenation using maximum likelihood are highly accurate under low HGT, they are less robust to high HGT rates.
Our study shows that quartet-based species-tree estimation methods can be highly accurate under the presence of both HGT and ILS. The study suggests the possibility that some quartet-based methods might be statistically consistent under phylogenomic models of gene tree heterogeneity with both HGT and ILS.
PMCID: PMC4603753  PMID: 26450506
phylogenomics; HGT; ILS; summary methods; concatenation
13.  Weighted Statistical Binning: Enabling Statistically Consistent Genome-Scale Phylogenetic Analyses 
PLoS ONE  2015;10(6):e0129183.
Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI:, and the software is available at
PMCID: PMC4472720  PMID: 26086579
14.  Ultra-large alignments using phylogeny-aware profiles 
Genome Biology  2015;16(1):124.
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0688-z) contains supplementary material, which is available to authorized users.
PMCID: PMC4492008  PMID: 26076734
15.  ASTRAL-II: coalescent-based species tree estimation with many hundreds of taxa and thousands of genes 
Bioinformatics  2015;31(12):i44-i52.
Motivation: The estimation of species phylogenies requires multiple loci, since different loci can have different trees due to incomplete lineage sorting, modeled by the multi-species coalescent model. We recently developed a coalescent-based method, ASTRAL, which is statistically consistent under the multi-species coalescent model and which is more accurate than other coalescent-based methods on the datasets we examined. ASTRAL runs in polynomial time, by constraining the search space using a set of allowed ‘bipartitions’. Despite the limitation to allowed bipartitions, ASTRAL is statistically consistent.
Results: We present a new version of ASTRAL, which we call ASTRAL-II. We show that ASTRAL-II has substantial advantages over ASTRAL: it is faster, can analyze much larger datasets (up to 1000 species and 1000 genes) and has substantially better accuracy under some conditions. ASTRAL’s running time is O(n2k|X|2), and ASTRAL-II’s running time is O(nk|X|2), where n is the number of species, k is the number of loci and X is the set of allowed bipartitions for the search space.
Availability and implementation: ASTRAL-II is available in open source at and datasets used are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4765870  PMID: 26072508
16.  Concatenation Analyses in the Presence of Incomplete Lineage Sorting 
PLoS Currents  2015;7:ecurrents.currents.tol.8d41ac0f13d1abedf4c4a59f5d17b1f7.
Incomplete lineage sorting (ILS), modelled by the multi-species coalescent, is a process that results in a gene tree being different from the species tree. Because ILS is expected to occur for at least some loci within genome-scale analyses, the evaluation of species tree estimation methods in the presence of ILS is of great interest. Performance on simulated and biological data have suggested that concatenation analyses can result in the wrong tree with high support under some conditions, and a recent theoretical result by Roch and Steel proved that concatenation using unpartitioned maximum likelihood analysis can be statistically inconsistent in the presence of ILS. In this study, we survey the major species tree estimation methods, including the newly proposed “statistical binning” methods, and discuss their theoretical properties. We also note that there are two interpretations of the term “statistical consistency”, and discuss the theoretical results proven under both interpretations.
PMCID: PMC4450984  PMID: 26064786
17.  Whole-genome analyses resolve early branches in the tree of life of modern birds 
Jarvis, Erich D. | Mirarab, Siavash | Aberer, Andre J. | Li, Bo | Houde, Peter | Li, Cai | Ho, Simon Y. W. | Faircloth, Brant C. | Nabholz, Benoit | Howard, Jason T. | Suh, Alexander | Weber, Claudia C. | da Fonseca, Rute R. | Li, Jianwen | Zhang, Fang | Li, Hui | Zhou, Long | Narula, Nitish | Liu, Liang | Ganapathy, Ganesh | Boussau, Bastien | Bayzid, Md. Shamsuzzoha | Zavidovych, Volodymyr | Subramanian, Sankar | Gabaldón, Toni | Capella-Gutiérrez, Salvador | Huerta-Cepas, Jaime | Rekepalli, Bhanu | Munch, Kasper | Schierup, Mikkel | Lindow, Bent | Warren, Wesley C. | Ray, David | Green, Richard E. | Bruford, Michael W. | Zhan, Xiangjiang | Dixon, Andrew | Li, Shengbin | Li, Ning | Huang, Yinhua | Derryberry, Elizabeth P. | Bertelsen, Mads Frost | Sheldon, Frederick H. | Brumfield, Robb T. | Mello, Claudio V. | Lovell, Peter V. | Wirthlin, Morgan | Schneider, Maria Paula Cruz | Prosdocimi, Francisco | Samaniego, José Alfredo | Velazquez, Amhed Missael Vargas | Alfaro-Núñez, Alonzo | Campos, Paula F. | Petersen, Bent | Sicheritz-Ponten, Thomas | Pas, An | Bailey, Tom | Scofield, Paul | Bunce, Michael | Lambert, David M. | Zhou, Qi | Perelman, Polina | Driskell, Amy C. | Shapiro, Beth | Xiong, Zijun | Zeng, Yongli | Liu, Shiping | Li, Zhenyu | Liu, Binghang | Wu, Kui | Xiao, Jin | Yinqi, Xiong | Zheng, Qiuemei | Zhang, Yong | Yang, Huanming | Wang, Jian | Smeds, Linnea | Rheindt, Frank E. | Braun, Michael | Fjeldsa, Jon | Orlando, Ludovic | Barker, F. Keith | Jønsson, Knud Andreas | Johnson, Warren | Koepfli, Klaus-Peter | O’Brien, Stephen | Haussler, David | Ryder, Oliver A. | Rahbek, Carsten | Willerslev, Eske | Graves, Gary R. | Glenn, Travis C. | McCormack, John | Burt, Dave | Ellegren, Hans | Alström, Per | Edwards, Scott V. | Stamatakis, Alexandros | Mindell, David P. | Cracraft, Joel | Braun, Edward L. | Warnow, Tandy | Jun, Wang | Gilbert, M. Thomas P. | Zhang, Guojie
Science (New York, N.Y.)  2014;346(6215):1320-1331.
To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in Neoaves, two groups we named Passerea and Columbea, representing independent lineages of diverse and convergently evolved land and water bird species. Among Passerea, we infer the common ancestor of core landbirds to have been an apex predator and confirm independent gains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging to sister clades. Even with whole genomes, some of the earliest branches in Neoaves proved challenging to resolve, which was best explained by massive protein-coding sequence convergence and high levels of incomplete lineage sorting that occurred during a rapid radiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago.
PMCID: PMC4405904  PMID: 25504713
18.  Phylogenomic analyses data of the avian phylogenomics project 
GigaScience  2015;4:4.
Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses.
Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence.
The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.
Electronic supplementary material
The online version of this article (doi:10.1186/s13742-014-0038-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4349222  PMID: 25741440
Avian genomes; Phylogenomics; Sequence alignments; Species tree; Gene trees; Indels; Transposable elements
19.  Data access for the 1,000 Plants (1KP) project 
GigaScience  2014;3:17.
The 1,000 plants (1KP) project is an international multi-disciplinary consortium that has generated transcriptome data from over 1,000 plant species, with exemplars for all of the major lineages across the Viridiplantae (green plants) clade. Here, we describe how to access the data used in a phylogenomics analysis of the first 85 species, and how to visualize our gene and species trees. Users can develop computational pipelines to analyse these data, in conjunction with data of their own that they can upload. Computationally estimated protein-protein interactions and biochemical pathways can be visualized at another site. Finally, we comment on our future plans and how they fit within this scalable system for the dissemination, visualization, and analysis of large multi-species data sets.
PMCID: PMC4306014  PMID: 25625010
Viridiplantae; Biodiversity; Transcriptomes; Phylogenomics; Interactions; Pathways
20.  BBCA: Improving the scalability of *BEAST using random binning 
BMC Genomics  2014;15(Suppl 6):S11.
Species tree estimation can be challenging in the presence of gene tree conflict due to incomplete lineage sorting (ILS), which can occur when the time between speciation events is short relative to the population size. Of the many methods that have been developed to estimate species trees in the presence of ILS, *BEAST, a Bayesian method that co-estimates the species tree and gene trees given sequence alignments on multiple loci, has generally been shown to have the best accuracy. However, *BEAST is extremely computationally intensive so that it cannot be used with large numbers of loci; hence, *BEAST is not suitable for genome-scale analyses.
We present BBCA (boosted binned coalescent-based analysis), a method that can be used with *BEAST (and other such co-estimation methods) to improve scalability. BBCA partitions the loci randomly into subsets, uses *BEAST on each subset to co-estimate the gene trees and species tree for the subset, and then combines the newly estimated gene trees together using MP-EST, a popular coalescent-based summary method. We compare time-restricted versions of BBCA and *BEAST on simulated datasets, and show that BBCA is at least as accurate as *BEAST, and achieves better convergence rates for large numbers of loci.
Phylogenomic analysis using *BEAST is currently limited to datasets with a small number of loci, and analyses with even just 100 loci can be computationally challenging. BBCA uses a very simple divide-and-conquer approach that makes it possible to use *BEAST on datasets containing hundreds of loci. This study shows that BBCA provides excellent accuracy and is highly scalable.
PMCID: PMC4239591  PMID: 25572469
multi-species coalescent; phylogenomics; incomplete lineage sorting; binning
21.  Disk covering methods improve phylogenomic analyses 
BMC Genomics  2014;15(Suppl 6):S7.
With the rapid growth rate of newly sequenced genomes, species tree inference from multiple genes has become a basic bioinformatics task in comparative and evolutionary biology. However, accurate species tree estimation is difficult in the presence of gene tree discordance, which is often due to incomplete lineage sorting (ILS), modelled by the multi-species coalescent. Several highly accurate coalescent-based species tree estimation methods have been developed over the last decade, including MP-EST. However, the running time for MP-EST increases rapidly as the number of species grows.
We present divide-and-conquer techniques that improve the scalability of MP-EST so that it can run efficiently on large datasets. Surprisingly, this technique also improves the accuracy of species trees estimated by MP-EST, as our study shows on a collection of simulated and biological datasets.
PMCID: PMC4239662  PMID: 25572610
multi-species coalescent process; incomplete lineage sorting; MP-EST; disk covering methods; divide-and-conquer
22.  Algorithms for MDC-Based Multi-Locus Phylogeny Inference: Beyond Rooted Binary Gene Trees on Single Alleles 
Journal of Computational Biology  2011;18(11):1543-1559.
One of the criteria for inferring a species tree from a collection of gene trees, when gene tree incongruence is assumed to be due to incomplete lineage sorting (ILS), is Minimize Deep Coalescence (MDC). Exact algorithms for inferring the species tree from rooted, binary trees under MDC were recently introduced. Nevertheless, in phylogenetic analyses of biological data sets, estimated gene trees may differ from true gene trees, be incompletely resolved, and not necessarily rooted. In this article, we propose new MDC formulations for the cases where the gene trees are unrooted/binary, rooted/non-binary, and unrooted/non-binary. Further, we prove structural theorems that allow us to extend the algorithms for the rooted/binary gene tree case to these cases in a straightforward manner. In addition, we devise MDC-based algorithms for cases when multiple alleles per species may be sampled. We study the performance of these methods in coalescent-based computer simulations.
PMCID: PMC3216099  PMID: 22035329
algorithms; coalescence; dynamic programming; graph theory; phylogenetic trees
23.  DACTAL: divide-and-conquer trees (almost) without alignments 
Bioinformatics  2012;28(12):i274-i282.
Motivation: While phylogenetic analyses of datasets containing 1000–5000 sequences are challenging for existing methods, the estimation of substantially larger phylogenies poses a problem of much greater complexity and scale.
Methods: We present DACTAL, a method for phylogeny estimation that produces trees from unaligned sequence datasets without ever needing to estimate an alignment on the entire dataset. DACTAL combines iteration with a novel divide-and-conquer approach, so that each iteration begins with a tree produced in the prior iteration, decomposes the taxon set into overlapping subsets, estimates trees on each subset, and then combines the smaller trees into a tree on the full taxon set using a new supertree method. We prove that DACTAL is guaranteed to produce the true tree under certain conditions. We compare DACTAL to SATé and maximum likelihood trees on estimated alignments using simulated and real datasets with 1000–27 643 taxa.
Results: Our studies show that on average DACTAL yields more accurate trees than the two-phase methods we studied on very large datasets that are difficult to align, and has approximately the same accuracy on the easier datasets. The comparison to SATé shows that both have the same accuracy, but that DACTAL achieves this accuracy in a fraction of the time. Furthermore, DACTAL can analyze larger datasets than SATé, including a dataset with almost 28 000 sequences.
Availability: DACTAL source code and results of dataset analyses are available at
PMCID: PMC3371850  PMID: 22689772
24.  Standard maximum likelihood analyses of alignments with gaps can be statistically inconsistent 
PLoS Currents  2012;4:RRN1308.
Most statistical methods for phylogenetic estimation in use today treat a gap (generally representing an insertion or deletion, i.e., indel) within the input sequence alignment as missing data. However, the statistical properties of this treatment of indels have not been fully investigated.
We prove that maximum likelihood phylogeny estimation, treating indels as missing data, can be statistically inconsistent for a general (and rather simple) model of sequence evolution, even when given the true alignment. Therefore, accurate phylogeny estimation cannot be guaranteed for maximum likelihood analyses, even given arbitrarily long sequences, when indels are present and treated as missing data.
Our result shows that the standard statistical techniques used to estimate phylogenies from sequence alignments may have unfavorable statistical properties, even when the sequence alignment is accurate and the assumed substitution model matches the generation model. This suggests that the recent research focus on developing statistical methods that treat indel events properly is an important direction for phylogeny estimation.
PMCID: PMC3299439  PMID: 22453901
25.  Treelength Optimization for Phylogeny Estimation 
PLoS ONE  2012;7(3):e33104.
The standard approach to phylogeny estimation uses two phases, in which the first phase produces an alignment on a set of homologous sequences, and the second phase estimates a tree on the multiple sequence alignment. POY, a method which seeks a tree/alignment pair minimizing the total treelength, is the most widely used alternative to this two-phase approach. The topological accuracy of trees computed under treelength optimization is, however, controversial. In particular, one study showed that treelength optimization using simple gap penalties produced poor trees and alignments, and suggested the possibility that if POY were used with an affine gap penalty, it might be able to be competitive with the best two-phase methods. In this paper we report on a study addressing this possibility. We present a new heuristic for treelength, called BeeTLe (Better Treelength), that is guaranteed to produce trees at least as short as POY. We then use this heuristic to analyze a large number of simulated and biological datasets, and compare the resultant trees and alignments to those produced using POY and also maximum likelihood (ML) and maximum parsimony (MP) trees computed on a number of alignments. In general, we find that trees produced by BeeTLe are shorter and more topologically accurate than POY trees, but that neither POY nor BeeTLe produces trees as topologically accurate as ML trees produced on standard alignments. These findings, taken as a whole, suggest that treelength optimization is not as good an approach to phylogenetic tree estimation as maximum likelihood based upon good alignment methods.
PMCID: PMC3307723  PMID: 22442677

Results 1-25 (37)