Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Systematic exploration of guide-tree topology effects for small protein alignments 
BMC Bioinformatics  2014;15(1):338.
Guide-trees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guide-trees, if any, give the best alignments. Some guide-tree construction schemes are based on pair-wise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods.
We explore all possible guide-trees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guide-trees sometimes outperform evolutionary guide-trees, as measured by structure derived reference alignments. However, default guide-trees fall way short of the optimum attainable scores. On average chained guide-trees perform better than balanced ones but are not better than default guide-trees for small alignments.
Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to sub-optimal guide-trees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guide-trees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guide-trees. The results for randomly chained guide-trees improve with the number of sequences.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-338) contains supplementary material, which is available to authorized users.
PMCID: PMC4287568  PMID: 25282640
Multiple sequence alignment; Guide-tree topology; Alignment accuracy; Benchmarking
2.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega 
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
PMCID: PMC3261699  PMID: 21988835
bioinformatics; hidden Markov models; multiple sequence alignment
3.  A Complete Analysis of HA and NA Genes of Influenza A Viruses 
PLoS ONE  2010;5(12):e14454.
More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza.
Methodology/Principal Findings
This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods.
We analyzed 18975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.
PMCID: PMC3012125  PMID: 21209922
4.  Sequence embedding for fast construction of guide trees for multiple sequence alignment 
The most widely used multiple sequence alignment methods require sequences to be clustered as an initial step. Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires memory and time proportional to N2 for N sequences. When N grows larger than 10,000 or so, this becomes increasingly prohibitive and can form a significant barrier to carrying out very large multiple alignments.
In this paper, we have tested variations on a class of embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances.
We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignment. Source code is available for download from
PMCID: PMC2893182  PMID: 20470396
5.  Haplotype frequency estimation error analysis in the presence of missing genotype data 
BMC Bioinformatics  2004;5:188.
Increasingly researchers are turning to the use of haplotype analysis as a tool in population studies, the investigation of linkage disequilibrium, and candidate gene analysis. When the phase of the data is unknown, computational methods, in particular those employing the Expectation-Maximisation (EM) algorithm, are frequently used for estimating the phase and frequency of the underlying haplotypes. These methods have proved very successful, predicting the phase-known frequencies from data for which the phase is unknown with a high degree of accuracy. Recently there has been much speculation as to the effect of unknown, or missing allelic data – a common phenomenon even with modern automated DNA analysis techniques – on the performance of EM-based methods. To this end an EM-based program, modified to accommodate missing data, has been developed, incorporating non-parametric bootstrapping for the calculation of accurate confidence intervals.
Here we present the results of the analyses of various data sets in which randomly selected known alleles have been relabelled as missing. Remarkably, we find that the absence of up to 30% of the data in both biallelic and multiallelic data sets with moderate to strong levels of linkage disequilibrium can be tolerated. Additionally, the frequencies of haplotypes which predominate in the complete data analysis remain essentially the same after the addition of the random noise caused by missing data.
These findings have important implications for the area of data gathering. It may be concluded that small levels of drop out in the data do not affect the overall accuracy of haplotype analysis perceptibly, and that, given recent findings on the effect of inaccurate data, ambiguous data points are best treated as unknown.
PMCID: PMC544188  PMID: 15574202

Results 1-5 (5)